diff examples/sbb-doc.txt @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.
author David A. Holland
date Sat, 22 Dec 2007 17:52:45 -0500
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/examples/sbb-doc.txt	Sat Dec 22 17:52:45 2007 -0500
@@ -0,0 +1,625 @@
+            ------------------------------------------------------
+                  Syntactic Building Blocks
+                  Copyright 1993-1998, Parsifal Software.
+                  All Rights Reserved.
+
+          This software is provided 'as-is', without any express or
+          implied warranty.  In no event will the authors be held
+          liable for any damages arising from the use of this software.
+
+          Permission is granted to anyone to use this software for any
+          purpose, including commercial applications, and to alter it
+          and redistribute it freely, subject to the restriction that
+          this notice may not be removed or altered from any source
+          distribution.
+            ------------------------------------------------------
+
+
+     Introduction
+     ---------------
+
+          This file contains examples of the productions necessary to
+          parse the more common syntactic elements in your grammar. It
+          should be of interest once you have begun to write your own
+          grammars. Each example is provided with sample reduction
+          procedures, which are adequate in the vast majority of cases.
+          You can use these productions like building blocks to quickly
+          assemble quite powerful grammars.  The text of this file is
+          also to be found in Appendix D of the AnaGram User's Guide.
+
+
+     End of File
+     --------------
+          If your parser is going to run under Unix or DOS and use
+          stream I/O to read its input, you can define end of file as
+          minus one:
+            eof = -1
+
+          If your parser is going to run under DOS and will use low
+          level I/O instead of stream I/O you might define eof as
+          Control Z:
+            eof = ^Z
+
+          If your parser is going to run under UNIX and will use low
+          level I/O instead of stream I/O you might define eof as
+          Control D:
+            eof = ^D
+
+          If your parser is simply going to parse a string in memory,
+          the definition of end of file should be a null character:
+            eof = 0
+
+          It is often convenient, however, to simply define end of
+          file so it will work in all of these contexts:
+            eof = -1 + 0 + ^D + ^Z
+
+
+     White Space
+     ----------------
+          It is convenient to have a representation for white space in
+          your grammar. Usually you do not wish to distinguish between
+          space characters and tab characters, so you can write:
+            blank = ' ' + '\t'
+
+          Using this definition, you can represent required white
+          space of indeterminate length with blank... and optional
+          white space with blank?...
+
+          It is common, however, to include comments (see below) as
+          white space. In this case, you can define the following
+          productions:
+            ws
+              -> blank
+              -> comment
+
+
+     End of Line
+     --------------
+          Because different systems use different representations for
+          end of line, it is wise to use an abstract end of line token
+          in your grammar, and define the token separately. If your
+          parser is going to use files that are known to use carriage
+          return alone or line feed alone as the end of line delimiter
+          you can use one of the following definitions:
+
+            eol = '\r'                          //carriage return only
+            eol = '\n'                          //line feed only
+
+          If your input files use the newline character as a line
+          terminator, but you wish to allow for optional carriage
+          returns, you might write:
+
+            eol
+              -> '\r'?, '\n'
+          or even
+            eol
+              -> '\r'?..., '\n'
+
+          If you wish to make a grammar that can deal with a file of
+          more or less arbitrary provenance, some caution is in order.
+          If you allow carriage return and line feed both to function
+          as end of line characters and you allow blank lines, you may
+          wind up with a very ambiguous grammar. For example, suppose
+          you use eol...  somewhere in your grammar and you have
+          defined eol thus:
+            eol
+              -> '\r'
+              -> '\n'
+              -> '\r', '\n'                        // trouble!
+
+          Your grammar is ambiguous since a carriage return, line feed
+          pair can be considered two line endings, according to the
+          first two production, or just one, according to the third.
+
+          If you really need to allow isolated carriage returns and
+          linefeeds to mark ends of line, but need to deal with the
+          pair as well, you can make good use of the idiosyncracies of
+          AnaGram's keyword logic by writing the following:
+            eol
+              -> '\r'
+              -> '\n'
+              -> "\r\n"
+
+          This will treat a carriage return followed by a line feed as
+          a single end of line.
+
+
+     Comments
+     -------------
+
+          There are two different categories of comments in common
+          use: Those that run to the end of line, and those that run
+          to a specific terminator.  The first can be conveniently
+          incorporated into your end of line token as a virtual
+          production:
+            eol
+              -> ["//", ~(eof + '\n')?...], '\n'
+
+          Thus, eol allows for an optional comment at the end of every
+          line. Note that you never want to allow end of file
+          characters in your comments.
+
+          Conventional C comments are slightly more complicated,
+          depending on your treatment of nesting. If you are not
+          interested in nesting comments, you can simply use the
+          following simple syntax:
+            comment
+              -> "/*", ~eof?..., "*/"
+
+          Note that this production works because keywords take
+          precedence over ordinary character input.
+
+          A somewhat more complex way to write this is useful:
+            comment
+              -> comment head, "*/"
+              -> comment head, comment
+            comment head
+              -> "/*"
+              -> comment head, ~eof
+
+          Note that where the previous grammar simply ignored the
+          beginning of any nested comment, this grammar recognizes a
+          nested comment, but then deliberately chooses to ignore
+          nesting.
+
+          If you want your comments to nest, you can easily rewrite
+          this grammar to allow nesting:
+
+            comment
+              -> comment head, "*/"
+            comment head
+              -> "/*"
+              -> comment head, ~eof
+              -> comment head, comment
+
+          Note that the only difference between nested and non-nested
+          comments is whether the grammar rule comment head, comment
+          reduces to comment head or to comment.
+
+          If you wish to defer the question of nesting to run-time,
+          you can use a semantically determined production to make the
+          decision:
+            comment
+              -> comment head, "*/"
+            comment head
+              -> "/*"
+              -> comment head, ~eof
+            comment, comment head
+              -> comment head, comment              =check_nesting();
+
+          Although the final production in this grammar has a somewhat
+          inscrutable air about it, a little study will show that if
+          check_nesting sets the reduction token to comment, the
+          effective grammar is the same as the grammar above for
+          non-nested comments. On the other hand, if check_nesting
+          sets the reduction token to comment head, the effective
+          grammar is the same as the grammar for nested comments.
+
+          Assuming you have a switch called nest_comments, you could
+          write check_nesting as follows:
+
+            check_nesting() {
+              if (nest_comments)
+                CHANGE_REDUCTION(comment_head);
+              else
+              CHANGE_REDUCTION(comment);
+            }
+
+          The else clause in this procedure is technically unnecessary
+          since comment is the default value of the reduction token.
+
+          If you wish to skip white space in your input, and wish to
+          consider comments as simple white space, you might want to
+          add the following production to your grammar:
+            ws
+              -> blank
+              -> comment
+          You can then use the disregard statement to ignore ws
+          appropriately.
+
+
+     Integers
+     ----------
+
+          It is quite easy to convert ordinary integers to binary
+          form. For instance:
+            (int) decimal integer
+              -> '0-9':d                                      =d-'0';
+              -> decimal integer:n, '0-9':d          =10*n + d - '0';
+
+          If necessary, you can specify that the value of decimal
+          integer be maintained as a long:
+            (long) decimal integer
+              -> '0-9':d                                      =d-'0';
+              -> decimal integer:n, '0-9':d          =10*n + d - '0';
+
+          Other representations are equally simple:
+            (long) octal integer
+              -> '0'                                              =0;
+              -> octal integer:n, '0-7':d             =8*n + d - '0';
+
+            (long) hex digit
+              -> '0-9':d                                      =d-'0';
+              -> 'A-F' + 'a-f':d                        =9 + (d & 7);
+
+            (long) hex integer
+              -> {"0x" + "0X"}, hex digit:d                       =d;
+              -> hex integer:n, hex digit:d                =16*n + d;
+
+          Note that if you use the C convention for octal integers,
+          you have to redefine decimal integer to avoid ambiguity:
+
+            (long) decimal integer
+              -> '1-9':d                                      =d-'0';
+              -> decimal integer:n, '0-9':d          =10*n + d - '0';
+
+          You can then represent the general case as follows:
+
+            (long) integer
+              -> decimal integer | octal integer | hex integer
+
+          You can allow for a signed integer with the following
+          productions:
+
+            (long) signed integer
+              -> '+'?, integer:n                                  =n;
+              -> '-', integer:n                                  =-n;
+
+          If you have included a disregard statement in your grammar
+          to skip over irrelevant white space in your input, you might
+          add the following to avoid skipping white space inside a
+          number:
+            [ lexeme {integer}]
+
+          Note that if you were to declare signed integer as a lexeme,
+          your parser would not allow space between a leading sign and
+          the integer.
+
+
+     Floating Point Numbers
+     ----------------------------
+
+          Floating point numbers are somewhat more complex than
+          integers, but not significantly so:
+
+            (double) real
+              -> simple real
+              -> integer part:r, exponent field:x        =r*pow10(x);
+              -> simple real:r, exponent field:x         =r*pow10(x);
+
+
+            (double) simple real
+              -> integer part:i, '.', fraction part:f           =i+f;
+              -> integer part, '.'
+              -> '.', fraction part:f                             =f;
+
+            (double) integer part
+              -> '0-9':d                                      =d-'0';
+              -> integer part:n, '0-9':d               =10*n + d-'0';
+
+            (double) fraction part
+              -> '0-9':d                                =(d-'0')/10.;
+              -> '0-9':d, fraction part:f             =(d-'0'+f)/10.;
+
+            exponent field
+              -> 'e' + 'E', signed exponent:x                     =x;
+
+            signed exponent
+              -> '+'?, exponent:n                                 =n;
+              -> '-', exponent:n                                 =-n;
+
+            exponent
+              -> '0-9':d                                      =d-'0';
+              -> exponent:n, '0-9':d                 =10*n + d - '0';
+
+          Note that fraction part uses right recursion rather than
+          left recursion.  exponent is effectively the same as decimal
+          integer, above, but allows for initial zeroes.
+
+          The situation becomes somewhat more complex if you wish to
+          allow both integer and floating point forms, particularly if
+          you wish to follow the C convention for octal integers.
+          First, you cannot have distinct productions for integer part
+          and decimal integer, since there is no way to distinguish
+          them until a decimal point or an exponent field is
+          encountered. Second, since 0129.3 looks for all the world
+          like an octal number until the '9' is encountered, one
+          either has to postpone all conversion calculations until the
+          issue is resolved or resort to trickery.  Here is a way to
+          resolve the problem by redefining integer part:
+            (double) integer part
+              -> confusion
+              -> octal integer:n                    =octal2decimal(n);
+              -> decimal integer:n                                 =n;
+
+            (double) confusion
+              -> octal integer:n, '8-9':d  =10*octal2decimal(n)+d-'0';
+              -> confusion:x, '0-9':d                   =10*x + d-'0';
+
+          where octal2decimal is defined thus:
+            double octal2decimal(int n) {
+              if (n) return 10*octal2decimal(n/8) + n%8;
+              else return 0;
+              }
+
+          Here confusion represents a number that starts off looking
+          like an octal integer, but then turns into a decimal number,
+          because an eight, a nine, a decimal pointer or an exponent
+          field is encountered. When this occurs, the function octal2-
+          decimal undoes the octal conversion that had already been
+          done, and redoes the conversion as decimal conversion.
+
+          If you have included a disregard statement in your grammar
+          to skip over irrelevant white space in your input, you might
+          add the following to avoid skipping white space inside a
+          real:
+            [ lexeme {real}]
+
+
+     Names
+     ---------
+
+          In almost all grammars, it is necessary to identify names.
+          To accumulate the characters that make up the name it is
+          convenient to use a stack. The reduction procedures in this
+          example use the functions ipn and pcn to accumulate the
+          characters. ipn initializes storage for the name and stores
+          the first character. pcn adds a single character to the
+          name. This grammar presumes the existence of a function
+          called identify_name which would look up the name in a
+          symbol table and return an identifying index.
+
+            letter = 'a-z' + 'A-Z' + '_'
+            digit  = '0-9'
+            (int) name
+              -> name string                        =identify_name();
+
+            name string
+              -> letter:c                                    =ins(c);
+              -> name string, letter+digit:c                 =pcn(c);
+
+            {/* embedded C to accumulate name */
+              char name[82];
+              int name_length;
+              void ipn(int c) {       /* Init and Put char to Name */
+                name[0] = c;
+                name_length = 1;
+              }
+              void pcn(int c) {                /* Put Char to Name */
+                assert(name_length < 82);
+                name[name_length++] = c;
+              }
+            } // End of embedded C
+
+          If you have included a disregard statement in your grammar
+          to skip over irrelevant white space in your input, you might
+          add the following to avoid skipping white space inside a
+          name:
+            [ lexeme {name}]
+
+
+     Names with Embedded White Space
+     ---------------------------------------------
+
+          It is sometimes convenient to allow symbol names to contain
+          embedded white space. This is easily done, although it
+          requires a bit of a trick:
+
+            name
+              -> name string, ws?...                =identify_name();
+
+            name string
+              -> letter:c                                    =ipn(c);
+              -> name string, letter+digit:c                 =pcn(c);
+              -> name string, ws..., letter+digit:c=pcn(' '), pcn(c);
+
+          Note that the last production reduces all contiguous white
+          space within a name to a single blank.
+
+          Allowing optional blanks following name string is important.
+          If you don't allow them there, any ws following a name will
+          be presumed to be embedded ws, requiring another letter or
+          digit, which is not what you intend. There are two ways to
+          accomplish this. The first, shown above, explicitly allows
+          for optional white space following name string. The second
+          method is to use the disregard and lexeme statements:
+
+            [
+              disregard ws
+              lexeme {name}
+            ]
+          If you use the disregard statement you should not include a
+          provision for white space in the production for name. Just
+          leave it as it was in the previous example.
+
+
+     Character Strings
+     ----------------------
+
+          Character strings are often required. The simplest way to
+          implement character strings is as follows:
+
+            character string
+              -> '"', ~(eof + '"')?..., '"'
+
+          This approach has the disadvantage that it makes no
+          provision for nonprinting characters.
+
+          There are numerous ways to provide for nonprinting
+          characters in your character strings. However, you can avoid
+          tedious documentation by using the same rules for
+          nonprinting characters that C uses. Unfortunately, the C
+          rules for octal and hexadecimal escape sequences complicate
+          the writing of the grammar quite substantially. For example,
+          if you wish to write a string that consists of ascii code 11
+          followed by the ascii digit '2', you must pad with a leading
+          zero, writing "\0132", since "\132" according to the rules
+          is a single three digit octal escape sequence designating
+          ascii code 90. The problem is that the rules allow for one,
+          two or three digit octal escape sequences, but sequences
+          shorter than three digits have to be followed by the end of
+          the string or a character that is not an ascii digit. There
+          is a similar, if not worse, problem with hex escape
+          sequences. There is no limit on the length of a hex escape
+          sequence, so there is no possible way to make the character
+          '2' follow a hex escape sequence without using another
+          escape sequence.
+
+          A straightforward approach to writing a grammar for strings
+          consistent with the C conventions yields a number of
+          conflicts which correspond exactly to the problems discussed
+          above. While it is certainly possible to write a grammar for
+          strings that has no conflicts, it is easier to proceed in a
+          straightforward manner and use a sticky declaration to
+          resolve the ambiguities.
+
+          Here is the complete grammar for a character string in
+          accordance with the C rules. In order to accumulate the
+          contents of the string, this grammar uses the functions ics
+          and acs, to initialize storage for a character string and to
+          append a character to that string respectively.
+
+            character string
+              -> string chars, '"'
+
+            string chars
+              -> '"'                                          =ics();
+              -> string chars, string char:c                 =acs(c);
+
+            string char
+              -> simple string char
+              -> escape sequence
+
+            simple string char = ~eof - ('"' + '\\' + '\n')
+
+            (int) escape sequence
+              -> "\\a"   ='\a';
+              -> "\\b"   ='\b';
+              -> "\\f"   ='\f';
+              -> "\\n"   ='\n';
+              -> "\\r"   ='\r';
+              -> "\\t"   ='\t';
+              -> "\\v"   ='\v';
+              -> "\\\\"  ='\\';
+              -> "\\?"   = '\?';
+              -> "\\'"   ='\'';
+              -> "\\\""  ='"';
+              -> octal escape
+              -> hex escape
+
+            (int) octal escape
+              -> one octal | two octal | three octal
+
+            (int) one octal
+              -> '\\', '0-7':d                                 =d-'0';
+
+            (int) two octal
+              -> one octal:n, '0-7':d                    =8*n + d-'0';
+
+            (int) three octal
+              -> two octal:n, '0-7':d                    =8*n + d-'0';
+
+            (int) hex escape
+              -> "\\x", hex number
+
+            (int) hex number
+              -> hex digit
+              -> hex number:n, hex digit:d                  =16*n + d;
+
+            [
+              sticky one octal, two octal, hex number
+            ]
+            {/* embedded C to define ics and acs */
+              char string_store[82];
+              int length;
+              void ics(void) {
+              length = 0;
+            }
+            void acs(int c) {
+              assert(length < 82);
+              string_store[length++] = c;
+            }
+
+          If you have included a disregard statement in your grammar
+          to skip over irrelevant white space in your input, you might
+          add the following to avoid skipping white space inside a
+          name:
+
+            [ lexeme {character string}]
+
+
+     Character Constants
+     -------------------------
+
+          It is almost trivial to extend the above syntax for
+          character strings to allow simple character constants:
+
+            (int) character constant
+              -> '\'', simple char:c, '\''                         =c;
+
+            (int) simple char
+              -> ~eof - ('\'' + '\\' + '\n')
+              -> escape sequence
+
+          The value of the character constant token is the character
+          code found in the input stream.
+
+          If you have included a disregard statement in your grammar
+          to skip over irrelevant white space in your input, you might
+          add the following to avoid skipping white space inside a
+          character constant:
+            [ lexeme {character constant}]
+
+
+     Simple Expressions
+     ------------------------
+
+          It is often convenient to allow for simple expressions in
+          your input. The syntax for expression logic is written in a
+          conventional way, using a separate token for each precedence
+          level desired. The following grammar will accept simple
+          addition, subtraction, multiplication and division of
+          floating point numbers:
+
+            (double) expression
+              -> term
+              -> expression:x, '+', term:t                       =x+t;
+              -> expression:x, '-', term:t                       =x-t;
+
+            (double) term
+              -> factor
+              -> term:t, '*', factor:f                           =t*f;
+              -> term:t, '/', factor:f                           =t/f;
+
+            (double) factor
+              -> real
+              -> '(', expression:x, ')'                            =x;
+              -> '-', factor:f                                    =-f;
+
+          An alternative way to write expression syntax is to write an
+          ambiguous grammar and use precedence declarations to resolve
+          the conflicts. This results in a slightly more compact and
+          faster parser:
+
+            (double) expression
+              -> expression:x, '+', expression:y                 =x+y;
+              -> expression:x, '-', expression:y                 =x-y;
+              -> expression:x, '*', expression:y                 =x*y;
+              -> expression:x, '/', expression:y                 =x/y;
+              -> unary minus, expression:x                        =-x;
+              -> '(', expression:x, ')'                            =x;
+              -> real
+
+              unary minus = '-'
+
+            [
+              left '+', '-'
+              left '*', '/'
+              right unary minus
+            ]
+
+          Note that we deal with the different precedence levels of
+          unary and binary minus by defining unary minus, which
+          AnaGram treats as distinct from the simple '-', and giving
+          it different precedence.
+