view doc/misc/html/examples/sbb-doc.html @ 24:a4899cdfc2d6 default tip

Obfuscate the regexps to strip off the IBM compiler's copyright banners. I don't want bots scanning github to think they're real copyright notices because that could cause real problems.
author David A. Holland
date Mon, 13 Jun 2022 00:40:23 -0400
parents 13d2b8934445
children
line wrap: on
line source

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<TITLE>Syntactic Building Blocks</TITLE>
</HEAD>


<BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif"
 TEXT="#000000" LINK="#0033CC"
 VLINK="#CC0033" ALINK="#CC0099">

<P>

<IMG ALIGN="right" SRC="../images/agrsl6c.gif" ALT="AnaGram"
         WIDTH=124 HEIGHT=30 >
<BR CLEAR="all">
Back to <A HREF="../index.html">Index</A>
<P>
<IMG ALIGN="bottom" SRC="../images/rbline6j.gif" ALT="----------------------"
        WIDTH=1010 HEIGHT=2  >


<BR CLEAR="all">

<H1>Syntactic Building Blocks</H1>

<IMG ALIGN="bottom" SRC="../images/rbline6j.gif" ALT="----------------------"
        WIDTH=1010 HEIGHT=2  >
<P>
<BR>

<H2>Introduction</H2>

          <tt>examples/sbb-doc.txt</tt> contains examples of the
          productions necessary to
          parse the more common syntactic elements in your grammar. It
          should be of interest once you have begun to write your own
          grammars. Each example is provided with sample reduction
          procedures, which are adequate in the vast majority of cases.
          You can use these productions like building blocks to quickly
          assemble quite powerful grammars.
<P>
          You are reading an HTML version of <tt>sbb-doc.txt</tt>. The
          actual text file
          should be easier to copy sections from. The text of this file is
          also to be found in Appendix D of the AnaGram User's Guide.
<P>
<BR>

<H2>End of File</H2>

          If your parser is going to  use
          stream I/O to read its input, you can define end of file as
          minus one:
               <PRE>             eof = -1  </PRE>

          If your parser is going to run under Windows/DOS and will use low
          level I/O instead of stream I/O you might
          define <STRONG>eof</STRONG> as
          Control Z:
           <PRE>             eof = ^Z </PRE>
<P>
          If your parser is going to run under UNIX and will use low
          level I/O instead of stream I/O you might define
          <STRONG>eof</STRONG> as
          Control D:
           <PRE>             eof = ^D </PRE>
<P>
          If your parser is simply going to parse a string in memory,
          the definition of end of file should be a null character:
           <PRE>             eof = 0 </PRE>
<P>
          It is often convenient, however, to simply define end of
          file so it will work in all of these contexts:
           <PRE>             eof = -1 + 0 + ^D + ^Z </PRE>
<P>
<BR>

 <H2>White Space</H2>

          It is convenient to have a representation for white space in
          your grammar. Usually you do not wish to distinguish between
          space characters and tab characters, so you can write:
           <PRE>             blank = ' ' + '\t' </PRE>
<P>
          Using this definition, you can represent required white
          space of indeterminate length with
<PRE>             blank... </PRE> and optional
          white space with
<PRE>             blank?... </PRE>
<P>
          It is common, however, to include comments (see below) as
          white space. In this case, you can define the following
          productions:
<PRE>            ws
              -&gt: blank
              -&gt; comment
</PRE>
<P>
<BR>

 <H2>End of Line</H2>

          Because different systems use different representations for
          end of line, it is wise to use an abstract end of line token
          in your grammar, and define the token separately. If your
          parser is going to use files that are known to use carriage
          return alone or line feed alone as the end of line delimiter
          you can use one of the following definitions:
<PRE>            eol = '\r'                  //carriage return only
            eol = '\n'                  //line feed only
</PRE>
          If your input files use the newline character as a line
          terminator, but you wish to allow for optional carriage
          returns, you might write:
<PRE>
            eol
              -&gt; '\r'?, '\n'
</PRE>
          or even
 <PRE>
            eol
              -&gt; '\r'?..., '\n'
</PRE>
<P>
          If you wish to make a grammar that can deal with a file of
          more or less arbitrary provenance, some caution is in order.
          If you allow carriage return and line feed both to function
          as end of line characters and you allow blank lines, you may
          wind up with a very ambiguous grammar. For example, suppose
          you use <STRONG>eol...</STRONG>  somewhere in your grammar and you have
          defined <STRONG>eol</STRONG> thus:
<PRE>
            eol
              -&gt; '\r'
              -&gt; '\n'
              -&gt; '\r', '\n'                // trouble!
</PRE>
          Your grammar is ambiguous since a carriage return, line feed
          pair can be considered two line endings, according to the
          first two production, or just one, according to the third.
<P>
          If you really need to allow isolated carriage returns and
          linefeeds to mark ends of line, but need to deal with the
          pair as well, you can make good use of the idiosyncracies of
          AnaGram's keyword logic by writing the following:
<PRE>
            eol
              -&gt; '\r'
              -&gt; '\n'
              -&gt; "\r\n"
</PRE>
          This will treat a carriage return followed by a line feed as
          a single end of line.
<P>
<BR>

<H2> Comments</H2>

          There are two different categories of comments in common
          use: Those that run to the end of line, and those that run
          to a specific terminator.  The first can be conveniently
          incorporated into your end of line token as a virtual
          production:
<PRE>
            eol
              -&gt; ["//", ~(eof + '\n')?...], '\n'
</PRE>
          Thus, <STRONG>eol</STRONG> allows for an optional comment at the end of every
          line. Note that you never want to allow end of file
          characters in your comments.
<P>
          Conventional C comments are slightly more complicated,
          depending on your treatment of nesting. If you are not
          interested in nesting comments, you can simply use the
          following simple syntax:
<PRE>
            comment
              -&gt; "/*", ~eof?..., "*/"
</PRE>
          Note that this production works because keywords take
          precedence over ordinary character input.
<P>
          A somewhat more complex way to write this is useful:
<PRE>
            comment
              -&gt; comment head, "*/"
              -&gt; comment head, comment
            comment head
              -&gt; "/*"
              -&gt; comment head, ~eof
</PRE>
          Note that where the previous grammar simply ignored the
          beginning of any nested comment, this grammar recognizes a
          nested comment, but then deliberately chooses to ignore
          nesting.
<P>
          If you want your comments to nest, you can easily rewrite
          this grammar to allow nesting:
<PRE>
            comment
              -&gt; comment head, "*/"
            comment head
              -&gt; "/*"
              -&gt; comment head, ~eof
              -&gt; comment head, comment
</PRE>
          Note that the only difference between nested and non-nested
          comments is whether the grammar rule
          <STRONG>comment head, comment</STRONG>
          reduces to <STRONG>comment head</STRONG> or
          to <STRONG>comment</STRONG>.
<P>
          If you wish to defer the question of nesting to run-time,
          you can use a semantically determined production to make the
          decision:
<PRE>
            comment
              -&gt; comment head, "*/"
            comment head
              -&gt; "/*"
              -&gt; comment head, ~eof
            comment, comment head
              -&gt; comment head, comment      =check_nesting();
</PRE>
<P>
          Although the final production in this grammar has a somewhat
          inscrutable air about it, a little study will show that if
          <STRONG>check_nesting</STRONG> sets the reduction
          token to  <STRONG>comment</STRONG>, the
          effective grammar is the same as the grammar above for
          non-nested comments. On the other hand,
          if <STRONG>check_nesting</STRONG>
          sets the reduction token
          to <STRONG>comment head</STRONG>, the effective
          grammar is the same as the grammar for nested comments.
<P>
          Assuming you have a switch called
          <STRONG>nest_comments</STRONG>, you could
          write <STRONG>check_nesting</STRONG> as follows:
<PRE>
            check_nesting() {
              if (nest_comments)
                CHANGE_REDUCTION(comment_head);
              else
              CHANGE_REDUCTION(comment);
            }
</PRE>
          The else clause in this procedure is technically unnecessary
          since comment is the default value of the reduction token.
<P>
          If you wish to skip white space in your input, and wish to
          consider comments as simple white space, you might want to
          add the following production to your grammar:
<PRE>
            ws
              -&gt; blank
              -&gt; comment
</PRE>
          You can then use the <STRONG>disregard</STRONG> statement
          to ignore <STRONG>ws</STRONG>
          appropriately.
<P>
<BR>

<H2>Integers</H2>

          It is quite easy to convert ordinary integers to binary
          form. For instance:
<PRE>
            (int) decimal integer
              -&gt; '0-9':d                                 =d-'0';
              -&gt; decimal integer:n, '0-9':d     =10*n + d - '0';
</PRE>
          If necessary, you can specify that the value of <STRONG>decimal
          integer</STRONG> be maintained as a long:
<PRE>
            (long) decimal integer
              -&gt; '0-9':d                                 =d-'0';
              -&gt; decimal integer:n, '0-9':d     =10*n + d - '0';
</PRE>
          Other representations are equally simple:
<PRE>
            (long) octal integer
              -&gt; '0'                                         =0;
              -&gt; octal integer:n, '0-7':d        =8*n + d - '0';

            (long) hex digit
              -&gt; '0-9':d                                 =d-'0';
              -&gt; 'A-F' + 'a-f':d               =9 + (d &amp; 7);

            (long) hex integer
              -&gt; {"0x" | "0X"}, hex digit:d                  =d;
              -&gt; hex integer:n, hex digit:d           =16*n + d;
</PRE>
          Note that if you use the C convention for octal integers,
          you have to redefine <STRONG>decimal integer</STRONG>
          to avoid ambiguity:
<PRE>
            (long) decimal integer
              -&gt; '1-9':d                                 =d-'0';
              -&gt; decimal integer:n, '0-9':d     =10*n + d - '0';
</PRE>
          You can then represent the general case as follows:
<PRE>
            (long) integer
              -&gt; decimal integer | octal integer | hex integer
</PRE>
          You can allow for a signed integer with the following
          productions:
<PRE>
            (long) signed integer
              -&gt; '+'?, integer:n                     =n;
              -&gt; '-', integer:n                     =-n;
</PRE>
          If you have included a <STRONG>disregard</STRONG>
          statement in your grammar
          to skip over irrelevant white space in your input, you might
          add the following to avoid skipping white space inside a
          number:
<PRE>
            [ lexeme {integer}]
</PRE>
          Note that if you were to declare <STRONG>signed
          integer</STRONG> as a lexeme,
          your parser would not allow space between a leading sign and
          the integer.
<P>
<BR>

 <H2>Floating Point Numbers</H2>

          Floating point numbers are somewhat more complex than
          integers, but not significantly so:
<PRE>
            (double) real
              -&gt; simple real
              -&gt; integer part:r, exponent field:x     =r*pow10(x);
              -&gt; simple real:r, exponent field:x      =r*pow10(x);


            (double) simple real
              -&gt; integer part:i, '.', fraction part:f        =i+f;
              -&gt; integer part, '.'
              -&gt; '.', fraction part:f                          =f;

            (double) integer part
              -&gt; '0-9':d                                 =d-'0';
              -&gt; integer part:n, '0-9':d          =10*n + d-'0';

            (double) fraction part
              -&gt; '0-9':d                           =(d-'0')/10.;
              -&gt; '0-9':d, fraction part:f        =(d-'0'+f)/10.;

            exponent field
              -&gt; 'e' + 'E', signed exponent:x             =x;

            signed exponent
              -&gt; '+'?, exponent:n                         =n;
              -&gt; '-', exponent:n                         =-n;

            exponent
              -&gt; '0-9':d                              =d-'0';
              -&gt; exponent:n, '0-9':d         =10*n + d - '0';
</PRE>
          Note that <STRONG>fraction part</STRONG> uses
          right recursion rather than
          left recursion.  <STRONG>exponent</STRONG> is
          effectively the same as <STRONG>decimal
          integer</STRONG>, above, but allows for initial zeroes.
<P>
          The situation becomes somewhat more complex if you wish to
          allow both integer and floating point forms, particularly if
          you wish to follow the C convention for octal integers.
          First, you cannot have distinct productions
          for <STRONG>integer part</STRONG>
          and <STRONG>decimal integer</STRONG>, since
          there is no way to distinguish
          them until a decimal point or an exponent field is
          encountered. Second, since 0129.3 looks for all the world
          like an octal number until the '9' is encountered, one
          either has to postpone all conversion calculations until the
          issue is resolved or resort to trickery.  Here is a way to
          resolve the problem by redefining <STRONG>integer part</STRONG>:
<PRE>
            (double) integer part
              -&gt; confusion
              -&gt; octal integer:n             =octal2decimal(n);
              -&gt; decimal integer:n                          =n;

            (double) confusion
              -&gt; octal integer:n, '8-9':d  =10*octal2decimal(n)+d-'0';
              -&gt; confusion:x, '0-9':d                   =10*x + d-'0';
</PRE>
          where <STRONG>octal2decimal</STRONG> is defined thus:
<PRE>
            double octal2decimal(int n) {
              if (n) return 10*octal2decimal(n/8) + n%8;
              else return 0;
              }
</PRE>
          Here <STRONG>confusion</STRONG> represents
          a number that starts off looking
          like an octal integer, but then turns into a decimal number,
          because an eight, a nine, a decimal pointer or an exponent
          field is encountered. When this occurs, the function
         <STRONG>octal2decimal</STRONG> undoes the octal
          conversion that had already been
          done, and redoes the conversion as decimal conversion.
<P>
          If you have included a <STRONG>disregard</STRONG>
          statement in your grammar
          to skip over irrelevant white space in your input, you might
          add the following to avoid skipping white space inside a
          real:
<PRE>
            [ lexeme {real}]
</PRE>
<BR>

 <H2>Names</H2>

          In almost all grammars, it is necessary to identify names.
          To accumulate the characters that make up the name it is
          convenient to use a stack. The reduction procedures in this
          example use the functions <STRONG>ipn</STRONG> and
          <STRONG>pcn</STRONG> to accumulate the
          characters. <STRONG>ipn</STRONG> initializes storage
          for the name and stores
          the first character. <STRONG>pcn</STRONG> adds a single
          character to the
          name. This grammar presumes the existence of a function
          called <STRONG>identify_name</STRONG> which would
          look up the name in a
          symbol table and return an identifying index.
<PRE>
            letter = 'a-z' + 'A-Z' + '_'
            digit  = '0-9'
            (int) name
              -&gt; name string                 =identify_name();

            name string
              -&gt; letter:c                             =ins(c);
              -&gt; name string, letter+digit:c          =pcn(c);

            {/* embedded C to accumulate name */
              char name[82];
              int name_length;
              void ipn(int c) {       /* Init and Put char to Name */
                name[0] = c;
                name_length = 1;
              }
              void pcn(int c) {                /* Put Char to Name */
                assert(name_length &lt; 82);
                name[name_length++] = c;
              }
            } // End of embedded C
</PRE>
<P>
          If you have included a
          <STRONG>disregard</STRONG> statement in your grammar
          to skip over irrelevant white space in your input, you might
          add the following to avoid skipping white space inside a
          name:
<PRE>
            [ lexeme {name}]
</PRE>

<BR>

 <H2>Names with Embedded White Space</H2>

          It is sometimes convenient to allow symbol names to contain
          embedded white space. This is easily done, although it
          requires a bit of a trick:
<PRE>
       name
         -&gt; name string, ws?...         =identify_name();

       name string
         -&gt; letter:c                            =ipn(c);
         -&gt; name string, letter+digit:c         =pcn(c);
         -&gt; name string, ws..., letter+digit:c  =pcn(' '), pcn(c);
</PRE>
          Note that the last production reduces all contiguous white
          space within a name to a single blank.
<P>
          Allowing optional blanks following name string is important.
          If you don't allow them there, any <STRONG>ws</STRONG>
          following a name will
          be presumed to be embedded <STRONG>ws</STRONG>,
          requiring another letter or
          digit, which is not what you intend. There are two ways to
          accomplish this. The first, shown above, explicitly allows
          for optional white space following name string. The second
          method is to use the <STRONG>disregard</STRONG>
          and <STRONG>lexeme</STRONG> statements:
<PRE>
            [
              disregard ws
              lexeme {name}
            ]
</PRE>
          If you use the <STRONG>disregard</STRONG> statement you should not include a
          provision for white space in the production for name. Just
          leave it as it was in the previous example.
<P>
<BR>

 <H2>Character Strings</H2>

          Character strings are often required. The simplest way to
          implement character strings is as follows:
<PRE>
            character string
              -&gt; '"', ~(eof + '"')?..., '"'
</PRE>
          This approach has the disadvantage that it makes no
          provision for nonprinting characters.
<P>
          There are numerous ways to provide for nonprinting
          characters in your character strings. However, you can avoid
          tedious documentation by using the same rules for
          nonprinting characters that C uses. Unfortunately, the C
          rules for octal and hexadecimal escape sequences complicate
          the writing of the grammar quite substantially. For example,
          if you wish to write a string that consists of ascii code 11
          followed by the ascii digit '2', you must pad with a leading
          zero, writing "\0132", since "\132" according to the rules
          is a single three digit octal escape sequence designating
          ascii code 90. The problem is that the rules allow for one,
          two or three digit octal escape sequences, but sequences
          shorter than three digits have to be followed by the end of
          the string or a character that is not an ascii digit. There
          is a similar, if not worse, problem with hex escape
          sequences. There is no limit on the length of a hex escape
          sequence, so there is no possible way to make the character
          '2' follow a hex escape sequence without using another
          escape sequence.
<P>
          A straightforward approach to writing a grammar for strings
          consistent with the C conventions yields a number of
          conflicts which correspond exactly to the problems discussed
          above. While it is certainly possible to write a grammar for
          strings that has no conflicts, it is easier to proceed in a
          straightforward manner and use a <STRONG>sticky</STRONG>
         declaration to
          resolve the ambiguities.
<P>
          Here is the complete grammar for a character string in
          accordance with the C rules. In order to accumulate the
          contents of the string, this grammar uses the functions
         <STRONG>ics</STRONG>  and <STRONG>acs</STRONG>,
          to initialize storage for a character string and to
          append a character to that string respectively.
<PRE>
            character string
              -&gt; string chars, '"'

            string chars
              -&gt; '"'                                  =ics();
              -&gt; string chars, string char:c         =acs(c);

            string char
              -&gt; simple string char
              -&gt; escape sequence

            simple string char = ~eof - ('"' + '\\' + '\n')

            (int) escape sequence
              -&gt; "\\a"   ='\a';
              -&gt; "\\b"   ='\b';
              -&gt; "\\f"   ='\f';
              -&gt; "\\n"   ='\n';
              -&gt; "\\r"   ='\r';
              -&gt; "\\t"   ='\t';
              -&gt; "\\v"   ='\v';
              -&gt; "\\\\"  ='\\';
              -&gt; "\\?"   = '\?';
              -&gt; "\\'"   ='\'';
              -&gt; "\\\""  ='"';
              -&gt; octal escape
              -&gt; hex escape

            (int) octal escape
              -&gt; one octal | two octal | three octal

            (int) one octal
              -&gt; '\\', '0-7':d                         =d-'0';

            (int) two octal
              -&gt; one octal:n, '0-7':d            =8*n + d-'0';

            (int) three octal
              -&gt; two octal:n, '0-7':d            =8*n + d-'0';

            (int) hex escape
              -&gt; "\\x", hex number

            (int) hex number
              -&gt; hex digit
              -&gt; hex number:n, hex digit:d          =16*n + d;

            [
              sticky one octal, two octal, hex number
            ]
            {/* embedded C to define ics and acs */
              char string_store[82];
              int length;
              void ics(void) {
              length = 0;
            }
            void acs(int c) {
              assert(length &lt; 82);
              string_store[length++] = c;
            }
</PRE>
<P>
          If you have included a <STRONG>disregard</STRONG>
          statement in your grammar
          to skip over irrelevant white space in your input, you might
          add the following to avoid skipping white space inside a
          name:
<PRE>
            [ lexeme {character string}]
</PRE>

<BR>

 <H2>Character Constants</H2>

          It is almost trivial to extend the above syntax for
          character strings to allow simple character constants:
<PRE>
            (int) character constant
              -&gt; '\'', simple char:c, '\''           =c;

            (int) simple char
              -&gt; ~eof - ('\'' + '\\' + '\n')
              -&gt; escape sequence
</PRE>
          The value of the character constant token is the character
          code found in the input stream.
<P>
          If you have included a <STRONG>disregard</STRONG>
          statement in your grammar
          to skip over irrelevant white space in your input, you might
          add the following to avoid skipping white space inside a
          character constant:
<PRE>
            [ lexeme {character constant}]
</PRE>

<BR>

 <H2>Simple Expressions</H2>

          It is often convenient to allow for simple expressions in
          your input. The syntax for expression logic is written in a
          conventional way, using a separate token for each precedence
          level desired. The following grammar will accept simple
          addition, subtraction, multiplication and division of
          floating point numbers:
<PRE>
            (double) expression
              -&gt; term
              -&gt; expression:x, '+', term:t             =x+t;
              -&gt; expression:x, '-', term:t             =x-t;

            (double) term
              -&gt; factor
              -&gt; term:t, '*', factor:f                 =t*f;
              -&gt; term:t, '/', factor:f                 =t/f;

            (double) factor
              -&gt; real
              -&gt; '(', expression:x, ')'                  =x;
              -&gt; '-', factor:f                          =-f;
</PRE>
<P>
          An alternative way to write expression syntax is to write an
          ambiguous grammar and use precedence declarations to resolve
          the conflicts. This results in a slightly more compact and
          faster parser:
<PRE>
            (double) expression
              -&gt; expression:x, '+', expression:y            =x+y;
              -&gt; expression:x, '-', expression:y            =x-y;
              -&gt; expression:x, '*', expression:y            =x*y;
              -&gt; expression:x, '/', expression:y            =x/y;
              -&gt; unary minus, expression:x                   =-x;
              -&gt; '(', expression:x, ')'                       =x;
              -&gt; real

              unary minus = '-'

            [
              left '+', '-'
              left '*', '/'
              right unary minus
            ]
</PRE>
          Note that we deal with the different precedence levels of
          unary and binary minus by defining unary minus, which
          AnaGram treats as distinct from the simple '-', and giving
          it different precedence.

<P>

<IMG ALIGN="bottom" SRC="../images/rbline6j.gif" ALT="----------------------"
      WIDTH=1010 HEIGHT=2 >
<P>
<IMG ALIGN="right" SRC="../images/pslrb6d.gif" ALT="Parsifal Software"
                WIDTH=181 HEIGHT=25>
<BR CLEAR="right">
<P>
Back to <A HREF="../index.html">Index</A>
<P>
<ADDRESS><FONT SIZE="-1">
                  AnaGram parser generator - examples<BR>
                  Syntactic Building Blocks<BR>
                  Copyright &copy; 1993-1999, Parsifal Software. <BR>
                  All Rights Reserved.<BR>
</FONT></ADDRESS>

</BODY>
</HTML>