diff doc/manual/sbb.tex @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.
author David A. Holland
date Sat, 22 Dec 2007 17:52:45 -0500
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/manual/sbb.tex	Sat Dec 22 17:52:45 2007 -0500
@@ -0,0 +1,701 @@
+\chapter{Syntactic Building Blocks}
+
+This appendix contains examples of the productions necessary to parse
+the more common syntactic elements in your grammar.  Each is provided
+with sample reduction procedures, which are adequate in the vast
+majority of cases.  You can use these productions like building blocks
+to quickly assemble quite powerful grammars.  These productions can be
+also be found in the file \agfile{sbb-doc.txt} in the
+\agfile{examples} directory of your AnaGram distribution.  An
+HTML version may be found in the \agfile{html} directory.
+% XXX: s/disk//
+
+\index{End of file}
+\paragraph{End of File.}
+If your parser is going to use conventional stream I/O to read its
+input, you can define end of file as minus one:
+\begin{indentingcode}{0.4in}
+ eof = $-$1
+\end{indentingcode}
+
+If your parser is going to run under Windows/DOS and will use low level I/O
+instead of stream I/O you might define eof as Control Z:
+
+\begin{indentingcode}{0.4in}
+eof = \^{}Z
+\end{indentingcode}
+
+% XXX the following is crap
+If your parser is going to run under UNIX and will use low level I/O instead of
+stream I/O you might define eof as Control D:
+\begin{indentingcode}{0.4in}
+eof = \^{}D
+\end{indentingcode}
+
+If your parser is simply going to parse a string in memory, the
+definition of end of file should be a null character:
+
+\begin{indentingcode}{0.4in}
+eof = 0
+\end{indentingcode}
+
+It is often convenient, however, to simply define end of file so it
+will work in all of these contexts:
+
+\begin{indentingcode}{0.4in}
+eof = $-$1 + 0 + \^{}D + \^{}Z
+\end{indentingcode}
+
+\index{White space}
+\paragraph{White Space.}
+It is convenient to have a representation for white space in your
+grammar.  Usually you do not wish to distinguish between space
+characters and tab characters, so you can write:
+
+\begin{indentingcode}{0.4in}
+blank = ' ' + '{\bs}t'
+\end{indentingcode}
+
+Using this definition, you can represent required white space of
+indeterminate length with \agcode{blank...} and optional white space
+with \agcode{blank?...}
+
+It is common, however, to include comments (see below) as white space.
+In this case, you can define the following productions:
+
+\begin{indentingcode}{0.4in}
+ws
+  -> blank
+  -> comment
+\end{indentingcode}
+
+\index{End of line}
+\paragraph{End of Line.}
+
+Because different systems use different representations for end of
+line, it is wise to use an abstract end of line token in your grammar,
+and define the token separately.  If your parser is going to use files
+that are known to use carriage return alone or line feed alone as the
+end of line delimiter you can use one of the following definitions:
+
+\begin{indentingcode}{0.4in}
+eol = '{\bs}r' // carriage return only
+eol = '{\bs}n' // line feed only
+\end{indentingcode}
+
+If your input files use the newline character as a line terminator,
+but you wish to allow for optional carriage returns, you might write:
+
+\begin{indentingcode}{0.4in}
+eol
+  -> '{\bs}r'?, '{\bs}n'
+\end{indentingcode}
+
+or even
+
+\begin{indentingcode}{0.4in}
+eol
+  -> '{\bs}r'?..., '{\bs}n'
+\end{indentingcode}
+
+If you wish to make a grammar that can deal with a file of more or
+less arbitrary provenance, some caution is in order. If you allow
+carriage return and line feed both to function as end of line
+characters and you allow blank lines, you may wind up with a very
+ambiguous grammar.  For example, suppose you use \agcode{eol...}
+somewhere in your grammar and you have defined eol thus:
+
+\begin{indentingcode}{0.4in}
+eol
+  -> '{\bs}r'
+  -> '{\bs}n'
+  -> '{\bs}r', '{\bs}n'  // trouble!
+\end{indentingcode}
+
+Your grammar is ambiguous since a carriage return, line feed pair can
+be considered two line endings, according to the first two
+productions, or just one, according to the third. 
+
+If you really need to allow isolated carriage returns and linefeeds to
+mark ends of line, but need to deal with the pair as well, you can
+make good use of the idiosyncracies of AnaGram's keyword logic by
+writing the following:
+
+\begin{indentingcode}{0.4in}
+eol
+  -> '{\bs}r'
+  -> '{\bs}n'
+  -> "{\bs}r{\bs}n"
+\end{indentingcode}
+
+This will treat a carriage return followed by a line feed as a single
+end of line.
+
+\index{Comments}\index{Parsing}
+\paragraph{Comments.}
+There are two different categories of comments in common use: those
+that run to the end of line, and those that run to a specific
+terminator. The first can be conveniently incorporated into your end
+of line token as a virtual production:
+
+\begin{indentingcode}{0.4in}
+eol
+  -> ["//", \~{}(eof + '{\bs}n')?...], '{\bs}n'
+\end{indentingcode}
+
+Thus, \agcode{eol} allows for an optional comment at the end of every
+line.  Note that you never want to allow end of file characters in
+your comments.
+
+% XXX: simply/simple
+Conventional C comments are slightly more complicated, depending on
+your treatment of nesting.  If you are not interested in nesting
+comments, you can simply use the following simple syntax:
+
+\begin{indentingcode}{0.4in}
+comment
+  -> "/*", \~{}eof?..., "*/"
+\end{indentingcode}
+
+Note that this production works because keywords take precedence over
+ordinary character input.
+% XXX add: Otherwise, it would probably conflict with other uses of
+% ``*'' and ``/''.
+
+A somewhat more complex way to write this is useful:
+
+\begin{indentingcode}{0.4in}
+comment
+  -> comment head, "*/"
+  -> comment head, comment
+comment head
+  -> "/*"
+  -> comment head, \~{}eof
+\end{indentingcode}
+
+Note that where the previous grammar simply ignored the beginning of
+any nested comment, this grammar recognizes a nested comment, but then
+deliberately chooses to ignore nesting.
+% XXX add: 
+
+If you want your comments to nest, you can easily rewrite this grammar
+to allow nesting:
+
+\begin{indentingcode}{0.4in}
+comment
+  -> comment head, "*/"
+comment head
+  -> "/*"
+  -> comment head, \~{}eof
+  -> comment head, comment
+\end{indentingcode}
+
+Note that the only difference between nested and non-nested comments
+is whether the grammar rule \agcode{comment head, comment} reduces to
+\agcode{comment head} or to \agcode{comment}.
+
+If you wish to defer the question of nesting to run-time, you can
+use a semantically determined production to make the decision:
+
+\begin{indentingcode}{0.4in}
+comment
+  -> comment head, "*/"
+comment head
+  -> "/*"
+  -> comment head, \~{}eof
+comment, comment head
+  -> comment head, comment = check{\us}nesting();
+\end{indentingcode}
+
+Although the final production in this grammar has a somewhat
+inscrutable air about it, a little study will show that if
+\agcode{check{\us}nesting} sets the reduction token to \agcode{comment},
+the effective grammar is the same as the grammar above for non-nested
+comments.  On the other hand, if \agcode{check{\us}nesting} sets the
+reduction token to comment head, the effective grammar is the same as
+the grammar for nested comments.
+
+Assuming you have a switch called \agcode{nest{\us}comments}, you could
+write \agcode{check{\us}nesting} as follows:
+
+\begin{indentingcode}{0.4in}
+check{\us}nesting() \bra
+  if (nest{\us}comments)
+    CHANGE{\us}REDUCTION(comment{\us}head);
+  else
+    CHANGE{\us}REDUCTION(comment);
+\ket
+\end{indentingcode}
+% XXX add some void to that
+
+The \agcode{else} clause in this procedure is technically unnecessary
+since \agcode{comment} is the default value of the reduction token.
+
+If you wish to skip white space in your input, and wish to consider
+comments as simple white space, you might want to add the following
+production to your grammar:
+
+\begin{indentingcode}{0.4in}
+ws
+  -> blank
+  -> comment
+\end{indentingcode}
+
+You can then use the \agparam{disregard} statement to ignore
+\agcode{ws} appropriately.
+
+\index{Integers}\index{Parsing}
+\paragraph{Integers.}
+It is quite easy to convert ordinary integers to binary form. For
+instance:
+
+\begin{indentingcode}{0.4in}
+(int) decimal integer
+  -> '0-9':d = d-'0';
+  -> decimal integer:n, '0-9':d = 10*n + d-'0';
+\end{indentingcode}
+
+If necessary, you can specify that the value of decimal integer be
+maintained as a long:
+
+\begin{indentingcode}{0.4in}
+(long) decimal integer
+  -> '0-9':d = d-'0';
+  -> decimal integer:n, '0-9':d = 10*n + d-'0';
+\end{indentingcode}
+
+Other representations are equally simple:
+
+\begin{indentingcode}{0.4in}
+(long) octal integer
+  -> '0' = 0;
+  -> octal integer:n, '0-7':d = 8*n + d-'0';
+
+(long) hex digit
+  -> '0-9':d = d-'0';
+  -> 'A-F' + 'a-f':d = 9 + (d \& 7);
+
+(long) hex integer
+  -> {\bra}"0x" + "0X"{\ket}, hex digit:d = d;
+  -> hex integer:n, hex digit:d = 16*n + d;
+\end{indentingcode}
+
+% XXX: s/if/if as shown/
+Note that if you use the C convention for octal integers, you have to
+redefine decimal integer to avoid ambiguity:
+
+\begin{indentingcode}{0.4in}
+(long) decimal integer
+  -> '1-9':d = d-'0';
+  -> decimal integer:n, '0-9':d = 10*n + d-'0';
+\end{indentingcode}
+
+% XXX add: This is because 0 standing alone could be either octal or
+% decimal.
+
+You can then represent the general case as follows:
+
+\begin{indentingcode}{0.4in}
+(long) integer
+  -> decimal integer | octal integer | hex integer
+\end{indentingcode}
+
+You can allow for a signed integer with the following productions:
+
+\begin{indentingcode}{0.4in}
+(long) signed integer
+  -> '+'?, integer:n = n;
+  -> '-', integer:n = -n;
+\end{indentingcode}
+
+If you have included a disregard statement in your grammar to skip
+over irrelevant white space in your input, you might add the following
+to avoid skipping white space inside a number:
+
+\begin{indentingcode}{0.4in}
+{}[ lexeme \bra integer \ket ]
+\end{indentingcode}
+
+Note that if you were to declare signed integer as a lexeme, your
+parser would not allow space between a leading sign and the integer.
+
+% XXX. This should really at least mention dealing with integer
+% overflow.
+
+\index{Floating point numbers}
+\index{Parsing}
+\paragraph{Floating Point Numbers.}
+Floating point numbers are somewhat more complex than integers, but
+not significantly so:
+
+\begin{indentingcode}{0.4in}
+(double) real
+  -> simple real
+  -> integer part:r, exponent field:x = r*pow10(x);
+  -> simple real:r, exponent field:x = r*pow10(x);
+
+(double) simple real
+  -> integer part:i, '.', fraction part:f = i+f;
+  -> integer part, '.'
+  -> '.', fraction part:f = f;
+
+(double) integer part
+  -> '0-9':d = d-'0';
+  -> integer part:n, '0-9':d = 10*n + d-'0';
+
+(double) fraction part
+  -> '0-9':d = (d-'0')/10.;
+  -> '0-9':d, fraction part:f = (d-'0'+f)/10.;
+
+exponent field
+  -> 'e' + 'E', signed exponent:x = x;
+
+signed exponent
+  -> '+'?, exponent:n = n;
+  -> '-', exponent:n = -n;
+
+exponent
+  -> '0-9':d = d-'0';
+  -> exponent:n, '0-9':d = 10*n + d-'0';
+\end{indentingcode}
+
+% XXX this is all wet - you really need to collect it as a string and
+% then use strtod on it, or you lose precision.
+
+Note that fraction part uses right recursion rather than left
+recursion.  \agcode{exponent} is effectively the same as
+\agcode{decimal integer}, above, but allows for initial zeroes.
+
+The situation becomes somewhat more complex if you wish to allow both
+integer and floating point forms, particularly if you wish to follow
+the C convention for octal integers.  First, you cannot have distinct
+productions for integer part and decimal integer, since there is no
+way to distinguish them until a decimal point or an exponent field is
+encountered.  Second, since 0129.3 looks for all the world like an
+octal number until the ``9'' is encountered, one either has to
+postpone all conversion calculations until the issue is resolved or
+resort to trickery.  Here is a way to resolve the problem by
+redefining integer part:
+
+\begin{indentingcode}{0.4in}
+(double) integer part
+  -> confusion
+  -> octal integer:n = octal2decimal(n);
+  -> decimal integer:n = n;
+
+(double) confusion
+  -> octal integer:n, '8-9':d = 10*octal2decimal(n) + d-'0';
+  -> confusion:x, '0-9':d = 10*x + d-'0';
+\end{indentingcode}
+
+where octal2decimal is defined thus:
+
+\begin{indentingcode}{0.4in}
+	double octal2decimal(int n) \bra
+	  if (n) return 10*octal2decimal(n/8) + n\%8;
+	  else return 0;
+	\ket
+\end{indentingcode}
+
+Here confusion represents a number that starts off looking like an
+octal integer, but then turns into a decimal number, because an eight,
+a nine, a decimal pointer or an exponent field is encountered. When
+this occurs, the function octal2decimal undoes the octal conversion
+that had already been done, and redoes the conversion as decimal
+conversion.
+
+If you have included a disregard statement in your grammar to skip
+over irrelevant white space in your input, you might add the following
+to avoid skipping white space inside a real:
+
+\begin{indentingcode}{0.4in}
+{}[ lexeme \bra real \ket ]
+\end{indentingcode}
+
+\index{Names}\index{Parsing}
+\paragraph{Names.}
+In almost all grammars, it is necessary to identify names.  To
+accumulate the characters that make up the name it is convenient to
+use a stack.  The reduction procedures in this example use the
+functions \agcode{ipn} and \agcode{pcn} to accumulate the characters.
+\agcode{ipn} initializes storage for the name and stores the first
+character.  \agcode{pcn} adds a single character to the name.  This
+grammar presumes the existence of a function called
+\agcode{identify{\us}name} which would look up the name in a symbol table
+and return an identifying index.
+
+\begin{indentingcode}{0.4in}
+letter = 'a-z' + 'A-Z' + '{\us}'
+digit  = '0-9'
+
+(int) name
+  -> name string = identify{\us}name();
+
+name string
+  -> letter:c = ipn(c);
+  -> name string, letter+digit:c  = pcn(c);
+
+\bra // Embedded C to accumulate name
+  char name[82];
+  int name{\us}length;
+
+  /* Init and Put char to Name */
+  void ipn(int c) \bra
+    name[0] = c;
+    name{\us}length = 1;
+  \ket
+
+  /* Put Char to Name */
+  void pcn(int c) \bra
+    assert(name{\us}length < 82);
+    name[name{\us}length++] = c;
+  \ket
+\ket // End of embedded C
+\end{indentingcode}
+
+% XXX: can I please change the names of ipn/pcn, make the global data
+% static, and so on?
+
+If you have included a disregard statement in your grammar to skip
+over irrelevant white space in your input, you might add the following
+to avoid skipping white space inside a name:
+
+\begin{indentingcode}{0.4in}
+{}[ lexeme \bra name \ket ]
+\end{indentingcode}
+
+\index{Names with embedded white space}\index{Parsing}
+\paragraph{Names with Embedded White Space.}
+It is sometimes convenient to allow symbol names to contain embedded
+white space. This is easily done, although it requires a bit of a
+trick:
+
+\begin{indentingcode}{0.4in}
+name
+  -> name string, ws?... = identify{\us}name();
+
+name string
+  -> letter:c = ipn(c);
+  -> name string, letter+digit:c = pcn(c);
+  -> name string, ws..., letter+digit:c = pcn(' '), pcn(c);
+\end{indentingcode}
+
+Note that the last production reduces all contiguous white space
+within a name to a single blank.
+
+Allowing optional blanks following \agcode{name string} is important.
+If you don't allow them there, any \agcode{ws} following a name will
+be presumed to be embedded \agcode{ws}, requiring another
+\agcode{letter} or \agcode{digit}, which is not what you intend.
+There are two ways to accomplish this. The first, shown above,
+explicitly allows for optional white space following \agcode{name
+string}.  The second method is to use the disregard and lexeme
+statements:
+
+\begin{indentingcode}{0.4in}
+{}[
+  disregard ws
+  lexeme \bra name \ket
+]
+\end{indentingcode}
+
+If you use the \agparam{disregard} statement you should not include a
+provision for white space in the production for \agcode{name}.  Just
+leave it as it was in the previous example.
+% XXX: s/provision for/provision for trailing/
+
+\index{Character Strings}\index{Parsing}
+\paragraph{Character Strings.}
+Character strings are often required.  The simplest way to implement
+character strings is as follows:
+
+\begin{indentingcode}{0.4in}
+character string
+  -> '"', \~{}(eof + '"')?..., '"'
+\end{indentingcode}
+
+This approach has the disadvantage that it makes no provision for
+nonprinting characters.
+
+There are numerous ways to provide for nonprinting characters in your
+character strings.  However, you can avoid tedious documentation by
+using the same rules for nonprinting characters that C uses.
+Unfortunately, the C rules for octal and hexadecimal escape sequences
+complicate the writing of the grammar quite substantially.  For
+example, if you wish to write a string that consists of ASCII code 11
+followed by the ascii digit ``2'', you must pad with a leading zero,
+writing \agcode{{\bs}0132}, since \agcode{{\bs}132}
+according to the rules is a single three digit octal escape sequence
+designating ASCII code 90.  The problem is that the rules allow for
+one, two or three digit octal escape sequences, but sequences shorter
+than three digits have to be followed by the end of the string or a
+character that is not an ASCII digit.  There is a similar, if not
+worse, problem with hex escape sequences.  There is no limit on the
+length of a hex escape sequence, so there is no possible way to make
+the character ``2'' follow a hex escape sequence without using another
+escape sequence.
+
+A straightforward approach to writing a grammar for strings consistent
+with the C conventions yields a number of conflicts which correspond
+exactly to the problems discussed above. While it is certainly
+possible to write a grammar for strings that has no conflicts, it is
+easier to proceed in a straightforward manner and use a sticky
+declaration to resolve the ambiguities.
+
+Here is the complete grammar for a character string in accordance with
+the C rules.  In order to accumulate the contents of the string, this
+grammar uses the functions \agcode{ics} and \agcode{acs}, to
+initialize storage for a character string and to append a character to
+that string respectively.
+% XXX can I please change the names of ics and acs?
+
+\begin{indentingcode}{0.4in}
+character string
+  -> string chars, '"'
+
+string chars
+  -> '"' = ics();
+  -> string chars, string char:c = acs(c);
+
+string char
+ -> simple string char
+ -> escape sequence
+
+simple string char = \~{}eof - ('"' + '{\bs\bs}' + '{\bs}n')
+
+(int) escape sequence
+ -> "{\bs\bs}a" ='{\bs}a';
+ -> "{\bs\bs}b" ='{\bs}b';
+ -> "{\bs\bs}f" ='{\bs}f';
+ -> "{\bs\bs}n" ='{\bs}n';
+ -> "{\bs\bs}r" ='{\bs}r';
+ -> "{\bs\bs}t" ='{\bs}t';
+ -> "{\bs\bs}v" ='{\bs}v';
+ -> "{\bs\bs\bs\bs}" ='{\bs\bs}';
+ -> "{\bs\bs}?" = '{\bs}?';
+ -> "{\bs\bs}'" ='{\bs}'';
+ -> "{\bs\bs\bs}"" = '"';
+ -> octal escape
+ -> hex escape
+
+(int) octal escape
+ -> one octal | two octal | three octal
+
+(int) one octal
+ -> '{\bs\bs}', '0-7':d = d-'0';
+
+(int) two octal
+ -> one octal:n, '0-7':d = 8*n + d-'0';
+
+(int) three octal
+ -> two octal:n, '0-7':d = 8*n + d-'0';
+
+(int) hex escape
+ -> "{\bs\bs}x", hex number
+
+(int) hex number
+ -> hex digit
+ -> hex number:n, hex digit:d = 16*n + d;
+
+{}[
+  sticky one octal, two octal, hex number
+]
+\bra // Embedded C to define ics and acs
+  char string{\us}store[82];
+  int length;
+
+  void ics(void) \bra
+    length = 0;
+  \ket
+  void acs(int c) \bra
+    assert(length < 82);
+    string{\us}store[length++] = c;
+  \ket
+\ket // End of embedded C
+\end{indentingcode}
+
+
+If you have included a disregard statement in your grammar to skip
+over irrelevant white space in your input, you might add the following
+to avoid skipping white space inside a name:
+% XXX inside a what?
+
+\begin{indentingcode}{0.4in}
+{}[ lexeme \bra character string \ket ]
+\end{indentingcode}
+
+\index{Character constants}\index{Parsing}
+\paragraph{Character Constants.}
+It is almost trivial to extend the above syntax for character strings
+to allow simple character constants:
+
+\begin{indentingcode}{0.4in}
+(int) character constant
+  -> '{\bs}'', simple char:c, '{\bs}'' = c;
+
+(int) simple char
+ -> \~{}eof - ('{\bs}'' + '{\bs\bs}' + '{\bs}n')
+ -> escape sequence
+\end{indentingcode}
+
+The value of the character constant token is the character code found
+in the input stream.
+
+If you have included a disregard statement in your grammar to skip
+over irrelevant white space in your input, you might add the following
+to avoid skipping white space inside a character constant:
+
+\begin{indentingcode}{0.4in}
+{}[ lexeme \bra character constant \ket ]
+\end{indentingcode}
+
+\index{Expressions}\index{Parsing}
+\paragraph{Simple Expressions.}
+It is often convenient to allow for simple expressions in your input.
+The syntax for expression logic is written in a conventional way,
+using a separate token for each precedence level desired.  The
+following grammar will accept simple addition, subtraction,
+multiplication and division of floating point numbers:
+
+\begin{indentingcode}{0.4in}
+(double) expression
+  -> term
+  -> expression:x, '+', term:t = x+t;
+  -> expression:x, '-', term:t = x-t;
+
+(double) term
+  -> factor
+  -> term:t, '*', factor:f = t*f;
+  -> term:t, '/', factor:f = t/f;
+
+(double) factor
+  -> real
+  -> '(', expression:x, ')' = x;
+  -> '-', factor:f = -f;
+\end{indentingcode}
+
+An alternative way to write expression syntax is to write an ambiguous
+grammar and use precedence declarations to resolve the conflicts:
+
+\begin{indentingcode}{0.4in}
+(double) expression
+  -> expression:x, '+', expression:y = x+y;
+  -> expression:x, '-', expression:y = x-y;
+  -> expression:x, '*', expression:y = x*y;
+  -> expression:x, '/', expression:y = x/y;
+  -> unary minus, expression:x = -x;
+  -> '(', expression:x, ')' = x;
+  -> real
+
+unary minus = '-'
+{}[
+  left '+', '-'
+  left '*', '/'
+  right unary minus
+]
+\end{indentingcode}
+
+Note that we deal with the different precedence levels of unary and
+binary minus by defining \agcode{unary minus}, which AnaGram treats as
+distinct from the simple \agcode{'-'}, and giving it different
+precedence.