Mercurial > ~dholland > hg > ag > index.cgi

\chapter{Syntactic Building Blocks}

This appendix contains examples of the productions necessary to parse
the more common syntactic elements in your grammar.  Each is provided
with sample reduction procedures, which are adequate in the vast
majority of cases.  You can use these productions like building blocks
to quickly assemble quite powerful grammars.  These productions can be
also be found in the file \agfile{sbb-doc.txt} in the
\agfile{examples} directory of your AnaGram distribution.  An
HTML version may be found in the \agfile{html} directory.
% XXX: s/disk//

\index{End of file}
\paragraph{End of File.}
If your parser is going to use conventional stream I/O to read its
input, you can define end of file as minus one:
\begin{indentingcode}{0.4in}
 eof = $-$1
\end{indentingcode}

If your parser is going to run under Windows/DOS and will use low level I/O
instead of stream I/O you might define eof as Control Z:

\begin{indentingcode}{0.4in}
eof = \^{}Z
\end{indentingcode}

% XXX the following is crap
If your parser is going to run under UNIX and will use low level I/O instead of
stream I/O you might define eof as Control D:
\begin{indentingcode}{0.4in}
eof = \^{}D
\end{indentingcode}

If your parser is simply going to parse a string in memory, the
definition of end of file should be a null character:

\begin{indentingcode}{0.4in}
eof = 0
\end{indentingcode}

It is often convenient, however, to simply define end of file so it
will work in all of these contexts:

\begin{indentingcode}{0.4in}
eof = $-$1 + 0 + \^{}D + \^{}Z
\end{indentingcode}

\index{White space}
\paragraph{White Space.}
It is convenient to have a representation for white space in your
grammar.  Usually you do not wish to distinguish between space
characters and tab characters, so you can write:

\begin{indentingcode}{0.4in}
blank = ' ' + '{\bs}t'
\end{indentingcode}

Using this definition, you can represent required white space of
indeterminate length with \agcode{blank...} and optional white space
with \agcode{blank?...}

It is common, however, to include comments (see below) as white space.
In this case, you can define the following productions:

\begin{indentingcode}{0.4in}
ws
  -> blank
  -> comment
\end{indentingcode}

\index{End of line}
\paragraph{End of Line.}

Because different systems use different representations for end of
line, it is wise to use an abstract end of line token in your grammar,
and define the token separately.  If your parser is going to use files
that are known to use carriage return alone or line feed alone as the
end of line delimiter you can use one of the following definitions:

\begin{indentingcode}{0.4in}
eol = '{\bs}r' // carriage return only
eol = '{\bs}n' // line feed only
\end{indentingcode}

If your input files use the newline character as a line terminator,
but you wish to allow for optional carriage returns, you might write:

\begin{indentingcode}{0.4in}
eol
  -> '{\bs}r'?, '{\bs}n'
\end{indentingcode}

or even

\begin{indentingcode}{0.4in}
eol
  -> '{\bs}r'?..., '{\bs}n'
\end{indentingcode}

If you wish to make a grammar that can deal with a file of more or
less arbitrary provenance, some caution is in order. If you allow
carriage return and line feed both to function as end of line
characters and you allow blank lines, you may wind up with a very
ambiguous grammar.  For example, suppose you use \agcode{eol...}
somewhere in your grammar and you have defined eol thus:

\begin{indentingcode}{0.4in}
eol
  -> '{\bs}r'
  -> '{\bs}n'
  -> '{\bs}r', '{\bs}n'  // trouble!
\end{indentingcode}

Your grammar is ambiguous since a carriage return, line feed pair can
be considered two line endings, according to the first two
productions, or just one, according to the third.

If you really need to allow isolated carriage returns and linefeeds to
mark ends of line, but need to deal with the pair as well, you can
make good use of the idiosyncracies of AnaGram's keyword logic by
writing the following:

\begin{indentingcode}{0.4in}
eol
  -> '{\bs}r'
  -> '{\bs}n'
  -> "{\bs}r{\bs}n"
\end{indentingcode}

This will treat a carriage return followed by a line feed as a single
end of line.

\index{Comments}\index{Parsing}
\paragraph{Comments.}
There are two different categories of comments in common use: those
that run to the end of line, and those that run to a specific
terminator. The first can be conveniently incorporated into your end
of line token as a virtual production:

\begin{indentingcode}{0.4in}
eol
  -> ["//", \~{}(eof + '{\bs}n')?...], '{\bs}n'
\end{indentingcode}

Thus, \agcode{eol} allows for an optional comment at the end of every
line.  Note that you never want to allow end of file characters in
your comments.

% XXX: simply/simple
Conventional C comments are slightly more complicated, depending on
your treatment of nesting.  If you are not interested in nesting
comments, you can simply use the following simple syntax:

\begin{indentingcode}{0.4in}
comment
  -> "/*", \~{}eof?..., "*/"
\end{indentingcode}

Note that this production works because keywords take precedence over
ordinary character input.
% XXX add: Otherwise, it would probably conflict with other uses of
% ``*'' and ``/''.

A somewhat more complex way to write this is useful:

\begin{indentingcode}{0.4in}
comment
  -> comment head, "*/"
  -> comment head, comment
comment head
  -> "/*"
  -> comment head, \~{}eof
\end{indentingcode}

Note that where the previous grammar simply ignored the beginning of
any nested comment, this grammar recognizes a nested comment, but then
deliberately chooses to ignore nesting.
% XXX add:

If you want your comments to nest, you can easily rewrite this grammar
to allow nesting:

\begin{indentingcode}{0.4in}
comment
  -> comment head, "*/"
comment head
  -> "/*"
  -> comment head, \~{}eof
  -> comment head, comment
\end{indentingcode}

Note that the only difference between nested and non-nested comments
is whether the grammar rule \agcode{comment head, comment} reduces to
\agcode{comment head} or to \agcode{comment}.

If you wish to defer the question of nesting to run-time, you can
use a semantically determined production to make the decision:

\begin{indentingcode}{0.4in}
comment
  -> comment head, "*/"
comment head
  -> "/*"
  -> comment head, \~{}eof
comment, comment head
  -> comment head, comment = check{\us}nesting();
\end{indentingcode}

Although the final production in this grammar has a somewhat
inscrutable air about it, a little study will show that if
\agcode{check{\us}nesting} sets the reduction token to \agcode{comment},
the effective grammar is the same as the grammar above for non-nested
comments.  On the other hand, if \agcode{check{\us}nesting} sets the
reduction token to comment head, the effective grammar is the same as
the grammar for nested comments.

Assuming you have a switch called \agcode{nest{\us}comments}, you could
write \agcode{check{\us}nesting} as follows:

\begin{indentingcode}{0.4in}
check{\us}nesting() \bra
  if (nest{\us}comments)
    CHANGE{\us}REDUCTION(comment{\us}head);
  else
    CHANGE{\us}REDUCTION(comment);
\ket
\end{indentingcode}
% XXX add some void to that

The \agcode{else} clause in this procedure is technically unnecessary
since \agcode{comment} is the default value of the reduction token.

If you wish to skip white space in your input, and wish to consider
comments as simple white space, you might want to add the following
production to your grammar:

\begin{indentingcode}{0.4in}
ws
  -> blank
  -> comment
\end{indentingcode}

You can then use the \agparam{disregard} statement to ignore
\agcode{ws} appropriately.

\index{Integers}\index{Parsing}
\paragraph{Integers.}
It is quite easy to convert ordinary integers to binary form. For
instance:

\begin{indentingcode}{0.4in}
(int) decimal integer
  -> '0-9':d = d-'0';
  -> decimal integer:n, '0-9':d = 10*n + d-'0';
\end{indentingcode}

If necessary, you can specify that the value of decimal integer be
maintained as a long:

\begin{indentingcode}{0.4in}
(long) decimal integer
  -> '0-9':d = d-'0';
  -> decimal integer:n, '0-9':d = 10*n + d-'0';
\end{indentingcode}

Other representations are equally simple:

\begin{indentingcode}{0.4in}
(long) octal integer
  -> '0' = 0;
  -> octal integer:n, '0-7':d = 8*n + d-'0';

(long) hex digit
  -> '0-9':d = d-'0';
  -> 'A-F' + 'a-f':d = 9 + (d \& 7);

(long) hex integer
  -> {\bra}"0x" + "0X"{\ket}, hex digit:d = d;
  -> hex integer:n, hex digit:d = 16*n + d;
\end{indentingcode}

% XXX: s/if/if as shown/
Note that if you use the C convention for octal integers, you have to
redefine decimal integer to avoid ambiguity:

\begin{indentingcode}{0.4in}
(long) decimal integer
  -> '1-9':d = d-'0';
  -> decimal integer:n, '0-9':d = 10*n + d-'0';
\end{indentingcode}

% XXX add: This is because 0 standing alone could be either octal or
% decimal.

You can then represent the general case as follows:

\begin{indentingcode}{0.4in}
(long) integer
  -> decimal integer | octal integer | hex integer
\end{indentingcode}

You can allow for a signed integer with the following productions:

\begin{indentingcode}{0.4in}
(long) signed integer
  -> '+'?, integer:n = n;
  -> '-', integer:n = -n;
\end{indentingcode}

If you have included a disregard statement in your grammar to skip
over irrelevant white space in your input, you might add the following
to avoid skipping white space inside a number:

\begin{indentingcode}{0.4in}
{}[ lexeme \bra integer \ket ]
\end{indentingcode}

Note that if you were to declare signed integer as a lexeme, your
parser would not allow space between a leading sign and the integer.

% XXX. This should really at least mention dealing with integer
% overflow.

\index{Floating point numbers}
\index{Parsing}
\paragraph{Floating Point Numbers.}
Floating point numbers are somewhat more complex than integers, but
not significantly so:

\begin{indentingcode}{0.4in}
(double) real
  -> simple real
  -> integer part:r, exponent field:x = r*pow10(x);
  -> simple real:r, exponent field:x = r*pow10(x);

(double) simple real
  -> integer part:i, '.', fraction part:f = i+f;
  -> integer part, '.'
  -> '.', fraction part:f = f;

(double) integer part
  -> '0-9':d = d-'0';
  -> integer part:n, '0-9':d = 10*n + d-'0';

(double) fraction part
  -> '0-9':d = (d-'0')/10.;
  -> '0-9':d, fraction part:f = (d-'0'+f)/10.;

exponent field
  -> 'e' + 'E', signed exponent:x = x;

signed exponent
  -> '+'?, exponent:n = n;
  -> '-', exponent:n = -n;

exponent
  -> '0-9':d = d-'0';
  -> exponent:n, '0-9':d = 10*n + d-'0';
\end{indentingcode}

% XXX this is all wet - you really need to collect it as a string and
% then use strtod on it, or you lose precision.

Note that fraction part uses right recursion rather than left
recursion.  \agcode{exponent} is effectively the same as
\agcode{decimal integer}, above, but allows for initial zeroes.

The situation becomes somewhat more complex if you wish to allow both
integer and floating point forms, particularly if you wish to follow
the C convention for octal integers.  First, you cannot have distinct
productions for integer part and decimal integer, since there is no
way to distinguish them until a decimal point or an exponent field is
encountered.  Second, since 0129.3 looks for all the world like an
octal number until the ``9'' is encountered, one either has to
postpone all conversion calculations until the issue is resolved or
resort to trickery.  Here is a way to resolve the problem by
redefining integer part:

\begin{indentingcode}{0.4in}
(double) integer part
  -> confusion
  -> octal integer:n = octal2decimal(n);
  -> decimal integer:n = n;

(double) confusion
  -> octal integer:n, '8-9':d = 10*octal2decimal(n) + d-'0';
  -> confusion:x, '0-9':d = 10*x + d-'0';
\end{indentingcode}

where octal2decimal is defined thus:

\begin{indentingcode}{0.4in}
	double octal2decimal(int n) \bra
	  if (n) return 10*octal2decimal(n/8) + n\%8;
	  else return 0;
	\ket
\end{indentingcode}

Here confusion represents a number that starts off looking like an
octal integer, but then turns into a decimal number, because an eight,
a nine, a decimal pointer or an exponent field is encountered. When
this occurs, the function octal2decimal undoes the octal conversion
that had already been done, and redoes the conversion as decimal
conversion.

If you have included a disregard statement in your grammar to skip
over irrelevant white space in your input, you might add the following
to avoid skipping white space inside a real:

\begin{indentingcode}{0.4in}
{}[ lexeme \bra real \ket ]
\end{indentingcode}

\index{Names}\index{Parsing}
\paragraph{Names.}
In almost all grammars, it is necessary to identify names.  To
accumulate the characters that make up the name it is convenient to
use a stack.  The reduction procedures in this example use the
functions \agcode{ipn} and \agcode{pcn} to accumulate the characters.
\agcode{ipn} initializes storage for the name and stores the first
character.  \agcode{pcn} adds a single character to the name.  This
grammar presumes the existence of a function called
\agcode{identify{\us}name} which would look up the name in a symbol table
and return an identifying index.

\begin{indentingcode}{0.4in}
letter = 'a-z' + 'A-Z' + '{\us}'
digit  = '0-9'

(int) name
  -> name string = identify{\us}name();

name string
  -> letter:c = ipn(c);
  -> name string, letter+digit:c  = pcn(c);

\bra // Embedded C to accumulate name
  char name[82];
  int name{\us}length;

  /* Init and Put char to Name */
  void ipn(int c) \bra
    name[0] = c;
    name{\us}length = 1;
  \ket

  /* Put Char to Name */
  void pcn(int c) \bra
    assert(name{\us}length < 82);
    name[name{\us}length++] = c;
  \ket
\ket // End of embedded C
\end{indentingcode}

% XXX: can I please change the names of ipn/pcn, make the global data
% static, and so on?

If you have included a disregard statement in your grammar to skip
over irrelevant white space in your input, you might add the following
to avoid skipping white space inside a name:

\begin{indentingcode}{0.4in}
{}[ lexeme \bra name \ket ]
\end{indentingcode}

\index{Names with embedded white space}\index{Parsing}
\paragraph{Names with Embedded White Space.}
It is sometimes convenient to allow symbol names to contain embedded
white space. This is easily done, although it requires a bit of a
trick:

\begin{indentingcode}{0.4in}
name
  -> name string, ws?... = identify{\us}name();

name string
  -> letter:c = ipn(c);
  -> name string, letter+digit:c = pcn(c);
  -> name string, ws..., letter+digit:c = pcn(' '), pcn(c);
\end{indentingcode}

Note that the last production reduces all contiguous white space
within a name to a single blank.

Allowing optional blanks following \agcode{name string} is important.
If you don't allow them there, any \agcode{ws} following a name will
be presumed to be embedded \agcode{ws}, requiring another
\agcode{letter} or \agcode{digit}, which is not what you intend.
There are two ways to accomplish this. The first, shown above,
explicitly allows for optional white space following \agcode{name
string}.  The second method is to use the disregard and lexeme
statements:

\begin{indentingcode}{0.4in}
{}[
  disregard ws
  lexeme \bra name \ket
]
\end{indentingcode}

If you use the \agparam{disregard} statement you should not include a
provision for white space in the production for \agcode{name}.  Just
leave it as it was in the previous example.
% XXX: s/provision for/provision for trailing/

\index{Character Strings}\index{Parsing}
\paragraph{Character Strings.}
Character strings are often required.  The simplest way to implement
character strings is as follows:

\begin{indentingcode}{0.4in}
character string
  -> '"', \~{}(eof + '"')?..., '"'
\end{indentingcode}

This approach has the disadvantage that it makes no provision for
nonprinting characters.

There are numerous ways to provide for nonprinting characters in your
character strings.  However, you can avoid tedious documentation by
using the same rules for nonprinting characters that C uses.
Unfortunately, the C rules for octal and hexadecimal escape sequences
complicate the writing of the grammar quite substantially.  For
example, if you wish to write a string that consists of ASCII code 11
followed by the ascii digit ``2'', you must pad with a leading zero,
writing \agcode{{\bs}0132}, since \agcode{{\bs}132}
according to the rules is a single three digit octal escape sequence
designating ASCII code 90.  The problem is that the rules allow for
one, two or three digit octal escape sequences, but sequences shorter
than three digits have to be followed by the end of the string or a
character that is not an ASCII digit.  There is a similar, if not
worse, problem with hex escape sequences.  There is no limit on the
length of a hex escape sequence, so there is no possible way to make
the character ``2'' follow a hex escape sequence without using another
escape sequence.

A straightforward approach to writing a grammar for strings consistent
with the C conventions yields a number of conflicts which correspond
exactly to the problems discussed above. While it is certainly
possible to write a grammar for strings that has no conflicts, it is
easier to proceed in a straightforward manner and use a sticky
declaration to resolve the ambiguities.

Here is the complete grammar for a character string in accordance with
the C rules.  In order to accumulate the contents of the string, this
grammar uses the functions \agcode{ics} and \agcode{acs}, to
initialize storage for a character string and to append a character to
that string respectively.
% XXX can I please change the names of ics and acs?

\begin{indentingcode}{0.4in}
character string
  -> string chars, '"'

string chars
  -> '"' = ics();
  -> string chars, string char:c = acs(c);

string char
 -> simple string char
 -> escape sequence

simple string char = \~{}eof - ('"' + '{\bs\bs}' + '{\bs}n')

(int) escape sequence
 -> "{\bs\bs}a" ='{\bs}a';
 -> "{\bs\bs}b" ='{\bs}b';
 -> "{\bs\bs}f" ='{\bs}f';
 -> "{\bs\bs}n" ='{\bs}n';
 -> "{\bs\bs}r" ='{\bs}r';
 -> "{\bs\bs}t" ='{\bs}t';
 -> "{\bs\bs}v" ='{\bs}v';
 -> "{\bs\bs\bs\bs}" ='{\bs\bs}';
 -> "{\bs\bs}?" = '{\bs}?';
 -> "{\bs\bs}'" ='{\bs}'';
 -> "{\bs\bs\bs}"" = '"';
 -> octal escape
 -> hex escape

(int) octal escape
 -> one octal | two octal | three octal

(int) one octal
 -> '{\bs\bs}', '0-7':d = d-'0';

(int) two octal
 -> one octal:n, '0-7':d = 8*n + d-'0';

(int) three octal
 -> two octal:n, '0-7':d = 8*n + d-'0';

(int) hex escape
 -> "{\bs\bs}x", hex number

(int) hex number
 -> hex digit
 -> hex number:n, hex digit:d = 16*n + d;

{}[
  sticky one octal, two octal, hex number
]
\bra // Embedded C to define ics and acs
  char string{\us}store[82];
  int length;

  void ics(void) \bra
    length = 0;
  \ket
  void acs(int c) \bra
    assert(length < 82);
    string{\us}store[length++] = c;
  \ket
\ket // End of embedded C
\end{indentingcode}


If you have included a disregard statement in your grammar to skip
over irrelevant white space in your input, you might add the following
to avoid skipping white space inside a name:
% XXX inside a what?

\begin{indentingcode}{0.4in}
{}[ lexeme \bra character string \ket ]
\end{indentingcode}

\index{Character constants}\index{Parsing}
\paragraph{Character Constants.}
It is almost trivial to extend the above syntax for character strings
to allow simple character constants:

\begin{indentingcode}{0.4in}
(int) character constant
  -> '{\bs}'', simple char:c, '{\bs}'' = c;

(int) simple char
 -> \~{}eof - ('{\bs}'' + '{\bs\bs}' + '{\bs}n')
 -> escape sequence
\end{indentingcode}

The value of the character constant token is the character code found
in the input stream.

If you have included a disregard statement in your grammar to skip
over irrelevant white space in your input, you might add the following
to avoid skipping white space inside a character constant:

\begin{indentingcode}{0.4in}
{}[ lexeme \bra character constant \ket ]
\end{indentingcode}

\index{Expressions}\index{Parsing}
\paragraph{Simple Expressions.}
It is often convenient to allow for simple expressions in your input.
The syntax for expression logic is written in a conventional way,
using a separate token for each precedence level desired.  The
following grammar will accept simple addition, subtraction,
multiplication and division of floating point numbers:

\begin{indentingcode}{0.4in}
(double) expression
  -> term
  -> expression:x, '+', term:t = x+t;
  -> expression:x, '-', term:t = x-t;

(double) term
  -> factor
  -> term:t, '*', factor:f = t*f;
  -> term:t, '/', factor:f = t/f;

(double) factor
  -> real
  -> '(', expression:x, ')' = x;
  -> '-', factor:f = -f;
\end{indentingcode}

An alternative way to write expression syntax is to write an ambiguous
grammar and use precedence declarations to resolve the conflicts:

\begin{indentingcode}{0.4in}
(double) expression
  -> expression:x, '+', expression:y = x+y;
  -> expression:x, '-', expression:y = x-y;
  -> expression:x, '*', expression:y = x*y;
  -> expression:x, '/', expression:y = x/y;
  -> unary minus, expression:x = -x;
  -> '(', expression:x, ')' = x;
  -> real

unary minus = '-'
{}[
  left '+', '-'
  left '*', '/'
  right unary minus
]
\end{indentingcode}

Note that we deal with the different precedence levels of unary and
binary minus by defining \agcode{unary minus}, which AnaGram treats as
distinct from the simple \agcode{'-'}, and giving it different
precedence.
author	David A. Holland
date	Sun, 26 Apr 2009 17:58:26 -0400
parents	13d2b8934445
children