Mercurial > ~dholland > hg > ag > index.cgi

\chapter{Syntax Files}
\index{Syntax file}\index{File}

Input files to AnaGram are called \agterm{syntax files}.  A syntax
file comprises a grammar and associated C or C++ code.  The grammar
consists of a number of productions along with supportng information
such as configuration sections and definitions of character sets.  The
associated code consists of reduction procedures (see \S 8.2.13) and
embedded C or C++ code (\S 8.2.17).  This chapter explains the rules
for writing syntax files acceptable to AnaGram.  The rules for
interfacing your parser to the balance of your program are given in
Chapter 9.


\section{Lexical Conventions}
\index{Lexical conventions}

\subsection{Statements}
\index{Statements}

For purposes of this manual, AnaGram statements are considered to be
productions, definition statements, configuration sections, and blocks
of embedded C or C++ code, all discussed individually below.  Each
statement must begin on a new line.  It is a good idea to separate
statements visually in your file by using blank lines freely.
There are generally no restrictions on the
\index{Statements}\index{Order of statements}order of statements
in a syntax file.  Good programming practice, however, suggests that
definitions and configuration sections should precede the grammar
itself.

\subsection{Spaces and Tabs}
\index{Spaces}\index{Tabs}

AnaGram allows spaces and tabs to be used freely to improve the
readability of grammars.  Spaces and tabs are ignored, except when
embedded in a token name, in a character set definition, or in a
keyword.  Within a token name, any sequence of spaces and tabs counts
as a single space.

\subsection{Continuation Lines}
\index{Continuation lines}

AnaGram statements normally end with a newline character or the end of
file.  If AnaGram encounters the end of a line and the statement it is
reading appears to be complete, it will not look for a continuation.
To continue a statement to another line, just make sure that what you
have on the first line is clearly incomplete.  For example,

\begin{indentingcode}{0.4in}
prep phrase -> preposition, "the", noun
\end{indentingcode}

looks complete to AnaGram, whereas

\begin{indentingcode}{0.4in}
prep phrase -> preposition, "the", noun,
\end{indentingcode}

looks incomplete because of the dangling comma at the end.

\subsection{Comments}
\index{Comments}

AnaGram accepts comments in accordance with the rules of C and C++,
that is, normal C comments bracketed with \agcode{/*} and \agcode{*/},
as well as comments which begin with \agcode{//} and continue to the
end of line.  AnaGram also observes these conventions when skipping
over embedded C code.

Since the ANSI standard for C insists that normal C comments do not
nest, AnaGram, by default, disallows nested comments.  You may,
however, set a configuration parameter,
\index{Nest comments}\index{Configuration switches}\index{Comments}
\agparam{nest comments},
to allow nested comments.  See Appendix A.  In any case, AnaGram will
use the same convention for embedded C as it uses for AnaGram proper.
You can change the convention in the middle of the file if necessary.

AnaGram treats each comment delimited with \agcode{/*} and \agcode{*/}
as though it were a single space. You can even put such comments in
the middle of token names if you should want to.  A comment that
begins with \agcode{//} is treated as though the end of line occurred
at the \agcode{//}.

\subsection{Blank Lines and Form Feeds}
\index{Blank lines}

Because blank lines and form feeds are visual separators, AnaGram will
not skip either looking for a continuation line. Therefore blank lines
and form feeds can occur only between AnaGram statements, not in the
middle of a statement.

It is a good idea to separate groups of productions with a blank line
or two, lest an accidental dangling comma make AnaGram think the
beginning of the next production is a continuation of the present one.


\section{Elements of Grammars}

\subsection{Names}
\index{Name}\index{Token}

You may use names to represent tokens, character sets, keywords and
\index{Virtual productions}\index{Production}virtual productions.
Names follow the same general rules as for any programming language,
with the notable exception that they may have embedded white space.
Names are made up of letters, digits, or underscores.  They may not
begin with a digit.  Any sequence of embedded spaces, tabs or comments
counts as a single space.  AnaGram distinguishes between upper and
lower case\index{Case sensitivity}, so that \agcode{Word} and
\agcode{word} are different names.  There is no particular limit to the
length of a name.  There are no reserved words as such, although
\agcode{grammar}, \agcode{eof}, and \agcode{error} will be treated as
reserved words unless you take special action by setting appropriate
configuration parameters.  The names AnaGram uses for
\index{Configuration parameters}configuration parameters
follow the same rules as for other names, except that
\index{Case sensitivity}case
is ignored.

\subsection{Reserved Words}
\index{Reserved words}\index{Words}

% XXX shouldn't that be \index{Grammar token}?
AnaGram treats tokens with the names \index{Grammar}\agcode{grammar},
\index{Eof token}\index{Token}\agcode{eof}, and \index{Error
token}\index{Token}\agcode{error} in a special manner unless certain
measures are taken.  Since you can override AnaGram's use of these
names, they are not reserved words in the true sense.

If your grammar has a token named \agcode{grammar}, AnaGram will take
that token to be the grammar token for your grammar unless you set the
\index{Token}\index{Grammar token}\index{Configuration parameters}
\agparam{grammar token}
configuration parameter or mark some other token as the grammar token
using ``\index{ \_dol}\$''.% See below ???.

If your grammar has a token named \agcode{error} and you take no
further steps, AnaGram will assume you wish to use error token
resynchronization in case of
\index{Syntax error}\index{Errors}syntax error.  See Chapter 9.
If you wish to use some other token as an error token you
may select it using the
\index{Configuration parameters}\index{Token}\index{Error token}
\agparam{error token} configuration parameter.
If you wish to use \agcode{error} as a token name, but do not want
error token resynchronization, you may set the \agparam{error token}
configuration parameter to any name that is not used in your grammar.
You may then use \agcode{error} as a token name without causing
AnaGram to include error token resynchronization in your parser.

\index{Resynchronization}
If you select automatic resynchronization or error token
resynchronization (see Chapter 9), AnaGram will look for a token
called \agcode{eof} to use as an end of file indicator.  You may
either name your end of file token \agcode{eof} or you may set the
\agparam{eof token} configuration parameter with the name of your end
of file token.

\subsection{Variable Names}
\index{Name}\index{C variable names}

With AnaGram you can associate C/C++ variable names with the
\index{Semantic value}\index{Token}\index{Value}semantic values of
tokens for use in your \index{Reduction procedure}reduction
procedures.  Each name follows the corresponding token in the grammar
rule on the right of the production, separated from the token by a
colon.  AnaGram allows variable names made up of letters, digits, and
underscores.  They may not begin with a digit.  Embedded spaces, tabs
or comments, are not allowed, of course.  AnaGram imposes no
restriction on length, but uses your variable names just as you have
written them in the code it generates to call reduction procedures.
Remember that your compiler may have a limit on the length of variable
names.  Also, AnaGram itself uses C variable names beginning with
\agcode{ag{\us}}.  It is therefore wise to avoid using names of this form.

\subsection{Terminal Tokens}
\index{Terminal token}\index{Token}

A \agterm{terminal token} is a token which does not appear on the left
side of a production.  It represents, therefore, a basic unit of input
to your parser.  You have several options with respect to terminal
tokens.  If the input to your parser consists of ASCII characters, you
may define terminal tokens explicitly as ASCII characters or as sets
of ASCII characters.  If you have an input procedure which produces
numeric codes, you may define the terminal tokens directly in terms of
these numeric codes.  On the other hand, you may leave the terminal
tokens completely undefined.  In this case, you must provide an input
procedure which can determine the appropriate
\index{Token}\index{Token number}\index{Number}token numbers.
It is an all or none situation.  If you provide any explicit
definitions, you must provide them for all terminal tokens.  Input
procedures and token input are discussed in Chapter 9.  Examples of
non-character input may be found in the Macro Preprocessor example in
the \agfile{examples/mpp} directory on your AnaGram distribution
disk.% Further examples are given in Chapter ???.
% XXX change ``on ...distribution disk'' to ``in ...distribution''.

\subsection{Character Representations}
\index{Character representations}

In specifying admissible input characters you may use \index{Character
constants}character constants following the normal C conventions.
Remember that a character constant may specify only a single
character.  Although some C compilers will allow constructs such as
\agcode{'mv'}, AnaGram doesn't allow this.  AnaGram recognizes the
same escape sequences as C, including octal and hex sequences, even
though this is, strictly speaking, unnecessary.  The escape sequences
AnaGram recognizes are:

%
% It would be nice to be able to just write this and tell latex to set
% it in three columns. but no... that would be too easy.
%
%
%\begin{tabular}{ll}
%\agcode{{\bs}a}&alert (bell) character\\
%\agcode{{\bs}b}&backspace\\
%\agcode{{\bs}f}&formfeed\\
%\agcode{{\bs}n}&newline\\
%\agcode{{\bs}r}&carriage return\\
%\agcode{{\bs}t}&horizontal tab\\
%\agcode{{\bs}v}&vertical tab\\
%\agcode{{\bs\bs}}&backslash\\
%\agcode{{\bs}?}&question mark\\
%\agcode{{\bs}'}&single quote\\
%\agcode{{\bs}"}&double quote\\
%\agcode{{\bs}ooo}&octal number\\
%\agcode{{\bs}xhh}&hexadecimal number\\
%\end{tabular}

\begin{indenting}{0.4in}
\begin{tabular}{llllll}
\agcode{{\bs}a}&alert (bell) character&
\agcode{{\bs}t}&horizontal tab&
\agcode{{\bs}'}&single quote\\
\agcode{{\bs}b}&backspace&
\agcode{{\bs}v}&vertical tab&
\agcode{{\bs}"}&double quote\\
\agcode{{\bs}f}&formfeed&
\agcode{{\bs\bs}}&backslash&
\agcode{{\bs}\textit{ooo}}&octal number\\
\agcode{{\bs}n}&newline&
\agcode{{\bs}?}&question mark&
\agcode{{\bs}x\textit{hh}}&hexadecimal number\\
\agcode{{\bs}r}&carriage return\\
\end{tabular}
\end{indenting}
\bigskip

The octal escape sequence allows up to three octal digits, in
accordance with ANSI specifications for C.  The hexadecimal numbers
may contain an arbitrary number of digits; however AnaGram will
truncate the result to sixteen bits.

A backslash followed by any character other than those listed above
will cause a syntax error.

You may also represent characters by writing the numeric code
explicitly, in decimal, octal, or hexadecimal representations.
AnaGram follows the C conventions for integer constants: a leading
\agcode{0} means the number is octal, a leading \agcode{0x} or
\agcode{0X} means it is hexadecimal. The hex digits \agcode{a-f} may
be either upper or lower case\index{Case sensitivity}.  Numbers may be
preceded by an optional minus sign.

If your parser uses a pre-existing \index{Lexical scanner}lexical
scanner and you wish to use the code numbers it generates to identify
tokens, you may simply treat those code numbers as character numbers.
You may use the numbers directly in your productions, or you may use
definition statements to name them.  You may also use an
\agparam{enum} statement within a configuration section to attach
names to the code numbers.
% XXX shouldn't this use of enum be indexed?

AnaGram also allows a special notation for control characters.  You
may represent a control character by using the ``\^{}'' character
preceding any printing ascii character. Thus you can write
\agcode{\^{}z} or \agcode{\^{}Z} to represent the DOS end-of-file
character.  Notice that quotation marks are not necessary.

Examples of character representations:

\begin{indenting}{0.4in}
\begin{tabular}{cccc}
\agcode{'K'}&\agcode{-1}&\agcode{0}&\agcode{'{\bs}t'}\\
\agcode{\^{}J}&\agcode{'{\bs}xff'}&\agcode{077}&\agcode{0XF3}\\
\end{tabular}
\end{indenting}

\subsection{Character Ranges}
\index{Character range}\index{Range}

It is convenient to be able to specify ranges of characters when
writing a grammar.  AnaGram supports several ways of representing
ranges of characters.  The first is an extension of the notation for
character constants: \agcode{'a-z'} is the set of lower case
characters.  You can even use escape sequences such as
\agcode{'{\bs}n-{\bs}r'} if you like.  The order of
characters used to specify the range is immaterial: \agcode{'z-a'} is
the same as \agcode{'a-z'}.  AnaGram will, however, issue a warning
just in case the unusual order results from a clerical error.

The second way to specify a range is by using two arbitrary character
representations, as described above, separated by two dots.  For
example, \agcode{\^{}C..\^{}Z}, \agcode{3..26}, \agcode{3..032},
\agcode{3..0x1a}, and \agcode{\^{}C..0x1a}, all represent the same
range of characters.  Similarly, \agcode{'A-F'}, \agcode{'A'..'F'},
\agcode{0101..0106}, \agcode{0x41..0x46}, \agcode{65..70}, and
\agcode{65..'F'} all represent the same range of characters.

\subsection{Character Sets}
\index{Character sets}

If you provide explicit definitions for terminal tokens, the basic
input unit for your parser will be considered a character set, even if
your input procedure provides numeric codes that are not actually
characters.  As a terminal token, a character set will be matched by
any input character that is a member of the set.  Character sets may
be named in definition statements, but they may also appear on the
right sides of productions without being named.

A character set may consist of one or more characters.  You can
specify a character set that consists of a single character by using
any of the character representation methods described above.  You can
specify a set consisting of a range of characters by using any of the
representations of character ranges described above.
\index{Character sets}
To specify more complicated sets, you can write
\index{Expressions}\index{Set expressions}expressions
using conventional set theoretic operations.
In AnaGram input, these operations are specified as follows:

\index{Union}\index{Difference}\index{Intersection}\index{Complement}
\begin{indenting}{0.4in}
\begin{tabular}{cl}
\agcode{A + B}&(union)\\
\agcode{A - B}&(difference)\\
\agcode{A \& B}&(intersection)\\
\agcode{\~{}A}&(complement)\\
\end{tabular}
\end{indenting}

where \agcode{A} and \agcode{B} are arbitrary sets.  Union and
difference have the same precedence.  Intersection has higher
precedence and complement has the highest precedence.  Thus in the
expression

\begin{indentingcode}{0.4in}
A + \~{}B\&C
\end{indentingcode}

the complement operation is performed first, then the intersection,
and finally the union.

Watch out!  In an AnaGram syntax file \agcode{65 + 97} represents the
character set which consists of lower case \agcode{a} and upper case
\agcode{A}.  It does not represent 162, the sum of 65 and 97.

Parentheses may be used to force the order of evaluation:

\begin{indentingcode}{0.4in}
\~{}(A \& (B+C))
\end{indentingcode}

In this example the union of \agcode{B} and \agcode{C} is calculated,
then the intersection of this set with \agcode{A} is calculated, and
finally the complement is evaluated.

The computation of the \index{Complement}complement of a
\index{Character sets}set requires a definition of the
\index{Universe}universe of set elements.  AnaGram will define the
universe to be the set of unsigned 8-bit characters, unless one or
more characters outside that range have been specified.  In that case,
the universe will consist of all characters on the interval defined by
the lesser of zero and the lowest character code used and the greater
of 255 and the highest character code used.  The complement of a
character set is everything in this universe except the characters in
the set.

Characters which make up part of the character universe, but are not
legitimate input according to your grammar, are lumped together into a
special token which will cause an error if it occurs in your input.

When your parser reads an input character, it uses that character to
index a conversion table in order to determine the appropriate
\index{Token number}\index{Token}\index{Number}token number.  If the
\index{Range}\index{Test range}\index{Configuration switches}
\agparam{test range} configuration switch
is on, its default setting, your parser will include code to verify
that the character is in bounds before it indexes the conversion
table.  If you are satisfied that checking bounds is unnecessary, you
may turn the \agparam{test range} switch off and get a slightly higher
level of performance from your parser.

For efficient processing, it is well to keep the number of tokens to a
minimum.  Therefore if you have a choice between defining a construct
as a token, with a production, or a set, with a definition, the set is
to be preferred.

Some useful character sets are:

\begin{indenting}{0.4in}
\begin{tabular}{ll}
\agcode{'a-z' + 'A-Z'}&Alphabetic characters\\
\agcode{'a-f' + 'A-F'}&Hex digits\\
\agcode{'0-9'}&Decimal digits\\
\agcode{0..127}&ASCII character set\\
\agcode{32..126}&Printing ASCII characters\\
\agcode{\~{}'{\bs}n'}&Anything but newline\\
\agcode{\^{}Z}&Windows/DOS end of file indicator\\
\agcode{-1}&Stream I/O end of file indicator\\
\agcode{0}&String terminator\\
\agcode{32..126 - 'a-z' - 'A-Z' - '0-9'}&Punctuation\\
\end{tabular}
\end{indenting}
\bigskip
% XXX ``punctuation'' is wrong; it should subtract off space too

Note that \agcode{'a-z'} is a range of characters but
\agcode{32..126 - 'a-z'} is a set difference.

When AnaGram encounters a character set in a grammar rule, it assigns
a token number to the character set.  If it has previously seen the
same character set it will assign the same token number; however, it
assigns the same token number only if the set expressions are
obviously the same.  Thus, AnaGram will assign the same token number
every time it sees \agcode{A + B}, but will assign a different token
number if it sees \agcode{B + A}.  Only when AnaGram has finished
scanning the entire syntax file can it actually evaluate the character
sets.  If it finds that several different tokens all refer to the same
character set, it will create a single token that represents the true
character set and create
\index{Shell productions}\index{Production}``shell productions'' for
the others.

\index{Character sets}If the character sets you use in your grammar
overlap, they do not properly represent
\index{Terminal token}\index{Token}terminal tokens.
To deal with this situation, AnaGram identifies all overlaps among
character sets and extends your grammar by adding a number of extra
productions.  For instance, suppose your grammar uses the following
character sets as though they were terminal tokens:

\begin{indentingcode}{0.4in}
'a-z' + 'A-Z'
'0-9'
'0-7'
'a-f' + 'A-F'
\end{indentingcode}

AnaGram will then modify your grammar by adding the following productions:

\begin{indentingcode}{0.4in}
'a-z' + 'A-Z'
  -> 'a-f' + 'A-F' | 'g-z' + 'G-Z'

'0-9'
  -> '0-7' + '8-9'
\end{indentingcode}

Although the tokens \agcode{'a-z' + 'A-Z'} and \agcode{'0-9'} are
technically now
\index{Nonterminal token}\index{Token}nonterminal tokens,
for purposes of determining the
\index{Token}\index{Data type}data type of their
\index{Semantic value}\index{token}\index{Value}semantic values,
AnaGram continues to regard them as terminal tokens.

This \index{Partition}\index{Universe}\index{Character universe}
``partitioning'' of the character universe is described in Chapter 6.

\subsection{Keyword Strings}
\index{Keywords}

In your grammar, AnaGram recognizes character strings within double
quotes (e.g., \agcode{"IF"}) as keywords.  The strings follow the same
syntactic rules as strings in C.  The same escape sequences are
honored.  AnaGram does not, however, allow for the concatenation of
adjacent strings.  Note that AnaGram strings are used only for the
definition of keywords in your grammar, not for messages to be
displayed or printed.

Keyword strings may not include null characters and must be at least
one character long.  You may have any number of keywords.  Each is
treated as a single terminal token.  A keyword may be given a name by
means of a definition statement.  Keywords may appear in virtual
productions.

AnaGram's keyword recognition works in the following way.  First, for
each state in your parser, AnaGram prepares a list of all the keywords
that are admissible in that state.  Your parser will recognize a
keyword \emph{only} if it is in an appropriate state; otherwise it
will appear to be an anonymous sequence of characters.  Your parser,
in any state, checks for keywords it expects before it checks for
acceptable characters.  That is, \emph{keywords take precedence} over
simple characters.  It does not look for keywords that would not be
acceptable input.  The parser will do whatever lookahead is necessary
in order to pick up the entire keyword.  Thus if the character
\agcode{I} and the keyword \agcode{IF} are both legitimate input at
some point, \agcode{IF} will be recognized, if present, in preference
to \agcode{I}.  If several admissible keywords match the input, such
as \agcode{IF} and \agcode{IFF}, the parser will select the longest
match, \agcode{IFF} in this example.

AnaGram does not incorporate keywords into its character sets.
Keywords stand apart and should not appear in definitions of character
sets.  In particular, they are not considered as belonging to the
complement of a character set.  Thus for the production

\begin{indentingcode}{0.4in}
next char -> \~{}('{\bs}n' + \^{}Z)
\end{indentingcode}
a keyword would not be considered legitimate input.

Note also that a keyword consisting of a single character does not
belong to the character universe.  Because of this fact, AnaGram's
treatment of \agcode{'X'} and \agcode{"X"} is very different.  If this
seems confusing at first, try using only keywords which are at least
two characters long until you have some experience with them.

AnaGram's keyword recognition logic normally does not make any
assumptions about what precedes or follows a keyword.  Thus if
\agcode{int} is a keyword, your parser will be capable of plucking it
out of a string of characters such as \agcode{disintegrate} if,
according to your grammar, it could follow \agcode{dis}.  The
\agparam{sticky} declaration and the \agparam{distinguish keywords}
statement, described below, can prevent such unwanted recognition of
keywords.  A keyword following a \agparam{sticky} token will not be
recognized if the first character of the keyword can be shifted in as
part of the \agparam{sticky} token.  The \agparam{distinguish
keywords} statement prevents recognition of a keyword if it is
followed immediately by a character of the sort that makes up the
keyword.

\subsection{Type Specifications For Tokens}
\index{Token}\index{Token type}\index{Type declarations}

When you write productions or token declarations (see below), AnaGram
allows you to specify the data type\index{Token}\index{Data type} of
the \index{Semantic value}\index{Token}\index{Value}semantic value of
a token by using a C or C++ data type specification.  The restrictions
are that AnaGram does not allow specification of array or function
types, nor explicit structure types.  Types that are defined with
typedef statements, structure definitions, or class definitions,
including template classes, in your embedded C or C++ are acceptable.
Thus the following specifications, for example, are acceptable:

\begin{indentingcode}{0.4in}
void
int
char *
unsigned long *near
static float *far
my{\us}type
double *
struct descriptor
struct widget *
vector <double> *
\end{indentingcode}

On the other hand, the following specifications are \emph{not} valid:

\begin{indentingcode}{0.4in}
int[20]
int *(int, unsigned char)
\bra int x,y; float z; \ket
struct \bra int k; float z; \ket
\end{indentingcode}

Note that AnaGram itself does nothing with the type specifications. It
simply passes them on to your compiler as appropriate.

\subsection{Productions}
\index{Production}

Productions are the basic units of a grammar.  A production consists
of a left side and a right side.  \index{Left side}The left side of a
production consists of one or more token names, joined by commas,
optionally preceded by a type specification enclosed in parentheses.
\index{Right side}The right side begins with an arrow and may either
begin on the same line as the left side or on a new line.  For
example:

\begin{indentingcode}{0.4in}
program -> statement list, eof
expression
  -> expression, plus, term

(int) variable name, function name
  -> name:n = look{\us}up(n);
\end{indentingcode}

The part of the right side of a production following the arrow is
called a \index{Grammar rule}\index{Rule}\agterm{grammar rule},
discussed below.  A production need not have a right side at all.  In
this case, it is simply called a
\index{Declaration}\index{Token}\agterm{token declaration}.
AnaGram assigns
\index{Token number}\index{Token}\index{Number}token numbers
to the token names on the left side, and, if there is a type
specification, records the data type for each of the tokens declared.
Declarations of this sort are most useful when using input from a
\index{Lexical scanner}lexical scanner.  See Chapter 9 for a discussion
of techniques for interfacing a lexical scanner to your parser.  If
you do not intend to use a lexical scanner you will have no need for
token declarations.

If you do not explicitly specify the type for the
\index{Semantic value}\index{Token}\index{Value}semantic value
of a token, it will be determined by the configuration parameter
\index{Default token type}\index{Configuration parameters}\index{Token}
\agparam{default token type}
if it is a \index{Nonterminal token}\index{Token}nonterminal token or
by the \index{Configuration parameters}configuration parameter
\index{Input token type}\index{Default input type}\agparam{default input type}
if it is a \index{Token}terminal token.
\agparam{Default token type} defaults to \agcode{void}.
\agparam{Default input type} defaults to \agcode{int}.

If a production has more than one token on the left side, as in the
third example above, it is called a
\index{Semantically determined production}\index{Production}
\agterm{semantically determined production}.  Semantically determined
productions are a useful tool for exerting semantic control over
syntactic analysis.  A semantically determined production should have
a reduction procedure which determines on a case by case basis which
of the tokens on the left side should be taken as the reduction token.
If there is no reduction procedure, or if the reduction procedure does
not make a choice, the reduction token will be the first syntactically
correct token on the left side of the production.  In the example
above, \agcode{variable name} will be the reduction token unless
\agcode{look{\us}up} changes it to \agcode{function name}.  Semantically
determined productions are discussed more fully in Chapter 9.

If several productions have the same left side, it does not need to be
repeated.  Subsequent right hand sides must each start on a new
line.  For example:

\begin{indentingcode}{0.4in}
integer
  -> digit
  -> integer, digit

name
  -> letter
  -> name, letter
  -> name, digit
\end{indentingcode}

On the other hand, you do not have to group productions with the same
left side.  You could write the above productions as follows, although
it would certainly not be good programming practice:

\begin{indentingcode}{0.4in}
name -> name, digit
integer -> integer, digit
name -> name, letter
integer -> digit
name -> letter
\end{indentingcode}

Nevertheless, there are a few occasions involving complex cross
recursions and semantically determined productions where it is not
possible to group productions neatly.

The right side of a production can be empty.  Such a production is
called a
\index{Null productions}\index{Production}\agterm{null production}.
Null productions are useful to denote an optional element in a
grammar, or a list that may be empty.  For example:

\begin{indentingcode}{0.4in}
optional widget
  ->
  -> widget

optional qualifiers
  ->
  -> optional qualifiers, qualifier
\end{indentingcode}

A second way to write multiple productions with the same left side
uses the \index{Vertical bar}\index{|}vertical bar character, ``$|$'',
to separate the grammar rules.  The productions given above for
\agcode{name}, \agcode{optional widget}, and \agcode{optional
qualifiers} can also be written:

\begin{indentingcode}{0.4in}
name -> letter | name, letter | name, digit
optional widget
  -> | widget

optional qualifiers
  -> | optional qualifiers, qualifier
\end{indentingcode}

Note that a null production cannot \emph{follow} a vertical bar.

A token that has a null production is called a
\index{Zero length token}\index{Token}\agterm{zero length token},
since it can be represented by an empty sequence of input characters,
that is to say, by nothing at all.  Furthermore, even if a token
doesn't have any null productions, if it has at least one rule
consisting entirely of zero length tokens it is also a zero length
token.  In the Token Table window, AnaGram notes which tokens are zero
length, because they can be a source of conflicts.

\subsection{Grammar Token}

Every grammar must have a single token which produces the entire
grammar.  This token is variously called the
\index{Token}\index{Grammar token}\agterm{grammar token}, the
\index{Goal token}\agterm{goal token} or the
\index{Start token}\agterm{start token}.
AnaGram provides several methods you may use to specify which token in
your grammar is the grammar token.

You may simply use the name \agcode{grammar} for the grammar token.
If you wish to use some other more descriptive name for your grammar
token, you may mark it with a following dollar sign when it appears on
the left side of a production.  Alternatively, you may set the
\index{Grammar token}\index{Configuration parameters}\agparam{grammar token}
configuration parameter to specify the grammar token.  Here are
examples of the methods:

\begin{indentingcode}{0.4in}
grammar
  -> [statement | newline]/...

program \$
  -> [statement | newline]/...

{}[ grammar token = program ]
program
  -> [statement | newline]/...
\end{indentingcode}

If you should use more than one of these techniques, AnaGram resolves
the issue in the following manner: A marked token or a configuration
parameter setting always takes precedence over simply naming a token
\agcode{grammar}.  If you mark more than one token or set the
configuration parameter more than once, the last setting or mark wins.

\subsection{Grammar Rules}
\index{Rule}\index{Grammar rule}

The part of a production to the right of the arrow is more often
called a \agterm{grammar rule}, or simply \agterm{rule}.  A grammar
rule is a sequence of \index{Rule elements}\agterm{rule elements},
joined by commas, as in the examples of productions given above.  Rule
elements are token names, character set expressions, virtual
productions, or immediate actions (see below).  Each rule element may
be optionally followed by a parameter assignment.  The entire rule may
be followed by an optional reduction procedure.  A \index{Parameter
assignment}parameter assignment is a colon followed by a C variable
name.  Here are some examples of rule elements with parameter
assignments:

\begin{indentingcode}{0.4in}
'0-9':d
integer:n
expression:x
declaration:declaration{\us}descriptor
\end{indentingcode}

The parameters you assign to tokens in your grammar rule become the
formal parameters for your \index{Reduction procedure}reduction
procedure.  The data type\index{Data type}\index{Reduction procedure
arguments} of the parameter is determined by the data type for the
semantic value of the token to which it is assigned.  If your grammar
rule has parameter assignments, but does not have a reduction
procedure, AnaGram will give you a warning in case the lack of a
reduction procedure is an oversight.  If you don't need a reduction
procedure you may safely ignore the warning.  On the other hand,
AnaGram has no way to determine whether you have failed to make
necessary parameter assignments.  You won't find out until you compile
your parser, when your compiler will give you error messages for
undefined symbols.

AnaGram assigns a unique rule number to each rule in your grammar.
Rules are numbered sequentially as they are encountered in the syntax
file.  AnaGram constructs rule zero itself.  Rule zero normally has a
single element, the grammar token, unless you have a
\agparam{disregard} statement in your grammar.  In this case there
will be two elements.

\subsection{Reduction Procedures}
\index{Reduction procedure}

% XXX somewhere in here it ought to say something like
% ``in the parsing literature reduction procedures are often known as
% \agterm{semantic actions}.''
% Note that R. says there's some subtle difference between the usual
% concept of semantic action and AG's concept of reduction procedure.
% I don't know what this difference is and I hope she can recall it.
%
% D. thinks this note ought to be at the end; R. wants it at the top.

A \agterm{reduction procedure} is a piece of C code which optionally
follows a production.  The code is executed when your parser
identifies the production in its input.  There are two forms for
reduction procedures, a short form and a long form.  The short form
consists of a single C expression.  The long form consists of an
arbitrary block of C code.  When AnaGram builds a parser, it inspects
the grammar rule to which the procedure is attached and identifies the
parameters for the procedure.  It uses these parameters as the formal
parameters for the procedure.
If the
\index{Macros}\index{Allow macros}\index{Configuration switches}
\agparam{allow macros}
configuration switch has not been turned off, AnaGram codes the
reduction procedure as a macro definition whenever possible.
Otherwise AnaGram codes it as a function definition.  AnaGram builds
the name for a reduction procedure by appending its internal procedure
number to the string \agcode{ag{\us}rp{\us}}.  Thus reduction procedures are
numbered in the order in which they are encountered in the syntax
file.

Both long and short form reduction procedures are preceded by an equal
sign which follows the production.  The short form consists of a C or
C++ expression terminated by a semicolon.  When the grammar rule is
reduced, the expression will be evaluated and its value will become
the value of the reduction token.  The expression and the terminating
semicolon must be entirely on a single line.  Note that, if you really
need to make the expression longer than will fit on one line, you can
embed a newline in a comment.  Some examples of short form reduction
procedures are:

% XXX is there anything we can do about the ugly underscores?
\begin{indentingcode}{0.4in}
=0;

=1;

=10*n + d-'0';

=
special{\us}processor(first{\us}parameter, second{\us}parameter);

=word{\us}count++;

=widget(constant{\us}1*parameter{\us}1 + constant{\us}2*parameter{\us}2  /*
{} */ + constant{\us}3*parameter{\us}3);
\end{indentingcode}

A long form reduction procedure consists of an arbitrary block of C or
C++ code, enclosed in braces (\bra \ket).  AnaGram will code the reduction
procedure as a function.  To return a value for the reduction token,
simply use the \agcode{return} statement.  There are effectively no
restrictions on the content or length of a reduction procedure.  Of
course, if there are unbalanced braces, unterminated comments or
unterminated string literals, AnaGram will not be able to determine
where the reduction procedure ends.  AnaGram treats
\index{Comments}nested comments within a reduction procedure according
to the value of the \index{Nest comments}\index{Configuration
switches}\agparam{nest comments} configuration switch at the point
where it encounters the reduction procedure.

From a practical point of view it is not usually good practice to have
a reduction procedure that is more than a few lines long since a long
procedure will hamper your overall view of your grammar. Long
reduction procedures should be written as separate named functions,
and should either be included in the embedded C portion of your syntax
file or should be included in a wholly separate module.  Here is an
example of a long form reduction procedure:

\begin{indentingcode}{0.4in}
=\bra
   if (flag) \bra
     total += x;
     return identify(x);
   \ket
   else \bra
     total = 0;
     flag = 1;
     return init{\us}table(x);
   \ket
 \ket
\end{indentingcode}

If a rule does not have a reduction procedure, the semantic value of
the reduction token will be set to the \index{Semantic
value}\index{Token}\index{Value}semantic value of the first token in
the rule, unless the rule is a \index{Null productions}null
production.  In the latter case, the value of the reduction token will
be set to zero.
% XXX and what if zero isn't a valid value for the type? a compiler
% error will occur.

% XXX add something like
%
% Variables appearing in reduction procedures which do not have a
% parameter assignment in the corresponding grammar rule can be
% declared globally or (file)-statically in your embedded C, or
% alternatively could be added to the parser control block using
% the \agparam{extend pcb} statement (q.v. | See Section ....).
% (Reword this.)
%
% Should also discuss the sequencing of reduction procedure calls
% so that people understand what happens if you use such variables.
%
% also ``A reduction procedure can be used to terminate parsing for
% semantic reasons''.
%

\subsection{Immediate Actions}
\index{Immediate action}\index{Action}

An immediate action is a rule element that consists of executable C or
C++ code embedded within a grammar rule to be executed when it is
encountered.  An immediate action is denoted by the use of an
exclamation point, \index{!}``!''.  The content of an immediate action
may be written following the rules for either long form or short form
reduction procedures.  As with any other rule element, it must be
separated from preceding and following rule elements by commas.  In
the grammar for a simple desk calculator, one might write

\begin{indentingcode}{0.4in}
transaction
  -> !printf('\#');, expression:x = printf("\%d{\bs}n", x);
\end{indentingcode}

% XXX s/apparent/visible/
Notice that the only apparent difference between an immediate action
and a reduction procedure is that the immediate action is preceded by
``!'' instead of ``=''.  The immediate action must be followed by a
comma to separate it from the following rule element.

Immediate actions may also be used in definitions:

\begin{indentingcode}{0.4in}
prompt = !printf('\#');
\end{indentingcode}

AnaGram implements an immediate action by creating a special token for
it.  AnaGram then creates a single null production for the
token. Finally, the immediate action is implemented as the reduction
procedure for the null production.

For example, you could implement \agcode{prompt} by writing a null production
with a reduction procedure:

\begin{indentingcode}{0.4in}
prompt
  ->      = printf('\#');
\end{indentingcode}

This production would be equivalent to the definition above.

There are two ways, however, in which immediate actions differ from
the equivalent null production.  Immediate actions may access any
parameter assignments which precede them in the rule in which they
occur.  On the other hand, there is no way to assign a data type to
the semantic value, if any, returned by the immediate action.
Therefore, the type is determined by your setting of the
\index{Default token type}\index{Configuration parameters}
\agparam{default token type} configuration parameter.

\subsection{Virtual Productions}
\index{Virtual productions}\index{Production}

Virtual productions are a convenient short form notation for common
grammatical constructs involving choice and repetition.  The notation
represents an extension of notation commonly used in programming
manuals.  A virtual production may be written in a grammar rule at any
place where you could write a token name, even within another virtual
production.  Note that use of virtual productions is never
\emph{required}, since the equivalent productions can always be
written out explicitly instead.

When AnaGram encounters a virtual production, it replaces the virtual
production with a new token and writes appropriate productions for the
new token.  When you look at your syntax tables using AnaGram windows,
you will see the productions that AnaGram generates.  AnaGram keeps a
record of virtual productions, so that generally if you use the same
virtual production a second time, you get the same set of tokens and
productions that were generated the first time it was used.  This is
not the case if the virtual productions contain reduction procedures
or immediate actions, since AnaGram is not equipped to determine
whether two pieces of C code are equivalent.  Thus, a virtual
production that contains a reduction procedure will be unique and will
not be reused.

One disadvantage of virtual productions is that there is no way to
specify the data type of the \index{Semantic value}\index{Virtual
production}semantic value of a virtual production.  Therefore, if you
have a reduction procedure within a virtual production, its return
value must be consistent with the type defined by the \index{Default
token type}\index{Configuration parameters}\agparam{default token type}
configuration parameter.

The simplest virtual production is the \index{Token}\index{Optional
token}\agterm{optional token}.  If \agcode{x} is an arbitrary token
name or set expression, you can indicate an optional \agcode{x} by
writing \index{?}\agcode{x?}.  You may also indicate a repetition of
\agcode{x} by using the ellipsis with either \agcode{x} or \agcode{x?}.
\index{...}\index{Ellipsis}Thus \agcode{x...} represents
one or more instances of \agcode{x} and \index{?...}\agcode{x?...}
represents zero or more instances of \agcode{x}. For example:

\begin{indentingcode}{0.4in}
'+'?
\end{indentingcode}

can be used to represent an optional plus sign, that is, a choice
between a plus sign and nothing at all.  Similarly,

\begin{indentingcode}{0.4in}
'{\bs}n'?...
\end{indentingcode}

represents an optional sequence of newline characters.

\index{Brackets}\index{Braces}\index{\_opb\_clb}\index{[]}
The next category of virtual productions uses brackets or braces to
indicate a choice among a number of enclosed grammar rules separated
by vertical bars.  A single rule may also be enclosed.  Note that
\emph{rules}, with following reduction procedures, are allowed, not
simply tokens.

Braces are used to indicate that one option must be chosen.  Brackets
are used to indicate the choice is optional, i.e. may be omitted
altogether.  The ellipsis following a set of options within brackets
or braces indicates the option may be repeated an indefinite number of
times.

You can use braces to indicate a simple choice among a number of
options.  A Cobol grammar offers the following choice of equivalent
keywords:

\begin{indentingcode}{0.4in}
\bra "RECORD", "IS"? | "RECORDS", "ARE"? \ket
\end{indentingcode}

\index{\_opb\_clb...}\index{ []...}
You may use the ellipsis with braces to indicate an arbitrary positive
number of repetitions of the choice:

\begin{indentingcode}{0.4in}
{\bra}type specifier | storage class specifier{\ket}...
\end{indentingcode}

This expression requires at least one type specifier or storage class
specifier, but will accept any number.

\index{[]}
To make a choice optional, use brackets instead of braces.  An
example, again drawn from a Cobol grammar, is:

\begin{indentingcode}{0.4in}
{}["LIMIT", "IS"? | "LIMITS", "ARE"?]
\end{indentingcode}

\index{[]...}
Ellipses may be used with brackets to indicate an arbitrary number of
choices that may be omitted altogether:

\begin{indentingcode}{0.4in}
{}[argument, [',', argument]...]
\end{indentingcode}

This expression describes an optional argument list with arguments
separated by commas.

If you use a null production within braces, it must be the first option:

\begin{indentingcode}{0.4in}
\bra | '+' | '-' \ket
\end{indentingcode}

Normally, you would do this only if you wanted to attach a reduction
procedure to the null production.  Note that if you include a null
production within braces, and add an ellipsis after the closing brace
for repetition, your grammar will be ambiguous.  Just exactly how many
times does the null production occur?  Use brackets instead, and omit
the null production.

Null productions are not allowed with brackets, since they would be
intrinsically ambiguous.

The options within braces or brackets may be grammar rules of any
length or complexity and may themselves contain virtual productions of
arbitrary complexity.  Nevertheless, in practice, clarity suffers as
soon as the options get very complex.  Virtual productions are most
important and useful when used in simple situations.  In those
situations they will enhance the clarity of your grammar.

Here is an example that is moderately complex, even though each rule
consists of a single token:

\begin{indentingcode}{0.4in}
\bra{\bra}"on" | "true"\ket = 1; | {\bra}"off" | "false"\ket = 0; | integer\ket
\end{indentingcode}

This example can be used to allow as input either an integer or, for
special cases, keywords.  You could write this option out in the
following way:

\begin{indentingcode}{0.4in}
p1
  -> p2   = 1;
  -> p3   = 0;
  -> integer

p2
  -> "on"
  -> "true"

p3
  -> "off"
  -> "false"
\end{indentingcode}

The final category of virtual production provides a notation for
\index{Alternating sequence}\agterm{alternating sequences}.  An
alternating sequence is a set of choices which may be repeated
arbitrarily subject to the side condition that no choice may follow
itself, in other words, that the choices must alternate.  Alternating
sequences are written with either brackets or braces depending on
whether the sequence is optional or not, followed by
\index{/...}``\agcode{/...}''.  Note that the choices themselves may
allow sequences.  For example:

\begin{indentingcode}{0.4in}
program
  -> [statement | newline...]/..., eof
\end{indentingcode}

represents a sequence of statements separated by one or more newlines.
Any two statements must be separated by one or more newline
characters, and newlines may also appear at the beginning and the end
of the program.

Null productions are not allowed within alternating sequences, since
they are intrinsically ambiguous in all cases.

\subsection{Definition Statements}
\index{Definitions}\index{Definition statement}\index{Statement}

A definition statement is simply a shorthand way of naming a character
set, a \index{Virtual productions}\index{Production}virtual
production, a keyword string, or an immediate action.  It can also be
used for providing an alternate name for a token. Definitions have the
form:

\begin{indentingcode}{0.4in}
name = \codemeta{character set}
name = \codemeta{virtual production}
name = \codemeta{keyword}
name = \codemeta{immediate action}
name = \codemeta{token name}
\end{indentingcode}

The name may be any name acceptable to AnaGram.  The name can then be
used anywhere you might have used the expression on the right
side.  \index{!}For example:

\begin{indentingcode}{0.4in}
upper case letter = 'A-Z'
lower case letter = 'a-z'
letter = upper case letter + lower case letter
statement list = statement?...
while keyword = "WHILE"
prompt = !printf("Please enter name:");
\end{indentingcode}

It is important to recognize that a definition statement that names a
set does not define a token.  A token is defined only when the set is
used in a grammar rule, and then only if the set is used directly, not
in combination with some other set.  Furthermore, if you use a
character set directly in a grammar rule, and in some other rule you
use a name that refers to the same set of characters, you will get two
different tokens.  For example, if you have defined \agcode{upper case
letter} as in the above example and use both \agcode{upper case
letter} and \agcode{'A-Z'} in grammar rules, AnaGram will assign
different \index{Token number}\index{Token}\index{Number}token numbers
to accommodate any differences in attributes you may assign to the
tokens.

Renaming tokens is a convenient way to connect two independently
written portions of a grammar.
% See the C grammar in the EXAMPLES directory of your distribution
% disk for an example.

\subsection{Embedded C}
\index{Embedded C}

You may encapsulate C or C++ code in your syntax file by enclosing it
in braces (\bra \ket).  Such pieces of code are copied to the parser file
untouched, in the order they are found in the syntax file.  There may
be any number of such pieces of embedded C.  The only restriction is
that they must not start on the same line as some other AnaGram
statement, and following AnaGram statements must also start on fresh
lines.

Normally, the blocks of embedded C in your syntax file are copied to
the parser file \emph{following} a set of definitions and declarations
AnaGram needs for the code it generates.  However, if the \emph{first}
statement in your \index{Syntax file}syntax file is a block of
embedded C, it will \emph{precede} AnaGram's definitions and
declarations.  This block of embedded C is called the
\index{Prologue}\index{C prologue}``C prologue''.  There are two main
reasons for this special treatment.  First, you may want to have a
title and \index{Copyright notice}copyright notice in your parser.  If
you include them in an initial block of embedded C they will be right
at the beginning of both your syntax file and your parser file.
Second, if some of your tokens have data type\index{Token}\index{Data
type}s other than those predefined in C or C++, you may include the
definitions here, so they will be available to the code AnaGram
generates.

AnaGram scans embedded C only insofar as is necessary to find the
closing right brace.  Therefore any braces used within embedded C must
balance properly.  AnaGram skips braces enclosed in character
constants and string literals, as well as braces enclosed in
comments.  It also recognizes C++ style comments that begin with
\agcode{//}.  \index{Comments}Treatment of nested versus non-nested comments
is controlled by the
\index{Nest comments}\index{Configuration switches}\agparam{nest comments}
configuration parameter.  AnaGram will use the status of this
parameter in effect at the beginning of the section of embedded C.

AnaGram, of course, can be confused by unterminated strings,
unbalanced brackets, and unterminated comments.  The most likely
outcome, in such a situation, is that AnaGram will encounter the end
of file looking for the end of the embedded C.  Should this happen,
AnaGram will identify the beginning of the piece of embedded C which
caused the problem.

The code you include as embedded C, of course, has to coexist with the
code AnaGram generates.  In order to keep the potential for conflicts
to a minimum, all variables and functions which AnaGram defines begin
either with the name of your parser or with the letters
\agcode{ag{\us}}.  You should avoid variable names which begin with these
letters.

Reduction procedures are copied to the \index{Parser
file}\index{File}parser file in the order in which they are defined
\emph{following} all of the embedded C.  Thus your reduction
procedures may freely use variables and macros defined anywhere in
your embedded C.

\subsection{Configuration Sections}
\index{Configuration section}

A configuration section is a special section of your syntax file
enclosed in brackets.  Within a configuration section you may set the
values of configuration parameters or switches, or you may use one or
more of several available attribute statements to specify special
treatment for certain tokens.  There can be as many or as few
configuration sections in your syntax file as you wish.  Each
configuration section must begin on a new line.  Any AnaGram statement
which follows a configuration section must also begin on a new line.

Within a configuration section, each parameter setting and each
attribute statement must begin on a new line.  The rules for using
comments and continuation lines are the same as for the rest of
AnaGram.

Configuration parameters control the way AnaGram interprets your
syntax file and the way it builds your parser.  A full discussion of
the use of configuration parameters, including a complete discussion
of each parameter and its default value, is given in Appendix A.

\index{Attribute statements}\index{Statement}
Attribute statements comprise the
\index{Precedence declarations}precedence declarations \agparam{left},
\agparam{right}, and \agparam{nonassoc}; the \agparam{sticky}
declaration; the \agparam{distinguish keywords} statement; the
\agparam{hidden} declaration; the \agparam{disregard} and
\agparam{lexeme} statements; the \agparam{enum} statement; the
\index{Reserve keywords}\agparam{reserve keywords} declaration; and
the \index{Rename macro}\agparam{rename macro} statement.

The precedence declarations and the
\index{Sticky declaration}\index{Declaration}\agparam{sticky}
declaration may be used to resolve conflicts in your grammar.  The
\agparam{distinguish keywords} statement may be used to control
keyword recognition.  The
\index{Hidden declaration}\index{Declaration}\agparam{hidden}
declaration causes certain token names not to be used when your parser
produces
\index{Syntax error}\index{Errors}\index{Error messages}syntax error
messages.  You may use the \agparam{disregard} and \agparam{lexeme}
statements to cause your parser to skip automatically over certain
tokens in its input.  The \agparam{enum} statement is almost identical
to the enum statement in C.  It can be used to assign names to input
codes in grammars which are taking input from a \index{Lexical
scanner}lexical scanner or another parser.  The
\index{Reserve keywords}\agparam{reserve keywords} declaration allows
you to specify certain keywords as reserved words.  The
\index{Rename macro}\agparam{rename macro} statement allows you to
override the names AnaGram uses for various macro definitions it
creates in the code it generates.

Attribute statements are discussed below. Except for
\agparam{disregard} and \agparam{rename macro} statements, attribute
statements accept lists of operands enclosed in braces (\bra \ket)
and separated by commas.  A dangling comma following the last item in
a list will be ignored.

\subsection{Setting Configuration Parameters}
\index{Configuration parameters}\index{Parameters}

Each configuration parameter has a name that follows the AnaGram
conventions for symbol names, except that AnaGram ignores
case\index{Case sensitivity} when looking up configuration parameter
names.

There are a number of varieties of configuration parameters.  The
simplest,
\index{Configuration switches}\index{Switches}configuration switches,
simply turn some feature of AnaGram on or off.  These parameters need
simply be stated to turn the feature on, or negated with the tilde
(\agcode{\~{}}) to turn the feature off:

\begin{indentingcode}{0.4in}
nest comments
\end{indentingcode}

causes AnaGram to allow nested comments, and

\begin{indentingcode}{0.4in}
\~{}nest comments
\end{indentingcode}

causes AnaGram to disallow nested comments.

You may also set or reset configuration switches with explicit on or
off values:

\begin{indentingcode}{0.4in}
nest comments = on
nest comments = off
\end{indentingcode}

The remaining configuration parameters are assigned values using a
simple assignment statement.  Depending on the parameter, the value it
takes may be the name of a token, a C variable name, a C or C++ data
type, a string constant or an integer.  String constants are written
using the same rules as keyword strings, described above.

\begin{indentingcode}{0.4in}
grammar token      = program
parser name        = widget
default token type = void *
header file name   = "widget.h"
parser stack size  = 50
\end{indentingcode}

A number of string-valued \index{Configuration
parameters}configuration parameters are used to determine file
names and variable names.  In these parameters, the \index{\#}``\#'',
\index{\_dol}``\$'', and ``\index{ \_prc}\%'' characters
are used as wild cards.  In file name specifications and the
specification of the name of your parser, ``\#'' will be replaced by
the name of your syntax file.  In other function and variable names
AnaGram creates while building your parser, ``\$'' will be replaced by
the name of your parser.  When building enumeration constants for the
names of the tokens in your grammar, ``\%'' will be replaced by the
name of the token.

Note that when entering a Windows/DOS path name as a
value for a file name parameter you must quote any backslashes in the
path name.  For example,

\begin{indentingcode}{0.4in}
coverage file name = "f:{\bs\bs}sna{\bs\bs}foo.nrc"
\end{indentingcode}

\subsection{Precedence Declarations}
\index{Precedence declarations}

AnaGram allows you to resolve shift-reduce conflicts by assigning
precedence levels to operators.  There are three precedence
declarations available, beginning with the keywords
\index{Left}\agparam{left}, \index{Right}\agparam{right}, and
\index{Nonassoc}\agparam{nonassoc} respectively.  Each such
declaration consists of the appropriate keyword and a list of tokens
enclosed in braces (\bra \ket). All the tokens in the list have the same
precedence, higher than tokens in any previous declaration and lower
than in any subsequent declaration.  If the keyword is \agparam{left},
the tokens will group to the left.  If it is \agparam{right}, they
will group to the right.  If it is \agparam{nonassoc} (for
non-associative) no grouping will be assumed.  Precedence declarations
must be included in a configuration section.  Here are precedence
declarations appropriate to a simple desk calculator program:

\begin{indentingcode}{0.4in}
{}[
  left  \bra '+', '-' \ket
  left  \bra star, '/', '\%' \ket
  right \bra unary minus \ket
]
unary minus = '-'
\end{indentingcode}

Note that \agcode{unary minus} and \agcode{'-'} can have different
precedence.

Precedence declarations are one of the few instances in AnaGram where
the \index{Statements}\index{Order of statements}order of statements
is significant.

The use of precedence declarations is discussed in Chapter 9.

\subsection{``Sticky'' Declarations}
\index{Sticky declaration}\index{Declaration}

AnaGram provides another means for resolving shift-reduce conflicts.
You may characterize any token as ``sticky''.  Then, in the case of a
\index{Shift-reduce conflict}\index{Conflicts}shift-reduce conflict
where a ``sticky'' token is the last token in the input buffer, the
conflict will be resolved by selecting the shift operation.
Intuitively, you may think of this as though the ``sticky'' token
adheres to and draws in any subsequent input that it can.  ``Sticky''
declarations are included in configuration sections.  They begin with
the keyword \agcode{sticky} followed by a list of tokens, separated by
commas inside braces (\bra \ket).  Suppose, for instance, you wished to
pick up a line of text, skipping any leading space or tab
characters. You might write the following syntax:

\begin{indentingcode}{0.4in}
white space = ' ' + '{\bs}t'

text char
  -> \~{}'{\bs}n':c  = do{\us}something(c);

line
  -> leading white space, text char?..., '{\bs}n'

leading white space
  ->
  -> leading white space, white space
\end{indentingcode}

Unfortunately, this syntax is ambiguous, since space and tab are
legitimate instances of both leading white space and text char.  What
you really want to do is to skip white space until you find a
non-blank character and then you want to accept all characters to the
end of the line.  There are two ways to address the problem.  The
first is to define a special token for the first non-blank character
and, using it, to write an unambiguous grammar.  This approach, while
laudable, is tedious and prolix.  Instead, use \agparam{sticky} to
resolve the problem:

\begin{indentingcode}{0.4in}
{}[ sticky \bra leading white space \ket ]
\end{indentingcode}

Now when AnaGram analyzes your grammar, and encounters the ambiguity,
it will understand that a blank or tab that could be treated either as
leading white space or the as the first text character should be
treated as white space.  Since \agcode{leading white space} is
``sticky'', any subsequent white space adheres to it.

As with conflicts resolved with precedence levels, AnaGram lists all
conflicts that it resolves using \agcode{sticky} in the
\index{Resolved Conflicts}\index{Window}\agwindow{Resolved Conflicts
Table}, so you can verify that the conflicts have been correctly
resolved.

An important use of sticky tokens is to inhibit the recognition of
following \index{Keywords}keywords.  Following a sticky token, a
keyword, which, according to your grammar, would otherwise be
legitimate input, will not be recognized if a shift action is possible
for the first character of the keyword.  For example, imagine that
\agcode{name} has been defined in the conventional way, and there
exists a production with name followed immediately by the keyword
\agcode{int}.  Then if, in your input, the word \agcode{print} were to
occur, your grammar would parse it as a name, \agcode{pr}, followed by
the keyword \agcode{int}.  If you make \agcode{name} sticky, however,
the first letter of \agcode{int} will be seen to be an acceptable
character for \agcode{name} and the keyword will not be
recognized. Your parser will then recognize the \agcode{name} as
\agcode{print}.

\subsection{Distinguish Keywords Statement}
\index{Distinguish keywords}\index{Keywords}

Distinguish keywords statements are occasionally needed to prevent
keyword recognition.  You may, for example, wish to prevent the
recognition of the keyword \agcode{int} when it occurs embedded in a
word such as \agcode{interval}.  Of course, you need to do this only
if both the keyword and the other word are both legitimate input at
the same point in your grammar.

A distinguish keywords statement can prevent recognition of a keyword
which is embedded in another word provided at least one character of
the other word follows the keyword.

The distinguish keywords statement has the form:

\begin{indentingcode}{0.4in}
distinguish keywords \bra \codemeta{list of character sets} \ket
\end{indentingcode}

AnaGram compares all the characters in each keyword to the characters
included in each character set in turn.  If it finds that all the
characters in a keyword are members of a particular set, it tells the
keyword recognition logic to try to match the keyword only against the
longest sequence of characters drawn from the specified set.  In other
words, in order for a keyword to be recognized, the keyword
\emph{must} be followed by a character \emph{not} in the set.  The set
associated with a keyword is the first one in the list which contains
all the characters found in the keyword.  If you have more than one
\agparam{distinguish keywords} statement in your grammar, the lists
are tried in the order in which they appear in the grammar.

The purpose of the \agparam{distinguish keywords} statement is to
enable your parser to distinguish a keyword from the same sequence of
characters embedded within another sequence.  Thus suppose that
\agcode{int} is a keyword, and, according to your grammar, could
appear in the same place as the word \agcode{integral}.  If you don't
want it to be recognized as a keyword in these circumstances, you
would write the following distinguish statement:

\begin{indentingcode}{0.4in}
distinguish keywords \bra 'a-z'+'A-Z' \ket
\end{indentingcode}

To also inhibit recognition of \agcode{int} within \agcode{print}, you
would combine the use of the distinguish keywords statement with the
\agparam{sticky} declaration.

\subsection{``Hidden'' Declarations}
\index{Hidden declaration}\index{Declaration}

AnaGram provides an optional \index{Error diagnosis}error diagnosis
feature for your parser (see Chapter 9).  The \agparam{hidden}
declaration allows you to identify tokens that you do not wish to be
used in making up \index{Diagnostic messages}diagnostic messages.
These tokens are tokens whose names would not mean anything to your
users.  The format of a ``hidden'' declaration is the same as that of
precedence and ``sticky'' declarations.  Within a configuration
section, the keyword ``hidden'' is followed by a list of tokens. For
example:

\begin{indentingcode}{0.4in}
{}[ hidden \bra comment head \ket ]
comment
  -> comment head, "*/"

comment head
  -> "/*"
  -> comment head, \~{}eof
\end{indentingcode}

This is an AnaGram representation of ANSI standard C comments
(non-nested).  In this example the token \agcode{comment head} exists
only for convenience in writing the grammar and has no particular
meaning to an end user.  On the other hand, he knows what the word
\agcode{comment} refers to.  The ``hidden'' attribute will cause AnaGram's
diagnostic builder, by backing up the stack until it finds a
non-hidden token, to eschew \agcode{comment head} in favor of
\agcode{comment}.
% XXX eschew obfuscation. how about ``avoid''?

\subsection{Disregard Statement}

The purpose of the
\index{Disregard statement}\index{Statement}\agparam{disregard}
statement is to skip over uninteresting \index{White space}white space
and comments in your input files. The disregard statement allows you
to specify a token that should be passed over in the input to your
parser.  The statement takes the form:

\begin{indentingcode}{0.4in}
disregard ws
\end{indentingcode}

where \agcode{ws} is a token name or character set.  Disregard
statements may be placed in any configuration section.

You may have more than one disregard statement in your grammar.  If
you do, AnaGram will create a shell production. For example, suppose
you write:

\begin{indentingcode}{0.4in}
{}[
  disregard alpha
  disregard beta
]
\end{indentingcode}

AnaGram will proceed as though you had written:

\begin{indentingcode}{0.4in}
gamma
  -> alpha | beta
{}[ disregard gamma ]
\end{indentingcode}

It frequently happens that you wish your parser to disregard blanks or
comments, except that white space within names, numbers, strings, and
other elementary constructs is subject to special rules and thus
should not be disregarded blindly.  In this case, you can use the
\agparam{lexeme} statement to declare these constructs off limits
for the disregard statement.  Within these constructs, the disregard
statement will be inoperative and the admissibility of white space
will be determined solely by the productions which define these
constructs.

Outside those productions which define lexemes, you should not
generally use a token which is supposed to be disregarded.  If you do,
your grammar will have conflicts, since the token could satisfy both
the explicit usage and the implicit rules set up by the disregard
statement.  Such conflicts, however, are resolved automatically in
favor of your explicit use of the token.  The conflicts will appear in
the \agwindow{Resolved Conflicts} window.
% XXX I'm not sure that's still true.

In order to implement the disregard statement AnaGram will redefine
some tokens in your grammar.  For example, \agcode{+} may be redefined
to consist of a simple plus sign followed by optional white space:

\begin{indentingcode}{0.4in}
'+'
  -> '+'\%, white space?...
\end{indentingcode}

The percent sign is used to indicate the original, simple plus sign
without the optional white space attached.  You will probably notice
the percent sign appearing in some windows and traces.  In earlier
versions of AnaGram, the degree sign, ``\agcode{\degrees}'', was used rather
than ``\agcode{\%}''.

\subsection{Lexeme Statement}

The ``lexeme'' \index{Statement}\index{Lexeme statement}statement is
used to fine-tune the disregard statement.
The lexeme statement takes the form:

\begin{indentingcode}{0.4in}
{}[ lexeme \bra \codemeta{nonterminal token list} \ket ]
\end{indentingcode}

where \textit{nonterminal token list} is a list of nonterminal tokens
separated by commas.
Lexeme statements may be placed in any configuration section, and
there may be any number of them.

When you specify that a token is to be disregarded, AnaGram rewrites
your grammar so that the token will be passed over whenever it occurs
at the beginning of a file or following a lexical unit, or
\agterm{lexeme}.  If you have no \agparam{lexeme} statement, then the
lexemes in your grammar are just the terminal tokens.

The \agparam{lexeme} statement allows you to specify that certain
nonterminal tokens are also to be treated as lexemes.  This means that
the disregard token will be skipped following the lexeme, but not
between the characters that constitute the lexeme.

Lexemes correspond to the tokens that a lexical scanner, if you were
using one, would commonly identify and pass to a parser as single
tokens.  You don't usually wish to disregard white space within these
tokens.  For example, in a grammar for a conventional programming
language where blank characters are to be disregarded, you might
include:

\begin{indentingcode}{0.4in}
{}[ lexeme \bra string, character constant, name, number \ket ]
\end{indentingcode}

since blank characters must not be overlooked within strings and
character constants and should not be permitted within names or
numbers.

Normally, AnaGram considers the disregard token to be optional;
however there are circumstances where treating the disregard token as
optional would lead to conflicts: two successive names, or two
successive numbers, for example. In this case, you would like to
require that the lexemes be separated by instances of the disregard
token.  To do this, simply set the
\index{Distinguish lexemes}\index{Configuration switches}
\agparam{distinguish lexemes}
configuration switch.
When this switch is set, AnaGram will ensure that disregard tokens
will be required in those situations where making them optional would
lead to conflicts.

White space may be used explicitly within definitions of lexeme tokens
in your grammar if desired, without causing conflicts. Thus, if you
wish to allow embedded space in variable names, you might write:

\begin{indentingcode}{0.4in}
{}[
  disregard space
  lexeme \bra variable name \ket
]
space = ' ' + '{\bs}t'
letter = 'a-z' + 'A-Z'
digit = '0-9'

variable name
  -> letter
  -> variable name, letter + digit
  -> variable name, space..., letter + digit
\end{indentingcode}

\subsection{Enum Statement}
\index{Enum statement}\index{Enumeration}\index{Token}

The \agparam{enum} statement follows rules nearly identical to those
for C and C++.  This makes it possible to copy an enum statement from
your syntax file to a program file written in either C or C++, without
any need for editing.  The only differences are that AnaGram makes no
provision for blank lines within the enumeration list, nor does it
accept a type name.  The \agparam{enum} statement is equivalent to a
corresponding set of definition statements.  It is especially useful
when a parser is accepting token input from another program, a
\index{Lexical scanner}lexical scanner, for example.  Using
the enum statement you may conveniently define all the identification
codes for the input tokens.

Each entry in an enum statement may be either a name, or a name
followed by an ``='' sign and a character representation.  If there is
a character representation the name is assigned the value of the
specified character.  Otherwise it is assigned a value one more than
that assigned to the previous name.  If the first name in the list is
not given an explicit value, it will be given the value zero.  For
example:

\begin{indentingcode}{0.4in}
{}[
  enum \bra
    eof, a,b,c,
    blank = '\ ', x, y
  \ket
]
\end{indentingcode}

is equivalent to the following definition statements

\begin{indentingcode}{0.4in}
eof = 0
a = 1
b = 2
c = 3
blank = '\ '
x = 33
y = 34
\end{indentingcode}

\subsection{Subgrammar Declarations}
\index{Subgrammar declaration}\index{Declaration}

A \agparam{subgrammar} declaration can be a useful way to deal with
conflicts in certain situations.  It tells AnaGram to treat the tokens
listed in the declaration as though they were each grammar tokens,
each specifying a complete subgrammar in itself, and, in determining
shift and reduction actions, to ignore the usage of the tokens in the
larger grammar.

In some cases it is perfectly reasonable to ignore usage.  The most
common example occurs when building a lexical scanner for a language
such as C as in the example in Section 7.4.4.  In this case, you can
write a complete grammar for a C token with no difficulty.  But if you
try to extend it to a sequence of tokens, you get scores of conflicts.
This situation arises because you specify that any C token can follow
another, when in actual practice, an identifier, for example, cannot
follow another identifier without some intervening space or
punctuation.

It is theoretically possible, but in practice quite awkward, to write
a grammar for a sequence of tokens so that there are no conflicts.
The subgrammar declaration provides a way around this problem by
telling AnaGram that when it is looking for reducing tokens for any
rule produced directly or indirectly by a subgrammar token, it should
disregard the usage of the token and only consider usage internal to
the definition of the subgrammar token, as though the subgrammar token
were the start token of the grammar.

The subgrammar declaration is made in a configuration section and
consists of the keyword \agcode{subgrammar} followed by a list of one
or more nonterminal token names, separated by commas and enclosed in
braces (\bra \ket). For example:

\begin{indentingcode}{0.4in}
{}[ subgrammar \bra C token, word \ket ]
\end{indentingcode}

Since the subgrammar statement changes the way AnaGram determines
reducing tokens, it should be used with caution.  You should be sure
that the conflicts you are eliminating are really inconsequential.

\subsection{Reserve Keywords Declaration}
\index{Reserve keywords}\index{Keywords}\index{Keyword anomalies}

The \agparam{reserve keywords} declaration can be used to specify a
list of keywords that are reserved and cannot be used except as
explicitly specified in the grammar.  This enables AnaGram to avoid
issuing meaningless keyword anomaly diagnostics (see \S 7.5).  AnaGram
does not automatically presume that keywords are also reserved words,
since in many grammars there is no need to specify reserved words.

The reserve keywords declaration is made in a configuration section
and consists of the words \agcode{reserve keywords} followed by a list
of one or more keyword strings, separated by commas and enclosed in
braces (\bra \ket). For example:

\begin{indentingcode}{0.4in}
{}[ reserve keywords \bra "int", "char", "float", "double" \ket ]
\end{indentingcode}

\subsection{Rename Macro Statement}
\index{Rename macro}\index{Macros}

AnaGram uses a number of macros in its generated code.  It is
possible, therefore, to run into naming collisions with other
components of your program.  The \agparam{rename macro} statement
allows you to change the name AnaGram uses for a particular macro to
avoid these problems.  For example, the Windows NT operating system
uses \agcode{CONTEXT} structures to perform various internal
operations.  If you use the context tracking option (see \S 9.5.4)
your parser will have a macro called \agcode{CONTEXT}.  To avoid the
name collision, add the following statement to any configuration
section in your grammar:

\begin{indentingcode}{0.4in}
rename macro CONTEXT AG{\us}CONTEXT
\end{indentingcode}

Then, simply use \agcode{AG{\us}CONTEXT} where you would otherwise have
used \agcode{CONTEXT}.
author	David A. Holland
date	Mon, 13 Jun 2022 00:06:39 -0400
parents	13d2b8934445
children