AnaGram interim repo (temporary): doc/manual/sf.tex comparison

comparison doc/manual/sf.tex @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.

author	David A. Holland
date	Sat, 22 Dec 2007 17:52:45 -0500
parents
children

comparison

equal deleted inserted replaced

--1:000000000000
+:13d2b8934445
+\chapter{Syntax Files}
+\index{Syntax file}\index{File}
+Input files to AnaGram are called \agterm{syntax files}.  A syntax
+file comprises a grammar and associated C or C++ code.  The grammar
+consists of a number of productions along with supportng information
+such as configuration sections and definitions of character sets.  The
+associated code consists of reduction procedures (see \S 8.2.13) and
+embedded C or C++ code (\S 8.2.17).  This chapter explains the rules
+for writing syntax files acceptable to AnaGram.  The rules for
+interfacing your parser to the balance of your program are given in
+Chapter 9.
+\section{Lexical Conventions}
+\index{Lexical conventions}
+\subsection{Statements}
+\index{Statements}
+For purposes of this manual, AnaGram statements are considered to be
+productions, definition statements, configuration sections, and blocks
+of embedded C or C++ code, all discussed individually below.  Each
+statement must begin on a new line.  It is a good idea to separate
+statements visually in your file by using blank lines freely.
+There are generally no restrictions on the
+\index{Statements}\index{Order of statements}order of statements
+in a syntax file.  Good programming practice, however, suggests that
+definitions and configuration sections should precede the grammar
+itself.
+\subsection{Spaces and Tabs}
+\index{Spaces}\index{Tabs}
+AnaGram allows spaces and tabs to be used freely to improve the
+readability of grammars.  Spaces and tabs are ignored, except when
+embedded in a token name, in a character set definition, or in a
+keyword.  Within a token name, any sequence of spaces and tabs counts
+as a single space.
+\subsection{Continuation Lines}
+\index{Continuation lines}
+AnaGram statements normally end with a newline character or the end of
+file.  If AnaGram encounters the end of a line and the statement it is
+reading appears to be complete, it will not look for a continuation.
+To continue a statement to another line, just make sure that what you
+have on the first line is clearly incomplete.  For example,
+\begin{indentingcode}{0.4in}
+prep phrase -> preposition, "the", noun
+\end{indentingcode}
+looks complete to AnaGram, whereas
+\begin{indentingcode}{0.4in}
+prep phrase -> preposition, "the", noun,
+\end{indentingcode}
+looks incomplete because of the dangling comma at the end.
+\subsection{Comments}
+\index{Comments}
+AnaGram accepts comments in accordance with the rules of C and C++,
+that is, normal C comments bracketed with \agcode{/*} and \agcode{*/},
+as well as comments which begin with \agcode{//} and continue to the
+end of line.  AnaGram also observes these conventions when skipping
+over embedded C code.
+Since the ANSI standard for C insists that normal C comments do not
+nest, AnaGram, by default, disallows nested comments.  You may,
+however, set a configuration parameter,
+\index{Nest comments}\index{Configuration switches}\index{Comments}
+\agparam{nest comments},
+to allow nested comments.  See Appendix A.  In any case, AnaGram will
+use the same convention for embedded C as it uses for AnaGram proper.
+You can change the convention in the middle of the file if necessary.
+AnaGram treats each comment delimited with \agcode{/*} and \agcode{*/}
+as though it were a single space. You can even put such comments in
+the middle of token names if you should want to.  A comment that
+begins with \agcode{//} is treated as though the end of line occurred
+at the \agcode{//}.
+\subsection{Blank Lines and Form Feeds}
+\index{Blank lines}
+Because blank lines and form feeds are visual separators, AnaGram will
+not skip either looking for a continuation line. Therefore blank lines
+and form feeds can occur only between AnaGram statements, not in the
+middle of a statement.
+It is a good idea to separate groups of productions with a blank line
+or two, lest an accidental dangling comma make AnaGram think the
+beginning of the next production is a continuation of the present one.
+\section{Elements of Grammars}
+\subsection{Names}
+\index{Name}\index{Token}
+You may use names to represent tokens, character sets, keywords and
+\index{Virtual productions}\index{Production}virtual productions.
+Names follow the same general rules as for any programming language,
+with the notable exception that they may have embedded white space.
+Names are made up of letters, digits, or underscores.  They may not
+begin with a digit.  Any sequence of embedded spaces, tabs or comments
+counts as a single space.  AnaGram distinguishes between upper and
+lower case\index{Case sensitivity}, so that \agcode{Word} and
+\agcode{word} are different names.  There is no particular limit to the
+length of a name.  There are no reserved words as such, although
+\agcode{grammar}, \agcode{eof}, and \agcode{error} will be treated as
+reserved words unless you take special action by setting appropriate
+configuration parameters.  The names AnaGram uses for
+\index{Configuration parameters}configuration parameters
+follow the same rules as for other names, except that
+\index{Case sensitivity}case
+is ignored.
+\subsection{Reserved Words}
+\index{Reserved words}\index{Words}
+% XXX shouldn't that be \index{Grammar token}?
+AnaGram treats tokens with the names \index{Grammar}\agcode{grammar},
+\index{Eof token}\index{Token}\agcode{eof}, and \index{Error
+token}\index{Token}\agcode{error} in a special manner unless certain
+measures are taken.  Since you can override AnaGram's use of these
+names, they are not reserved words in the true sense.
+If your grammar has a token named \agcode{grammar}, AnaGram will take
+that token to be the grammar token for your grammar unless you set the
+\index{Token}\index{Grammar token}\index{Configuration parameters}
+\agparam{grammar token}
+configuration parameter or mark some other token as the grammar token
+using ``\index{ \_dol}\$''.% See below ???.
+If your grammar has a token named \agcode{error} and you take no
+further steps, AnaGram will assume you wish to use error token
+resynchronization in case of
+\index{Syntax error}\index{Errors}syntax error.  See Chapter 9.
+If you wish to use some other token as an error token you
+may select it using the
+\index{Configuration parameters}\index{Token}\index{Error token}
+\agparam{error token} configuration parameter.
+If you wish to use \agcode{error} as a token name, but do not want
+error token resynchronization, you may set the \agparam{error token}
+configuration parameter to any name that is not used in your grammar.
+You may then use \agcode{error} as a token name without causing
+AnaGram to include error token resynchronization in your parser.
+\index{Resynchronization}
+If you select automatic resynchronization or error token
+resynchronization (see Chapter 9), AnaGram will look for a token
+called \agcode{eof} to use as an end of file indicator.  You may
+either name your end of file token \agcode{eof} or you may set the
+\agparam{eof token} configuration parameter with the name of your end
+of file token.
+\subsection{Variable Names}
+\index{Name}\index{C variable names}
+With AnaGram you can associate C/C++ variable names with the
+\index{Semantic value}\index{Token}\index{Value}semantic values of
+tokens for use in your \index{Reduction procedure}reduction
+procedures.  Each name follows the corresponding token in the grammar
+rule on the right of the production, separated from the token by a
+colon.  AnaGram allows variable names made up of letters, digits, and
+underscores.  They may not begin with a digit.  Embedded spaces, tabs
+or comments, are not allowed, of course.  AnaGram imposes no
+restriction on length, but uses your variable names just as you have
+written them in the code it generates to call reduction procedures.
+Remember that your compiler may have a limit on the length of variable
+names.  Also, AnaGram itself uses C variable names beginning with
+\agcode{ag{\us}}.  It is therefore wise to avoid using names of this form.
+\subsection{Terminal Tokens}
+\index{Terminal token}\index{Token}
+A \agterm{terminal token} is a token which does not appear on the left
+side of a production.  It represents, therefore, a basic unit of input
+to your parser.  You have several options with respect to terminal
+tokens.  If the input to your parser consists of ASCII characters, you
+may define terminal tokens explicitly as ASCII characters or as sets
+of ASCII characters.  If you have an input procedure which produces
+numeric codes, you may define the terminal tokens directly in terms of
+these numeric codes.  On the other hand, you may leave the terminal
+tokens completely undefined.  In this case, you must provide an input
+procedure which can determine the appropriate
+\index{Token}\index{Token number}\index{Number}token numbers.
+It is an all or none situation.  If you provide any explicit
+definitions, you must provide them for all terminal tokens.  Input
+procedures and token input are discussed in Chapter 9.  Examples of
+non-character input may be found in the Macro Preprocessor example in
+the \agfile{examples/mpp} directory on your AnaGram distribution
+disk.% Further examples are given in Chapter ???.
+% XXX change ``on ...distribution disk'' to ``in ...distribution''.
+\subsection{Character Representations}
+\index{Character representations}
+In specifying admissible input characters you may use \index{Character
+constants}character constants following the normal C conventions.
+Remember that a character constant may specify only a single
+character.  Although some C compilers will allow constructs such as
+\agcode{'mv'}, AnaGram doesn't allow this.  AnaGram recognizes the
+same escape sequences as C, including octal and hex sequences, even
+though this is, strictly speaking, unnecessary.  The escape sequences
+AnaGram recognizes are:
+%
+% It would be nice to be able to just write this and tell latex to set
+% it in three columns. but no... that would be too easy.
+%
+%
+%\begin{tabular}{ll}
+%\agcode{{\bs}a}&alert (bell) character\\
+%\agcode{{\bs}b}&backspace\\
+%\agcode{{\bs}f}&formfeed\\
+%\agcode{{\bs}n}&newline\\
+%\agcode{{\bs}r}&carriage return\\
+%\agcode{{\bs}t}&horizontal tab\\
+%\agcode{{\bs}v}&vertical tab\\
+%\agcode{{\bs\bs}}&backslash\\
+%\agcode{{\bs}?}&question mark\\
+%\agcode{{\bs}'}&single quote\\
+%\agcode{{\bs}"}&double quote\\
+%\agcode{{\bs}ooo}&octal number\\
+%\agcode{{\bs}xhh}&hexadecimal number\\
+%\end{tabular}
+\begin{indenting}{0.4in}
+\begin{tabular}{llllll}
+\agcode{{\bs}a}&alert (bell) character&
+\agcode{{\bs}t}&horizontal tab&
+\agcode{{\bs}'}&single quote\\
+\agcode{{\bs}b}&backspace&
+\agcode{{\bs}v}&vertical tab&
+\agcode{{\bs}"}&double quote\\
+\agcode{{\bs}f}&formfeed&
+\agcode{{\bs\bs}}&backslash&
+\agcode{{\bs}\textit{ooo}}&octal number\\
+\agcode{{\bs}n}&newline&
+\agcode{{\bs}?}&question mark&
+\agcode{{\bs}x\textit{hh}}&hexadecimal number\\
+\agcode{{\bs}r}&carriage return\\
+\end{tabular}
+\end{indenting}
+\bigskip
+The octal escape sequence allows up to three octal digits, in
+accordance with ANSI specifications for C.  The hexadecimal numbers
+may contain an arbitrary number of digits; however AnaGram will
+truncate the result to sixteen bits.
+A backslash followed by any character other than those listed above
+will cause a syntax error.
+You may also represent characters by writing the numeric code
+explicitly, in decimal, octal, or hexadecimal representations.
+AnaGram follows the C conventions for integer constants: a leading
+\agcode{0} means the number is octal, a leading \agcode{0x} or
+\agcode{0X} means it is hexadecimal. The hex digits \agcode{a-f} may
+be either upper or lower case\index{Case sensitivity}.  Numbers may be
+preceded by an optional minus sign.
+If your parser uses a pre-existing \index{Lexical scanner}lexical
+scanner and you wish to use the code numbers it generates to identify
+tokens, you may simply treat those code numbers as character numbers.
+You may use the numbers directly in your productions, or you may use
+definition statements to name them.  You may also use an
+\agparam{enum} statement within a configuration section to attach
+names to the code numbers.
+% XXX shouldn't this use of enum be indexed?
+AnaGram also allows a special notation for control characters.  You
+may represent a control character by using the ``\^{}'' character
+preceding any printing ascii character. Thus you can write
+\agcode{\^{}z} or \agcode{\^{}Z} to represent the DOS end-of-file
+character.  Notice that quotation marks are not necessary.
+Examples of character representations:
+\begin{indenting}{0.4in}
+\begin{tabular}{cccc}
+\agcode{'K'}&\agcode{-1}&\agcode{0}&\agcode{'{\bs}t'}\\
+\agcode{\^{}J}&\agcode{'{\bs}xff'}&\agcode{077}&\agcode{0XF3}\\
+\end{tabular}
+\end{indenting}
+\subsection{Character Ranges}
+\index{Character range}\index{Range}
+It is convenient to be able to specify ranges of characters when
+writing a grammar.  AnaGram supports several ways of representing
+ranges of characters.  The first is an extension of the notation for
+character constants: \agcode{'a-z'} is the set of lower case
+characters.  You can even use escape sequences such as
+\agcode{'{\bs}n-{\bs}r'} if you like.  The order of
+characters used to specify the range is immaterial: \agcode{'z-a'} is
+the same as \agcode{'a-z'}.  AnaGram will, however, issue a warning
+just in case the unusual order results from a clerical error.
+The second way to specify a range is by using two arbitrary character
+representations, as described above, separated by two dots.  For
+example, \agcode{\^{}C..\^{}Z}, \agcode{3..26}, \agcode{3..032},
+\agcode{3..0x1a}, and \agcode{\^{}C..0x1a}, all represent the same
+range of characters.  Similarly, \agcode{'A-F'}, \agcode{'A'..'F'},
+\agcode{0101..0106}, \agcode{0x41..0x46}, \agcode{65..70}, and
+\agcode{65..'F'} all represent the same range of characters.
+\subsection{Character Sets}
+\index{Character sets}
+If you provide explicit definitions for terminal tokens, the basic
+input unit for your parser will be considered a character set, even if
+your input procedure provides numeric codes that are not actually
+characters.  As a terminal token, a character set will be matched by
+any input character that is a member of the set.  Character sets may
+be named in definition statements, but they may also appear on the
+right sides of productions without being named.
+A character set may consist of one or more characters.  You can
+specify a character set that consists of a single character by using
+any of the character representation methods described above.  You can
+specify a set consisting of a range of characters by using any of the
+representations of character ranges described above.
+\index{Character sets}
+To specify more complicated sets, you can write
+\index{Expressions}\index{Set expressions}expressions
+using conventional set theoretic operations.
+In AnaGram input, these operations are specified as follows:
+\index{Union}\index{Difference}\index{Intersection}\index{Complement}
+\begin{indenting}{0.4in}
+\begin{tabular}{cl}
+\agcode{A + B}&(union)\\
+\agcode{A - B}&(difference)\\
+\agcode{A \& B}&(intersection)\\
+\agcode{\~{}A}&(complement)\\
+\end{tabular}
+\end{indenting}
+where \agcode{A} and \agcode{B} are arbitrary sets.  Union and
+difference have the same precedence.  Intersection has higher
+precedence and complement has the highest precedence.  Thus in the
+expression
+\begin{indentingcode}{0.4in}
+A + \~{}B\&C
+\end{indentingcode}
+the complement operation is performed first, then the intersection,
+and finally the union.
+Watch out!  In an AnaGram syntax file \agcode{65 + 97} represents the
+character set which consists of lower case \agcode{a} and upper case
+\agcode{A}.  It does not represent 162, the sum of 65 and 97.
+Parentheses may be used to force the order of evaluation:
+\begin{indentingcode}{0.4in}
+\~{}(A \& (B+C))
+\end{indentingcode}
+In this example the union of \agcode{B} and \agcode{C} is calculated,
+then the intersection of this set with \agcode{A} is calculated, and
+finally the complement is evaluated.
+The computation of the \index{Complement}complement of a
+\index{Character sets}set requires a definition of the
+\index{Universe}universe of set elements.  AnaGram will define the
+universe to be the set of unsigned 8-bit characters, unless one or
+more characters outside that range have been specified.  In that case,
+the universe will consist of all characters on the interval defined by
+the lesser of zero and the lowest character code used and the greater
+of 255 and the highest character code used.  The complement of a
+character set is everything in this universe except the characters in
+the set.
+Characters which make up part of the character universe, but are not
+legitimate input according to your grammar, are lumped together into a
+special token which will cause an error if it occurs in your input.
+When your parser reads an input character, it uses that character to
+index a conversion table in order to determine the appropriate
+\index{Token number}\index{Token}\index{Number}token number.  If the
+\index{Range}\index{Test range}\index{Configuration switches}
+\agparam{test range} configuration switch
+is on, its default setting, your parser will include code to verify
+that the character is in bounds before it indexes the conversion
+table.  If you are satisfied that checking bounds is unnecessary, you
+may turn the \agparam{test range} switch off and get a slightly higher
+level of performance from your parser.
+For efficient processing, it is well to keep the number of tokens to a
+minimum.  Therefore if you have a choice between defining a construct
+as a token, with a production, or a set, with a definition, the set is
+to be preferred.
+Some useful character sets are:
+\begin{indenting}{0.4in}
+\begin{tabular}{ll}
+\agcode{'a-z' + 'A-Z'}&Alphabetic characters\\
+\agcode{'a-f' + 'A-F'}&Hex digits\\
+\agcode{'0-9'}&Decimal digits\\
+\agcode{0..127}&ASCII character set\\
+\agcode{32..126}&Printing ASCII characters\\
+\agcode{\~{}'{\bs}n'}&Anything but newline\\
+\agcode{\^{}Z}&Windows/DOS end of file indicator\\
+\agcode{-1}&Stream I/O end of file indicator\\
+\agcode{0}&String terminator\\
+\agcode{32..126 - 'a-z' - 'A-Z' - '0-9'}&Punctuation\\
+\end{tabular}
+\end{indenting}
+\bigskip
+% XXX ``punctuation'' is wrong; it should subtract off space too
+Note that \agcode{'a-z'} is a range of characters but
+\agcode{32..126 - 'a-z'} is a set difference.
+When AnaGram encounters a character set in a grammar rule, it assigns
+a token number to the character set.  If it has previously seen the
+same character set it will assign the same token number; however, it
+assigns the same token number only if the set expressions are
+obviously the same.  Thus, AnaGram will assign the same token number
+every time it sees \agcode{A + B}, but will assign a different token
+number if it sees \agcode{B + A}.  Only when AnaGram has finished
+scanning the entire syntax file can it actually evaluate the character
+sets.  If it finds that several different tokens all refer to the same
+character set, it will create a single token that represents the true
+character set and create
+\index{Shell productions}\index{Production}``shell productions'' for
+the others.
+\index{Character sets}If the character sets you use in your grammar
+overlap, they do not properly represent
+\index{Terminal token}\index{Token}terminal tokens.
+To deal with this situation, AnaGram identifies all overlaps among
+character sets and extends your grammar by adding a number of extra
+productions.  For instance, suppose your grammar uses the following
+character sets as though they were terminal tokens:
+\begin{indentingcode}{0.4in}
+'a-z' + 'A-Z'
+'0-9'
+'0-7'
+'a-f' + 'A-F'
+\end{indentingcode}
+AnaGram will then modify your grammar by adding the following productions:
+\begin{indentingcode}{0.4in}
+'a-z' + 'A-Z'
+-> 'a-f' + 'A-F' | 'g-z' + 'G-Z'
+'0-9'
+-> '0-7' + '8-9'
+\end{indentingcode}
+Although the tokens \agcode{'a-z' + 'A-Z'} and \agcode{'0-9'} are
+technically now
+\index{Nonterminal token}\index{Token}nonterminal tokens,
+for purposes of determining the
+\index{Token}\index{Data type}data type of their
+\index{Semantic value}\index{token}\index{Value}semantic values,
+AnaGram continues to regard them as terminal tokens.
+This \index{Partition}\index{Universe}\index{Character universe}
+``partitioning'' of the character universe is described in Chapter 6.
+\subsection{Keyword Strings}
+\index{Keywords}
+In your grammar, AnaGram recognizes character strings within double
+quotes (e.g., \agcode{"IF"}) as keywords.  The strings follow the same
+syntactic rules as strings in C.  The same escape sequences are
+honored.  AnaGram does not, however, allow for the concatenation of
+adjacent strings.  Note that AnaGram strings are used only for the
+definition of keywords in your grammar, not for messages to be
+displayed or printed.
+Keyword strings may not include null characters and must be at least
+one character long.  You may have any number of keywords.  Each is
+treated as a single terminal token.  A keyword may be given a name by
+means of a definition statement.  Keywords may appear in virtual
+productions.
+AnaGram's keyword recognition works in the following way.  First, for
+each state in your parser, AnaGram prepares a list of all the keywords
+that are admissible in that state.  Your parser will recognize a
+keyword \emph{only} if it is in an appropriate state; otherwise it
+will appear to be an anonymous sequence of characters.  Your parser,
+in any state, checks for keywords it expects before it checks for
+acceptable characters.  That is, \emph{keywords take precedence} over
+simple characters.  It does not look for keywords that would not be
+acceptable input.  The parser will do whatever lookahead is necessary
+in order to pick up the entire keyword.  Thus if the character
+\agcode{I} and the keyword \agcode{IF} are both legitimate input at
+some point, \agcode{IF} will be recognized, if present, in preference
+to \agcode{I}.  If several admissible keywords match the input, such
+as \agcode{IF} and \agcode{IFF}, the parser will select the longest
+match, \agcode{IFF} in this example.
+AnaGram does not incorporate keywords into its character sets.
+Keywords stand apart and should not appear in definitions of character
+sets.  In particular, they are not considered as belonging to the
+complement of a character set.  Thus for the production
+\begin{indentingcode}{0.4in}
+next char -> \~{}('{\bs}n' + \^{}Z)
+\end{indentingcode}
+a keyword would not be considered legitimate input.
+Note also that a keyword consisting of a single character does not
+belong to the character universe.  Because of this fact, AnaGram's
+treatment of \agcode{'X'} and \agcode{"X"} is very different.  If this
+seems confusing at first, try using only keywords which are at least
+two characters long until you have some experience with them.
+AnaGram's keyword recognition logic normally does not make any
+assumptions about what precedes or follows a keyword.  Thus if
+\agcode{int} is a keyword, your parser will be capable of plucking it
+out of a string of characters such as \agcode{disintegrate} if,
+according to your grammar, it could follow \agcode{dis}.  The
+\agparam{sticky} declaration and the \agparam{distinguish keywords}
+statement, described below, can prevent such unwanted recognition of
+keywords.  A keyword following a \agparam{sticky} token will not be
+recognized if the first character of the keyword can be shifted in as
+part of the \agparam{sticky} token.  The \agparam{distinguish
+keywords} statement prevents recognition of a keyword if it is
+followed immediately by a character of the sort that makes up the
+keyword.
+\subsection{Type Specifications For Tokens}
+\index{Token}\index{Token type}\index{Type declarations}
+When you write productions or token declarations (see below), AnaGram
+allows you to specify the data type\index{Token}\index{Data type} of
+the \index{Semantic value}\index{Token}\index{Value}semantic value of
+a token by using a C or C++ data type specification.  The restrictions
+are that AnaGram does not allow specification of array or function
+types, nor explicit structure types.  Types that are defined with
+typedef statements, structure definitions, or class definitions,
+including template classes, in your embedded C or C++ are acceptable.
+Thus the following specifications, for example, are acceptable:
+\begin{indentingcode}{0.4in}
+void
+int
+char *
+unsigned long *near
+static float *far
+my{\us}type
+double *
+struct descriptor
+struct widget *
+vector <double> *
+\end{indentingcode}
+On the other hand, the following specifications are \emph{not} valid:
+\begin{indentingcode}{0.4in}
+int[20]
+int *(int, unsigned char)
+\bra int x,y; float z; \ket
+struct \bra int k; float z; \ket
+\end{indentingcode}
+Note that AnaGram itself does nothing with the type specifications. It
+simply passes them on to your compiler as appropriate.
+\subsection{Productions}
+\index{Production}
+Productions are the basic units of a grammar.  A production consists
+of a left side and a right side.  \index{Left side}The left side of a
+production consists of one or more token names, joined by commas,
+optionally preceded by a type specification enclosed in parentheses.
+\index{Right side}The right side begins with an arrow and may either
+begin on the same line as the left side or on a new line.  For
+example:
+\begin{indentingcode}{0.4in}
+program -> statement list, eof
+expression
+-> expression, plus, term
+(int) variable name, function name
+-> name:n = look{\us}up(n);
+\end{indentingcode}
+The part of the right side of a production following the arrow is
+called a \index{Grammar rule}\index{Rule}\agterm{grammar rule},
+discussed below.  A production need not have a right side at all.  In
+this case, it is simply called a
+\index{Declaration}\index{Token}\agterm{token declaration}.
+AnaGram assigns
+\index{Token number}\index{Token}\index{Number}token numbers
+to the token names on the left side, and, if there is a type
+specification, records the data type for each of the tokens declared.
+Declarations of this sort are most useful when using input from a
+\index{Lexical scanner}lexical scanner.  See Chapter 9 for a discussion
+of techniques for interfacing a lexical scanner to your parser.  If
+you do not intend to use a lexical scanner you will have no need for
+token declarations.
+If you do not explicitly specify the type for the
+\index{Semantic value}\index{Token}\index{Value}semantic value
+of a token, it will be determined by the configuration parameter
+\index{Default token type}\index{Configuration parameters}\index{Token}
+\agparam{default token type}
+if it is a \index{Nonterminal token}\index{Token}nonterminal token or
+by the \index{Configuration parameters}configuration parameter
+\index{Input token type}\index{Default input type}\agparam{default input type}
+if it is a \index{Token}terminal token.
+\agparam{Default token type} defaults to \agcode{void}.
+\agparam{Default input type} defaults to \agcode{int}.
+If a production has more than one token on the left side, as in the
+third example above, it is called a
+\index{Semantically determined production}\index{Production}
+\agterm{semantically determined production}.  Semantically determined
+productions are a useful tool for exerting semantic control over
+syntactic analysis.  A semantically determined production should have
+a reduction procedure which determines on a case by case basis which
+of the tokens on the left side should be taken as the reduction token.
+If there is no reduction procedure, or if the reduction procedure does
+not make a choice, the reduction token will be the first syntactically
+correct token on the left side of the production.  In the example
+above, \agcode{variable name} will be the reduction token unless
+\agcode{look{\us}up} changes it to \agcode{function name}.  Semantically
+determined productions are discussed more fully in Chapter 9.
+If several productions have the same left side, it does not need to be
+repeated.  Subsequent right hand sides must each start on a new
+line.  For example:
+\begin{indentingcode}{0.4in}
+integer
+-> digit
+-> integer, digit
+name
+-> letter
+-> name, letter
+-> name, digit
+\end{indentingcode}
+On the other hand, you do not have to group productions with the same
+left side.  You could write the above productions as follows, although
+it would certainly not be good programming practice:
+\begin{indentingcode}{0.4in}
+name -> name, digit
+integer -> integer, digit
+name -> name, letter
+integer -> digit
+name -> letter
+\end{indentingcode}
+Nevertheless, there are a few occasions involving complex cross
+recursions and semantically determined productions where it is not
+possible to group productions neatly.
+The right side of a production can be empty.  Such a production is
+called a
+\index{Null productions}\index{Production}\agterm{null production}.
+Null productions are useful to denote an optional element in a
+grammar, or a list that may be empty.  For example:
+\begin{indentingcode}{0.4in}
+optional widget
+->
+-> widget
+optional qualifiers
+->
+-> optional qualifiers, qualifier
+\end{indentingcode}
+A second way to write multiple productions with the same left side
+uses the \index{Vertical bar}\index{|}vertical bar character, ``$|$'',
+to separate the grammar rules.  The productions given above for
+\agcode{name}, \agcode{optional widget}, and \agcode{optional
+qualifiers} can also be written:
+\begin{indentingcode}{0.4in}
+name -> letter | name, letter | name, digit
+optional widget
+-> | widget
+optional qualifiers
+-> | optional qualifiers, qualifier
+\end{indentingcode}
+Note that a null production cannot \emph{follow} a vertical bar.
+A token that has a null production is called a
+\index{Zero length token}\index{Token}\agterm{zero length token},
+since it can be represented by an empty sequence of input characters,
+that is to say, by nothing at all.  Furthermore, even if a token
+doesn't have any null productions, if it has at least one rule
+consisting entirely of zero length tokens it is also a zero length
+token.  In the Token Table window, AnaGram notes which tokens are zero
+length, because they can be a source of conflicts.
+\subsection{Grammar Token}
+Every grammar must have a single token which produces the entire
+grammar.  This token is variously called the
+\index{Token}\index{Grammar token}\agterm{grammar token}, the
+\index{Goal token}\agterm{goal token} or the
+\index{Start token}\agterm{start token}.
+AnaGram provides several methods you may use to specify which token in
+your grammar is the grammar token.
+You may simply use the name \agcode{grammar} for the grammar token.
+If you wish to use some other more descriptive name for your grammar
+token, you may mark it with a following dollar sign when it appears on
+the left side of a production.  Alternatively, you may set the
+\index{Grammar token}\index{Configuration parameters}\agparam{grammar token}
+configuration parameter to specify the grammar token.  Here are
+examples of the methods:
+\begin{indentingcode}{0.4in}
+grammar
+-> [statement | newline]/...
+program \$
+-> [statement | newline]/...
+{}[ grammar token = program ]
+program
+-> [statement | newline]/...
+\end{indentingcode}
+If you should use more than one of these techniques, AnaGram resolves
+the issue in the following manner: A marked token or a configuration
+parameter setting always takes precedence over simply naming a token
+\agcode{grammar}.  If you mark more than one token or set the
+configuration parameter more than once, the last setting or mark wins.
+\subsection{Grammar Rules}
+\index{Rule}\index{Grammar rule}
+The part of a production to the right of the arrow is more often
+called a \agterm{grammar rule}, or simply \agterm{rule}.  A grammar
+rule is a sequence of \index{Rule elements}\agterm{rule elements},
+joined by commas, as in the examples of productions given above.  Rule
+elements are token names, character set expressions, virtual
+productions, or immediate actions (see below).  Each rule element may
+be optionally followed by a parameter assignment.  The entire rule may
+be followed by an optional reduction procedure.  A \index{Parameter
+assignment}parameter assignment is a colon followed by a C variable
+name.  Here are some examples of rule elements with parameter
+assignments:
+\begin{indentingcode}{0.4in}
+'0-9':d
+integer:n
+expression:x
+declaration:declaration{\us}descriptor
+\end{indentingcode}
+The parameters you assign to tokens in your grammar rule become the
+formal parameters for your \index{Reduction procedure}reduction
+procedure.  The data type\index{Data type}\index{Reduction procedure
+arguments} of the parameter is determined by the data type for the
+semantic value of the token to which it is assigned.  If your grammar
+rule has parameter assignments, but does not have a reduction
+procedure, AnaGram will give you a warning in case the lack of a
+reduction procedure is an oversight.  If you don't need a reduction
+procedure you may safely ignore the warning.  On the other hand,
+AnaGram has no way to determine whether you have failed to make
+necessary parameter assignments.  You won't find out until you compile
+your parser, when your compiler will give you error messages for
+undefined symbols.
+AnaGram assigns a unique rule number to each rule in your grammar.
+Rules are numbered sequentially as they are encountered in the syntax
+file.  AnaGram constructs rule zero itself.  Rule zero normally has a
+single element, the grammar token, unless you have a
+\agparam{disregard} statement in your grammar.  In this case there
+will be two elements.
+\subsection{Reduction Procedures}
+\index{Reduction procedure}
+% XXX somewhere in here it ought to say something like
+% ``in the parsing literature reduction procedures are often known as
+% \agterm{semantic actions}.''
+% Note that R. says there's some subtle difference between the usual
+% concept of semantic action and AG's concept of reduction procedure.
+% I don't know what this difference is and I hope she can recall it.
+%
+% D. thinks this note ought to be at the end; R. wants it at the top.
+A \agterm{reduction procedure} is a piece of C code which optionally
+follows a production.  The code is executed when your parser
+identifies the production in its input.  There are two forms for
+reduction procedures, a short form and a long form.  The short form
+consists of a single C expression.  The long form consists of an
+arbitrary block of C code.  When AnaGram builds a parser, it inspects
+the grammar rule to which the procedure is attached and identifies the
+parameters for the procedure.  It uses these parameters as the formal
+parameters for the procedure.
+If the
+\index{Macros}\index{Allow macros}\index{Configuration switches}
+\agparam{allow macros}
+configuration switch has not been turned off, AnaGram codes the
+reduction procedure as a macro definition whenever possible.
+Otherwise AnaGram codes it as a function definition.  AnaGram builds
+the name for a reduction procedure by appending its internal procedure
+number to the string \agcode{ag{\us}rp{\us}}.  Thus reduction procedures are
+numbered in the order in which they are encountered in the syntax
+file.
+Both long and short form reduction procedures are preceded by an equal
+sign which follows the production.  The short form consists of a C or
+C++ expression terminated by a semicolon.  When the grammar rule is
+reduced, the expression will be evaluated and its value will become
+the value of the reduction token.  The expression and the terminating
+semicolon must be entirely on a single line.  Note that, if you really
+need to make the expression longer than will fit on one line, you can
+embed a newline in a comment.  Some examples of short form reduction
+procedures are:
+% XXX is there anything we can do about the ugly underscores?
+\begin{indentingcode}{0.4in}
+=0;
+=1;
+=10*n + d-'0';
+=
+special{\us}processor(first{\us}parameter, second{\us}parameter);
+=word{\us}count++;
+=widget(constant{\us}1*parameter{\us}1 + constant{\us}2*parameter{\us}2  /*
+{} */ + constant{\us}3*parameter{\us}3);
+\end{indentingcode}
+A long form reduction procedure consists of an arbitrary block of C or
+C++ code, enclosed in braces (\bra \ket).  AnaGram will code the reduction
+procedure as a function.  To return a value for the reduction token,
+simply use the \agcode{return} statement.  There are effectively no
+restrictions on the content or length of a reduction procedure.  Of
+course, if there are unbalanced braces, unterminated comments or
+unterminated string literals, AnaGram will not be able to determine
+where the reduction procedure ends.  AnaGram treats
+\index{Comments}nested comments within a reduction procedure according
+to the value of the \index{Nest comments}\index{Configuration
+switches}\agparam{nest comments} configuration switch at the point
+where it encounters the reduction procedure.
+From a practical point of view it is not usually good practice to have
+a reduction procedure that is more than a few lines long since a long
+procedure will hamper your overall view of your grammar. Long
+reduction procedures should be written as separate named functions,
+and should either be included in the embedded C portion of your syntax
+file or should be included in a wholly separate module.  Here is an
+example of a long form reduction procedure:
+\begin{indentingcode}{0.4in}
+=\bra
+if (flag) \bra
+total += x;
+return identify(x);
+\ket
+else \bra
+total = 0;
+flag = 1;
+return init{\us}table(x);
+\ket
+\ket
+\end{indentingcode}
+If a rule does not have a reduction procedure, the semantic value of
+the reduction token will be set to the \index{Semantic
+value}\index{Token}\index{Value}semantic value of the first token in
+the rule, unless the rule is a \index{Null productions}null
+production.  In the latter case, the value of the reduction token will
+be set to zero.
+% XXX and what if zero isn't a valid value for the type? a compiler
+% error will occur.
+% XXX add something like
+%
+% Variables appearing in reduction procedures which do not have a
+% parameter assignment in the corresponding grammar rule can be
+% declared globally or (file)-statically in your embedded C, or
+% alternatively could be added to the parser control block using
+% the \agparam{extend pcb} statement (q.v. | See Section ....).
+% (Reword this.)
+%
+% Should also discuss the sequencing of reduction procedure calls
+% so that people understand what happens if you use such variables.
+%
+% also ``A reduction procedure can be used to terminate parsing for
+% semantic reasons''.
+%
+\subsection{Immediate Actions}
+\index{Immediate action}\index{Action}
+An immediate action is a rule element that consists of executable C or
+C++ code embedded within a grammar rule to be executed when it is
+encountered.  An immediate action is denoted by the use of an
+exclamation point, \index{!}``!''.  The content of an immediate action
+may be written following the rules for either long form or short form
+reduction procedures.  As with any other rule element, it must be
+separated from preceding and following rule elements by commas.  In
+the grammar for a simple desk calculator, one might write
+\begin{indentingcode}{0.4in}
+transaction
+-> !printf('\#');, expression:x = printf("\%d{\bs}n", x);
+\end{indentingcode}
+% XXX s/apparent/visible/
+Notice that the only apparent difference between an immediate action
+and a reduction procedure is that the immediate action is preceded by
+``!'' instead of ``=''.  The immediate action must be followed by a
+comma to separate it from the following rule element.
+Immediate actions may also be used in definitions:
+\begin{indentingcode}{0.4in}
+prompt = !printf('\#');
+\end{indentingcode}
+AnaGram implements an immediate action by creating a special token for
+it.  AnaGram then creates a single null production for the
+token. Finally, the immediate action is implemented as the reduction
+procedure for the null production.
+For example, you could implement \agcode{prompt} by writing a null production
+with a reduction procedure:
+\begin{indentingcode}{0.4in}
+prompt
+->      = printf('\#');
+\end{indentingcode}
+This production would be equivalent to the definition above.
+There are two ways, however, in which immediate actions differ from
+the equivalent null production.  Immediate actions may access any
+parameter assignments which precede them in the rule in which they
+occur.  On the other hand, there is no way to assign a data type to
+the semantic value, if any, returned by the immediate action.
+Therefore, the type is determined by your setting of the
+\index{Default token type}\index{Configuration parameters}
+\agparam{default token type} configuration parameter.
+\subsection{Virtual Productions}
+\index{Virtual productions}\index{Production}
+Virtual productions are a convenient short form notation for common
+grammatical constructs involving choice and repetition.  The notation
+represents an extension of notation commonly used in programming
+manuals.  A virtual production may be written in a grammar rule at any
+place where you could write a token name, even within another virtual
+production.  Note that use of virtual productions is never
+\emph{required}, since the equivalent productions can always be
+written out explicitly instead.
+When AnaGram encounters a virtual production, it replaces the virtual
+production with a new token and writes appropriate productions for the
+new token.  When you look at your syntax tables using AnaGram windows,
+you will see the productions that AnaGram generates.  AnaGram keeps a
+record of virtual productions, so that generally if you use the same
+virtual production a second time, you get the same set of tokens and
+productions that were generated the first time it was used.  This is
+not the case if the virtual productions contain reduction procedures
+or immediate actions, since AnaGram is not equipped to determine
+whether two pieces of C code are equivalent.  Thus, a virtual
+production that contains a reduction procedure will be unique and will
+not be reused.
+One disadvantage of virtual productions is that there is no way to
+specify the data type of the \index{Semantic value}\index{Virtual
+production}semantic value of a virtual production.  Therefore, if you
+have a reduction procedure within a virtual production, its return
+value must be consistent with the type defined by the \index{Default
+token type}\index{Configuration parameters}\agparam{default token type}
+configuration parameter.
+The simplest virtual production is the \index{Token}\index{Optional
+token}\agterm{optional token}.  If \agcode{x} is an arbitrary token
+name or set expression, you can indicate an optional \agcode{x} by
+writing \index{?}\agcode{x?}.  You may also indicate a repetition of
+\agcode{x} by using the ellipsis with either \agcode{x} or \agcode{x?}.
+\index{...}\index{Ellipsis}Thus \agcode{x...} represents
+one or more instances of \agcode{x} and \index{?...}\agcode{x?...}
+represents zero or more instances of \agcode{x}. For example:
+\begin{indentingcode}{0.4in}
+'+'?
+\end{indentingcode}
+can be used to represent an optional plus sign, that is, a choice
+between a plus sign and nothing at all.  Similarly,
+\begin{indentingcode}{0.4in}
+'{\bs}n'?...
+\end{indentingcode}
+represents an optional sequence of newline characters.
+\index{Brackets}\index{Braces}\index{\_opb\_clb}\index{[]}
+The next category of virtual productions uses brackets or braces to
+indicate a choice among a number of enclosed grammar rules separated
+by vertical bars.  A single rule may also be enclosed.  Note that
+\emph{rules}, with following reduction procedures, are allowed, not
+simply tokens.
+Braces are used to indicate that one option must be chosen.  Brackets
+are used to indicate the choice is optional, i.e. may be omitted
+altogether.  The ellipsis following a set of options within brackets
+or braces indicates the option may be repeated an indefinite number of
+times.
+You can use braces to indicate a simple choice among a number of
+options.  A Cobol grammar offers the following choice of equivalent
+keywords:
+\begin{indentingcode}{0.4in}
+\bra "RECORD", "IS"? | "RECORDS", "ARE"? \ket
+\end{indentingcode}
+\index{\_opb\_clb...}\index{ []...}
+You may use the ellipsis with braces to indicate an arbitrary positive
+number of repetitions of the choice:
+\begin{indentingcode}{0.4in}
+{\bra}type specifier | storage class specifier{\ket}...
+\end{indentingcode}
+This expression requires at least one type specifier or storage class
+specifier, but will accept any number.
+\index{[]}
+To make a choice optional, use brackets instead of braces.  An
+example, again drawn from a Cobol grammar, is:
+\begin{indentingcode}{0.4in}
+{}["LIMIT", "IS"? | "LIMITS", "ARE"?]
+\end{indentingcode}
+\index{[]...}
+Ellipses may be used with brackets to indicate an arbitrary number of
+choices that may be omitted altogether:
+\begin{indentingcode}{0.4in}
+{}[argument, [',', argument]...]
+\end{indentingcode}
+This expression describes an optional argument list with arguments
+separated by commas.
+If you use a null production within braces, it must be the first option:
+\begin{indentingcode}{0.4in}
+\bra | '+' | '-' \ket
+\end{indentingcode}
+Normally, you would do this only if you wanted to attach a reduction
+procedure to the null production.  Note that if you include a null
+production within braces, and add an ellipsis after the closing brace
+for repetition, your grammar will be ambiguous.  Just exactly how many
+times does the null production occur?  Use brackets instead, and omit
+the null production.
+Null productions are not allowed with brackets, since they would be
+intrinsically ambiguous.
+The options within braces or brackets may be grammar rules of any
+length or complexity and may themselves contain virtual productions of
+arbitrary complexity.  Nevertheless, in practice, clarity suffers as
+soon as the options get very complex.  Virtual productions are most
+important and useful when used in simple situations.  In those
+situations they will enhance the clarity of your grammar.
+Here is an example that is moderately complex, even though each rule
+consists of a single token:
+\begin{indentingcode}{0.4in}
+\bra{\bra}"on" | "true"\ket = 1; | {\bra}"off" | "false"\ket = 0; | integer\ket
+\end{indentingcode}
+This example can be used to allow as input either an integer or, for
+special cases, keywords.  You could write this option out in the
+following way:
+\begin{indentingcode}{0.4in}
+p1
+-> p2   = 1;
+-> p3   = 0;
+-> integer
+p2
+-> "on"
+-> "true"
+p3
+-> "off"
+-> "false"
+\end{indentingcode}
+The final category of virtual production provides a notation for
+\index{Alternating sequence}\agterm{alternating sequences}.  An
+alternating sequence is a set of choices which may be repeated
+arbitrarily subject to the side condition that no choice may follow
+itself, in other words, that the choices must alternate.  Alternating
+sequences are written with either brackets or braces depending on
+whether the sequence is optional or not, followed by
+\index{/...}``\agcode{/...}''.  Note that the choices themselves may
+allow sequences.  For example:
+\begin{indentingcode}{0.4in}
+program
+-> [statement | newline...]/..., eof
+\end{indentingcode}
+represents a sequence of statements separated by one or more newlines.
+Any two statements must be separated by one or more newline
+characters, and newlines may also appear at the beginning and the end
+of the program.
+Null productions are not allowed within alternating sequences, since
+they are intrinsically ambiguous in all cases.
+\subsection{Definition Statements}
+\index{Definitions}\index{Definition statement}\index{Statement}
+A definition statement is simply a shorthand way of naming a character
+set, a \index{Virtual productions}\index{Production}virtual
+production, a keyword string, or an immediate action.  It can also be
+used for providing an alternate name for a token. Definitions have the
+form:
+\begin{indentingcode}{0.4in}
+name = \codemeta{character set}
+name = \codemeta{virtual production}
+name = \codemeta{keyword}
+name = \codemeta{immediate action}
+name = \codemeta{token name}
+\end{indentingcode}
+The name may be any name acceptable to AnaGram.  The name can then be
+used anywhere you might have used the expression on the right
+side.  \index{!}For example:
+\begin{indentingcode}{0.4in}
+upper case letter = 'A-Z'
+lower case letter = 'a-z'
+letter = upper case letter + lower case letter
+statement list = statement?...
+while keyword = "WHILE"
+prompt = !printf("Please enter name:");
+\end{indentingcode}
+It is important to recognize that a definition statement that names a
+set does not define a token.  A token is defined only when the set is
+used in a grammar rule, and then only if the set is used directly, not
+in combination with some other set.  Furthermore, if you use a
+character set directly in a grammar rule, and in some other rule you
+use a name that refers to the same set of characters, you will get two
+different tokens.  For example, if you have defined \agcode{upper case
+letter} as in the above example and use both \agcode{upper case
+letter} and \agcode{'A-Z'} in grammar rules, AnaGram will assign
+different \index{Token number}\index{Token}\index{Number}token numbers
+to accommodate any differences in attributes you may assign to the
+tokens.
+Renaming tokens is a convenient way to connect two independently
+written portions of a grammar.
+% See the C grammar in the EXAMPLES directory of your distribution
+% disk for an example.
+\subsection{Embedded C}
+\index{Embedded C}
+You may encapsulate C or C++ code in your syntax file by enclosing it
+in braces (\bra \ket).  Such pieces of code are copied to the parser file
+untouched, in the order they are found in the syntax file.  There may
+be any number of such pieces of embedded C.  The only restriction is
+that they must not start on the same line as some other AnaGram
+statement, and following AnaGram statements must also start on fresh
+lines.
+Normally, the blocks of embedded C in your syntax file are copied to
+the parser file \emph{following} a set of definitions and declarations
+AnaGram needs for the code it generates.  However, if the \emph{first}
+statement in your \index{Syntax file}syntax file is a block of
+embedded C, it will \emph{precede} AnaGram's definitions and
+declarations.  This block of embedded C is called the
+\index{Prologue}\index{C prologue}``C prologue''.  There are two main
+reasons for this special treatment.  First, you may want to have a
+title and \index{Copyright notice}copyright notice in your parser.  If
+you include them in an initial block of embedded C they will be right
+at the beginning of both your syntax file and your parser file.
+Second, if some of your tokens have data type\index{Token}\index{Data
+type}s other than those predefined in C or C++, you may include the
+definitions here, so they will be available to the code AnaGram
+generates.
+AnaGram scans embedded C only insofar as is necessary to find the
+closing right brace.  Therefore any braces used within embedded C must
+balance properly.  AnaGram skips braces enclosed in character
+constants and string literals, as well as braces enclosed in
+comments.  It also recognizes C++ style comments that begin with
+\agcode{//}.  \index{Comments}Treatment of nested versus non-nested comments
+is controlled by the
+\index{Nest comments}\index{Configuration switches}\agparam{nest comments}
+configuration parameter.  AnaGram will use the status of this
+parameter in effect at the beginning of the section of embedded C.
+AnaGram, of course, can be confused by unterminated strings,
+unbalanced brackets, and unterminated comments.  The most likely
+outcome, in such a situation, is that AnaGram will encounter the end
+of file looking for the end of the embedded C.  Should this happen,
+AnaGram will identify the beginning of the piece of embedded C which
+caused the problem.
+The code you include as embedded C, of course, has to coexist with the
+code AnaGram generates.  In order to keep the potential for conflicts
+to a minimum, all variables and functions which AnaGram defines begin
+either with the name of your parser or with the letters
+\agcode{ag{\us}}.  You should avoid variable names which begin with these
+letters.
+Reduction procedures are copied to the \index{Parser
+file}\index{File}parser file in the order in which they are defined
+\emph{following} all of the embedded C.  Thus your reduction
+procedures may freely use variables and macros defined anywhere in
+your embedded C.
+\subsection{Configuration Sections}
+\index{Configuration section}
+A configuration section is a special section of your syntax file
+enclosed in brackets.  Within a configuration section you may set the
+values of configuration parameters or switches, or you may use one or
+more of several available attribute statements to specify special
+treatment for certain tokens.  There can be as many or as few
+configuration sections in your syntax file as you wish.  Each
+configuration section must begin on a new line.  Any AnaGram statement
+which follows a configuration section must also begin on a new line.
+Within a configuration section, each parameter setting and each
+attribute statement must begin on a new line.  The rules for using
+comments and continuation lines are the same as for the rest of
+AnaGram.
+Configuration parameters control the way AnaGram interprets your
+syntax file and the way it builds your parser.  A full discussion of
+the use of configuration parameters, including a complete discussion
+of each parameter and its default value, is given in Appendix A.
+\index{Attribute statements}\index{Statement}
+Attribute statements comprise the
+\index{Precedence declarations}precedence declarations \agparam{left},
+\agparam{right}, and \agparam{nonassoc}; the \agparam{sticky}
+declaration; the \agparam{distinguish keywords} statement; the
+\agparam{hidden} declaration; the \agparam{disregard} and
+\agparam{lexeme} statements; the \agparam{enum} statement; the
+\index{Reserve keywords}\agparam{reserve keywords} declaration; and
+the \index{Rename macro}\agparam{rename macro} statement.
+The precedence declarations and the
+\index{Sticky declaration}\index{Declaration}\agparam{sticky}
+declaration may be used to resolve conflicts in your grammar.  The
+\agparam{distinguish keywords} statement may be used to control
+keyword recognition.  The
+\index{Hidden declaration}\index{Declaration}\agparam{hidden}
+declaration causes certain token names not to be used when your parser
+produces
+\index{Syntax error}\index{Errors}\index{Error messages}syntax error
+messages.  You may use the \agparam{disregard} and \agparam{lexeme}
+statements to cause your parser to skip automatically over certain
+tokens in its input.  The \agparam{enum} statement is almost identical
+to the enum statement in C.  It can be used to assign names to input
+codes in grammars which are taking input from a \index{Lexical
+scanner}lexical scanner or another parser.  The
+\index{Reserve keywords}\agparam{reserve keywords} declaration allows
+you to specify certain keywords as reserved words.  The
+\index{Rename macro}\agparam{rename macro} statement allows you to
+override the names AnaGram uses for various macro definitions it
+creates in the code it generates.
+Attribute statements are discussed below. Except for
+\agparam{disregard} and \agparam{rename macro} statements, attribute
+statements accept lists of operands enclosed in braces (\bra \ket)
+and separated by commas.  A dangling comma following the last item in
+a list will be ignored.
+\subsection{Setting Configuration Parameters}
+\index{Configuration parameters}\index{Parameters}
+Each configuration parameter has a name that follows the AnaGram
+conventions for symbol names, except that AnaGram ignores
+case\index{Case sensitivity} when looking up configuration parameter
+names.
+There are a number of varieties of configuration parameters.  The
+simplest,
+\index{Configuration switches}\index{Switches}configuration switches,
+simply turn some feature of AnaGram on or off.  These parameters need
+simply be stated to turn the feature on, or negated with the tilde
+(\agcode{\~{}}) to turn the feature off:
+\begin{indentingcode}{0.4in}
+nest comments
+\end{indentingcode}
+causes AnaGram to allow nested comments, and
+\begin{indentingcode}{0.4in}
+\~{}nest comments
+\end{indentingcode}
+causes AnaGram to disallow nested comments.
+You may also set or reset configuration switches with explicit on or
+off values:
+\begin{indentingcode}{0.4in}
+nest comments = on
+nest comments = off
+\end{indentingcode}
+The remaining configuration parameters are assigned values using a
+simple assignment statement.  Depending on the parameter, the value it
+takes may be the name of a token, a C variable name, a C or C++ data
+type, a string constant or an integer.  String constants are written
+using the same rules as keyword strings, described above.
+\begin{indentingcode}{0.4in}
+grammar token      = program
+parser name        = widget
+default token type = void *
+header file name   = "widget.h"
+parser stack size  = 50
+\end{indentingcode}
+A number of string-valued \index{Configuration
+parameters}configuration parameters are used to determine file
+names and variable names.  In these parameters, the \index{\#}``\#'',
+\index{\_dol}``\$'', and ``\index{ \_prc}\%'' characters
+are used as wild cards.  In file name specifications and the
+specification of the name of your parser, ``\#'' will be replaced by
+the name of your syntax file.  In other function and variable names
+AnaGram creates while building your parser, ``\$'' will be replaced by
+the name of your parser.  When building enumeration constants for the
+names of the tokens in your grammar, ``\%'' will be replaced by the
+name of the token.
+Note that when entering a Windows/DOS path name as a
+value for a file name parameter you must quote any backslashes in the
+path name.  For example,
+\begin{indentingcode}{0.4in}
+coverage file name = "f:{\bs\bs}sna{\bs\bs}foo.nrc"
+\end{indentingcode}
+\subsection{Precedence Declarations}
+\index{Precedence declarations}
+AnaGram allows you to resolve shift-reduce conflicts by assigning
+precedence levels to operators.  There are three precedence
+declarations available, beginning with the keywords
+\index{Left}\agparam{left}, \index{Right}\agparam{right}, and
+\index{Nonassoc}\agparam{nonassoc} respectively.  Each such
+declaration consists of the appropriate keyword and a list of tokens
+enclosed in braces (\bra \ket). All the tokens in the list have the same
+precedence, higher than tokens in any previous declaration and lower
+than in any subsequent declaration.  If the keyword is \agparam{left},
+the tokens will group to the left.  If it is \agparam{right}, they
+will group to the right.  If it is \agparam{nonassoc} (for
+non-associative) no grouping will be assumed.  Precedence declarations
+must be included in a configuration section.  Here are precedence
+declarations appropriate to a simple desk calculator program:
+\begin{indentingcode}{0.4in}
+{}[
+left  \bra '+', '-' \ket
+left  \bra star, '/', '\%' \ket
+right \bra unary minus \ket
+]
+unary minus = '-'
+\end{indentingcode}
+Note that \agcode{unary minus} and \agcode{'-'} can have different
+precedence.
+Precedence declarations are one of the few instances in AnaGram where
+the \index{Statements}\index{Order of statements}order of statements
+is significant.
+The use of precedence declarations is discussed in Chapter 9.
+\subsection{``Sticky'' Declarations}
+\index{Sticky declaration}\index{Declaration}
+AnaGram provides another means for resolving shift-reduce conflicts.
+You may characterize any token as ``sticky''.  Then, in the case of a
+\index{Shift-reduce conflict}\index{Conflicts}shift-reduce conflict
+where a ``sticky'' token is the last token in the input buffer, the
+conflict will be resolved by selecting the shift operation.
+Intuitively, you may think of this as though the ``sticky'' token
+adheres to and draws in any subsequent input that it can.  ``Sticky''
+declarations are included in configuration sections.  They begin with
+the keyword \agcode{sticky} followed by a list of tokens, separated by
+commas inside braces (\bra \ket).  Suppose, for instance, you wished to
+pick up a line of text, skipping any leading space or tab
+characters. You might write the following syntax:
+\begin{indentingcode}{0.4in}
+white space = ' ' + '{\bs}t'
+text char
+-> \~{}'{\bs}n':c  = do{\us}something(c);
+line
+-> leading white space, text char?..., '{\bs}n'
+leading white space
+->
+-> leading white space, white space
+\end{indentingcode}
+Unfortunately, this syntax is ambiguous, since space and tab are
+legitimate instances of both leading white space and text char.  What
+you really want to do is to skip white space until you find a
+non-blank character and then you want to accept all characters to the
+end of the line.  There are two ways to address the problem.  The
+first is to define a special token for the first non-blank character
+and, using it, to write an unambiguous grammar.  This approach, while
+laudable, is tedious and prolix.  Instead, use \agparam{sticky} to
+resolve the problem:
+\begin{indentingcode}{0.4in}
+{}[ sticky \bra leading white space \ket ]
+\end{indentingcode}
+Now when AnaGram analyzes your grammar, and encounters the ambiguity,
+it will understand that a blank or tab that could be treated either as
+leading white space or the as the first text character should be
+treated as white space.  Since \agcode{leading white space} is
+``sticky'', any subsequent white space adheres to it.
+As with conflicts resolved with precedence levels, AnaGram lists all
+conflicts that it resolves using \agcode{sticky} in the
+\index{Resolved Conflicts}\index{Window}\agwindow{Resolved Conflicts
+Table}, so you can verify that the conflicts have been correctly
+resolved.
+An important use of sticky tokens is to inhibit the recognition of
+following \index{Keywords}keywords.  Following a sticky token, a
+keyword, which, according to your grammar, would otherwise be
+legitimate input, will not be recognized if a shift action is possible
+for the first character of the keyword.  For example, imagine that
+\agcode{name} has been defined in the conventional way, and there
+exists a production with name followed immediately by the keyword
+\agcode{int}.  Then if, in your input, the word \agcode{print} were to
+occur, your grammar would parse it as a name, \agcode{pr}, followed by
+the keyword \agcode{int}.  If you make \agcode{name} sticky, however,
+the first letter of \agcode{int} will be seen to be an acceptable
+character for \agcode{name} and the keyword will not be
+recognized. Your parser will then recognize the \agcode{name} as
+\agcode{print}.
+\subsection{Distinguish Keywords Statement}
+\index{Distinguish keywords}\index{Keywords}
+Distinguish keywords statements are occasionally needed to prevent
+keyword recognition.  You may, for example, wish to prevent the
+recognition of the keyword \agcode{int} when it occurs embedded in a
+word such as \agcode{interval}.  Of course, you need to do this only
+if both the keyword and the other word are both legitimate input at
+the same point in your grammar.
+A distinguish keywords statement can prevent recognition of a keyword
+which is embedded in another word provided at least one character of
+the other word follows the keyword.
+The distinguish keywords statement has the form:
+\begin{indentingcode}{0.4in}
+distinguish keywords \bra \codemeta{list of character sets} \ket
+\end{indentingcode}
+AnaGram compares all the characters in each keyword to the characters
+included in each character set in turn.  If it finds that all the
+characters in a keyword are members of a particular set, it tells the
+keyword recognition logic to try to match the keyword only against the
+longest sequence of characters drawn from the specified set.  In other
+words, in order for a keyword to be recognized, the keyword
+\emph{must} be followed by a character \emph{not} in the set.  The set
+associated with a keyword is the first one in the list which contains
+all the characters found in the keyword.  If you have more than one
+\agparam{distinguish keywords} statement in your grammar, the lists
+are tried in the order in which they appear in the grammar.
+The purpose of the \agparam{distinguish keywords} statement is to
+enable your parser to distinguish a keyword from the same sequence of
+characters embedded within another sequence.  Thus suppose that
+\agcode{int} is a keyword, and, according to your grammar, could
+appear in the same place as the word \agcode{integral}.  If you don't
+want it to be recognized as a keyword in these circumstances, you
+would write the following distinguish statement:
+\begin{indentingcode}{0.4in}
+distinguish keywords \bra 'a-z'+'A-Z' \ket
+\end{indentingcode}
+To also inhibit recognition of \agcode{int} within \agcode{print}, you
+would combine the use of the distinguish keywords statement with the
+\agparam{sticky} declaration.
+\subsection{``Hidden'' Declarations}
+\index{Hidden declaration}\index{Declaration}
+AnaGram provides an optional \index{Error diagnosis}error diagnosis
+feature for your parser (see Chapter 9).  The \agparam{hidden}
+declaration allows you to identify tokens that you do not wish to be
+used in making up \index{Diagnostic messages}diagnostic messages.
+These tokens are tokens whose names would not mean anything to your
+users.  The format of a ``hidden'' declaration is the same as that of
+precedence and ``sticky'' declarations.  Within a configuration
+section, the keyword ``hidden'' is followed by a list of tokens. For
+example:
+\begin{indentingcode}{0.4in}
+{}[ hidden \bra comment head \ket ]
+comment
+-> comment head, "*/"
+comment head
+-> "/*"
+-> comment head, \~{}eof
+\end{indentingcode}
+This is an AnaGram representation of ANSI standard C comments
+(non-nested).  In this example the token \agcode{comment head} exists
+only for convenience in writing the grammar and has no particular
+meaning to an end user.  On the other hand, he knows what the word
+\agcode{comment} refers to.  The ``hidden'' attribute will cause AnaGram's
+diagnostic builder, by backing up the stack until it finds a
+non-hidden token, to eschew \agcode{comment head} in favor of
+\agcode{comment}.
+% XXX eschew obfuscation. how about ``avoid''?
+\subsection{Disregard Statement}
+The purpose of the
+\index{Disregard statement}\index{Statement}\agparam{disregard}
+statement is to skip over uninteresting \index{White space}white space
+and comments in your input files. The disregard statement allows you
+to specify a token that should be passed over in the input to your
+parser.  The statement takes the form:
+\begin{indentingcode}{0.4in}
+disregard ws
+\end{indentingcode}
+where \agcode{ws} is a token name or character set.  Disregard
+statements may be placed in any configuration section.
+You may have more than one disregard statement in your grammar.  If
+you do, AnaGram will create a shell production. For example, suppose
+you write:
+\begin{indentingcode}{0.4in}
+{}[
+disregard alpha
+disregard beta
+]
+\end{indentingcode}
+AnaGram will proceed as though you had written:
+\begin{indentingcode}{0.4in}
+gamma
+-> alpha | beta
+{}[ disregard gamma ]
+\end{indentingcode}
+It frequently happens that you wish your parser to disregard blanks or
+comments, except that white space within names, numbers, strings, and
+other elementary constructs is subject to special rules and thus
+should not be disregarded blindly.  In this case, you can use the
+\agparam{lexeme} statement to declare these constructs off limits
+for the disregard statement.  Within these constructs, the disregard
+statement will be inoperative and the admissibility of white space
+will be determined solely by the productions which define these
+constructs.
+Outside those productions which define lexemes, you should not
+generally use a token which is supposed to be disregarded.  If you do,
+your grammar will have conflicts, since the token could satisfy both
+the explicit usage and the implicit rules set up by the disregard
+statement.  Such conflicts, however, are resolved automatically in
+favor of your explicit use of the token.  The conflicts will appear in
+the \agwindow{Resolved Conflicts} window.
+% XXX I'm not sure that's still true.
+In order to implement the disregard statement AnaGram will redefine
+some tokens in your grammar.  For example, \agcode{+} may be redefined
+to consist of a simple plus sign followed by optional white space:
+\begin{indentingcode}{0.4in}
+'+'
+-> '+'\%, white space?...
+\end{indentingcode}
+The percent sign is used to indicate the original, simple plus sign
+without the optional white space attached.  You will probably notice
+the percent sign appearing in some windows and traces.  In earlier
+versions of AnaGram, the degree sign, ``\agcode{\degrees}'', was used rather
+than ``\agcode{\%}''.
+\subsection{Lexeme Statement}
+The ``lexeme'' \index{Statement}\index{Lexeme statement}statement is
+used to fine-tune the disregard statement.
+The lexeme statement takes the form:
+\begin{indentingcode}{0.4in}
+{}[ lexeme \bra \codemeta{nonterminal token list} \ket ]
+\end{indentingcode}
+where \textit{nonterminal token list} is a list of nonterminal tokens
+separated by commas.
+Lexeme statements may be placed in any configuration section, and
+there may be any number of them.
+When you specify that a token is to be disregarded, AnaGram rewrites
+your grammar so that the token will be passed over whenever it occurs
+at the beginning of a file or following a lexical unit, or
+\agterm{lexeme}.  If you have no \agparam{lexeme} statement, then the
+lexemes in your grammar are just the terminal tokens.
+The \agparam{lexeme} statement allows you to specify that certain
+nonterminal tokens are also to be treated as lexemes.  This means that
+the disregard token will be skipped following the lexeme, but not
+between the characters that constitute the lexeme.
+Lexemes correspond to the tokens that a lexical scanner, if you were
+using one, would commonly identify and pass to a parser as single
+tokens.  You don't usually wish to disregard white space within these
+tokens.  For example, in a grammar for a conventional programming
+language where blank characters are to be disregarded, you might
+include:
+\begin{indentingcode}{0.4in}
+{}[ lexeme \bra string, character constant, name, number \ket ]
+\end{indentingcode}
+since blank characters must not be overlooked within strings and
+character constants and should not be permitted within names or
+numbers.
+Normally, AnaGram considers the disregard token to be optional;
+however there are circumstances where treating the disregard token as
+optional would lead to conflicts: two successive names, or two
+successive numbers, for example. In this case, you would like to
+require that the lexemes be separated by instances of the disregard
+token.  To do this, simply set the
+\index{Distinguish lexemes}\index{Configuration switches}
+\agparam{distinguish lexemes}
+configuration switch.
+When this switch is set, AnaGram will ensure that disregard tokens
+will be required in those situations where making them optional would
+lead to conflicts.
+White space may be used explicitly within definitions of lexeme tokens
+in your grammar if desired, without causing conflicts. Thus, if you
+wish to allow embedded space in variable names, you might write:
+\begin{indentingcode}{0.4in}
+{}[
+disregard space
+lexeme \bra variable name \ket
+]
+space = ' ' + '{\bs}t'
+letter = 'a-z' + 'A-Z'
+digit = '0-9'
+variable name
+-> letter
+-> variable name, letter + digit
+-> variable name, space..., letter + digit
+\end{indentingcode}
+\subsection{Enum Statement}
+\index{Enum statement}\index{Enumeration}\index{Token}
+The \agparam{enum} statement follows rules nearly identical to those
+for C and C++.  This makes it possible to copy an enum statement from
+your syntax file to a program file written in either C or C++, without
+any need for editing.  The only differences are that AnaGram makes no
+provision for blank lines within the enumeration list, nor does it
+accept a type name.  The \agparam{enum} statement is equivalent to a
+corresponding set of definition statements.  It is especially useful
+when a parser is accepting token input from another program, a
+\index{Lexical scanner}lexical scanner, for example.  Using
+the enum statement you may conveniently define all the identification
+codes for the input tokens.
+Each entry in an enum statement may be either a name, or a name
+followed by an ``='' sign and a character representation.  If there is
+a character representation the name is assigned the value of the
+specified character.  Otherwise it is assigned a value one more than
+that assigned to the previous name.  If the first name in the list is
+not given an explicit value, it will be given the value zero.  For
+example:
+\begin{indentingcode}{0.4in}
+{}[
+enum \bra
+eof, a,b,c,
+blank = '\ ', x, y
+\ket
+]
+\end{indentingcode}
+is equivalent to the following definition statements
+\begin{indentingcode}{0.4in}
+eof = 0
+a = 1
+b = 2
+c = 3
+blank = '\ '
+x = 33
+y = 34
+\end{indentingcode}
+\subsection{Subgrammar Declarations}
+\index{Subgrammar declaration}\index{Declaration}
+A \agparam{subgrammar} declaration can be a useful way to deal with
+conflicts in certain situations.  It tells AnaGram to treat the tokens
+listed in the declaration as though they were each grammar tokens,
+each specifying a complete subgrammar in itself, and, in determining
+shift and reduction actions, to ignore the usage of the tokens in the
+larger grammar.
+In some cases it is perfectly reasonable to ignore usage.  The most
+common example occurs when building a lexical scanner for a language
+such as C as in the example in Section 7.4.4.  In this case, you can
+write a complete grammar for a C token with no difficulty.  But if you
+try to extend it to a sequence of tokens, you get scores of conflicts.
+This situation arises because you specify that any C token can follow
+another, when in actual practice, an identifier, for example, cannot
+follow another identifier without some intervening space or
+punctuation.
+It is theoretically possible, but in practice quite awkward, to write
+a grammar for a sequence of tokens so that there are no conflicts.
+The subgrammar declaration provides a way around this problem by
+telling AnaGram that when it is looking for reducing tokens for any
+rule produced directly or indirectly by a subgrammar token, it should
+disregard the usage of the token and only consider usage internal to
+the definition of the subgrammar token, as though the subgrammar token
+were the start token of the grammar.
+The subgrammar declaration is made in a configuration section and
+consists of the keyword \agcode{subgrammar} followed by a list of one
+or more nonterminal token names, separated by commas and enclosed in
+braces (\bra \ket). For example:
+\begin{indentingcode}{0.4in}
+{}[ subgrammar \bra C token, word \ket ]
+\end{indentingcode}
+Since the subgrammar statement changes the way AnaGram determines
+reducing tokens, it should be used with caution.  You should be sure
+that the conflicts you are eliminating are really inconsequential.
+\subsection{Reserve Keywords Declaration}
+\index{Reserve keywords}\index{Keywords}\index{Keyword anomalies}
+The \agparam{reserve keywords} declaration can be used to specify a
+list of keywords that are reserved and cannot be used except as
+explicitly specified in the grammar.  This enables AnaGram to avoid
+issuing meaningless keyword anomaly diagnostics (see \S 7.5).  AnaGram
+does not automatically presume that keywords are also reserved words,
+since in many grammars there is no need to specify reserved words.
+The reserve keywords declaration is made in a configuration section
+and consists of the words \agcode{reserve keywords} followed by a list
+of one or more keyword strings, separated by commas and enclosed in
+braces (\bra \ket). For example:
+\begin{indentingcode}{0.4in}
+{}[ reserve keywords \bra "int", "char", "float", "double" \ket ]
+\end{indentingcode}
+\subsection{Rename Macro Statement}
+\index{Rename macro}\index{Macros}
+AnaGram uses a number of macros in its generated code.  It is
+possible, therefore, to run into naming collisions with other
+components of your program.  The \agparam{rename macro} statement
+allows you to change the name AnaGram uses for a particular macro to
+avoid these problems.  For example, the Windows NT operating system
+uses \agcode{CONTEXT} structures to perform various internal
+operations.  If you use the context tracking option (see \S 9.5.4)
+your parser will have a macro called \agcode{CONTEXT}.  To avoid the
+name collision, add the following statement to any configuration
+section in your grammar:
+\begin{indentingcode}{0.4in}
+rename macro CONTEXT AG{\us}CONTEXT
+\end{indentingcode}
+Then, simply use \agcode{AG{\us}CONTEXT} where you would otherwise have
+used \agcode{CONTEXT}.

Mercurial > ~dholland > hg > ag > index.cgi

comparison doc/manual/sf.tex @ 0:13d2b8934445