AnaGram interim repo (temporary): doc/manual/dd.tex comparison

comparison doc/manual/dd.tex @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.

author	David A. Holland
date	Sat, 22 Dec 2007 17:52:45 -0500
parents
children

comparison

equal deleted inserted replaced

--1:000000000000
+:13d2b8934445
+\chapter{Programming With AnaGram}
+Although AnaGram has many options and features which enable you to
+build a parser that meets your needs precisely, it has well-defined
+defaults so that you do not generally need to learn about an option
+until you need the facility it provides.  The purpose of this chapter
+is to show you how to use the options and features effectively.
+The options and features of AnaGram can be divided roughly into three
+groups: those that control the general aspects of your parser, those
+that control input to the parser and those that control error
+handling.  After dealing with these three groups of options and
+features, this chapter concludes with a discussion of various advanced
+techniques.
+Many aspects of your parser are controlled by setting configuration
+parameters, either in a configuration file or in your syntax file.
+This chapter presumes you are familiar with setting configuration
+parameters.  The names of configuration parameters, as they occur in
+the text, are printed in \agparam{bold face type}.  Appendix A
+describes the use of configuration parameters and provides a detailed
+discussion of each configuration parameter.
+\section{General Aspects}
+\subsection{Program Development}
+The first step in writing a program is to write a grammar in AnaGram
+notation which describes the input the program expects.  The file
+containing the grammar, called the syntax file, conventionally has the
+extension \agfile{.syn}.  You could also make up a few sample input
+files at this time, but it is not necessary to write reduction
+procedures at this stage.
+Run AnaGram and use the \index{Analyze Grammar}Analyze Grammar command
+to create parse tables.  If there are syntax errors in the grammar at
+this point, you will have to correct them before proceeding, but you
+do not necessarily have to eliminate conflicts, if there are any, at
+this time.  There are, however, many aids available to help you with
+conflicts.  These aids are described in Chapters 5 through 7, and
+somewhat more briefly in the Online Help topics.
+Once syntax errors are corrected, you can try out your grammar on the
+sample input files using the File Trace facility.  With File Trace,
+you can see interactively just how your grammar operates on your test
+files.  You can also use Grammar Trace to answer ``what if'' questions
+concerning input to the grammar.  The Grammar Trace does not use a
+test file, but rather allows you to make input choices interactively.
+At any time, you can write reduction procedures to process your input
+data as its components are identified in the input stream.  Each
+procedure is associated with a grammar rule.  The reduction procedures
+will be incorporated into your parser when you create it with the
+\index{Build Parser}Build Parser command.
+By default, unless you specify an input procedure, parser input will
+be read from \agcode{stdin}, using the default \agcode{GET{\us}INPUT}
+macro.  You will probably wish to redefine \agcode{GET{\us}INPUT}, or
+configure your parser to use \agparam{pointer input} or \agparam{event
+driven} input.
+\subsection{The Default Parser}
+\index{Parser}
+If you apply the Build Parser command to a syntax file which contains
+only a grammar, with no reduction procedures and no embedded C code,
+AnaGram will still produce a complete C command line program which you
+can compile and run.  \index{Input procedures}This parser will parse
+character input from \agcode{stdin}.  If the input does not satisfy
+the rules of your grammar, the parser will issue a syntax error
+diagnostic to \agcode{stderr} identifying the exact line and column
+numbers of the error.  If the parser should overflow its stack, it
+will abort with an error message to \agcode{stderr}.  If the parse is
+successful, that is if the parser succeeds in identifying the grammar
+token without encountering an error, it will simply return to the
+command line.
+You can extend such a simple parser, often quite effectively, by
+adding only reduction procedures.  If the reduction procedures write
+output to \agcode{stdout}, you can produce a conventional ``filter''
+program without having to pay any attention to input handling, error
+handling, or any of the other options AnaGram provides.
+%CALC, in the EXAMPLES directory, is an example of such a program.
+\subsection{The Content of the Parser and Header Files}
+% XXX s/from your parser file/from your syntax file/
+AnaGram creates two \index{Output files}\index{File}output files: a
+parser file and a header file.  \index{Parser file}\index{File}The
+parser file contains the C code you need to compile and link before
+you can run your parser.  It begins with the \index{C
+prologue}\index{Prologue}C prologue, if any, from your parser file.
+The C prologue is an optional block of \index{Embedded C}embedded C or
+C++ which precedes everything else in your syntax file.  Although it
+can contain anything you wish, normally it is used to place
+identification information, \index{Copyright notice}copyright notices,
+etc., at the beginning of your parser file.  If your parser uses token
+types that require definition, the appropriate \agcode{\#include}
+statements and definitions should be placed in the C prologue.  See
+``Defining Token Types'', below.
+Following the C prologue, AnaGram places a number of definitions of
+variables and macros that you might need to refer to in your embedded
+C, and in your reduction procedures.  Not the least of these
+definitions is the parser control block, described below.  Following
+these definitions, AnaGram inserts all your embedded C, in the order
+in which it occurred in your syntax file.  Following the embedded C
+come all your reduction procedures.  Finally, AnaGram adds the tables
+which summarize your grammar and a parsing engine customized to your
+requirements.
+The \index{Header file}\index{File}header file contains definitions
+needed by your parser.  These include definitions of the \index{Parser
+value stack}\index{Value stack}\index{Stack}value stack type, the
+input token type, the \index{Parser control block}parser control block
+type, and token name enumeration constants.  The definitions are
+placed in a header file so that you can make them available to other
+modules if necessary.
+\subsection{Naming Output Files}
+\index{Output files}\index{File}
+Unless you specify otherwise, AnaGram names the parser and header
+files following conventional programming practice.  Both \index{File
+name}\index{File name}files have the same name as your syntax file,
+with extensions \agfile{.c} and \agfile{.h} respectively.  These
+names, however, are controlled by the configuration parameters
+\index{Configuration parameters}\index{Name}
+\index{Parser file name}\agparam{parser file name} and
+\index{Header file name}\agparam{header file name}
+respectively, so you can override AnaGram's defaults if you wish.  If
+you normally use C++ rather than C, for example, you might want to
+include the following statement in your configuration file:
+\begin{indentingcode}{0.4in}
+parser file name = "\#.cpp"
+\end{indentingcode}
+When AnaGram names the parser file it substitutes the name of your
+syntax file for the ``\#'' character in the file name template.
+\subsection{Compiling Your Parser}
+\index{Parser}
+Although AnaGram was designed primarily with ANSI C in mind, a good
+deal of care has been taken to ensure that its output is consistent
+with older C compilers and with newer C++ compilers.  If your compiler
+does not support ANSI function prototypes, you should set the
+\index{Old style}\index{Configuration switches}\agparam{old style}
+switch in your configuration file.  If you are intending to compile
+your parser using a 16-bit compiler, you might want to turn on the
+\index{Near functions}\index{Configuration switches}\agparam{near functions}
+switch in your configuration file.  If you are building a parser for
+use in an embedded system, you might want to make sure the
+\index{Const data}\index{Configuration switch}\agparam{const data}
+configuration switch is set so that all the tables AnaGram generates
+will be declared \agcode{const}.
+\subsection{Naming Your Parser}
+\index{Parser}
+In the default case, AnaGram creates a main program for you.
+Generally, however, you will probably want a parser function which you
+can call from your own main program.  You won't want AnaGram to define
+\agcode{main} for you.  You can stop AnaGram from defining
+\agcode{main} in any of several ways: Include some embedded C in your
+syntax file, turn off
+\index{Main program}the \index{Configuration switches}\agparam{main program}
+configuration switch, or turn on either the \agparam{event driven} or
+\agparam{pointer input} switches.  Since you almost always will have
+some embedded C in your syntax file, you will seldom have to use the
+\agparam{main program} switch.
+Normally, AnaGram simply uses the name of your syntax file to create
+the name of your parser.  Thus if your syntax file is called
+\agfile{ana.syn} your parser will have the name \agcode{ana}.  AnaGram
+does not check the parser name for compliance with the rules of C.  If
+you use strange characters in your file name, you will get strange
+characters in the name of your parser, and you will get unpleasant
+remarks from your C compiler when you try to compile your parser.
+Thus, for example, if you were to name your parser file
+\agfile{!@\#.syn}, AnaGram will call your parser \agcode{!@\#}.  Your
+compiler will doubtless choke.
+\index{Parser}
+If you wish AnaGram to give your parser a name other than the file
+name, you may set the
+\index{Parser name}\index{Name}\index{Configuration parameters}
+\agparam{parser name}
+configuration parameter.  Thus, to make sure your parser is called
+\agcode{periwinkle} you would include the following line in a
+configuration section in your syntax file:
+% Note: this is not actually required to be in double quotes.
+% It'll also accept anything that's syntactically acceptable to it
+% as a C data type, which also lets you give it things like
+% ``periwinkle *'' that result in uncompilable code.
+\begin{indentingcode}{0.4in}
+parser name = "periwinkle"
+\end{indentingcode}
+Besides the parser itself, AnaGram generates a number of other
+functions, variables and type definitions when it creates your parser.
+All these entities are named using the parser name as the base.  The
+templates and their usages are as follows:
+\begin{indenting}{0.4in}
+\begin{tabular}{ll}
+\index{Parser}\index{Initializer}\index{Name}
+\agcode{init{\us}\$}&initializer for parser\\
+\index{Grammar token}\index{Value}
+\agcode{\${\us}value}&returns value of grammar token\\
+\index{Parser value stack}\index{Value stack}\index{Stack}
+\agcode{\${\us}vs{\us}type}&value stack type\\
+\agcode{\${\us}it{\us}type}&input token union\\
+\agcode{\${\us}token{\us}type}&token name enumeration typedef\\
+\agcode{\${\us}\%{\us}token}&token name enumeration constants\\
+\agcode{\${\us}pcb{\us}type}&typedef of parser control block\\
+\index{Parser control block}
+\agcode{\${\us}pcb}&parser control block\\
+\index{Rule Count}
+\agcode{\${\us}nrc}&rule count table\\
+\agcode{\${\us}nrpc}&reduction procedure count table\\
+\\
+\end{tabular}
+\end{indenting}
+When AnaGram defines these entities it substitutes the parser name for
+the dollar sign.  In the token name enumeration constants it
+substitutes the token name for the \index{{\us}prc}``\%'' character.
+Embedded space characters are replaced with underscore characters.
+\subsection{The Parser Control Block}
+\index{Parser control block}
+The complete status of a parse is kept in a structure called a
+\agterm{parser control block}.  As a default, AnaGram defines a parser
+control block for you, and provides a macro, \index{PCB}\agcode{PCB},
+which enables you to access it simply.  The name AnaGram assigns to
+the parser control block is
+% XXX
+%\agcode{\${\us}pcb}, where as above ``\$'' is replaced with the name of
+%your parser.
+\agcode{\textit{$<$parser name$>$}{\us}pcb}.
+If you need to refer to the parser control block from some module
+other than the parser module, use an \agcode{\#include} statement to
+include the header file for your parser and refer to the parser
+control block by its name as above.  The structure of the parser
+control block is described in Appendix E.  In this chapter, particular
+fields will be discussed as necessary.
+Since the parser control block contains the complete status of a
+parse, you may interrupt a parse and continue it later by saving and
+restoring the control block.  If you have multiple input streams, all
+controlled by the same grammar, you may have a separate control block
+for each stream.  If you wish to call your parser recursively, you may
+define a fresh control block for each level of recursion.  To make
+best use of these capabilities, you will need to declare the parser
+control block yourself.  This is discussed below under ``Advanced
+Techniques''.
+\subsection{Calling Your Parser}
+% XXX should have an example of actually calling the thing.
+% XXX should also have ``terminating your parser'' or something like
+% that.
+The parser function AnaGram defines is a simple function which takes
+no arguments and returns no values.  All communication with the parser
+takes place via the parser control block.  When your parser returns,
+\index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} contains an exit
+code describing the outcome of the parse.  Symbols for the
+exit codes are defined in the header file AnaGram generates.
+\index{Exit codes}\index{Error codes}These symbols, their values,
+and their meanings are:
+\index{AG{\us}RUNNING{\us}CODE}
+\index{AG{\us}SUCCESS{\us}CODE}
+\index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}
+\index{AG{\us}REDUCTION{\us}ERROR{\us}CODE}
+\index{AG{\us}STACK{\us}ERROR{\us}CODE}
+\index{AG{\us}SEMANTIC{\us}ERROR{\us}CODE}
+\begin{indenting}{0.4in}
+\begin{tabular}{lll}
+\agcode{AG{\us}RUNNING{\us}CODE}&0&Parse is not yet complete\\
+\agcode{AG{\us}SUCCESS{\us}CODE}&1&Parse terminated successfully\\
+\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE}&2&Syntax error was encountered\\
+\agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE}&3&Bad reduction token encountered\\
+\agcode{AG{\us}STACK{\us}ERROR{\us}CODE}&4&Parser stack overflowed\\
+\agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE}&5&Semantic error\\
+\\
+\end{tabular}
+\end{indenting}
+Only an event driven parser will return the value
+\agcode{AG{\us}RUNNING{\us}CODE}, since any other parser continues executing
+until it terminates successfully or encounters an unrecoverable error.
+Syntax errors, reduction token errors, and stack errors are discussed
+below under ``Error Handling''.
+% XXX: this bit belongs somewhere else
+\agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE} is a special case.  It is available
+for you to use in your reduction procedures to terminate a parse for
+semantic reasons.
+% XXX add: AnaGram will never set it itself.
+If, in a reduction procedure, you determine that parsing should not
+continue, you need only include the statement:
+\begin{indentingcode}{0.4in}
+PCB.exit{\us}flag = AG{\us}SEMANTIC{\us}ERROR{\us}CODE;
+\end{indentingcode}
+When your reduction procedure returns, the parse will then terminate
+and the parser will return control to the calling program.
+\subsection{Parser Return Value}
+\index{Value}
+If, in your grammar, there is a value assigned to the grammar token,
+you may retrieve it, after the parse is complete, by calling the
+parser value function, the name of which is given by
+\agcode{\${\us}value} where ``\$'' is the name of your parser.
+\agcode{\${\us}value} takes no arguments, and returns a value of the type
+assigned to the grammar token in your syntax file.
+Although in theoretical discussions of parsing the result of the parse
+is contained in the value of the grammar token, in practice, more
+often than not, results are communicated to other procedures by
+setting the values of global variables.  Thus the value of the grammar
+token is often of little interest.
+Since the parser per se takes no arguments, it is usually convenient
+to write a small interface function with a calling sequence
+appropriate to the problem.  The interface function can then take care
+of appropriate initializations, call the parser, and retrieve results.
+\subsection{Defining Token Types}
+When you add reduction procedures to your grammar, you will often find
+it convenient to add type declarations for the \index{Semantic
+value}\index{Token}\index{Value}semantic values of some of the tokens
+in your grammar.  As long as the types you use are conventional C data
+types\index{Data type}\index{Token}, you don't have to do anything
+special.  If, however, you have used types or classes that you have
+defined yourself, you need to make sure that the appropriate
+definition statements precede their use in the code AnaGram generates.
+To do this, you need to have a C prologue in your syntax file.  In the
+C prologue, you should place the definition statements your parser
+will need, or at least an \agcode{\#include} statement that will cause
+the types or classes to be defined.
+\subsection{Debugging Your Parser}
+Because the ``flow of control'' of your parser is algorithmically
+derived from your grammar, debugging your parser separates into two
+separate exercises: debugging your grammar, discussed in Chapter 7,
+and debugging your reduction procedures.
+When debugging, it is usually a good idea to turn off the
+\index{Macros}\index{Allow macros}\index{Configuration switches}
+\agparam{allow macros}
+switch.  This switch is normally on and causes simple reduction
+procedures to be implemented as macros.  When you turn it off, you get
+a proper function definition for each reduction procedure, so you can
+put a breakpoint in any reduction procedure you choose.  If the
+\index{Line numbers}\index{Configuration switches}
+\agparam{line numbers} switch
+is on each reduction procedure will contain a
+\index{\#line}\agcode{\#line} directive to show where the reduction
+procedure is found in your syntax file.  Once you have acquired
+confidence in your reduction procedures you may turn the
+\agparam{allow macros} switch back on for slightly improved
+performance.
+If your debugger allows you to inspect entire structures, you will
+find it convenient to look at the parser control block while you are
+debugging.  The contents of the parser control block are described in
+Appendix E.
+A good way to begin debugging a new parser is to simply put a
+breakpoint in each reduction procedure.  Start your parser and step
+through the reduction procedures one by one, verifying that they
+perform as expected.  After you have stepped through a reduction
+procedure, turn off its breakpoint.  If there are multiple paths,
+leave breakpoints on the paths not taken.  Liberal use of the assert
+macro helps assure that your fixes don't break procedures you have
+already tested.
+\section{Providing Input to Your Parser}
+\index{Parser}\index{Input}\index{Input procedures}
+This section describes three methods for providing input to your
+parser.  In the first method your program calls the parser which then
+requests input tokens as it needs them.  It returns only when it has
+completed the parse.  The parser requests input tokens by invoking a
+macro called \agcode{GET{\us}INPUT}, described below.
+The second method for providing input can be used when the entire
+sequence of input tokens is available in memory.  This method is
+controlled by the \index{Pointer input}\index{Configuration
+switches}\agparam{pointer input} configuration switch.  It is
+discussed below.
+The third method for providing input is especially convenient when
+using \index{Lexical scanner}lexical scanners or multi-stage parsing.
+It is controlled by the \index{Event driven}\index{Configuration
+switches}\agparam{event driven} configuration switch.
+\subsection{The \agcode{GET{\us}INPUT} Macro}
+\index{GET{\us}INPUT}\index{Macros}
+The default parser simply reads characters from \agcode{stdin}.  It
+does this by invoking a macro called \agcode{GET{\us}INPUT} every time it
+needs an input character.  The default definition of
+\agcode{GET{\us}INPUT} is:
+\index{PCB}\index{input{\us}code}
+\begin{indentingcode}{0.4in}
+\#define GET{\us}INPUT (PCB.input{\us}code = getchar())
+\end{indentingcode}
+\agcode{PCB.input{\us}code} is an integer field in the parser control
+block which is used to hold the current input \index{Character
+codes}character code.
+By including your own definition of \agcode{GET{\us}INPUT} in your
+embedded C, you override the default definition provided by AnaGram.
+The only requirement for \agcode{GET{\us}INPUT} is that it store a
+character in \agcode{PCB.input{\us}code}.  Suppose you wish to make a
+parser that reads characters from a file provided by the calling
+program.  You could include the following in your embedded C:
+\begin{indentingcode}{0.4in}
+extern FILE *file;
+\#define GET{\us}INPUT (PCB.input{\us}code = fgetc(file))
+\end{indentingcode}
+Now your parser, when invoked, will read characters from the specified
+file instead of reading them from \agcode{stdin}.  Of course,
+\agcode{GET{\us}INPUT} is not constrained to reading a file or data
+stream.  You may implement \agcode{GET{\us}INPUT} in any manner you
+choose.  You may implement it as a function call, or you may choose to
+define \agcode{GET{\us}INPUT} so that it expands into inline code for
+faster execution.
+\subsection{Pointer Input}
+\index{Pointer input}\index{Input procedures}
+It often happens that the data you wish to parse are already in memory
+when you are ready to call the parser.  While you could rewrite
+\agcode{GET{\us}INPUT} to simply scan the array by incrementing a
+pointer, AnaGram provides an alternative approach since this is such a
+common situation.  In a configuration section in your syntax file
+simply turn on the \index{Pointer input}\index{Configuration
+switches}\agparam{pointer input} switch.  Then before you call your
+parser, load \index{pointer}\index{PCB}\agcode{PCB.pointer}, the
+pointer field in the parser control block, with a pointer to your
+array.  Assuming your parser is called \agcode{ana}, and you wish to
+call an interface function with an argument consisting of a character
+string, here's what you do:
+\begin{indentingcode}{0.4in}
+{}[
+pointer input
+]
+\bra
+void ana{\us}shell(char *source{\us}text) \bra
+PCB.pointer = (unsigned char *)source{\us}text;
+ana();
+\ket
+\ket
+\end{indentingcode}
+% XXX s/the//
+The type of the \agcode{PCB.pointer} defaults to
+\agcode{unsigned char *} to
+minimize difficulty with full 256-character sets.  If your compiler is
+fussy, you should use a cast, as above, when you set the value.  If
+your data requires more than 256
+\index{Character codes}character codes, you may still use pointer
+input by using the \index{Pointer type}\index{Configuration
+parameters}\agparam{pointer type} configuration parameter to change
+the definition of the field in the parser control block.  Normally,
+the value of \agparam{pointer type} should be a C data type that
+converts to integer.  If \agparam{pointer type} does not convert to
+integer, you must provide an
+\index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro, as
+described below, to extract a token identifier.  Do not change
+\agparam{pointer type} to \agcode{signed char} in order to avoid the
+cast in the above example.  That will have the effect of making all
+character codes above 127 inaccessible to your parser.
+Note that if you use pointer input your parser does not need a
+\agcode{GET{\us}INPUT} macro.  Parsers that use pointer input usually
+run somewhat faster than those that use \agcode{GET{\us}INPUT},
+particularly if they use keywords.
+% XXX that is unclear - I know it means that the keyword logic is
+% particularly improved by using pointer input, but it could be read
+% to imply that adding keywords makes the parser even faster, which is
+% backwards.
+\subsection{Event Driven Parsers}
+\index{Event driven parser}\index{Parser}
+There are many situations where the input to a parser is developed by
+an independent process and the linkage required to implement a
+\agcode{GET{\us}INPUT} macro is unduly cumbersome.  In these
+circumstances, it is convenient to use an \agparam{event driven}
+parser.  With an event driven parser, you do not simply call the
+parser and wait for it to finish.  Instead, you call its
+\index{Initializer}initializer first, and then call it each time you
+have a character for it.  The parser processes the character and
+returns as soon as it needs more input, encounters an error or finds
+the parse complete.  You can interrogate
+\index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to determine
+whether the parser can accept more input.
+To create an event driven parser, set the \index{Event
+driven}\index{Configuration switches}\agparam{event driven} switch in
+your syntax file.  Then, to initialize the parser, call the
+initialization procedure, or \index{Initializer}initializer, provided
+by AnaGram.  The name of this procedure is \agcode{init{\us}\$} where
+``\agcode{\$}'' represents the name of your parser.  If your parser is named
+\agcode{ana}, the
+\index{Parser}initialization procedure is named \agcode{init{\us}ana}.
+To process a single character, store the character in
+\index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code}, then call
+\agcode{ana}.  When it returns, check
+\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to see if the
+parser is still running.  When the parse is successful, you may
+retrieve the value of the grammar token, if you wish, by using the
+\index{Parser value function}parser value function, in this case,
+\agcode{ana{\us}value}.
+% XXX s/case,/case/ above. or s/function,/function;/
+As an example, let us imagine we are to write a an interface function
+for our parser which takes a list of string pointers, a count, and a
+pointer to a location into which we may store an error flag.  The
+input to our parser is to be the concatenation of all the character
+strings.  We will set up a loop which will call the parser for all the
+characters of the strings in turn.  We will assume that the function
+will return the value of the grammar token, which we will assume to be
+also of type double:
+\begin{indentingcode}{0.4in}
+{}[
+event driven
+]
+\bra
+double parse{\us}strings(char **ptr, int n{\us}strings, int *error) \bra
+init{\us}ana();
+while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\&
+				n{\us}strings--) \bra
+char *p = *ptr++;
+while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\& *p) \bra
+PCB.input{\us}code == *p++;
+ana();
+\ket
+\ket
+assert(error);
+*error = PCB.exit{\us}flag != AG{\us}SUCCESS{\us}CODE;
+return ana{\us}value();
+\ket
+\ket
+\end{indentingcode}
+The purpose of this example is simply to show how to use an event
+driven parser.  Of course it would be possible, as far as this example
+is concerned, to concatenate the strings and use pointer input
+instead.  A problem sufficiently complex to \emph{require} an event
+driven parser would be too complex to serve as a simple example.
+\subsection{Token Input}
+\index{Token input}\index{Input procedures}
+Thus far in this chapter, we have assumed that the input to your
+parser consisted of ordinary characters.  There are many situations
+where it is convenient to have a
+\index{Preprocessor}\index{Token}\index{Token}preprocessor, or
+\index{Lexical scanner}lexical scanner, which identifies basic tokens
+and hands them over to your parser for further processing.  Accepting
+input from such preprocessors is discussed in the remainder of this
+section.
+Sometimes preprocessors simply pass on text characters, acting as
+filters to remove unwanted characters, such as white space or
+comments, and to insert other text, such as macro expansions.  In such
+situations, there is no need to treat the preprocessor differently
+from any other character source.  The input methods described above
+are sufficient to deal with the input provided by the preprocessor.
+In what follows, we deal with situations where the preprocessor passes
+on \index{Token number}\index{Token}\index{Number}\agterm{token
+numbers} rather than character codes.  The preprocessor may also pass
+on token \emph{values}, which also need accommodation of some sort.
+% XXX also also?
+There are two principal interfacing problems to deal with.  The first
+has to do with identifying the tokens to your parser.  The second has
+to do with providing the semantic values of the tokens.
+%
+%If your preprocessor does not provide values with its tokens, your parser
+%may use any of the input techniques described above for character input,
+%the only difference being that instead of setting PCB.input{\us}code to a
+%character value, you set it to the token identifier.
+%
+%If your preprocessor does provide token values, then you have to use either
+%a GET{\us}INPUT macro, or configure your parser to be event driven.  If you wish
+%to use pointer input, you must provide an INPUT{\us}CODE macro.
+%
+\subsection{Identifying Tokens using Predefined Token Numbers}
+\index{Token}\index{Number}\index{Token number}
+If you have a pre-existing \index{Lexical scanner}lexical scanner,
+written for use with some other parsing system, it probably outputs
+its own set of token numbers.  The most robust way of interfacing such
+a lexical scanner is to include, in your syntax file, either an
+\index{Enum statement}\agparam{enum} statement or a set of definition
+statements
+for the terminal tokens, equating
+\index{Terminal token}\index{Token}terminal token names with the
+numeric values output by the lexical scanner, so that AnaGram treats
+them as character codes.  In this situation, you simply set
+\index{PCB}\index{input{\us}code}\agcode{PCB.input{\us}code} to the token
+number determined by the lexical scanner.  Generally, lexical scanners
+written for other parsing systems expect to be called for each token.
+Therefore, you would normally use a \agcode{GET{\us}INPUT} macro to call
+the lexical scanner and provide input to your parser.
+% XXX as far as I know, lex expects to call yacc, not vice versa.
+\subsection{Identifying Tokens using AnaGram's Token Numbers}
+If you are writing a new preprocessor, you have more freedom.  You
+could simply create a set of codes as above.  On the other hand, you
+can save a level of translation and make your system run faster by
+providing your parser with internal token numbers directly.  Here's
+what you have to do.
+First, when you write your syntax file, leave all the terminal tokens
+undefined.  That means, of course, that you have to have a name for
+each terminal token.  You can't use a literal character or a number
+for the token.  AnaGram will generate a unique token number for each
+token in your grammar.  In the header file it generates, AnaGram
+always provides a set of
+\index{Enumeration constants}\index{Constants}enumeration constants
+for all the named tokens in your grammar.  The names for these
+constants are controlled by the
+\index{Configuration parameters}\index{Enum constant name}
+\agparam{enum constant name}
+parameter.  (See Appendix A.) These constants normally have the form
+\agcode{\textit{$<$parser name$>$}{\us}\textit{$<$token name$>$}{\us}token}.
+Note that embedded space in the token name will be replaced with
+underscore characters.  Assume your parser is called \agcode{ana}, and
+in your grammar you have a token called \agcode{integer constant}.
+The enumeration constant identifying the token is then
+\agcode{ana{\us}integer{\us}constant{\us}token}.  Now, to hand off an integer
+constant to your parser you write:
+\begin{indentingcode}{0.4in}
+PCB.input{\us}code = ana{\us}integer{\us}constant{\us}token;
+\end{indentingcode}
+\subsection{Providing Token Values}
+If your \index{Preprocessor}preprocessor provides \index{Semantic
+value}\index{Token}\index{Value}semantic values for input tokens, you
+must inform AnaGram by setting the
+\index{Input values}\index{Configuration switches}\index{Value}
+\agparam{input values}
+configuration switch in your syntax file. Then, whenever you provide a
+token, you must also store a value in
+\index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}.
+You can do this as part of your \agcode{GET{\us}INPUT} macro, or, if you
+have an \agparam{event driven} parser, when you set
+\index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code} prior to
+calling the parser function.  If you are using \index{Pointer
+input}\index{Configuration switches}\agparam{pointer input}, the
+pointer will presumably identify the token value.  You must provide an
+\index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro to extract
+the identification code from the token value.  For example, if the
+token value is a structure and the appropriate member field is called
+\agcode{id}, you would write:
+\begin{indentingcode}{0.4in}
+\#define INPUT{\us}CODE(t) (t).id
+\end{indentingcode}
+Generally, the simplest way to interface the preprocessor and your
+parser, when you are passing token values, is to use an event driven
+parser.  In this situation, the preprocessor, when it identifies a
+token, simply loads the token identifier into
+\agcode{PCB.input{\us}code}, loads the value into
+\index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}, and calls
+the parser.
+\index{Token}
+If the values of your input tokens are all of the same type, you must
+set the
+\index{Default input type}\index{Configuration parameters}
+\index{Input type}\agparam{default input type}
+configuration parameter so that AnaGram can declare
+\index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}
+appropriately.  \index{Token type}\agparam{Default input type} will
+default to \agcode{int} if you do not set it either in your configuration file
+or in your syntax file.
+Some \index{Lexical scanner}lexical scanners simply provide a pointer
+to the text of the token they have identified.  In this situation, you
+would set \agparam{default input type} to \agcode{char *}.  When you
+provide a token to the parser you would set \agcode{PCB.input{\us}value}
+to point to the text of the token.
+If different tokens have values of different types, the situation
+becomes slightly more complex.  First, you must tell AnaGram about the
+types of your input tokens.  You do this by including a
+\index{Declaration}\index{Type declarations}\agterm{type declaration}
+in your syntax file.  A type declaration is a token declaration
+preceded by a C data type\index{Data type}\index{Token} in
+parentheses.  Assume that your \index{Preprocessor}preprocessor
+identifies, among others, the following tokens: \agcode{name},
+\agcode{string}, \agcode{real constant}, \agcode{integer constant},
+and \agcode{unsigned constant}.  You might then include the following
+in your syntax file:
+\begin{indentingcode}{0.4in}
+{}[
+input values
+]
+(char *) name, string
+(double) real constant
+(long) integer constant, unsigned constant
+\end{indentingcode}
+AnaGram will then create, in the parser control block, an input value
+field which can accommodate any of these terminal tokens in your
+grammar.
+To enable you to store data into the input value field of the parser
+control block, AnaGram provides a convenient macro called
+\index{INPUT{\us}VALUE}\index{Macros}\agcode{INPUT{\us}VALUE} to serve as
+the destination of an assignment statement.  \agcode{INPUT{\us}VALUE}
+takes the type of the data as a parameter.  Thus one could write:
+\begin{indentingcode}{0.4in}
+INPUT{\us}VALUE(char *) = text{\us}pointer;
+INPUT{\us}VALUE(long) = constant{\us}value;
+\end{indentingcode}
+\section{Error Handling}
+There are two classes of errors your parser needs to be able to deal
+with.  The first consists of \agterm{implementation errors} and the second
+consists of \agterm{syntax errors}.  Syntax errors arise because the input to
+the parser does not conform to the definition of the language it is
+designed to parse.  Implementation errors arise because the programs
+we write are never perfect and because the environment in which our
+programs run is often something less than ideal.
+\subsection{Implementation Errors}
+\index{Implementation errors}\index{Errors}
+% XXX parser stack overflow is not really an ``implementation error''
+There are two implementation errors which your parser needs to be able
+to deal with.  The first is \agterm{parser stack overflow}.  The
+second comes from a bad \agterm{reduction token}.
+\index{Stack}
+\paragraph{Stack Overflow.}
+Stack overflow is an error which your parser must be able to deal
+with.  In general, no matter how big you make your parser stack, it is
+possible for legitimate input to cause it to overflow.  The size of
+the stack for your parser is controlled by the configuration parameter
+\agparam{parser stack size}.  This parameter defaults to a value of
+32.  This value has been found to be adequate for ordinary usage.
+If your parser has only left recursive constructs, then there is a
+maximum depth beyond which the parser stack will never grow.  If your
+parser has center recursive or right recursive productions, then no
+matter how much stack space you allocate, there will always be a
+syntactically correct input file which causes the stack to overflow.
+This can be illustrated by the following set of C statements:
+\begin{indentingcode}{0.4in}
+x = y;
+x = (y);
+x = ((y));
+x = (((y)));
+.
+.
+.
+\end{indentingcode}
+Each set of parentheses requires another level on the parser stack.
+When this set of statements was tried with Borland C++, it ran out of
+stack space at 127 sets of parentheses and diagnosed the problem as
+``Expression is too complicated''.
+AnaGram calculates the actual size of the parser stack by calculating
+the maximum depth for left recursive constructs and adding half the
+value of
+\index{Parser stack size}\index{Configuration parameters}\index{Stack}
+\index{Parser state stack}\index{State stack}
+\agparam{parser stack size}.  It then uses the larger of the calculated
+value and \agparam{parser stack size} to allocate stack storage.  You
+may check the value actually used in your parser by inspecting the
+definition of
+\index{AG{\us}PARSER{\us}STACK{\us}SIZE}\agcode{AG{\us}PARSER{\us}STACK{\us}SIZE}.
+If your parser runs out of stack space, it will set
+\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
+\index{AG{\us}STACK{\us}ERROR{\us}CODE}\agcode{AG{\us}STACK{\us}ERROR{\us}CODE}, invoke
+the
+\index{Macros}\index{PARSER{\us}STACK{\us}OVERFLOW}\agcode{PARSER{\us}STACK{\us}OVERFLOW}
+macro and return to the calling program.  The default definition of
+this macro is:
+\begin{indentingcode}{0.4in}
+\#define PARSER{\us}STACK{\us}OVERFLOW \bra fprintf(stderr, {\bs}
+"{\bs}nParser stack overflow{\bs}n"); \ket
+\end{indentingcode}
+% XXX ``provide your own definition'', not ``redefine''
+If this definition is not consistent with your needs, you may redefine
+it in any block of embedded C in your syntax file.
+\index{Reduction token error}
+\paragraph{Reduction Token Error.}
+A properly functioning parser should never encounter a reduction token
+error.  Therefore, reduction token errors should be taken quite
+seriously.  The only way to cause a reduction token error in an
+otherwise properly functioning parser is to set incorrectly the
+reduction token for a semantically determined production.
+% XXX ``to incorrectly set''
+Before your parser calls a reduction procedure, it stores the token
+number of the token to which the production would normally reduce in
+\index{reduction{\us}token}\index{PCB}\agcode{PCB.reduction{\us}token}.  If
+the production is a semantically determined production, you may, in
+your reduction procedure, change the value of
+\agcode{PCB.reduction{\us}token} to one of the alternative tokens on
+the left side of the production.  When your reduction procedure
+returns, your parser checks to verify that
+\agcode{PCB.reduction{\us}token} is a valid token number for the
+current state of the parser.  If it is not, it sets
+\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
+\index{AG{\us}REDUCTION{\us}ERROR{\us}CODE}\agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE}
+and invokes
+\index{REDUCTION{\us}TOKEN{\us}ERROR}\index{Macros}\agcode{REDUCTION{\us}TOKEN{\us}ERROR}.
+The default definition of this macro is:
+\begin{indentingcode}{0.4in}
+\#define REDUCTION{\us}TOKEN{\us}ERROR \bra fprintf(stderr,{\bs}
+"{\bs}nReduction{\us}token error{\bs}n"); \ket
+\end{indentingcode}
+\subsection{Syntax Errors}
+\index{Syntax error}\index{Errors}
+If the input data to your parser does not conform to the rules you
+have specified in your grammar, your parser will detect a syntax
+error.  There are two basic aspects of dealing with syntax errors:
+\index{Error diagnosis}\agterm{diagnosing} the error and
+\agterm{recovering} from the error, that is, restarting the parse, or
+``resynchronizing'' the parser.
+If you use the default settings for syntax error handling, then on
+encountering a syntax error your parser will call a diagnostic
+procedure which will create an error message and store a pointer to it
+in
+\index{Error messages}\index{error{\us}message}\index{PCB}
+\agcode{PCB.error{\us}message}.
+Then, it will set
+\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
+\index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE} and
+call a macro called
+\index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR}.  The
+default definition of \agcode{SYNTAX{\us}ERROR} will print the error
+message on \agcode{stderr}.  Finally, in lieu of trying to continue
+the parse, it will return to the calling program.
+AnaGram has several options which allow you to tailor diagnostic
+messages to your requirements or help you to create your own.  It also
+provides several options for continuing the parse.
+The options available to help you diagnose errors are:
+\begin{itemize}
+\item line and column tracking
+\item creation of a diagnostic message
+\item identification of the error frame
+\end{itemize}
+\index{Numbers}\index{Lines and columns}\index{Configuration switches}
+\paragraph{Line and Column Tracking.}
+Your parser will automatically track lines and columns in its input if
+the \agparam{lines and columns} configuration switch is on.  Since
+this is a common requirement, \agparam{lines and columns} defaults to
+on.  If you don't want your parser to spend time counting lines and
+columns you should turn the switch off, thus:
+\begin{indentingcode}{0.4in}
+\agcode{
+\~{}lines and columns
+}
+\end{indentingcode}
+Normally, if you are using a \index{Lexical scanner}lexical scanner,
+you would turn lines and columns off.
+% XXX: this should say *why*.
+The line and column counts are maintained in
+\index{line}\index{PCB}\agcode{PCB.line} and
+\index{column}\index{PCB}\agcode{PCB.column} respectively.
+\agcode{PCB.line} and \agcode{PCB.column} are initialized with the
+values of the \index{FIRST{\us}LINE}\index{Macros}\agcode{FIRST{\us}LINE}
+and \index{Macros}\index{FIRST{\us}COLUMN}\agcode{FIRST{\us}COLUMN} macros
+respectively.  These macros provide default initial values of 1 for
+both line and column numbers.  To override these definitions, simply
+include definitions for these macros in your syntax file.  If tab
+characters are encountered, they are expanded in accordance with the
+\index{Tab spacing}\agparam{tab spacing} parameter.
+When your parser is executing a reduction procedure, \agcode{PCB.line} and
+\agcode{PCB.column} refer to the first input character following the
+rule that is being reduced.  When your parser has encountered a syntax
+error, and is executing your \agcode{SYNTAX{\us}ERROR} macro,
+\agcode{PCB.line} and \agcode{PCB.column} refer to the erroneous input
+character.
+\paragraph{Diagnostic Messages.}
+If the \index{Diagnose errors}\index{Configuration switches}
+\agparam{diagnose errors} switch is on, its default setting, AnaGram
+will include an error diagnostic procedure in your parser.  When your
+parser encounters a syntax error, this procedure will create a simple
+diagnostic message and store a pointer to it in
+\index{error{\us}message}\index{PCB}\agcode{PCB.error{\us}message} before
+your \agcode{SYNTAX{\us}ERROR} macro is executed.  The default definition
+of \agcode{SYNTAX{\us}ERROR} prints this message on \agcode{stderr}.
+If your parser was in a state where there was a single input character
+expected or a simple named token expected, it will create a message of
+the form:
+\begin{indentingcode}{0.4in}
+Missing ';'
+\end{indentingcode}
+or
+\begin{indentingcode}{0.4in}
+Missing semicolon
+\end{indentingcode}
+If there was more than one possible input your parser will check to
+see if it can identify the erroneous input.  If it can it will create
+a message of the form:
+\begin{indentingcode}{0.4in}
+Unexpected ';'
+\end{indentingcode}
+or
+\begin{indentingcode}{0.4in}
+Unexpected semicolon
+\end{indentingcode}
+Otherwise, the diagnostic message will be simply:
+\begin{indentingcode}{0.4in}
+Unexpected input
+\end{indentingcode}
+If you do not need a diagnostic message, or choose to create your own,
+you should turn \agparam{diagnose errors} off.
+% XXX Somewhere there should be a discussion of what ``creating your
+% own'' would entail.
+\index{Error frame}
+\paragraph{Error Frame.}
+Often it is desirable to know the ``frame'' of an error, that is, what
+the parser thought it was doing when it encountered the error.  If,
+for instance, you forget to terminate a comment in a C program, your C
+compiler sees an unexpected end of file.  When you look simply at the
+alleged error, of course, you can't see any problem.  In order to
+understand the error, you need to know that the parser was trying to
+find a complete comment.  In this case, we can say that the comment is
+the ``frame'' of the error.
+AnaGram provides an optional facility in its error diagnostic
+procedure, controlled by the
+\index{Error frame}\index{Configuration switches}\agparam{error frame}
+switch, for identifying the frame of a syntax error.  The
+\agparam{diagnose errors} switch must also be on to enable the
+diagnostic procedure.  If you enable \agparam{error frame} in your
+syntax file, AnaGram will include a procedure which will scan
+backwards on the state stack looking for the frame of the error.  When
+it finds what appears to be the error frame, it will store the stack
+index in
+\index{error{\us}frame{\us}ssx}\index{PCB}\agcode{PCB.error{\us}frame{\us}ssx} and
+the token number of the nonterminal token the parser was looking for
+in
+\index{error{\us}frame{\us}token}\index{PCB}\agcode{PCB.error{\us}frame{\us}token}.
+%
+% XXX. Why is the discussion of ``hidden'' inside the discussion of
+% ``error frame''? hidden applies to ordinary error diagnosis also.
+%
+% Furthermore, this discussion of error frame needs an example, or
+% nobody will ever figure out how to do it.
+%
+If, in your grammar, there are nonterminal tokens that are not
+suitable for diagnostic use, usually because they name an intermediate
+stage in the parse that means nothing to your user, you can make sure
+that AnaGram ignores them in doing its analysis by declaring them as
+\index{Declaration}\index{Hidden declaration}\agparam{hidden}.  To
+declare tokens as hidden, include a \agparam{hidden} declaration in a
+configuration section.  (See Chapter 8.) For instance, consider:
+\begin{indentingcode}{0.4in}
+comment
+-> comment head, "*/"
+comment head
+-> "/*"
+-> comment head, \~{}end of file
+{}[ hidden \bra comment head \ket ]
+\end{indentingcode}
+We mark comment head as hidden, because we only wish to talk about
+complete comments with our users.
+In order to use the error frame effectively in your diagnostics, you
+need to have an ASCII representation of the name of the token as well
+as its token number.  If you turn the
+\index{Token names}\index{Configuration switches}\agparam{token names}
+configuration switch on in your syntax file, AnaGram will provide an
+array of ASCII strings, indexed by token number, which you may use in
+your diagnostics.  The name of the array is created by appending
+\agcode{{\us}token{\us}names} to the name of your parser.  If your parser is
+called \agcode{ana}, your token name array will have the name
+\agcode{ana{\us}token{\us}names}.  As a convenience, AnaGram
+also defines a macro,
+\index{TOKEN{\us}NAMES}\index{Macros}\agcode{TOKEN{\us}NAMES}, which
+evaluates to the name of the token name array.  Note that
+\agparam{token names}
+controls the generation of an array of ASCII strings and should not be
+confused with the \agcode{typedef enum} statement in the parser header
+file which provides you with a set of enumeration constants.
+% XXX maybe it means the *strings* should not be confused?
+If you are tracking context, using the techniques described below, you
+can use the macro
+\index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} or
+\index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT} to
+determine the context of the error frame token.
+\index{SYNTAX{\us}ERROR}\index{Macros}
+\paragraph{SYNTAX{\us}ERROR Macro.}
+When your parser finds a syntax error, it first executes any of the
+diagnostic procedures described above that you have enabled, sets
+\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
+\index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE},
+and then invokes the \agcode{SYNTAX{\us}ERROR} macro.  If you have not
+defined \agcode{SYNTAX{\us}ERROR} it will be defined thus if you have set
+\index{Lines and columns}\index{Configuration switches}
+\agparam{lines and columns}:
+\begin{indentingcode}{0.4in}
+\#define SYNTAX{\us}ERROR {\bs}
+fprintf(stderr,"\%s,line \%d,column \%d{\bs}n", {\bs}
+PCB.error{\us}message, PCB.line, PCB.column)
+\end{indentingcode}
+and thus if you have not:
+\begin{indentingcode}{0.4in}
+\#define SYNTAX{\us}ERROR {\bs}
+fprintf(stderr, "\%s{\bs}n", PCB.error{\us}message)
+\end{indentingcode}
+In most circumstances, you will probably want to write your own
+\agcode{SYNTAX{\us}ERROR} macro, since this diagnostic is one your users
+will see with some frequency.
+% XXX yes and why exactly? is there something we have in mind better
+% than just printing PCB.error_message?
+The default macro simply returns to the parser.  Your macro doesn't
+have to.  If you wish, you could call \agcode{abort} or \agcode{exit}
+directly from the macro.  If the \agcode{SYNTAX{\us}ERROR} macro returns
+control to the parser, subsequent events depend on your choices for
+error recovery.
+\section{Error Recovery}
+\index{Error recovery}\index{Syntax error}\index{Errors}
+Syntax errors can be caused by any of a number of problems.  Some come
+from simple typographic errors: the user skips a character or types
+the wrong one.  Others come from true errors: he types something that
+might be correct in its place, but in context is totally wrong.
+Usually, if your parser is reading a file, you will want to continue
+parsing the input, checking for other syntax errors at the very least.
+The problem with doing this is getting the parser restarted, or
+``resynchronized'', in some reasonable manner.
+AnaGram provides a number of ways for your parser to recover from a
+syntax error.  The least graceful, of course, is simply to call
+\agcode{abort} or \agcode{exit} from the \agcode{SYNTAX{\us}ERROR} macro.
+If you don't do this you have several options:
+\begin{itemize}
+\item error token resynchronization
+\item auto resynchronization
+\item simple return to calling program
+\item ignore the error
+\end{itemize}
+\subsection{Error Token Resynchronization}
+\index{Resynchronization}
+When AnaGram builds your parser it checks to see if you have used a
+token called \agcode{error} in your grammar or if you have assigned a
+token name as the value of the configuration parameter
+\index{Error token}\index{token}\index{Configuration parameters}
+\agparam{error token}.  If so, it includes a call to an error token
+resynchronization procedure immediately after the invocation of
+\index{SYNTAX{\us}ERROR}\agcode{SYNTAX{\us}ERROR}.  The error token
+resynchronization procedure works in the following way: It scans the
+state stack backwards looking for the most recent state in which
+\agcode{error} or the token named by \agparam{error token} was valid
+input.  It then truncates the stack to this level, and jumps to the
+state indicated by the error token.  It then passes over any input it
+sees until it sees valid input for the state in which it finds itself.
+At this point, it returns to the parser which continues as though
+nothing had happened.  Since this is substantially easier than it
+sounds, let's look at an example.  Suppose we are writing a C
+compiler, and we wish to catch errors in ordinary statements.  We add
+the following production to our grammar:
+\begin{indentingcode}{0.4in}
+statement
+-> error, ';'
+\end{indentingcode}
+Now, if the parser encounters a syntax error anytime while it is
+parsing any statement, it will pop back to the most recent state where
+it was looking for a statement, jump forward to the state indicated by
+the token \agcode{error} in the new production, and then skip input
+until it sees a semicolon.  At this point it will continue a normal
+parse.  The effect of continuing at this point is to recognize and
+reduce the above production, i.e., the parser will proceed as if it
+had found a complete, correct ``statement''.  This production could
+even have a reduction procedure to do any clean-up that an error might
+require.
+If you use error token resynchronization, you must identify an end of
+file token to guarantee that the resynchronization procedure can
+always terminate.  To do this, either name your end of file token
+\agcode{eof} or use the
+\index{Eof token}\index{Configuration parameters}\index{Token}
+\agparam{eof token} configuration parameter to specify it.
+For example, if your parser is reading conventional stream input, the
+end of file will be denoted by a $-1$ value.  You can define the end
+of file token thus:
+\begin{indentingcode}{0.4in}
+eof = -1
+\end{indentingcode}
+% XXX as ``finally'' means something in Java, let's change this to
+% ``at last''
+On the other hand, if you have already defined a token named
+\agcode{finally}, you can add the following line to any configuration
+segment:
+\begin{indentingcode}{0.4in}
+eof token = finally
+\end{indentingcode}
+The end of file token, of course, must be a terminal token.
+% XXX this is not ``of course'' to a casual observer.
+\subsection{Automatic Resynchronization}
+\index{Resynchronization}\index{Automatic resynchronization}
+If you have not specified an \agcode{error} token in your syntax file,
+AnaGram checks to see if you have turned on the
+\index{Auto resynch}\index{Configuration switches}
+\agparam{auto resynch} configuration switch.
+If so, it includes a call to an automatic resynchronization procedure
+immediately after the call to \agcode{SYNTAX{\us}ERROR}.  The automatic
+resynchronization procedure uses a heuristic based on your grammar to
+get back in step with the input.  To use it you need do only two
+things: You need to turn on the \index{Auto resynch}\agparam{auto
+resynch} switch, and you need to specify an end of file token as for
+error token resynchronization, above.
+The primary advantage of the automatic resynchronization is that it is
+easy to use.  The disadvantage is that it turns off all reduction
+procedures, so that your parser is reduced to being a syntax checker
+after it encounters an error.  If your grammar uses semantically
+determined productions, your reduction procedures will not be invoked
+so the primary reduction token will be used in all cases.
+% XXX *why* does it do this?
+\subsection{Other Ways to Continue}
+% XXX the example of ``reading input from a keyboard'' should be
+% clarified to indicate that this means something like an application
+% where you press F10 for the menu, not typing at a command line.
+%
+If you do not wish to use either of the above resynchronization
+procedures, you still have a number of options.  If your parser is
+reading input from a keyboard, for instance, it is probably sufficient
+to simply ignore bad input characters.  You can do this by simply
+resetting \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to
+zero in your
+\index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro.
+% XXX XXX should say \agcode{AG_RUNNING_CODE}, not zero!!
+Your parser will then continue, passing over the bad input as though
+it had never occurred.  If you do this, you should, of course, notify
+your user somehow that you're skipping a character.  Issuing a beep on
+the computer's speaker from the \agcode{SYNTAX{\us}ERROR} macro is
+usually enough.
+If you do not wish to continue the parse, but want your main program
+to continue, you need do nothing special.  \agcode{PCB.exit{\us}flag} is
+% XXX XXX should say \agcode{AG_SYNTAX_ERROR_CODE}, not 2!!
+set to 2 before the \agcode{SYNTAX{\us}ERROR} macro is called.  If your
+macro does not change \agcode{PCB.exit{\us}flag}, when it relinquishes
+control to your parser, your parser will simply return to the calling
+program.  The calling program can determine that the parse was
+unsuccessful by inspecting \agcode{PCB.exit{\us}flag} and take whatever
+action you deem appropriate.
+\section{Advanced Techniques}
+\subsection{Semantically Determined Productions}
+\index{Semantically determined production}\index{Production}
+A semantically determined production is one which has more than one
+token on the left side.  The reduction procedure then determines which
+token has in fact been identified, using whatever criteria are
+necessary.  In some cases where the purpose is simply to provide
+multiple syntactic options to be chosen at execution time, the
+determination is made simply by interrogating a switch.  Other
+situations may require a more complex determination, such as a symbol
+table look-up, for instance.
+\index{Production}
+The tokens on the left side of the production can be used just like
+any other tokens in your grammar.  Their semantic values, however,
+must all be of the same \index{Data type}\index{Token}data type.
+Depending on how you have defined your grammar, it may be that
+whenever any one of the tokens on the left side is syntactically
+acceptable input, all the tokens on the left are syntactically
+acceptable.  That is, the production could reduce to any of the tokens
+on the left without causing an immediate error condition.  In many
+circumstances, however, this is not the case.  In a Pascal grammar,
+for example, a semantically determined production might be used to
+allow a reduction procedure to determine whether a particular
+identifier is a constant identifier, a type identifier, a variable
+identifier, or so on. In any particular context, only a subset of the
+tokens on the left may be syntactically acceptable.
+Before your reduction procedure is called, your parser will set the
+reduction token to the first token on the left side which is
+syntactically correct. If you need to change this assignment you have
+several options.  From within your reduction procedure, you may simply
+set
+\index{reduction{\us}token}\index{PCB}\index{Token}\agcode{PCB.reduction{\us}token}
+to the semantically correct value.  For this purpose, it is convenient
+to use the token name enumeration constants provided in the header
+file for your parser.  Note that if you select a reduction token that
+is not syntactically correct, after your reduction procedure returns,
+your parser will encounter a \index{Reduction token
+error}\agterm{reduction token error}, described above.
+AnaGram provides several tools to help you set the reduction token
+correctly.  First, it provides a \agterm{change reduction} function
+which will set the reduction token to a specified token only if the
+specified token is syntactically correct.  It will return a flag to
+indicate the outcome: non-zero on success, zero on failure.  The name
+of this function is given by appending \agcode{{\us}change{\us}reduction} to
+the name of your parser.  Thus, if your parser is named \agcode{ana},
+the name of the function would be \agcode{ana{\us}change{\us}reduction}.  In
+those cases where the semantically correct reduction token is not
+syntactically correct, you will want to provide error diagnostics for
+your user.  If you wish the parse to continue, so you can check
+errors, you may simply return from the reduction procedure.  Since the
+default reduction is syntactically correct, the parse can continue as
+though there had been no error.
+To simplify use of the change reduction function, AnaGram provides a macro,
+\index{CHANGE{\us}REDUCTION}\index{Macros}\agcode{CHANGE{\us}REDUCTION}.
+Simply call the macro with the name of the desired token as the
+argument, replacing embedded blanks in the token name with
+underscores.
+For example, in writing a grammar for the C language, it is quite
+convenient to write the following production:
+\begin{indentingcode}{0.4in}
+identifier, typedef name
+-> name                = check{\us}typedef();
+\end{indentingcode}
+The reduction procedure can then check the symbol table to see if
+whether the name that has been found is a typedef name.  If so, it can
+use the \agcode{CHANGE{\us}REDUCTION} macro to change the reduction token
+to \agcode{typedef name} and verify that this is acceptable:
+\begin{indentingcode}{0.4in}
+if (!CHANGE{\us}REDUCTION(typedef{\us}name)) diagnose{\us}error();
+\end{indentingcode}
+Note that the embedded space in the token name must be replaced with
+an underscore character.
+Under some circumstances, in your reduction procedure, you might wish
+to know precisely which reduction tokens are syntactically correct.
+For instance, you might wish, in an error diagnostic, to tell your
+user what you expected to see.  If you set the
+\index{Reduction choices}\index{Configuration switches}
+\agparam{reduction choices} switch,
+AnaGram will include in your parser file a function which will
+identify the acceptable choices for the reduction token in the current
+state.  The prototype of this function is:
+\begin{indentingcode}{0.4in}
+int \${\us}reduction{\us}choices(int *);
+\end{indentingcode}
+where ``\agcode{\$}'' represents the name of your parser.  You must provide an
+integer array whose length is at least as long as the maximum number
+of reduction choices you might have.  The function will fill the array
+with the token numbers of those which are acceptable in the current
+state and return a count of the number of acceptable choices it found.
+You can call this function from any reduction procedure.  AnaGram also
+provides a macro to invoke this procedure:
+\index{REDUCTION{\us}CHOICES}\index{Macros}\agcode{REDUCTION{\us}CHOICES}.
+For example, to provide a diagnostic which details the acceptable
+token, you might combine the use of the \agparam{reduction choices}
+switch with the
+\index{Token names}\index{Configuration switches}\agparam{token names}
+switch described above:
+\begin{indentingcode}{0.4in}
+int ok{\us}tokens[20], n{\us}ok{\us}tokens, i;
+n{\us}ok{\us}tokens = REDUCTION{\us}CHOICES(ok{\us}tokens);
+printf("Acceptable input comprises: {\bs}n");
+for (i = 0; i $<$ n{\us}ok{\us}tokens; i++) \bra
+printf("  \%s{\bs}n", TOKEN{\us}NAMES[i]);
+\ket
+\end{indentingcode}
+A semantically determined production can even be a null production.
+You can use a semantically determined null production to interrogate
+the settings of parameters and control parsing accordingly:
+\begin{indentingcode}{0.4in}
+condition false, condition true
+-> = \bra if (condition) CHANGE{\us}REDUCTION(condition{\us}true); \ket
+\end{indentingcode}
+There are numerous examples of the use of semantically determined
+productions in the examples provided in the
+\index{examples}\agfile{examples} directory of your AnaGram
+distribution disk.
+% XXX too much anaphora
+% XXX s/disk//
+\subsection{Defining Parser Control Blocks}
+\index{Parser control block}
+All references to the parser control block in your parser are made
+using the macro \index{PCB}\agcode{PCB}.  The only intrinsic
+requirement on PCB is that it evaluate to an \agterm{lvalue} (see
+Kernighan and Ritchie) that identifies a parser control block.  The
+actual access may be direct, indirect through a pointer, subscripted,
+or even more complex, although if the access is too complex, the
+performance of your parser could suffer.  Simple indirect or
+subscripted references are usually enough to enable you to build a
+system with multiple parallel parsing processes.  If you wish to
+define \agcode{PCB} in some way other than a simple, direct access to
+a compiled-in control block, you will have to declare the control
+block yourself.
+When AnaGram builds a parser, it checks the status of the
+\index{Declare pcb}\index{Configuration switches}\agparam{declare pcb}
+configuration switch.  If it is on, the default setting, AnaGram
+declares a parser control block for you.  AnaGram creates the name of
+the parser control block variable by appending \agcode{{\us}pcb} to the
+name of your parser.  Thus if the name of your parser is
+\agcode{ana}, the parser control block is \agcode{ana{\us}pcb}.
+In the header file AnaGram generates, a typedef statement defines the
+structure of the parser control block.  The typedef name is given by
+appending \agcode{{\us}pcb{\us}type} to the name of your parser.  Thus if
+the name of your parser is \agcode{ana}, the type of the parser
+control block is given by \agcode{ana{\us}pcb{\us}type}.  Thus, when AnaGram
+defines the parser control block for \agcode{ana}, it does so by
+including the following two lines of code:
+\begin{indentingcode}{0.4in}
+ana{\us}pcb{\us}type ana{\us}pcb;
+\#define PCB ana{\us}pcb
+\end{indentingcode}
+If you wish to declare the parser control block yourself, you should
+turn off the \agparam{declare pcb} switch.  To turn \agparam{declare
+pcb} off, include the following line in a configuration segment in
+your syntax file:
+\begin{indentingcode}{0.4in}
+\~{}declare pcb
+\end{indentingcode}
+Suppose your program needs to serve up to sixteen ``clients'', each
+with its own input stream.  You might turn \agparam{declare pcb} off
+and declare the parser control block in the following manner:
+\begin{indentingcode}{0.4in}
+ana{\us}pcb{\us}type ana{\us}pcb[16];    /* declare control blocks */
+int client;
+\#define PCB ana{\us}pcb[client]  /* tell parser about it */
+\end{indentingcode}
+Perhaps you need to parse a number of input streams, but you don't
+know exactly how many until run time.  You might make the following
+declarations:
+\begin{indentingcode}{0.4in}
+ana{\us}pcb{\us}type *ana{\us}pcb;       /* pointer to control block */
+\#define PCB (*ana{\us}pcb)       /* tell parser about it */
+\end{indentingcode}
+Note that when you declare \agcode{PCB} as a pointer, you should put
+parentheses around the declaration so that your compiler codes the
+indirection properly.
+There are many situations where it is convenient for a parser to be
+reentrant.  A parser used for evaluating formulas in a spreadsheet
+program, for instance, needs to be able to call itself recursively if
+it is to use natural order recalculation.  A parser used to implement
+macro substitutions may need to be recursive to deal with embedded
+macros.
+Here is an example of an interface function which is designed for
+recursive calls to a parser, using the definitions above:
+% XXX can I please at least remove the nonstandard <alloc.h>?
+% And fix the misuse of assert, and check malloc for failure?
+% And use AG_SUCCESS_CODE instead of 1?
+\begin{indentingcode}{0.4in}
+\#include <assert.h>
+\#include <alloc.h>
+\#define PCB (*ana{\us}pcb)
+ana{\us}pcb{\us}type *ana{\us}pcb;
+void do{\us}ana(void) \bra
+ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb;
+ana{\us}pcb = malloc(sizeof(ana{\us}pcb{\us}type));
+ana();
+assert(ana{\us}pcb.exit{\us}flag == 1);
+free(ana{\us}pcb);
+ana{\us}pcb = save{\us}ana;
+\ket
+\end{indentingcode}
+Here is another way to accomplish the same end, this time using stack
+storage rather than heap storage:
+% XXX ditto
+\begin{indentingcode}{0.4in}
+\#include <assert.h>
+\#include <alloc.h>
+\#define PCB (*ana{\us}pcb)
+ana{\us}pcb{\us}type *ana{\us}pcb;
+void do{\us}ana(void) \bra
+ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb;
+ana{\us}pcb{\us}type local{\us}pcb;
+ana{\us}pcb = \&local{\us}pcb;
+ana();\\
+assert(ana{\us}pcb.exit{\us}flag == 1);
+ana{\us}pcb = save{\us}ana;
+\ket
+\end{indentingcode}
+% XXX and here we should discuss \agparam{reentrant parser}, too.
+\subsection{Multi-stage Parsing}
+\index{Parsing}\index{Multi-stage parsing}
+Multi-stage parsing consists of chaining together a number of parsers
+in series so that each parser provides input to the following one.
+Users of \agfile{lex} and \agfile{yacc} are accustomed to using
+two-level parsing, since the ``\index{Lexical scanner}lexical
+scanner'', or ``lexer'' they write in \agfile{lex} is really a very
+simple parser whose output becomes the input to the parser written in
+\agfile{yacc}.  AnaGram has been developed so that you may use as many
+levels as are appropriate to your problem, and so that, if you wish,
+you may write all of the parsers in AnaGram.
+Many problems that do not lend themselves conveniently to solution
+with a simple grammar can be neatly solved by using multi-stage
+parsing.  In many cases this is because multi-stage parsing can be
+used to parse constructs that are not context-free.  A first level
+parser can use semantic information to decide which tokens to pass on
+to the next level.  Thus, a first level parser for a C compiler can
+use semantic information to distinguish typedef names from variable
+names.
+% XXX I believe this is referring to QPL. Nowadays there's Python...
+As another example, a proprietary programming language used indents to
+control its block structure.  A first level parser looked only at
+lines and indents, passing the text through to the second level
+parser.  When it encountered changes in indentation level, it inserted
+block start and block end tokens as necessary.
+Using AnaGram it is extremely easy to set up multi-stage parses.
+Simply configure the second level parser as an event-driven parser.
+The first level parser can then hand over tokens or characters to it
+as it develops them.
+The C macro preprocessor example, found in the
+\index{examples}\agfile{examples} directory of your AnaGram
+distribution disk, illustrates the use of multi-stage parsing.
+\subsection{Context Tracking}
+\index{Context tracking}
+When you are writing a reduction procedure for a particular grammar
+rule, you often need to know the value one or another of your program
+variables had at the time the first token in the rule was encountered.
+Examples of such variables are:
+\begin{itemize}
+\item Line or column number
+\item Index in an input file
+\item Index into an array
+\item Counters, as of symbols defined, etc.
+\end{itemize}
+Such variables can be thought of as representing the ``context'' of
+the rule you are reducing.  Sometimes it is possible to incorporate
+the values of such variables into the values of reduction tokens, but
+this can become quite cumbersome.  AnaGram provides an optional
+feature known as ``context tracking'' to deal with this problem.
+Here's how it works:
+First, you identify the variables which you want to track.  Second,
+you write a typedef statement in the \index{C prologue}C prologue of
+your parser which defines a data structure with fields to accommodate
+values for all of these variables.  Third, you tell AnaGram what the
+name of the type of your data structure is, using the
+\index{Context type}\index{Configuration parameters}\agparam{context type}
+configuration parameter.  This causes AnaGram to add a field called
+\index{PCB}\index{input{\us}context}\agcode{input{\us}context} and a stack,
+the \index{Context stack}\index{Stack}\agterm{context stack}, called
+\index{PCB}\index{cs}\agcode{cs}, both of the type you have specified,
+to your parser control block.  Fourth, you write code to gather the
+context information for each input character.
+There are several ways to provide the initial context information.
+You may write a
+\index{GET{\us}CONTEXT}\index{Macros}\agcode{GET{\us}CONTEXT} macro which
+sets the context stack variables directly.  Using the
+\index{CONTEXT}\index{Macros}\agcode{CONTEXT} macro defined below, and
+assuming your context type has line, column and pointer fields, you
+could define \agcode{GET{\us}CONTEXT} as follows:
+\begin{indentingcode}{0.4in}
+\#define GET{\us}CONTEXT CONTEXT.pointer = PCB.pointer,{\bs}
+CONTEXT.line = PCB.line,{\bs}
+CONTEXT.column = PCB.column
+\end{indentingcode}
+If you are using \agparam{pointer input}, you must write a
+\agcode{GET{\us}CONTEXT} macro to save context information.  If you use a
+\index{GET{\us}INPUT}\index{Macros}\agcode{GET{\us}INPUT} macro or have an
+event-driven parser, you may either store values directly into
+\index{input{\us}context}\index{PCB}\agcode{PCB.input{\us}context} when you
+develop the input token, or you may write a \agcode{GET{\us}CONTEXT}
+macro.  The macro will provide a slight increment in performance.
+% XXX say why it's faster (I assume because it won't look up context
+% for inputs that don't need it?)
+AnaGram provides six macros to enable you to read values in a
+convenient manner from the context stack,
+\index{cs}\index{PCB}\agcode{PCB.cs}.  Three of these macros are
+designed to be used from your parser itself, and three are available
+to use from other modules.  These three macros are designed for use in
+your parser:
+\begin{itemize}
+\item \agcode{CONTEXT}
+\item \agcode{RULE{\us}CONTEXT}
+\item \agcode{ERROR{\us}CONTEXT}
+\end{itemize}
+These macros are defined at the beginning of your parser file, so they
+may be used anywhere within your parser.
+\index{CONTEXT}\index{Macros}\agcode{CONTEXT}
+can be used to read or write the current top of the context stack as
+indexed by \index{PCB}\agcode{PCB.ssx}.  When your parser is executing
+a reduction procedure for a particular grammar rule, \agcode{CONTEXT}
+will evaluate to the value of the input context as it was just before
+the very first token in the rule.  The definition of \agcode{CONTEXT}
+is:
+\begin{indentingcode}{0.4in}
+\#define CONTEXT (PCB.cs[PCB.ssx])
+\end{indentingcode}
+\index{RULE{\us}CONTEXT}\index{Macros}\agcode{RULE{\us}CONTEXT} can be used
+within a reduction procedure to get the context for any element within
+the rule being reduced.  For example, \agcode{RULE{\us}CONTEXT[0]} is the
+context of the first element in the rule, \agcode{RULE{\us}CONTEXT[1]} is
+the context of the second element in the rule, and so on.
+\agcode{RULE{\us}CONTEXT[0]} is exactly the same as \agcode{CONTEXT}.
+% XXX There should be a way to address the context of tokens in a
+% rule by the symbolic names we've bound to them.
+The definition of \agcode{RULE{\us}CONTEXT} is:
+\begin{indentingcode}{0.4in}
+\#define RULE{\us}CONTEXT (\&(PCB.cs[PCB.ssx]))
+\end{indentingcode}
+As an example, let us suppose that we are writing a parser to read a
+parameter file for a program.  Let us imagine the following statements
+make up a part of our syntax file:
+\begin{indentingcode}{0.4in}
+\bra
+typedef struct \bra int line, column \ket location;
+\#define GET{\us}INPUT {\bs}
+PCB.input{\us}code = fgetc(input{\us}file); {\bs}
+PCB.input{\us}context.line = PCB.line; {\bs}
+PCB.input{\us}context.column = PCB.column;
+\ket
+{}[ context type = location ]\\
+parameter assignment
+-> parameter name, '=', number
+\end{indentingcode}
+Let us suppose that for each parameter we have stored a range of
+admissible values.  We have to diagnose an attempt to use an incorrect
+value.  We could write our diagnostic message as follows:
+\begin{indentingcode}{0.4in}
+fprintf(stderr, "Bad value at line \%d, column \%d in "
+"parameter assignment at line \%d, column \%d",
+RULE{\us}CONTEXT[2].line,
+RULE{\us}CONTEXT[2].column,
+CONTEXT.line,
+CONTEXT.column);
+\end{indentingcode}
+This diagnostic message would give our user the exact location both of
+the bad value and of the beginning of the statement that contained the
+bad value.
+\index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} can be
+used within a
+\index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro to
+find the context of an error if you have turned on the
+\index{Error frame}\index{Configuration switches}\agparam{error frame}
+and
+\index{Diagnose errors}\index{Configuration switches}
+\agparam{diagnose errors}
+switches.  AnaGram itself tracks context using a structure consisting
+of line and column numbers.  In case of errors such as encountering an
+end of file in a comment, it uses the \agcode{ERROR{\us}CONTEXT} macro
+to determine the line and column number at which the comment began.
+% XXX that sounds like something AG does with your grammar, not
+% what AG does reading its own input, which is what it is. rephrase...
+The definition of \agcode{ERROR{\us}CONTEXT} is:
+\begin{indentingcode}{0.4in}
+\#define ERROR{\us}CONTEXT (PCB.cs[PCB.error{\us}frame{\us}ssx])
+\end{indentingcode}
+Three similar macros are also available for more general use:
+\begin{itemize}
+\item \index{PCONTEXT}\index{Macros}\agcode{PCONTEXT(pcb)}
+\item \index{PRULE{\us}CONTEXT}\index{Macros}\agcode{PRULE{\us}CONTEXT(pcb)}
+\item \index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT(pcb)}
+\end{itemize}
+% XXX repeating ``modules other than'' is bad
+These macros are identical in function to the corresponding macros in
+the first class.  The only difference is that they take the name of a
+parser control block, \agcode{pcb}, as an argument so they can be used
+in modules other than the parser module.  AnaGram includes the
+definitions for these macros in the parser header file so that they
+can be used in modules other than the parser itself.  Since these
+macros are not specific to any one parser, the definitions are
+conditional so that they will only be defined once in a given module,
+even if you include header files corresponding to several parsers.
+The definitions of these macros are as follows:
+\begin{indentingcode}{0.4in}
+\#define PCONTEXT(pcb) (pcb.cs[pcb.ssx])
+\#define PRULE{\us}CONTEXT(pcb) (\&(pcb.cs[pcb.ssx]))
+\#define PERROR{\us}CONTEXT(pcb) (pcb.cs[pcb.error{\us}frame{\us}ssx])
+\end{indentingcode}
+Note that since the context macros only make sense when called from a
+reduction procedure or an error procedure, there are not many
+occasions to use these macros.  The most common situation would be
+when you have compiled the bulk of the code for your reduction
+procedures in a separate module.
+Remember that \agcode{PRULE{\us}CONTEXT}, because it identifies an array
+rather than a value, requires a subscript.  For an example, let us
+rewrite the diagnostic message given above for
+\agcode{RULE{\us}CONTEXT} using \agcode{PRULE{\us}CONTEXT}, assuming
+that the name of our parser control block is \agcode{ana{\us}pcb}:
+\begin{indentingcode}{0.4in}
+fprintf(stderr, "Bad value at line \%d, column \%d in "
+"resource statement at line \%d, column \%d",
+PRULE{\us}CONTEXT(ana{\us}pcb)[2].line,
+PRULE{\us}CONTEXT(ana{\us}pcb)[2].column,
+PCONTEXT.line,
+PCONTEXT.column);
+\end{indentingcode}
+\subsection{Coverage Analysis}
+\index{Coverage analysis}
+AnaGram has simple facilities for helping you determine the adequacy
+of your test suites.  The
+\index{Rule coverage}\index{Configuration switches}
+\agparam{rule coverage} configuration switch
+controls these facilities.  When you set \agparam{rule coverage},
+AnaGram includes code in your parser to count the number of times the
+parser identifies each rule in your grammar.  AnaGram also provides
+procedures you can use to write these counts to a file and accumulate
+them over multiple executions of your parser.  Finally, it provides a
+window where you may inspect the counts to see the extent to which
+your tests have covered the options in your grammar.
+To maintain the counts, AnaGram declares, at the beginning of your
+parser, an integer array, whose name is created by appending
+\agcode{{\us}nrc} to the name of your parser.  The array contains one
+counter for each rule you have defined in your grammar.  There are no
+entries for the auxiliary rules that AnaGram creates to deal with set
+overlaps or disregard statements.  In order to identify positively all
+the rules that the parser reduces, AnaGram turns off certain
+optimization features in your parser.  Therefore, a parser that has
+the \agparam{rule coverage} switch enabled will run slightly slower
+than one with the switch off.
+AnaGram also provides procedures to write the counts to a file and to
+initialize the counts from a file.  The procedures are named by
+appending \agcode{{\us}write{\us}counts} and \agcode{{\us}read{\us}counts}
+respectively to the name of your parser.  Thus, if your parser is
+called \agcode{ana}, the procedures are called
+\agcode{ana{\us}write{\us}counts} and \agcode{ana{\us}read{\us}counts}.
+Neither takes any arguments nor returns a value.  To accumulate counts
+correctly, you should include calls to the
+\index{read{\us}counts}\agcode{read{\us}counts} and
+\index{write{\us}counts}\agcode{write{\us}counts} procedures in your
+program.  A convenient way to do this is to include statements such as
+the following in your main program:
+% XXX perhaps this means ``atexit''
+\begin{indentingcode}{0.4in}
+ana{\us}read{\us}counts();            /* before calling parser */
+at{\us}exit(ana{\us}write{\us}counts);
+\end{indentingcode}
+For your convenience, AnaGram defines two macros,
+\index{READ{\us}COUNTS}\index{Macros}\agcode{READ{\us}COUNTS} and
+\index{WRITE{\us}COUNTS}\index{Macros}\agcode{WRITE{\us}COUNTS}, in your
+parser.  They call the \agcode{read{\us}counts} and
+\agcode{write{\us}counts} procedures respectively when \agparam{rule
+coverage} is set.  Otherwise they are null.  Thus you may code them
+into your main program and it will work whether or not the
+\agparam{rule coverage} switch is set.  For example,
+\begin{indentingcode}{0.4in}
+READ{\us}COUNTS;         /* read counts if coverage enabled */
+my{\us}parser();         /* call parser */
+WRITE{\us}COUNTS;        /* write updated counts */
+\end{indentingcode}
+The \agcode{write{\us}counts} procedure writes an identifier code and the
+counts to a count file.  The name of the count file is given by the
+\index{Coverage file name}\index{Configuration parameters}
+\agparam{coverage file name} parameter, which defaults to the same name as your
+syntax file but with the extension
+\index{File extension}\index{nrc}\agfile{.nrc}.  The identifier code
+changes each time you modify your syntax file.  The
+\agcode{read{\us}counts} procedure attempts to read the count file.  If
+it cannot find it, or the identifier code is out of date, it simply
+initializes the counter array to zeroes.  Otherwise, it initializes
+the counter arrays to the values found in the file.
+When you run AnaGram and analyze your syntax file, if
+\agparam{rule coverage} is set, AnaGram will enable the \agmenu{Rule
+Coverage} option on the \agmenu{Browse} menu.  If you select
+\agmenu{Rule Coverage}, AnaGram will prepare a \agwindow{Rule
+Coverage} window from the rule count file you select.  AnaGram will
+warn you if the file you selected is older than the syntax file, since
+under those conditions, the coverage file might be invalid.
+The \index{Rule Coverage}\index{Window}\agwindow{Rule Coverage} window
+shows the count for each rule, the rule number and the text of the
+rule.  It is also synched to the syntax file so that you can see the
+rule in context.  AnaGram also modifies the display of the
+\index{Reduction Procedures}\index{Window}\agwindow{Reduction
+Procedures} window so that each procedure descriptor is preceded by
+the number of times it has been called.  You can use this display to
+verify that all your reduction procedures have been tried.
+% XXX having this paragraph here seems confusing
+The \index{Trace Coverage}\index{Window}\agwindow{Trace Coverage}
+window, created when you use the \agwindow{File Trace} or
+\agwindow{Grammar Trace} option, provides information similar to that
+provided by \agwindow{Rule Coverage}.  The differences are these:
+Optimizations are not turned off for the \agwindow{Trace Coverage}, so
+that some rules of length zero or one will not be properly counted.
+Also, the \agwindow{Trace Coverage} does not tell you about the
+reduction procedures you have tested.
+\agwindow{File Trace} can become quite tedious to use if you have very
+many semantically determined productions, so in these cases the
+\agparam{rule coverage} approach can give you the information you need
+more quickly.
+\subsection{Using Precedence Operators}
+The conventional syntax for arithmetic expressions used in most
+programming languages can be parsed simply by reference to
+\index{Operator precedence}\index{Precedence operators}
+\agterm{operator precedence}.  Operator precedence refers to
+the rules we use to determine the order in which arithmetic operations
+should be carried out.  In normal usage, this means that
+multiplication and division take precedence over addition and
+subtraction, which in turn take precedence over comparison operations.
+One can formalize this usage by assigning a numeric \index{Precedence
+level}\agterm{precedence level} to each operator, so that the
+operations are carried out starting with those of highest precedence
+and continuing in order of declining precedence.  When operators have
+the same precedence level, such as addition and subtraction operators,
+one can decide the order of operation to be left to right or right to
+left.  Operators of equal precedence which are to be evaluated left to
+right are called \agterm{left associative}.  Those which should be
+evaluated right to left are called \agterm{right associative}.  If the
+nature of the operators is such that the question should never arise,
+they are called \agterm{non-associative}.
+AnaGram provides three declarations,
+\index{Precedence declarations}\index{Left}\index{Right}\index{Nonassoc}
+\agparam{left}, \agparam{right}, and \agparam{nonassoc}, which you can
+use to associate precedence levels and associativity with tokens in
+your grammar.  The syntax of these statements is given in Chapter 8.
+When AnaGram encounters a shift-reduce \index{Conflicts}conflict in
+your grammar, it looks to see if the conflict can be resolved by using
+precedence and associativity rules.  If so, it applies the rules to
+the conflict and records the resolution in the \index{Resolved
+Conflicts}\index{Window}\agwindow{Resolved Conflicts} table.
+There are two occasions where you should consider using precedence
+declarations in your grammar: Where rewriting the grammar to get rid
+of a conflict would obscure and complicate the grammar, and where you
+wish to try to get a more compact, slightly faster parser by using
+precedence rules for parsing arithmetic expressions.
+Here is an example of using precedence declarations to parse simple
+arithmetic expressions:
+\begin{indentingcode}{0.4in}
+unary minus = '-'
+{}[
+left \bra '+', '-' \ket
+left \bra '*', '/' \ket
+right \bra unary minus \ket
+]
+exp
+-> number
+-> unary minus, exp
+-> exp, '+', exp
+-> exp, '-', exp
+-> exp, '*', exp
+-> exp, '/', exp
+\end{indentingcode}
+A complete working calculator grammar using this syntax,
+\agfile{ffcalcx}, can be found in the
+\index{examples}\agfile{examples/ffcalc} directory of your
+AnaGram distribution disk.
+% XXX s/disk//
+\subsection{Parser Performance}
+The parsers AnaGram generates have been engineered to provide maximum
+performance subject to constraints of reliability and robustness.
+There are a number of steps you may take, however, to make optimize
+the performance of your parser.
+\paragraph{Standard Stack Frame.}  If your compiler has a switch that
+allows you to turn \emph{off} the standard stack frame when you
+compile your parser, do so.  Your parser uses a large number of very
+small functions which run fastest when your compiler does not use the
+standard stack frame.
+\paragraph{Error Diagnostic Features.}  If your parser does not need
+to diagnose errors, turn off the
+\index{Diagnose errors}\index{Configuration switches}
+\agparam{diagnose errors} switch.
+Turn off the
+\index{Lines and columns}\index{Configuration switches}
+\agparam{lines and columns} switch if you don't need this information.
+If your parser doesn't need a diagnostic, and halts on syntax error,
+turn off the
+\index{Backtrack}\index{Configuration switches}\agparam{backtrack} switch.
+\paragraph{Anti-optimization Switches.}  Certain switches de-optimize
+your parser for various reasons.  These switches,
+\index{Traditional engine}\index{Configuration switches}
+\agparam{traditional engine} and
+\index{Rule coverage}\index{Configuration switches}
+\agparam{rule coverage},
+should be turned off once you no longer need their effects.
+\paragraph{Other Switches.}  For maximum performance you should use
+\index{Pointer input}\index{Configuration switches}\agparam{pointer
+input}.  If you can guarantee that your input will not have
+out-of-range input, you can turn off
+\index{Test range}\index{Configuration switches}\index{Range}
+\agparam{test range}.
+% XXX s/out-of-range input/out-of-range characters or tokens/

Mercurial > ~dholland > hg > ag > index.cgi

comparison doc/manual/dd.tex @ 0:13d2b8934445