Mercurial > ~dholland > hg > ag > index.cgi

\chapter{Programming With AnaGram}

Although AnaGram has many options and features which enable you to
build a parser that meets your needs precisely, it has well-defined
defaults so that you do not generally need to learn about an option
until you need the facility it provides.  The purpose of this chapter
is to show you how to use the options and features effectively.

The options and features of AnaGram can be divided roughly into three
groups: those that control the general aspects of your parser, those
that control input to the parser and those that control error
handling.  After dealing with these three groups of options and
features, this chapter concludes with a discussion of various advanced
techniques.

Many aspects of your parser are controlled by setting configuration
parameters, either in a configuration file or in your syntax file.
This chapter presumes you are familiar with setting configuration
parameters.  The names of configuration parameters, as they occur in
the text, are printed in \agparam{bold face type}.  Appendix A
describes the use of configuration parameters and provides a detailed
discussion of each configuration parameter.


\section{General Aspects}

\subsection{Program Development}

The first step in writing a program is to write a grammar in AnaGram
notation which describes the input the program expects.  The file
containing the grammar, called the syntax file, conventionally has the
extension \agfile{.syn}.  You could also make up a few sample input
files at this time, but it is not necessary to write reduction
procedures at this stage.

Run AnaGram and use the \index{Analyze Grammar}Analyze Grammar command
to create parse tables.  If there are syntax errors in the grammar at
this point, you will have to correct them before proceeding, but you
do not necessarily have to eliminate conflicts, if there are any, at
this time.  There are, however, many aids available to help you with
conflicts.  These aids are described in Chapters 5 through 7, and
somewhat more briefly in the Online Help topics.

Once syntax errors are corrected, you can try out your grammar on the
sample input files using the File Trace facility.  With File Trace,
you can see interactively just how your grammar operates on your test
files.  You can also use Grammar Trace to answer ``what if'' questions
concerning input to the grammar.  The Grammar Trace does not use a
test file, but rather allows you to make input choices interactively.

At any time, you can write reduction procedures to process your input
data as its components are identified in the input stream.  Each
procedure is associated with a grammar rule.  The reduction procedures
will be incorporated into your parser when you create it with the
\index{Build Parser}Build Parser command.

By default, unless you specify an input procedure, parser input will
be read from \agcode{stdin}, using the default \agcode{GET{\us}INPUT}
macro.  You will probably wish to redefine \agcode{GET{\us}INPUT}, or
configure your parser to use \agparam{pointer input} or \agparam{event
driven} input.

\subsection{The Default Parser}
\index{Parser}

If you apply the Build Parser command to a syntax file which contains
only a grammar, with no reduction procedures and no embedded C code,
AnaGram will still produce a complete C command line program which you
can compile and run.  \index{Input procedures}This parser will parse
character input from \agcode{stdin}.  If the input does not satisfy
the rules of your grammar, the parser will issue a syntax error
diagnostic to \agcode{stderr} identifying the exact line and column
numbers of the error.  If the parser should overflow its stack, it
will abort with an error message to \agcode{stderr}.  If the parse is
successful, that is if the parser succeeds in identifying the grammar
token without encountering an error, it will simply return to the
command line.

You can extend such a simple parser, often quite effectively, by
adding only reduction procedures.  If the reduction procedures write
output to \agcode{stdout}, you can produce a conventional ``filter''
program without having to pay any attention to input handling, error
handling, or any of the other options AnaGram provides.
%CALC, in the EXAMPLES directory, is an example of such a program.

\subsection{The Content of the Parser and Header Files}

% XXX s/from your parser file/from your syntax file/
AnaGram creates two \index{Output files}\index{File}output files: a
parser file and a header file.  \index{Parser file}\index{File}The
parser file contains the C code you need to compile and link before
you can run your parser.  It begins with the \index{C
prologue}\index{Prologue}C prologue, if any, from your parser file.
The C prologue is an optional block of \index{Embedded C}embedded C or
C++ which precedes everything else in your syntax file.  Although it
can contain anything you wish, normally it is used to place
identification information, \index{Copyright notice}copyright notices,
etc., at the beginning of your parser file.  If your parser uses token
types that require definition, the appropriate \agcode{\#include}
statements and definitions should be placed in the C prologue.  See
``Defining Token Types'', below.

Following the C prologue, AnaGram places a number of definitions of
variables and macros that you might need to refer to in your embedded
C, and in your reduction procedures.  Not the least of these
definitions is the parser control block, described below.  Following
these definitions, AnaGram inserts all your embedded C, in the order
in which it occurred in your syntax file.  Following the embedded C
come all your reduction procedures.  Finally, AnaGram adds the tables
which summarize your grammar and a parsing engine customized to your
requirements.

The \index{Header file}\index{File}header file contains definitions
needed by your parser.  These include definitions of the \index{Parser
value stack}\index{Value stack}\index{Stack}value stack type, the
input token type, the \index{Parser control block}parser control block
type, and token name enumeration constants.  The definitions are
placed in a header file so that you can make them available to other
modules if necessary.

\subsection{Naming Output Files}
\index{Output files}\index{File}

Unless you specify otherwise, AnaGram names the parser and header
files following conventional programming practice.  Both \index{File
name}\index{File name}files have the same name as your syntax file,
with extensions \agfile{.c} and \agfile{.h} respectively.  These
names, however, are controlled by the configuration parameters
\index{Configuration parameters}\index{Name}
\index{Parser file name}\agparam{parser file name} and
\index{Header file name}\agparam{header file name}
respectively, so you can override AnaGram's defaults if you wish.  If
you normally use C++ rather than C, for example, you might want to
include the following statement in your configuration file:

\begin{indentingcode}{0.4in}
parser file name = "\#.cpp"
\end{indentingcode}

When AnaGram names the parser file it substitutes the name of your
syntax file for the ``\#'' character in the file name template.

\subsection{Compiling Your Parser}
\index{Parser}

Although AnaGram was designed primarily with ANSI C in mind, a good
deal of care has been taken to ensure that its output is consistent
with older C compilers and with newer C++ compilers.  If your compiler
does not support ANSI function prototypes, you should set the
\index{Old style}\index{Configuration switches}\agparam{old style}
switch in your configuration file.  If you are intending to compile
your parser using a 16-bit compiler, you might want to turn on the
\index{Near functions}\index{Configuration switches}\agparam{near functions}
switch in your configuration file.  If you are building a parser for
use in an embedded system, you might want to make sure the
\index{Const data}\index{Configuration switch}\agparam{const data}
configuration switch is set so that all the tables AnaGram generates
will be declared \agcode{const}.

\subsection{Naming Your Parser}
\index{Parser}

In the default case, AnaGram creates a main program for you.
Generally, however, you will probably want a parser function which you
can call from your own main program.  You won't want AnaGram to define
\agcode{main} for you.  You can stop AnaGram from defining
\agcode{main} in any of several ways: Include some embedded C in your
syntax file, turn off
\index{Main program}the \index{Configuration switches}\agparam{main program}
configuration switch, or turn on either the \agparam{event driven} or
\agparam{pointer input} switches.  Since you almost always will have
some embedded C in your syntax file, you will seldom have to use the
\agparam{main program} switch.

Normally, AnaGram simply uses the name of your syntax file to create
the name of your parser.  Thus if your syntax file is called
\agfile{ana.syn} your parser will have the name \agcode{ana}.  AnaGram
does not check the parser name for compliance with the rules of C.  If
you use strange characters in your file name, you will get strange
characters in the name of your parser, and you will get unpleasant
remarks from your C compiler when you try to compile your parser.
Thus, for example, if you were to name your parser file
\agfile{!@\#.syn}, AnaGram will call your parser \agcode{!@\#}.  Your
compiler will doubtless choke.

\index{Parser}
If you wish AnaGram to give your parser a name other than the file
name, you may set the
\index{Parser name}\index{Name}\index{Configuration parameters}
\agparam{parser name}
configuration parameter.  Thus, to make sure your parser is called
\agcode{periwinkle} you would include the following line in a
configuration section in your syntax file:

% Note: this is not actually required to be in double quotes.
% It'll also accept anything that's syntactically acceptable to it
% as a C data type, which also lets you give it things like
% ``periwinkle *'' that result in uncompilable code.

\begin{indentingcode}{0.4in}
parser name = "periwinkle"
\end{indentingcode}

Besides the parser itself, AnaGram generates a number of other
functions, variables and type definitions when it creates your parser.
All these entities are named using the parser name as the base.  The
templates and their usages are as follows:

\begin{indenting}{0.4in}
\begin{tabular}{ll}

\index{Parser}\index{Initializer}\index{Name}
\agcode{init{\us}\$}&initializer for parser\\

\index{Grammar token}\index{Value}
\agcode{\${\us}value}&returns value of grammar token\\

\index{Parser value stack}\index{Value stack}\index{Stack}
\agcode{\${\us}vs{\us}type}&value stack type\\

\agcode{\${\us}it{\us}type}&input token union\\
\agcode{\${\us}token{\us}type}&token name enumeration typedef\\
\agcode{\${\us}\%{\us}token}&token name enumeration constants\\
\agcode{\${\us}pcb{\us}type}&typedef of parser control block\\

\index{Parser control block}
\agcode{\${\us}pcb}&parser control block\\

\index{Rule Count}
\agcode{\${\us}nrc}&rule count table\\

\agcode{\${\us}nrpc}&reduction procedure count table\\
\\
\end{tabular}
\end{indenting}

When AnaGram defines these entities it substitutes the parser name for
the dollar sign.  In the token name enumeration constants it
substitutes the token name for the \index{{\us}prc}``\%'' character.
Embedded space characters are replaced with underscore characters.

\subsection{The Parser Control Block}
\index{Parser control block}

The complete status of a parse is kept in a structure called a
\agterm{parser control block}.  As a default, AnaGram defines a parser
control block for you, and provides a macro, \index{PCB}\agcode{PCB},
which enables you to access it simply.  The name AnaGram assigns to
the parser control block is
% XXX
%\agcode{\${\us}pcb}, where as above ``\$'' is replaced with the name of
%your parser.
\agcode{\textit{$<$parser name$>$}{\us}pcb}.
If you need to refer to the parser control block from some module
other than the parser module, use an \agcode{\#include} statement to
include the header file for your parser and refer to the parser
control block by its name as above.  The structure of the parser
control block is described in Appendix E.  In this chapter, particular
fields will be discussed as necessary.

Since the parser control block contains the complete status of a
parse, you may interrupt a parse and continue it later by saving and
restoring the control block.  If you have multiple input streams, all
controlled by the same grammar, you may have a separate control block
for each stream.  If you wish to call your parser recursively, you may
define a fresh control block for each level of recursion.  To make
best use of these capabilities, you will need to declare the parser
control block yourself.  This is discussed below under ``Advanced
Techniques''.

\subsection{Calling Your Parser}
% XXX should have an example of actually calling the thing.
% XXX should also have ``terminating your parser'' or something like
% that.

The parser function AnaGram defines is a simple function which takes
no arguments and returns no values.  All communication with the parser
takes place via the parser control block.  When your parser returns,
\index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} contains an exit
code describing the outcome of the parse.  Symbols for the
exit codes are defined in the header file AnaGram generates.
\index{Exit codes}\index{Error codes}These symbols, their values,
and their meanings are:

\index{AG{\us}RUNNING{\us}CODE}
\index{AG{\us}SUCCESS{\us}CODE}
\index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}
\index{AG{\us}REDUCTION{\us}ERROR{\us}CODE}
\index{AG{\us}STACK{\us}ERROR{\us}CODE}
\index{AG{\us}SEMANTIC{\us}ERROR{\us}CODE}
\begin{indenting}{0.4in}
\begin{tabular}{lll}
\agcode{AG{\us}RUNNING{\us}CODE}&0&Parse is not yet complete\\
\agcode{AG{\us}SUCCESS{\us}CODE}&1&Parse terminated successfully\\
\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE}&2&Syntax error was encountered\\
\agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE}&3&Bad reduction token encountered\\
\agcode{AG{\us}STACK{\us}ERROR{\us}CODE}&4&Parser stack overflowed\\
\agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE}&5&Semantic error\\
\\
\end{tabular}
\end{indenting}

Only an event driven parser will return the value
\agcode{AG{\us}RUNNING{\us}CODE}, since any other parser continues executing
until it terminates successfully or encounters an unrecoverable error.

Syntax errors, reduction token errors, and stack errors are discussed
below under ``Error Handling''.

% XXX: this bit belongs somewhere else
\agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE} is a special case.  It is available
for you to use in your reduction procedures to terminate a parse for
semantic reasons.
% XXX add: AnaGram will never set it itself.
If, in a reduction procedure, you determine that parsing should not
continue, you need only include the statement:

\begin{indentingcode}{0.4in}
PCB.exit{\us}flag = AG{\us}SEMANTIC{\us}ERROR{\us}CODE;
\end{indentingcode}

When your reduction procedure returns, the parse will then terminate
and the parser will return control to the calling program.

\subsection{Parser Return Value}
\index{Value}

If, in your grammar, there is a value assigned to the grammar token,
you may retrieve it, after the parse is complete, by calling the
parser value function, the name of which is given by
\agcode{\${\us}value} where ``\$'' is the name of your parser.
\agcode{\${\us}value} takes no arguments, and returns a value of the type
assigned to the grammar token in your syntax file.

Although in theoretical discussions of parsing the result of the parse
is contained in the value of the grammar token, in practice, more
often than not, results are communicated to other procedures by
setting the values of global variables.  Thus the value of the grammar
token is often of little interest.

Since the parser per se takes no arguments, it is usually convenient
to write a small interface function with a calling sequence
appropriate to the problem.  The interface function can then take care
of appropriate initializations, call the parser, and retrieve results.

\subsection{Defining Token Types}

When you add reduction procedures to your grammar, you will often find
it convenient to add type declarations for the \index{Semantic
value}\index{Token}\index{Value}semantic values of some of the tokens
in your grammar.  As long as the types you use are conventional C data
types\index{Data type}\index{Token}, you don't have to do anything
special.  If, however, you have used types or classes that you have
defined yourself, you need to make sure that the appropriate
definition statements precede their use in the code AnaGram generates.
To do this, you need to have a C prologue in your syntax file.  In the
C prologue, you should place the definition statements your parser
will need, or at least an \agcode{\#include} statement that will cause
the types or classes to be defined.

\subsection{Debugging Your Parser}

Because the ``flow of control'' of your parser is algorithmically
derived from your grammar, debugging your parser separates into two
separate exercises: debugging your grammar, discussed in Chapter 7,
and debugging your reduction procedures.

When debugging, it is usually a good idea to turn off the
\index{Macros}\index{Allow macros}\index{Configuration switches}
\agparam{allow macros}
switch.  This switch is normally on and causes simple reduction
procedures to be implemented as macros.  When you turn it off, you get
a proper function definition for each reduction procedure, so you can
put a breakpoint in any reduction procedure you choose.  If the
\index{Line numbers}\index{Configuration switches}
\agparam{line numbers} switch
is on each reduction procedure will contain a
\index{\#line}\agcode{\#line} directive to show where the reduction
procedure is found in your syntax file.  Once you have acquired
confidence in your reduction procedures you may turn the
\agparam{allow macros} switch back on for slightly improved
performance.

If your debugger allows you to inspect entire structures, you will
find it convenient to look at the parser control block while you are
debugging.  The contents of the parser control block are described in
Appendix E.

A good way to begin debugging a new parser is to simply put a
breakpoint in each reduction procedure.  Start your parser and step
through the reduction procedures one by one, verifying that they
perform as expected.  After you have stepped through a reduction
procedure, turn off its breakpoint.  If there are multiple paths,
leave breakpoints on the paths not taken.  Liberal use of the assert
macro helps assure that your fixes don't break procedures you have
already tested.

\section{Providing Input to Your Parser}
\index{Parser}\index{Input}\index{Input procedures}

This section describes three methods for providing input to your
parser.  In the first method your program calls the parser which then
requests input tokens as it needs them.  It returns only when it has
completed the parse.  The parser requests input tokens by invoking a
macro called \agcode{GET{\us}INPUT}, described below.

The second method for providing input can be used when the entire
sequence of input tokens is available in memory.  This method is
controlled by the \index{Pointer input}\index{Configuration
switches}\agparam{pointer input} configuration switch.  It is
discussed below.

The third method for providing input is especially convenient when
using \index{Lexical scanner}lexical scanners or multi-stage parsing.
It is controlled by the \index{Event driven}\index{Configuration
switches}\agparam{event driven} configuration switch.

\subsection{The \agcode{GET{\us}INPUT} Macro}
\index{GET{\us}INPUT}\index{Macros}

The default parser simply reads characters from \agcode{stdin}.  It
does this by invoking a macro called \agcode{GET{\us}INPUT} every time it
needs an input character.  The default definition of
\agcode{GET{\us}INPUT} is:

\index{PCB}\index{input{\us}code}
\begin{indentingcode}{0.4in}
\#define GET{\us}INPUT (PCB.input{\us}code = getchar())
\end{indentingcode}

\agcode{PCB.input{\us}code} is an integer field in the parser control
block which is used to hold the current input \index{Character
codes}character code.

By including your own definition of \agcode{GET{\us}INPUT} in your
embedded C, you override the default definition provided by AnaGram.
The only requirement for \agcode{GET{\us}INPUT} is that it store a
character in \agcode{PCB.input{\us}code}.  Suppose you wish to make a
parser that reads characters from a file provided by the calling
program.  You could include the following in your embedded C:

\begin{indentingcode}{0.4in}
extern FILE *file;
\#define GET{\us}INPUT (PCB.input{\us}code = fgetc(file))
\end{indentingcode}

Now your parser, when invoked, will read characters from the specified
file instead of reading them from \agcode{stdin}.  Of course,
\agcode{GET{\us}INPUT} is not constrained to reading a file or data
stream.  You may implement \agcode{GET{\us}INPUT} in any manner you
choose.  You may implement it as a function call, or you may choose to
define \agcode{GET{\us}INPUT} so that it expands into inline code for
faster execution.

\subsection{Pointer Input}
\index{Pointer input}\index{Input procedures}

It often happens that the data you wish to parse are already in memory
when you are ready to call the parser.  While you could rewrite
\agcode{GET{\us}INPUT} to simply scan the array by incrementing a
pointer, AnaGram provides an alternative approach since this is such a
common situation.  In a configuration section in your syntax file
simply turn on the \index{Pointer input}\index{Configuration
switches}\agparam{pointer input} switch.  Then before you call your
parser, load \index{pointer}\index{PCB}\agcode{PCB.pointer}, the
pointer field in the parser control block, with a pointer to your
array.  Assuming your parser is called \agcode{ana}, and you wish to
call an interface function with an argument consisting of a character
string, here's what you do:

\begin{indentingcode}{0.4in}
{}[
  pointer input
]

\bra
  void ana{\us}shell(char *source{\us}text) \bra
    PCB.pointer = (unsigned char *)source{\us}text;
    ana();
  \ket
\ket
\end{indentingcode}

% XXX s/the//
The type of the \agcode{PCB.pointer} defaults to
\agcode{unsigned char *} to
minimize difficulty with full 256-character sets.  If your compiler is
fussy, you should use a cast, as above, when you set the value.  If
your data requires more than 256
\index{Character codes}character codes, you may still use pointer
input by using the \index{Pointer type}\index{Configuration
parameters}\agparam{pointer type} configuration parameter to change
the definition of the field in the parser control block.  Normally,
the value of \agparam{pointer type} should be a C data type that
converts to integer.  If \agparam{pointer type} does not convert to
integer, you must provide an
\index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro, as
described below, to extract a token identifier.  Do not change
\agparam{pointer type} to \agcode{signed char} in order to avoid the
cast in the above example.  That will have the effect of making all
character codes above 127 inaccessible to your parser.

Note that if you use pointer input your parser does not need a
\agcode{GET{\us}INPUT} macro.  Parsers that use pointer input usually
run somewhat faster than those that use \agcode{GET{\us}INPUT},
particularly if they use keywords.
% XXX that is unclear - I know it means that the keyword logic is
% particularly improved by using pointer input, but it could be read
% to imply that adding keywords makes the parser even faster, which is
% backwards.

\subsection{Event Driven Parsers}
\index{Event driven parser}\index{Parser}

There are many situations where the input to a parser is developed by
an independent process and the linkage required to implement a
\agcode{GET{\us}INPUT} macro is unduly cumbersome.  In these
circumstances, it is convenient to use an \agparam{event driven}
parser.  With an event driven parser, you do not simply call the
parser and wait for it to finish.  Instead, you call its
\index{Initializer}initializer first, and then call it each time you
have a character for it.  The parser processes the character and
returns as soon as it needs more input, encounters an error or finds
the parse complete.  You can interrogate
\index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to determine
whether the parser can accept more input.

To create an event driven parser, set the \index{Event
driven}\index{Configuration switches}\agparam{event driven} switch in
your syntax file.  Then, to initialize the parser, call the
initialization procedure, or \index{Initializer}initializer, provided
by AnaGram.  The name of this procedure is \agcode{init{\us}\$} where
``\agcode{\$}'' represents the name of your parser.  If your parser is named
\agcode{ana}, the
\index{Parser}initialization procedure is named \agcode{init{\us}ana}.
To process a single character, store the character in
\index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code}, then call
\agcode{ana}.  When it returns, check
\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to see if the
parser is still running.  When the parse is successful, you may
retrieve the value of the grammar token, if you wish, by using the
\index{Parser value function}parser value function, in this case,
\agcode{ana{\us}value}.
% XXX s/case,/case/ above. or s/function,/function;/

As an example, let us imagine we are to write a an interface function
for our parser which takes a list of string pointers, a count, and a
pointer to a location into which we may store an error flag.  The
input to our parser is to be the concatenation of all the character
strings.  We will set up a loop which will call the parser for all the
characters of the strings in turn.  We will assume that the function
will return the value of the grammar token, which we will assume to be
also of type double:

\begin{indentingcode}{0.4in}
{}[
  event driven
]

\bra
  double parse{\us}strings(char **ptr, int n{\us}strings, int *error) \bra
    init{\us}ana();
    while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\&
				n{\us}strings--) \bra
      char *p = *ptr++;
      while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\& *p) \bra
        PCB.input{\us}code == *p++;
        ana();
      \ket
    \ket
    assert(error);
    *error = PCB.exit{\us}flag != AG{\us}SUCCESS{\us}CODE;
    return ana{\us}value();
  \ket
\ket
\end{indentingcode}

The purpose of this example is simply to show how to use an event
driven parser.  Of course it would be possible, as far as this example
is concerned, to concatenate the strings and use pointer input
instead.  A problem sufficiently complex to \emph{require} an event
driven parser would be too complex to serve as a simple example.

\subsection{Token Input}
\index{Token input}\index{Input procedures}

Thus far in this chapter, we have assumed that the input to your
parser consisted of ordinary characters.  There are many situations
where it is convenient to have a
\index{Preprocessor}\index{Token}\index{Token}preprocessor, or
\index{Lexical scanner}lexical scanner, which identifies basic tokens
and hands them over to your parser for further processing.  Accepting
input from such preprocessors is discussed in the remainder of this
section.

Sometimes preprocessors simply pass on text characters, acting as
filters to remove unwanted characters, such as white space or
comments, and to insert other text, such as macro expansions.  In such
situations, there is no need to treat the preprocessor differently
from any other character source.  The input methods described above
are sufficient to deal with the input provided by the preprocessor.

In what follows, we deal with situations where the preprocessor passes
on \index{Token number}\index{Token}\index{Number}\agterm{token
numbers} rather than character codes.  The preprocessor may also pass
on token \emph{values}, which also need accommodation of some sort.
% XXX also also?

There are two principal interfacing problems to deal with.  The first
has to do with identifying the tokens to your parser.  The second has
to do with providing the semantic values of the tokens.
%
%If your preprocessor does not provide values with its tokens, your parser
%may use any of the input techniques described above for character input,
%the only difference being that instead of setting PCB.input{\us}code to a
%character value, you set it to the token identifier.
%
%If your preprocessor does provide token values, then you have to use either
%a GET{\us}INPUT macro, or configure your parser to be event driven.  If you wish
%to use pointer input, you must provide an INPUT{\us}CODE macro.
%

\subsection{Identifying Tokens using Predefined Token Numbers}
\index{Token}\index{Number}\index{Token number}

If you have a pre-existing \index{Lexical scanner}lexical scanner,
written for use with some other parsing system, it probably outputs
its own set of token numbers.  The most robust way of interfacing such
a lexical scanner is to include, in your syntax file, either an
\index{Enum statement}\agparam{enum} statement or a set of definition
statements
for the terminal tokens, equating
\index{Terminal token}\index{Token}terminal token names with the
numeric values output by the lexical scanner, so that AnaGram treats
them as character codes.  In this situation, you simply set
\index{PCB}\index{input{\us}code}\agcode{PCB.input{\us}code} to the token
number determined by the lexical scanner.  Generally, lexical scanners
written for other parsing systems expect to be called for each token.
Therefore, you would normally use a \agcode{GET{\us}INPUT} macro to call
the lexical scanner and provide input to your parser.
% XXX as far as I know, lex expects to call yacc, not vice versa.

\subsection{Identifying Tokens using AnaGram's Token Numbers}

If you are writing a new preprocessor, you have more freedom.  You
could simply create a set of codes as above.  On the other hand, you
can save a level of translation and make your system run faster by
providing your parser with internal token numbers directly.  Here's
what you have to do.

First, when you write your syntax file, leave all the terminal tokens
undefined.  That means, of course, that you have to have a name for
each terminal token.  You can't use a literal character or a number
for the token.  AnaGram will generate a unique token number for each
token in your grammar.  In the header file it generates, AnaGram
always provides a set of
\index{Enumeration constants}\index{Constants}enumeration constants
for all the named tokens in your grammar.  The names for these
constants are controlled by the
\index{Configuration parameters}\index{Enum constant name}
\agparam{enum constant name}
parameter.  (See Appendix A.) These constants normally have the form
\agcode{\textit{$<$parser name$>$}{\us}\textit{$<$token name$>$}{\us}token}.
Note that embedded space in the token name will be replaced with
underscore characters.  Assume your parser is called \agcode{ana}, and
in your grammar you have a token called \agcode{integer constant}.
The enumeration constant identifying the token is then
\agcode{ana{\us}integer{\us}constant{\us}token}.  Now, to hand off an integer
constant to your parser you write:

\begin{indentingcode}{0.4in}
PCB.input{\us}code = ana{\us}integer{\us}constant{\us}token;
\end{indentingcode}

\subsection{Providing Token Values}

If your \index{Preprocessor}preprocessor provides \index{Semantic
value}\index{Token}\index{Value}semantic values for input tokens, you
must inform AnaGram by setting the
\index{Input values}\index{Configuration switches}\index{Value}
\agparam{input values}
configuration switch in your syntax file. Then, whenever you provide a
token, you must also store a value in
\index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}.
You can do this as part of your \agcode{GET{\us}INPUT} macro, or, if you
have an \agparam{event driven} parser, when you set
\index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code} prior to
calling the parser function.  If you are using \index{Pointer
input}\index{Configuration switches}\agparam{pointer input}, the
pointer will presumably identify the token value.  You must provide an
\index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro to extract
the identification code from the token value.  For example, if the
token value is a structure and the appropriate member field is called
\agcode{id}, you would write:

\begin{indentingcode}{0.4in}
\#define INPUT{\us}CODE(t) (t).id
\end{indentingcode}

Generally, the simplest way to interface the preprocessor and your
parser, when you are passing token values, is to use an event driven
parser.  In this situation, the preprocessor, when it identifies a
token, simply loads the token identifier into
\agcode{PCB.input{\us}code}, loads the value into
\index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}, and calls
the parser.

\index{Token}
If the values of your input tokens are all of the same type, you must
set the
\index{Default input type}\index{Configuration parameters}
\index{Input type}\agparam{default input type}
configuration parameter so that AnaGram can declare
\index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}
appropriately.  \index{Token type}\agparam{Default input type} will
default to \agcode{int} if you do not set it either in your configuration file
or in your syntax file.

Some \index{Lexical scanner}lexical scanners simply provide a pointer
to the text of the token they have identified.  In this situation, you
would set \agparam{default input type} to \agcode{char *}.  When you
provide a token to the parser you would set \agcode{PCB.input{\us}value}
to point to the text of the token.

If different tokens have values of different types, the situation
becomes slightly more complex.  First, you must tell AnaGram about the
types of your input tokens.  You do this by including a
\index{Declaration}\index{Type declarations}\agterm{type declaration}
in your syntax file.  A type declaration is a token declaration
preceded by a C data type\index{Data type}\index{Token} in
parentheses.  Assume that your \index{Preprocessor}preprocessor
identifies, among others, the following tokens: \agcode{name},
\agcode{string}, \agcode{real constant}, \agcode{integer constant},
and \agcode{unsigned constant}.  You might then include the following
in your syntax file:

\begin{indentingcode}{0.4in}
{}[
  input values
]

(char *) name, string
(double) real constant
(long) integer constant, unsigned constant
\end{indentingcode}

AnaGram will then create, in the parser control block, an input value
field which can accommodate any of these terminal tokens in your
grammar.

To enable you to store data into the input value field of the parser
control block, AnaGram provides a convenient macro called
\index{INPUT{\us}VALUE}\index{Macros}\agcode{INPUT{\us}VALUE} to serve as
the destination of an assignment statement.  \agcode{INPUT{\us}VALUE}
takes the type of the data as a parameter.  Thus one could write:

\begin{indentingcode}{0.4in}
INPUT{\us}VALUE(char *) = text{\us}pointer;
INPUT{\us}VALUE(long) = constant{\us}value;
\end{indentingcode}

\section{Error Handling}

There are two classes of errors your parser needs to be able to deal
with.  The first consists of \agterm{implementation errors} and the second
consists of \agterm{syntax errors}.  Syntax errors arise because the input to
the parser does not conform to the definition of the language it is
designed to parse.  Implementation errors arise because the programs
we write are never perfect and because the environment in which our
programs run is often something less than ideal.

\subsection{Implementation Errors}
\index{Implementation errors}\index{Errors}

% XXX parser stack overflow is not really an ``implementation error''

There are two implementation errors which your parser needs to be able
to deal with.  The first is \agterm{parser stack overflow}.  The
second comes from a bad \agterm{reduction token}.

\index{Stack}
\paragraph{Stack Overflow.}
Stack overflow is an error which your parser must be able to deal
with.  In general, no matter how big you make your parser stack, it is
possible for legitimate input to cause it to overflow.  The size of
the stack for your parser is controlled by the configuration parameter
\agparam{parser stack size}.  This parameter defaults to a value of
32.  This value has been found to be adequate for ordinary usage.

If your parser has only left recursive constructs, then there is a
maximum depth beyond which the parser stack will never grow.  If your
parser has center recursive or right recursive productions, then no
matter how much stack space you allocate, there will always be a
syntactically correct input file which causes the stack to overflow.
This can be illustrated by the following set of C statements:

\begin{indentingcode}{0.4in}
    x = y;
    x = (y);
    x = ((y));
    x = (((y)));
    .
    .
    .
\end{indentingcode}

Each set of parentheses requires another level on the parser stack.
When this set of statements was tried with Borland C++, it ran out of
stack space at 127 sets of parentheses and diagnosed the problem as
``Expression is too complicated''.

AnaGram calculates the actual size of the parser stack by calculating
the maximum depth for left recursive constructs and adding half the
value of
\index{Parser stack size}\index{Configuration parameters}\index{Stack}
\index{Parser state stack}\index{State stack}
\agparam{parser stack size}.  It then uses the larger of the calculated
value and \agparam{parser stack size} to allocate stack storage.  You
may check the value actually used in your parser by inspecting the
definition of
\index{AG{\us}PARSER{\us}STACK{\us}SIZE}\agcode{AG{\us}PARSER{\us}STACK{\us}SIZE}.

If your parser runs out of stack space, it will set
\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
\index{AG{\us}STACK{\us}ERROR{\us}CODE}\agcode{AG{\us}STACK{\us}ERROR{\us}CODE}, invoke
the
\index{Macros}\index{PARSER{\us}STACK{\us}OVERFLOW}\agcode{PARSER{\us}STACK{\us}OVERFLOW}
macro and return to the calling program.  The default definition of
this macro is:

\begin{indentingcode}{0.4in}
\#define PARSER{\us}STACK{\us}OVERFLOW \bra fprintf(stderr, {\bs}
  "{\bs}nParser stack overflow{\bs}n"); \ket
\end{indentingcode}

% XXX ``provide your own definition'', not ``redefine''

If this definition is not consistent with your needs, you may redefine
it in any block of embedded C in your syntax file.

\index{Reduction token error}
\paragraph{Reduction Token Error.}
A properly functioning parser should never encounter a reduction token
error.  Therefore, reduction token errors should be taken quite
seriously.  The only way to cause a reduction token error in an
otherwise properly functioning parser is to set incorrectly the
reduction token for a semantically determined production.
% XXX ``to incorrectly set''

Before your parser calls a reduction procedure, it stores the token
number of the token to which the production would normally reduce in
\index{reduction{\us}token}\index{PCB}\agcode{PCB.reduction{\us}token}.  If
the production is a semantically determined production, you may, in
your reduction procedure, change the value of
\agcode{PCB.reduction{\us}token} to one of the alternative tokens on
the left side of the production.  When your reduction procedure
returns, your parser checks to verify that
\agcode{PCB.reduction{\us}token} is a valid token number for the
current state of the parser.  If it is not, it sets
\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
\index{AG{\us}REDUCTION{\us}ERROR{\us}CODE}\agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE}
and invokes
\index{REDUCTION{\us}TOKEN{\us}ERROR}\index{Macros}\agcode{REDUCTION{\us}TOKEN{\us}ERROR}.
The default definition of this macro is:

\begin{indentingcode}{0.4in}
\#define REDUCTION{\us}TOKEN{\us}ERROR \bra fprintf(stderr,{\bs}
  "{\bs}nReduction{\us}token error{\bs}n"); \ket
\end{indentingcode}

\subsection{Syntax Errors}
\index{Syntax error}\index{Errors}

If the input data to your parser does not conform to the rules you
have specified in your grammar, your parser will detect a syntax
error.  There are two basic aspects of dealing with syntax errors:
\index{Error diagnosis}\agterm{diagnosing} the error and
\agterm{recovering} from the error, that is, restarting the parse, or
``resynchronizing'' the parser.

If you use the default settings for syntax error handling, then on
encountering a syntax error your parser will call a diagnostic
procedure which will create an error message and store a pointer to it
in
\index{Error messages}\index{error{\us}message}\index{PCB}
\agcode{PCB.error{\us}message}.
Then, it will set
\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
\index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE} and
call a macro called
\index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR}.  The
default definition of \agcode{SYNTAX{\us}ERROR} will print the error
message on \agcode{stderr}.  Finally, in lieu of trying to continue
the parse, it will return to the calling program.

AnaGram has several options which allow you to tailor diagnostic
messages to your requirements or help you to create your own.  It also
provides several options for continuing the parse.

The options available to help you diagnose errors are:

\begin{itemize}
\item line and column tracking
\item creation of a diagnostic message
\item identification of the error frame
\end{itemize}

\index{Numbers}\index{Lines and columns}\index{Configuration switches}
\paragraph{Line and Column Tracking.}
Your parser will automatically track lines and columns in its input if
the \agparam{lines and columns} configuration switch is on.  Since
this is a common requirement, \agparam{lines and columns} defaults to
on.  If you don't want your parser to spend time counting lines and
columns you should turn the switch off, thus:

\begin{indentingcode}{0.4in}
\agcode{
\~{}lines and columns
}
\end{indentingcode}

Normally, if you are using a \index{Lexical scanner}lexical scanner,
you would turn lines and columns off.
% XXX: this should say *why*.

The line and column counts are maintained in
\index{line}\index{PCB}\agcode{PCB.line} and
\index{column}\index{PCB}\agcode{PCB.column} respectively.
\agcode{PCB.line} and \agcode{PCB.column} are initialized with the
values of the \index{FIRST{\us}LINE}\index{Macros}\agcode{FIRST{\us}LINE}
and \index{Macros}\index{FIRST{\us}COLUMN}\agcode{FIRST{\us}COLUMN} macros
respectively.  These macros provide default initial values of 1 for
both line and column numbers.  To override these definitions, simply
include definitions for these macros in your syntax file.  If tab
characters are encountered, they are expanded in accordance with the
\index{Tab spacing}\agparam{tab spacing} parameter.

When your parser is executing a reduction procedure, \agcode{PCB.line} and
\agcode{PCB.column} refer to the first input character following the
rule that is being reduced.  When your parser has encountered a syntax
error, and is executing your \agcode{SYNTAX{\us}ERROR} macro,
\agcode{PCB.line} and \agcode{PCB.column} refer to the erroneous input
character.

\paragraph{Diagnostic Messages.}
If the \index{Diagnose errors}\index{Configuration switches}
\agparam{diagnose errors} switch is on, its default setting, AnaGram
will include an error diagnostic procedure in your parser.  When your
parser encounters a syntax error, this procedure will create a simple
diagnostic message and store a pointer to it in
\index{error{\us}message}\index{PCB}\agcode{PCB.error{\us}message} before
your \agcode{SYNTAX{\us}ERROR} macro is executed.  The default definition
of \agcode{SYNTAX{\us}ERROR} prints this message on \agcode{stderr}.

If your parser was in a state where there was a single input character
expected or a simple named token expected, it will create a message of
the form:

\begin{indentingcode}{0.4in}
Missing ';'
\end{indentingcode}
or
\begin{indentingcode}{0.4in}
Missing semicolon
\end{indentingcode}

If there was more than one possible input your parser will check to
see if it can identify the erroneous input.  If it can it will create
a message of the form:

\begin{indentingcode}{0.4in}
Unexpected ';'
\end{indentingcode}
or
\begin{indentingcode}{0.4in}
Unexpected semicolon
\end{indentingcode}

Otherwise, the diagnostic message will be simply:

\begin{indentingcode}{0.4in}
Unexpected input
\end{indentingcode}

If you do not need a diagnostic message, or choose to create your own,
you should turn \agparam{diagnose errors} off.

% XXX Somewhere there should be a discussion of what ``creating your
% own'' would entail.

\index{Error frame}
\paragraph{Error Frame.}
Often it is desirable to know the ``frame'' of an error, that is, what
the parser thought it was doing when it encountered the error.  If,
for instance, you forget to terminate a comment in a C program, your C
compiler sees an unexpected end of file.  When you look simply at the
alleged error, of course, you can't see any problem.  In order to
understand the error, you need to know that the parser was trying to
find a complete comment.  In this case, we can say that the comment is
the ``frame'' of the error.

AnaGram provides an optional facility in its error diagnostic
procedure, controlled by the
\index{Error frame}\index{Configuration switches}\agparam{error frame}
switch, for identifying the frame of a syntax error.  The
\agparam{diagnose errors} switch must also be on to enable the
diagnostic procedure.  If you enable \agparam{error frame} in your
syntax file, AnaGram will include a procedure which will scan
backwards on the state stack looking for the frame of the error.  When
it finds what appears to be the error frame, it will store the stack
index in
\index{error{\us}frame{\us}ssx}\index{PCB}\agcode{PCB.error{\us}frame{\us}ssx} and
the token number of the nonterminal token the parser was looking for
in
\index{error{\us}frame{\us}token}\index{PCB}\agcode{PCB.error{\us}frame{\us}token}.

%
% XXX. Why is the discussion of ``hidden'' inside the discussion of
% ``error frame''? hidden applies to ordinary error diagnosis also.
%
% Furthermore, this discussion of error frame needs an example, or
% nobody will ever figure out how to do it.
%

If, in your grammar, there are nonterminal tokens that are not
suitable for diagnostic use, usually because they name an intermediate
stage in the parse that means nothing to your user, you can make sure
that AnaGram ignores them in doing its analysis by declaring them as
\index{Declaration}\index{Hidden declaration}\agparam{hidden}.  To
declare tokens as hidden, include a \agparam{hidden} declaration in a
configuration section.  (See Chapter 8.) For instance, consider:

\begin{indentingcode}{0.4in}
comment
  -> comment head, "*/"
comment head
  -> "/*"
  -> comment head, \~{}end of file
{}[ hidden \bra comment head \ket ]
\end{indentingcode}

We mark comment head as hidden, because we only wish to talk about
complete comments with our users.

In order to use the error frame effectively in your diagnostics, you
need to have an ASCII representation of the name of the token as well
as its token number.  If you turn the
\index{Token names}\index{Configuration switches}\agparam{token names}
configuration switch on in your syntax file, AnaGram will provide an
array of ASCII strings, indexed by token number, which you may use in
your diagnostics.  The name of the array is created by appending
\agcode{{\us}token{\us}names} to the name of your parser.  If your parser is
called \agcode{ana}, your token name array will have the name
\agcode{ana{\us}token{\us}names}.  As a convenience, AnaGram
also defines a macro,
\index{TOKEN{\us}NAMES}\index{Macros}\agcode{TOKEN{\us}NAMES}, which
evaluates to the name of the token name array.  Note that
\agparam{token names}
controls the generation of an array of ASCII strings and should not be
confused with the \agcode{typedef enum} statement in the parser header
file which provides you with a set of enumeration constants.
% XXX maybe it means the *strings* should not be confused?

If you are tracking context, using the techniques described below, you
can use the macro
\index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} or
\index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT} to
determine the context of the error frame token.

\index{SYNTAX{\us}ERROR}\index{Macros}
\paragraph{SYNTAX{\us}ERROR Macro.}
When your parser finds a syntax error, it first executes any of the
diagnostic procedures described above that you have enabled, sets
\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
\index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE},
and then invokes the \agcode{SYNTAX{\us}ERROR} macro.  If you have not
defined \agcode{SYNTAX{\us}ERROR} it will be defined thus if you have set
\index{Lines and columns}\index{Configuration switches}
\agparam{lines and columns}:

\begin{indentingcode}{0.4in}
\#define SYNTAX{\us}ERROR {\bs}
  fprintf(stderr,"\%s,line \%d,column \%d{\bs}n", {\bs}
      PCB.error{\us}message, PCB.line, PCB.column)
\end{indentingcode}

and thus if you have not:

\begin{indentingcode}{0.4in}
\#define SYNTAX{\us}ERROR {\bs}
  fprintf(stderr, "\%s{\bs}n", PCB.error{\us}message)
\end{indentingcode}

In most circumstances, you will probably want to write your own
\agcode{SYNTAX{\us}ERROR} macro, since this diagnostic is one your users
will see with some frequency.
% XXX yes and why exactly? is there something we have in mind better
% than just printing PCB.error_message?

The default macro simply returns to the parser.  Your macro doesn't
have to.  If you wish, you could call \agcode{abort} or \agcode{exit}
directly from the macro.  If the \agcode{SYNTAX{\us}ERROR} macro returns
control to the parser, subsequent events depend on your choices for
error recovery.

\section{Error Recovery}
\index{Error recovery}\index{Syntax error}\index{Errors}

Syntax errors can be caused by any of a number of problems.  Some come
from simple typographic errors: the user skips a character or types
the wrong one.  Others come from true errors: he types something that
might be correct in its place, but in context is totally wrong.
Usually, if your parser is reading a file, you will want to continue
parsing the input, checking for other syntax errors at the very least.
The problem with doing this is getting the parser restarted, or
``resynchronized'', in some reasonable manner.

AnaGram provides a number of ways for your parser to recover from a
syntax error.  The least graceful, of course, is simply to call
\agcode{abort} or \agcode{exit} from the \agcode{SYNTAX{\us}ERROR} macro.
If you don't do this you have several options:

\begin{itemize}
\item error token resynchronization
\item auto resynchronization
\item simple return to calling program
\item ignore the error
\end{itemize}

\subsection{Error Token Resynchronization}
\index{Resynchronization}

When AnaGram builds your parser it checks to see if you have used a
token called \agcode{error} in your grammar or if you have assigned a
token name as the value of the configuration parameter
\index{Error token}\index{token}\index{Configuration parameters}
\agparam{error token}.  If so, it includes a call to an error token
resynchronization procedure immediately after the invocation of
\index{SYNTAX{\us}ERROR}\agcode{SYNTAX{\us}ERROR}.  The error token
resynchronization procedure works in the following way: It scans the
state stack backwards looking for the most recent state in which
\agcode{error} or the token named by \agparam{error token} was valid
input.  It then truncates the stack to this level, and jumps to the
state indicated by the error token.  It then passes over any input it
sees until it sees valid input for the state in which it finds itself.
At this point, it returns to the parser which continues as though
nothing had happened.  Since this is substantially easier than it
sounds, let's look at an example.  Suppose we are writing a C
compiler, and we wish to catch errors in ordinary statements.  We add
the following production to our grammar:

\begin{indentingcode}{0.4in}
statement
  -> error, ';'
\end{indentingcode}

Now, if the parser encounters a syntax error anytime while it is
parsing any statement, it will pop back to the most recent state where
it was looking for a statement, jump forward to the state indicated by
the token \agcode{error} in the new production, and then skip input
until it sees a semicolon.  At this point it will continue a normal
parse.  The effect of continuing at this point is to recognize and
reduce the above production, i.e., the parser will proceed as if it
had found a complete, correct ``statement''.  This production could
even have a reduction procedure to do any clean-up that an error might
require.

If you use error token resynchronization, you must identify an end of
file token to guarantee that the resynchronization procedure can
always terminate.  To do this, either name your end of file token
\agcode{eof} or use the
\index{Eof token}\index{Configuration parameters}\index{Token}
\agparam{eof token} configuration parameter to specify it.

For example, if your parser is reading conventional stream input, the
end of file will be denoted by a $-1$ value.  You can define the end
of file token thus:

\begin{indentingcode}{0.4in}
eof = -1
\end{indentingcode}

% XXX as ``finally'' means something in Java, let's change this to
% ``at last''
On the other hand, if you have already defined a token named
\agcode{finally}, you can add the following line to any configuration
segment:

\begin{indentingcode}{0.4in}
eof token = finally
\end{indentingcode}

The end of file token, of course, must be a terminal token.
% XXX this is not ``of course'' to a casual observer.

\subsection{Automatic Resynchronization}
\index{Resynchronization}\index{Automatic resynchronization}

If you have not specified an \agcode{error} token in your syntax file,
AnaGram checks to see if you have turned on the
\index{Auto resynch}\index{Configuration switches}
\agparam{auto resynch} configuration switch.
If so, it includes a call to an automatic resynchronization procedure
immediately after the call to \agcode{SYNTAX{\us}ERROR}.  The automatic
resynchronization procedure uses a heuristic based on your grammar to
get back in step with the input.  To use it you need do only two
things: You need to turn on the \index{Auto resynch}\agparam{auto
resynch} switch, and you need to specify an end of file token as for
error token resynchronization, above.

The primary advantage of the automatic resynchronization is that it is
easy to use.  The disadvantage is that it turns off all reduction
procedures, so that your parser is reduced to being a syntax checker
after it encounters an error.  If your grammar uses semantically
determined productions, your reduction procedures will not be invoked
so the primary reduction token will be used in all cases.

% XXX *why* does it do this?

\subsection{Other Ways to Continue}

% XXX the example of ``reading input from a keyboard'' should be
% clarified to indicate that this means something like an application
% where you press F10 for the menu, not typing at a command line.
%
If you do not wish to use either of the above resynchronization
procedures, you still have a number of options.  If your parser is
reading input from a keyboard, for instance, it is probably sufficient
to simply ignore bad input characters.  You can do this by simply
resetting \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to
zero in your
\index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro.
% XXX XXX should say \agcode{AG_RUNNING_CODE}, not zero!!
Your parser will then continue, passing over the bad input as though
it had never occurred.  If you do this, you should, of course, notify
your user somehow that you're skipping a character.  Issuing a beep on
the computer's speaker from the \agcode{SYNTAX{\us}ERROR} macro is
usually enough.

If you do not wish to continue the parse, but want your main program
to continue, you need do nothing special.  \agcode{PCB.exit{\us}flag} is
% XXX XXX should say \agcode{AG_SYNTAX_ERROR_CODE}, not 2!!
set to 2 before the \agcode{SYNTAX{\us}ERROR} macro is called.  If your
macro does not change \agcode{PCB.exit{\us}flag}, when it relinquishes
control to your parser, your parser will simply return to the calling
program.  The calling program can determine that the parse was
unsuccessful by inspecting \agcode{PCB.exit{\us}flag} and take whatever
action you deem appropriate.


\section{Advanced Techniques}

\subsection{Semantically Determined Productions}
\index{Semantically determined production}\index{Production}

A semantically determined production is one which has more than one
token on the left side.  The reduction procedure then determines which
token has in fact been identified, using whatever criteria are
necessary.  In some cases where the purpose is simply to provide
multiple syntactic options to be chosen at execution time, the
determination is made simply by interrogating a switch.  Other
situations may require a more complex determination, such as a symbol
table look-up, for instance.

\index{Production}
The tokens on the left side of the production can be used just like
any other tokens in your grammar.  Their semantic values, however,
must all be of the same \index{Data type}\index{Token}data type.

Depending on how you have defined your grammar, it may be that
whenever any one of the tokens on the left side is syntactically
acceptable input, all the tokens on the left are syntactically
acceptable.  That is, the production could reduce to any of the tokens
on the left without causing an immediate error condition.  In many
circumstances, however, this is not the case.  In a Pascal grammar,
for example, a semantically determined production might be used to
allow a reduction procedure to determine whether a particular
identifier is a constant identifier, a type identifier, a variable
identifier, or so on. In any particular context, only a subset of the
tokens on the left may be syntactically acceptable.

Before your reduction procedure is called, your parser will set the
reduction token to the first token on the left side which is
syntactically correct. If you need to change this assignment you have
several options.  From within your reduction procedure, you may simply
set
\index{reduction{\us}token}\index{PCB}\index{Token}\agcode{PCB.reduction{\us}token}
to the semantically correct value.  For this purpose, it is convenient
to use the token name enumeration constants provided in the header
file for your parser.  Note that if you select a reduction token that
is not syntactically correct, after your reduction procedure returns,
your parser will encounter a \index{Reduction token
error}\agterm{reduction token error}, described above.

AnaGram provides several tools to help you set the reduction token
correctly.  First, it provides a \agterm{change reduction} function
which will set the reduction token to a specified token only if the
specified token is syntactically correct.  It will return a flag to
indicate the outcome: non-zero on success, zero on failure.  The name
of this function is given by appending \agcode{{\us}change{\us}reduction} to
the name of your parser.  Thus, if your parser is named \agcode{ana},
the name of the function would be \agcode{ana{\us}change{\us}reduction}.  In
those cases where the semantically correct reduction token is not
syntactically correct, you will want to provide error diagnostics for
your user.  If you wish the parse to continue, so you can check
errors, you may simply return from the reduction procedure.  Since the
default reduction is syntactically correct, the parse can continue as
though there had been no error.

To simplify use of the change reduction function, AnaGram provides a macro,
\index{CHANGE{\us}REDUCTION}\index{Macros}\agcode{CHANGE{\us}REDUCTION}.
Simply call the macro with the name of the desired token as the
argument, replacing embedded blanks in the token name with
underscores.

For example, in writing a grammar for the C language, it is quite
convenient to write the following production:

\begin{indentingcode}{0.4in}
identifier, typedef name
  -> name                = check{\us}typedef();
\end{indentingcode}

The reduction procedure can then check the symbol table to see if
whether the name that has been found is a typedef name.  If so, it can
use the \agcode{CHANGE{\us}REDUCTION} macro to change the reduction token
to \agcode{typedef name} and verify that this is acceptable:

\begin{indentingcode}{0.4in}
if (!CHANGE{\us}REDUCTION(typedef{\us}name)) diagnose{\us}error();
\end{indentingcode}

Note that the embedded space in the token name must be replaced with
an underscore character.

Under some circumstances, in your reduction procedure, you might wish
to know precisely which reduction tokens are syntactically correct.
For instance, you might wish, in an error diagnostic, to tell your
user what you expected to see.  If you set the
\index{Reduction choices}\index{Configuration switches}
\agparam{reduction choices} switch,
AnaGram will include in your parser file a function which will
identify the acceptable choices for the reduction token in the current
state.  The prototype of this function is:

\begin{indentingcode}{0.4in}
int \${\us}reduction{\us}choices(int *);
\end{indentingcode}

where ``\agcode{\$}'' represents the name of your parser.  You must provide an
integer array whose length is at least as long as the maximum number
of reduction choices you might have.  The function will fill the array
with the token numbers of those which are acceptable in the current
state and return a count of the number of acceptable choices it found.
You can call this function from any reduction procedure.  AnaGram also
provides a macro to invoke this procedure:
\index{REDUCTION{\us}CHOICES}\index{Macros}\agcode{REDUCTION{\us}CHOICES}.
For example, to provide a diagnostic which details the acceptable
token, you might combine the use of the \agparam{reduction choices}
switch with the
\index{Token names}\index{Configuration switches}\agparam{token names}
switch described above:

\begin{indentingcode}{0.4in}
int ok{\us}tokens[20], n{\us}ok{\us}tokens, i;
n{\us}ok{\us}tokens = REDUCTION{\us}CHOICES(ok{\us}tokens);
printf("Acceptable input comprises: {\bs}n");
for (i = 0; i $<$ n{\us}ok{\us}tokens; i++) \bra
  printf("  \%s{\bs}n", TOKEN{\us}NAMES[i]);
\ket
\end{indentingcode}

A semantically determined production can even be a null production.
You can use a semantically determined null production to interrogate
the settings of parameters and control parsing accordingly:

\begin{indentingcode}{0.4in}
condition false, condition true
  -> = \bra if (condition) CHANGE{\us}REDUCTION(condition{\us}true); \ket
\end{indentingcode}

There are numerous examples of the use of semantically determined
productions in the examples provided in the
\index{examples}\agfile{examples} directory of your AnaGram
distribution disk.
% XXX too much anaphora
% XXX s/disk//

\subsection{Defining Parser Control Blocks}
\index{Parser control block}

All references to the parser control block in your parser are made
using the macro \index{PCB}\agcode{PCB}.  The only intrinsic
requirement on PCB is that it evaluate to an \agterm{lvalue} (see
Kernighan and Ritchie) that identifies a parser control block.  The
actual access may be direct, indirect through a pointer, subscripted,
or even more complex, although if the access is too complex, the
performance of your parser could suffer.  Simple indirect or
subscripted references are usually enough to enable you to build a
system with multiple parallel parsing processes.  If you wish to
define \agcode{PCB} in some way other than a simple, direct access to
a compiled-in control block, you will have to declare the control
block yourself.

When AnaGram builds a parser, it checks the status of the
\index{Declare pcb}\index{Configuration switches}\agparam{declare pcb}
configuration switch.  If it is on, the default setting, AnaGram
declares a parser control block for you.  AnaGram creates the name of
the parser control block variable by appending \agcode{{\us}pcb} to the
name of your parser.  Thus if the name of your parser is
\agcode{ana}, the parser control block is \agcode{ana{\us}pcb}.

In the header file AnaGram generates, a typedef statement defines the
structure of the parser control block.  The typedef name is given by
appending \agcode{{\us}pcb{\us}type} to the name of your parser.  Thus if
the name of your parser is \agcode{ana}, the type of the parser
control block is given by \agcode{ana{\us}pcb{\us}type}.  Thus, when AnaGram
defines the parser control block for \agcode{ana}, it does so by
including the following two lines of code:

\begin{indentingcode}{0.4in}
ana{\us}pcb{\us}type ana{\us}pcb;
\#define PCB ana{\us}pcb
\end{indentingcode}

If you wish to declare the parser control block yourself, you should
turn off the \agparam{declare pcb} switch.  To turn \agparam{declare
pcb} off, include the following line in a configuration segment in
your syntax file:

\begin{indentingcode}{0.4in}
\~{}declare pcb
\end{indentingcode}

Suppose your program needs to serve up to sixteen ``clients'', each
with its own input stream.  You might turn \agparam{declare pcb} off
and declare the parser control block in the following manner:

\begin{indentingcode}{0.4in}
ana{\us}pcb{\us}type ana{\us}pcb[16];    /* declare control blocks */
int client;
\#define PCB ana{\us}pcb[client]  /* tell parser about it */
\end{indentingcode}

Perhaps you need to parse a number of input streams, but you don't
know exactly how many until run time.  You might make the following
declarations:

\begin{indentingcode}{0.4in}
ana{\us}pcb{\us}type *ana{\us}pcb;       /* pointer to control block */
\#define PCB (*ana{\us}pcb)       /* tell parser about it */
\end{indentingcode}

Note that when you declare \agcode{PCB} as a pointer, you should put
parentheses around the declaration so that your compiler codes the
indirection properly.

There are many situations where it is convenient for a parser to be
reentrant.  A parser used for evaluating formulas in a spreadsheet
program, for instance, needs to be able to call itself recursively if
it is to use natural order recalculation.  A parser used to implement
macro substitutions may need to be recursive to deal with embedded
macros.

Here is an example of an interface function which is designed for
recursive calls to a parser, using the definitions above:

% XXX can I please at least remove the nonstandard <alloc.h>?
% And fix the misuse of assert, and check malloc for failure?
% And use AG_SUCCESS_CODE instead of 1?
\begin{indentingcode}{0.4in}
\#include <assert.h>
\#include <alloc.h>

\#define PCB (*ana{\us}pcb)
ana{\us}pcb{\us}type *ana{\us}pcb;

void do{\us}ana(void) \bra
  ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb;
  ana{\us}pcb = malloc(sizeof(ana{\us}pcb{\us}type));
  ana();
  assert(ana{\us}pcb.exit{\us}flag == 1);
  free(ana{\us}pcb);
  ana{\us}pcb = save{\us}ana;
\ket
\end{indentingcode}

Here is another way to accomplish the same end, this time using stack
storage rather than heap storage:

% XXX ditto
\begin{indentingcode}{0.4in}
\#include <assert.h>
\#include <alloc.h>

\#define PCB (*ana{\us}pcb)
ana{\us}pcb{\us}type *ana{\us}pcb;

void do{\us}ana(void) \bra
  ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb;
  ana{\us}pcb{\us}type local{\us}pcb;
  ana{\us}pcb = \&local{\us}pcb;
  ana();\\
  assert(ana{\us}pcb.exit{\us}flag == 1);
  ana{\us}pcb = save{\us}ana;
\ket
\end{indentingcode}

% XXX and here we should discuss \agparam{reentrant parser}, too.


\subsection{Multi-stage Parsing}
\index{Parsing}\index{Multi-stage parsing}

Multi-stage parsing consists of chaining together a number of parsers
in series so that each parser provides input to the following one.
Users of \agfile{lex} and \agfile{yacc} are accustomed to using
two-level parsing, since the ``\index{Lexical scanner}lexical
scanner'', or ``lexer'' they write in \agfile{lex} is really a very
simple parser whose output becomes the input to the parser written in
\agfile{yacc}.  AnaGram has been developed so that you may use as many
levels as are appropriate to your problem, and so that, if you wish,
you may write all of the parsers in AnaGram.

Many problems that do not lend themselves conveniently to solution
with a simple grammar can be neatly solved by using multi-stage
parsing.  In many cases this is because multi-stage parsing can be
used to parse constructs that are not context-free.  A first level
parser can use semantic information to decide which tokens to pass on
to the next level.  Thus, a first level parser for a C compiler can
use semantic information to distinguish typedef names from variable
names.

% XXX I believe this is referring to QPL. Nowadays there's Python...
As another example, a proprietary programming language used indents to
control its block structure.  A first level parser looked only at
lines and indents, passing the text through to the second level
parser.  When it encountered changes in indentation level, it inserted
block start and block end tokens as necessary.

Using AnaGram it is extremely easy to set up multi-stage parses.
Simply configure the second level parser as an event-driven parser.
The first level parser can then hand over tokens or characters to it
as it develops them.

The C macro preprocessor example, found in the
\index{examples}\agfile{examples} directory of your AnaGram
distribution disk, illustrates the use of multi-stage parsing.

\subsection{Context Tracking}
\index{Context tracking}

When you are writing a reduction procedure for a particular grammar
rule, you often need to know the value one or another of your program
variables had at the time the first token in the rule was encountered.
Examples of such variables are:

\begin{itemize}
\item Line or column number
\item Index in an input file
\item Index into an array
\item Counters, as of symbols defined, etc.
\end{itemize}

Such variables can be thought of as representing the ``context'' of
the rule you are reducing.  Sometimes it is possible to incorporate
the values of such variables into the values of reduction tokens, but
this can become quite cumbersome.  AnaGram provides an optional
feature known as ``context tracking'' to deal with this problem.
Here's how it works:

First, you identify the variables which you want to track.  Second,
you write a typedef statement in the \index{C prologue}C prologue of
your parser which defines a data structure with fields to accommodate
values for all of these variables.  Third, you tell AnaGram what the
name of the type of your data structure is, using the
\index{Context type}\index{Configuration parameters}\agparam{context type}
configuration parameter.  This causes AnaGram to add a field called
\index{PCB}\index{input{\us}context}\agcode{input{\us}context} and a stack,
the \index{Context stack}\index{Stack}\agterm{context stack}, called
\index{PCB}\index{cs}\agcode{cs}, both of the type you have specified,
to your parser control block.  Fourth, you write code to gather the
context information for each input character.

There are several ways to provide the initial context information.
You may write a
\index{GET{\us}CONTEXT}\index{Macros}\agcode{GET{\us}CONTEXT} macro which
sets the context stack variables directly.  Using the
\index{CONTEXT}\index{Macros}\agcode{CONTEXT} macro defined below, and
assuming your context type has line, column and pointer fields, you
could define \agcode{GET{\us}CONTEXT} as follows:

\begin{indentingcode}{0.4in}
\#define GET{\us}CONTEXT CONTEXT.pointer = PCB.pointer,{\bs}
         CONTEXT.line = PCB.line,{\bs}
         CONTEXT.column = PCB.column
\end{indentingcode}

If you are using \agparam{pointer input}, you must write a
\agcode{GET{\us}CONTEXT} macro to save context information.  If you use a
\index{GET{\us}INPUT}\index{Macros}\agcode{GET{\us}INPUT} macro or have an
event-driven parser, you may either store values directly into
\index{input{\us}context}\index{PCB}\agcode{PCB.input{\us}context} when you
develop the input token, or you may write a \agcode{GET{\us}CONTEXT}
macro.  The macro will provide a slight increment in performance.
% XXX say why it's faster (I assume because it won't look up context
% for inputs that don't need it?)

AnaGram provides six macros to enable you to read values in a
convenient manner from the context stack,
\index{cs}\index{PCB}\agcode{PCB.cs}.  Three of these macros are
designed to be used from your parser itself, and three are available
to use from other modules.  These three macros are designed for use in
your parser:

\begin{itemize}
\item \agcode{CONTEXT}
\item \agcode{RULE{\us}CONTEXT}
\item \agcode{ERROR{\us}CONTEXT}
\end{itemize}

These macros are defined at the beginning of your parser file, so they
may be used anywhere within your parser.

\index{CONTEXT}\index{Macros}\agcode{CONTEXT}
can be used to read or write the current top of the context stack as
indexed by \index{PCB}\agcode{PCB.ssx}.  When your parser is executing
a reduction procedure for a particular grammar rule, \agcode{CONTEXT}
will evaluate to the value of the input context as it was just before
the very first token in the rule.  The definition of \agcode{CONTEXT}
is:

\begin{indentingcode}{0.4in}
\#define CONTEXT (PCB.cs[PCB.ssx])
\end{indentingcode}

\index{RULE{\us}CONTEXT}\index{Macros}\agcode{RULE{\us}CONTEXT} can be used
within a reduction procedure to get the context for any element within
the rule being reduced.  For example, \agcode{RULE{\us}CONTEXT[0]} is the
context of the first element in the rule, \agcode{RULE{\us}CONTEXT[1]} is
the context of the second element in the rule, and so on.
\agcode{RULE{\us}CONTEXT[0]} is exactly the same as \agcode{CONTEXT}.

% XXX There should be a way to address the context of tokens in a
% rule by the symbolic names we've bound to them.

The definition of \agcode{RULE{\us}CONTEXT} is:

\begin{indentingcode}{0.4in}
\#define RULE{\us}CONTEXT (\&(PCB.cs[PCB.ssx]))
\end{indentingcode}

As an example, let us suppose that we are writing a parser to read a
parameter file for a program.  Let us imagine the following statements
make up a part of our syntax file:

\begin{indentingcode}{0.4in}
\bra
  typedef struct \bra int line, column \ket location;
  \#define GET{\us}INPUT {\bs}
  PCB.input{\us}code = fgetc(input{\us}file); {\bs}
  PCB.input{\us}context.line = PCB.line; {\bs}
  PCB.input{\us}context.column = PCB.column;
\ket
{}[ context type = location ]\\

parameter assignment
  -> parameter name, '=', number
\end{indentingcode}

Let us suppose that for each parameter we have stored a range of
admissible values.  We have to diagnose an attempt to use an incorrect
value.  We could write our diagnostic message as follows:

\begin{indentingcode}{0.4in}
fprintf(stderr, "Bad value at line \%d, column \%d in "
   "parameter assignment at line \%d, column \%d",
   RULE{\us}CONTEXT[2].line,
   RULE{\us}CONTEXT[2].column,
   CONTEXT.line,
   CONTEXT.column);
\end{indentingcode}

This diagnostic message would give our user the exact location both of
the bad value and of the beginning of the statement that contained the
bad value.

\index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} can be
used within a
\index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro to
find the context of an error if you have turned on the
\index{Error frame}\index{Configuration switches}\agparam{error frame}
and
\index{Diagnose errors}\index{Configuration switches}
\agparam{diagnose errors}
switches.  AnaGram itself tracks context using a structure consisting
of line and column numbers.  In case of errors such as encountering an
end of file in a comment, it uses the \agcode{ERROR{\us}CONTEXT} macro
to determine the line and column number at which the comment began.
% XXX that sounds like something AG does with your grammar, not
% what AG does reading its own input, which is what it is. rephrase...
The definition of \agcode{ERROR{\us}CONTEXT} is:

\begin{indentingcode}{0.4in}
\#define ERROR{\us}CONTEXT (PCB.cs[PCB.error{\us}frame{\us}ssx])
\end{indentingcode}

Three similar macros are also available for more general use:

\begin{itemize}
\item \index{PCONTEXT}\index{Macros}\agcode{PCONTEXT(pcb)}
\item \index{PRULE{\us}CONTEXT}\index{Macros}\agcode{PRULE{\us}CONTEXT(pcb)}
\item \index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT(pcb)}
\end{itemize}

% XXX repeating ``modules other than'' is bad
These macros are identical in function to the corresponding macros in
the first class.  The only difference is that they take the name of a
parser control block, \agcode{pcb}, as an argument so they can be used
in modules other than the parser module.  AnaGram includes the
definitions for these macros in the parser header file so that they
can be used in modules other than the parser itself.  Since these
macros are not specific to any one parser, the definitions are
conditional so that they will only be defined once in a given module,
even if you include header files corresponding to several parsers.
The definitions of these macros are as follows:

\begin{indentingcode}{0.4in}
\#define PCONTEXT(pcb) (pcb.cs[pcb.ssx])
\#define PRULE{\us}CONTEXT(pcb) (\&(pcb.cs[pcb.ssx]))
\#define PERROR{\us}CONTEXT(pcb) (pcb.cs[pcb.error{\us}frame{\us}ssx])
\end{indentingcode}

Note that since the context macros only make sense when called from a
reduction procedure or an error procedure, there are not many
occasions to use these macros.  The most common situation would be
when you have compiled the bulk of the code for your reduction
procedures in a separate module.

Remember that \agcode{PRULE{\us}CONTEXT}, because it identifies an array
rather than a value, requires a subscript.  For an example, let us
rewrite the diagnostic message given above for
\agcode{RULE{\us}CONTEXT} using \agcode{PRULE{\us}CONTEXT}, assuming
that the name of our parser control block is \agcode{ana{\us}pcb}:

\begin{indentingcode}{0.4in}
fprintf(stderr, "Bad value at line \%d, column \%d in "
   "resource statement at line \%d, column \%d",
   PRULE{\us}CONTEXT(ana{\us}pcb)[2].line,
   PRULE{\us}CONTEXT(ana{\us}pcb)[2].column,
   PCONTEXT.line,
   PCONTEXT.column);
\end{indentingcode}

\subsection{Coverage Analysis}
\index{Coverage analysis}

AnaGram has simple facilities for helping you determine the adequacy
of your test suites.  The
\index{Rule coverage}\index{Configuration switches}
\agparam{rule coverage} configuration switch
controls these facilities.  When you set \agparam{rule coverage},
AnaGram includes code in your parser to count the number of times the
parser identifies each rule in your grammar.  AnaGram also provides
procedures you can use to write these counts to a file and accumulate
them over multiple executions of your parser.  Finally, it provides a
window where you may inspect the counts to see the extent to which
your tests have covered the options in your grammar.

To maintain the counts, AnaGram declares, at the beginning of your
parser, an integer array, whose name is created by appending
\agcode{{\us}nrc} to the name of your parser.  The array contains one
counter for each rule you have defined in your grammar.  There are no
entries for the auxiliary rules that AnaGram creates to deal with set
overlaps or disregard statements.  In order to identify positively all
the rules that the parser reduces, AnaGram turns off certain
optimization features in your parser.  Therefore, a parser that has
the \agparam{rule coverage} switch enabled will run slightly slower
than one with the switch off.

AnaGram also provides procedures to write the counts to a file and to
initialize the counts from a file.  The procedures are named by
appending \agcode{{\us}write{\us}counts} and \agcode{{\us}read{\us}counts}
respectively to the name of your parser.  Thus, if your parser is
called \agcode{ana}, the procedures are called
\agcode{ana{\us}write{\us}counts} and \agcode{ana{\us}read{\us}counts}.
Neither takes any arguments nor returns a value.  To accumulate counts
correctly, you should include calls to the
\index{read{\us}counts}\agcode{read{\us}counts} and
\index{write{\us}counts}\agcode{write{\us}counts} procedures in your
program.  A convenient way to do this is to include statements such as
the following in your main program:

% XXX perhaps this means ``atexit''
\begin{indentingcode}{0.4in}
ana{\us}read{\us}counts();            /* before calling parser */
at{\us}exit(ana{\us}write{\us}counts);
\end{indentingcode}

For your convenience, AnaGram defines two macros,
\index{READ{\us}COUNTS}\index{Macros}\agcode{READ{\us}COUNTS} and
\index{WRITE{\us}COUNTS}\index{Macros}\agcode{WRITE{\us}COUNTS}, in your
parser.  They call the \agcode{read{\us}counts} and
\agcode{write{\us}counts} procedures respectively when \agparam{rule
coverage} is set.  Otherwise they are null.  Thus you may code them
into your main program and it will work whether or not the
\agparam{rule coverage} switch is set.  For example,

\begin{indentingcode}{0.4in}
READ{\us}COUNTS;         /* read counts if coverage enabled */
my{\us}parser();         /* call parser */
WRITE{\us}COUNTS;        /* write updated counts */
\end{indentingcode}

The \agcode{write{\us}counts} procedure writes an identifier code and the
counts to a count file.  The name of the count file is given by the
\index{Coverage file name}\index{Configuration parameters}
\agparam{coverage file name} parameter, which defaults to the same name as your
syntax file but with the extension
\index{File extension}\index{nrc}\agfile{.nrc}.  The identifier code
changes each time you modify your syntax file.  The
\agcode{read{\us}counts} procedure attempts to read the count file.  If
it cannot find it, or the identifier code is out of date, it simply
initializes the counter array to zeroes.  Otherwise, it initializes
the counter arrays to the values found in the file.

When you run AnaGram and analyze your syntax file, if
\agparam{rule coverage} is set, AnaGram will enable the \agmenu{Rule
Coverage} option on the \agmenu{Browse} menu.  If you select
\agmenu{Rule Coverage}, AnaGram will prepare a \agwindow{Rule
Coverage} window from the rule count file you select.  AnaGram will
warn you if the file you selected is older than the syntax file, since
under those conditions, the coverage file might be invalid.

The \index{Rule Coverage}\index{Window}\agwindow{Rule Coverage} window
shows the count for each rule, the rule number and the text of the
rule.  It is also synched to the syntax file so that you can see the
rule in context.  AnaGram also modifies the display of the
\index{Reduction Procedures}\index{Window}\agwindow{Reduction
Procedures} window so that each procedure descriptor is preceded by
the number of times it has been called.  You can use this display to
verify that all your reduction procedures have been tried.

% XXX having this paragraph here seems confusing
The \index{Trace Coverage}\index{Window}\agwindow{Trace Coverage}
window, created when you use the \agwindow{File Trace} or
\agwindow{Grammar Trace} option, provides information similar to that
provided by \agwindow{Rule Coverage}.  The differences are these:
Optimizations are not turned off for the \agwindow{Trace Coverage}, so
that some rules of length zero or one will not be properly counted.
Also, the \agwindow{Trace Coverage} does not tell you about the
reduction procedures you have tested.

\agwindow{File Trace} can become quite tedious to use if you have very
many semantically determined productions, so in these cases the
\agparam{rule coverage} approach can give you the information you need
more quickly.

\subsection{Using Precedence Operators}

The conventional syntax for arithmetic expressions used in most
programming languages can be parsed simply by reference to
\index{Operator precedence}\index{Precedence operators}
\agterm{operator precedence}.  Operator precedence refers to
the rules we use to determine the order in which arithmetic operations
should be carried out.  In normal usage, this means that
multiplication and division take precedence over addition and
subtraction, which in turn take precedence over comparison operations.
One can formalize this usage by assigning a numeric \index{Precedence
level}\agterm{precedence level} to each operator, so that the
operations are carried out starting with those of highest precedence
and continuing in order of declining precedence.  When operators have
the same precedence level, such as addition and subtraction operators,
one can decide the order of operation to be left to right or right to
left.  Operators of equal precedence which are to be evaluated left to
right are called \agterm{left associative}.  Those which should be
evaluated right to left are called \agterm{right associative}.  If the
nature of the operators is such that the question should never arise,
they are called \agterm{non-associative}.

AnaGram provides three declarations,
\index{Precedence declarations}\index{Left}\index{Right}\index{Nonassoc}
\agparam{left}, \agparam{right}, and \agparam{nonassoc}, which you can
use to associate precedence levels and associativity with tokens in
your grammar.  The syntax of these statements is given in Chapter 8.

When AnaGram encounters a shift-reduce \index{Conflicts}conflict in
your grammar, it looks to see if the conflict can be resolved by using
precedence and associativity rules.  If so, it applies the rules to
the conflict and records the resolution in the \index{Resolved
Conflicts}\index{Window}\agwindow{Resolved Conflicts} table.

There are two occasions where you should consider using precedence
declarations in your grammar: Where rewriting the grammar to get rid
of a conflict would obscure and complicate the grammar, and where you
wish to try to get a more compact, slightly faster parser by using
precedence rules for parsing arithmetic expressions.

Here is an example of using precedence declarations to parse simple
arithmetic expressions:

\begin{indentingcode}{0.4in}
unary minus = '-'
{}[
  left \bra '+', '-' \ket
  left \bra '*', '/' \ket
  right \bra unary minus \ket
]

exp
  -> number
  -> unary minus, exp
  -> exp, '+', exp
  -> exp, '-', exp
  -> exp, '*', exp
  -> exp, '/', exp
\end{indentingcode}

A complete working calculator grammar using this syntax,
\agfile{ffcalcx}, can be found in the
\index{examples}\agfile{examples/ffcalc} directory of your
AnaGram distribution disk.
% XXX s/disk//

\subsection{Parser Performance}

The parsers AnaGram generates have been engineered to provide maximum
performance subject to constraints of reliability and robustness.
There are a number of steps you may take, however, to make optimize
the performance of your parser.

\paragraph{Standard Stack Frame.}  If your compiler has a switch that
allows you to turn \emph{off} the standard stack frame when you
compile your parser, do so.  Your parser uses a large number of very
small functions which run fastest when your compiler does not use the
standard stack frame.

\paragraph{Error Diagnostic Features.}  If your parser does not need
to diagnose errors, turn off the
\index{Diagnose errors}\index{Configuration switches}
\agparam{diagnose errors} switch.
Turn off the
\index{Lines and columns}\index{Configuration switches}
\agparam{lines and columns} switch if you don't need this information.
If your parser doesn't need a diagnostic, and halts on syntax error,
turn off the
\index{Backtrack}\index{Configuration switches}\agparam{backtrack} switch.

\paragraph{Anti-optimization Switches.}  Certain switches de-optimize
your parser for various reasons.  These switches,
\index{Traditional engine}\index{Configuration switches}
\agparam{traditional engine} and
\index{Rule coverage}\index{Configuration switches}
\agparam{rule coverage},
should be turned off once you no longer need their effects.

\paragraph{Other Switches.}  For maximum performance you should use
\index{Pointer input}\index{Configuration switches}\agparam{pointer
input}.  If you can guarantee that your input will not have
out-of-range input, you can turn off
\index{Test range}\index{Configuration switches}\index{Range}
\agparam{test range}.
% XXX s/out-of-range input/out-of-range characters or tokens/
author	David A. Holland
date	Sat, 22 Dec 2007 17:52:45 -0500
parents
children