Mercurial > ~dholland > hg > ag > index.cgi

\chapter{Introduction to Syntax Directed Parsing}

Every programmer has to deal with input data.  The units of data may
consist of single characters from a keyboard or file, unit records
from a file, mouse events in a windowing system, or even more complex
data constructs.  When the processing of a unit of data depends only on
the value of that data, it is usually straightforward.  Usually,
however, the processing depends also on what has preceded, and often
even on what follows, the input under consideration.  Keeping track of
these dependencies in order to know how to process the data is called
\index{Parsing}\agterm{parsing}.

It is often relatively easy to keep track of simple dependencies when
first writing a program.  But as the program develops, as new features
are added, as bugs are fixed, the dependencies often stop being
simple.  Then input processing becomes a headache, since it is hard to
keep track of or even identify all the particular cases.  Changes to
the program cause unexpected problems.  Program maintenance threatens
to get out of control.

Syntax directed parsing is a technique for dealing with input data
streams which keeps you in control of the problem, gives you a high
degree of confidence in your program, makes bugs less likely and makes
program maintenance and the addition of new features an easy task.

To use syntax directed parsing you create a high level description of
the structure of your input data, called a
\index{Grammar}\agterm{grammar}.
A file which contains a grammar is called a
\index{Syntax file}\index{File}\agterm{syntax file}.
A
\index{Parser generator}\agterm{parser generator},
such as AnaGram,
can create, from a syntax file, a function (or program) called a
\index{Parser}\agterm{parser},
written in C or C++.  The parser keeps track of all the dependencies
in your input, and calls certain functions,
\index{Reduction procedure}\agterm{reduction procedures},
to deal with specific units or sequences of data as they are
encountered.

Reduction procedures are functions you write to process your data.
They are linked, in an obvious way in the grammar, to the structures
in your input, so that your parser will call them at precisely the
right times with precisely the right data.  When you write reduction
procedures, you can concentrate entirely on what you have to do with
the data.  You don't have to encumber your code with switches and
tests to determine the structure of your input.

This chapter describes in a general way how you write a grammar and
how a parser works.  In particular, it defines a number of terms which
are useful in talking about grammars and syntax directed parsing.  It
then introduces a number of special features of AnaGram.  More
detailed treatment of writing grammars is given in Chapter 8.
Techniques for building a complete functioning parser are discussed in
Chapter 9.


\section{Describing an Input Sequence}

Writing a grammar consists of describing the acceptable input
sequences for your program.  The vehicle for describing an input
sequence is called a
\index{Production}\agterm{production}.
Productions show how a logical component of the input can be made up
of a sequence of more elementary components.  A production that
describes a date might be written:

\begin{indentingcode}{0.4in}
date
  -> name of month, day, comma, year
\end{indentingcode}

The components of the input are called
\index{Token}\agterm{tokens}.
The sequence of tokens on the \index{Right side}right side of the
production is called a
\index{Grammar rule}\agterm{grammar rule},
or \index{Rule}\agterm{rule} for short.
The individual tokens on the right side of the rule are also called
\index{Rule elements}\index{Rule}\agterm{rule elements}.
The \agterm{rule length} is the number of elements in the rule.

The token on the \index{Left side}left side of the production is
called the
\index{Reduction token}\index{Token}\agterm{reduction token}
for the rule.  \index{Token}Tokens may have
\index{Value}\index{Semantic value}\agterm{semantic values},
as distinguished from \index{Value}\agterm{syntactic values}, which
you can use in your reduction procedures.  For instance, the value of
\agcode{name of month} could be an integer in the range zero to
eleven, or it could be a pointer to an ASCII string.  The value of day
could be an integer in the range one to thirty-one.

A grammar consists of a number of such productions, each of which
describes some component of the input in terms of other components.
It does not take very many productions to describe quite complex input
streams.  A grammar for the C language, for instance, requires about
two hundred productions.

Many people find the term \agterm{production} quite confusing.  It
comes from theoretical linguistics where it is used to describe how
one may produce sequences which correspond to a set of grammatical
rules.  Ironically, the major usage of the idea has been in parsing
where the interest is not so much in creating sequences which satisfy
the grammatical rules as in decoding and analyzing such sequences.
Nonetheless, it is convenient, in the above example, to say that the
token \agcode{date} \agterm{produces} a sequence of tokens consisting
of \agcode{name of month}, \agcode{day}, \agcode{comma}, and
\agcode{year}.  We also say that the sequence \agterm{reduces} to
\agcode{date}.

There may be more than one production to describe a given component,
if there is more than one way it may be represented.  For instance,

\begin{indentingcode}{0.4in}
date
  -> day, name of month, year
\end{indentingcode}

describes another common way of writing a date.  In other words, a
reduction token may produce a number of different grammar rules.

Tokens which appear on the left side of one or more productions are called
\index{Token}\index{Nonterminal token}\agterm{nonterminal tokens}.
Those which appear \emph{only} on the right sides of productions are
called
\index{Token}\index{Terminal token}\agterm{terminal tokens}.
Terminal tokens are the units which actually appear physically in the
input.  Nonterminal tokens are identified when a sequence of tokens
that matches the right side of a production is seen in the input.
When AnaGram analyzes a grammar, it assigns a unique
\index{Token number}\index{Number}\index{Token}\agterm{token number}
to each token it finds in the grammar.

Nonterminal tokens, such as \agcode{date} in the example above, may
appear in any grammar rule just as though they were input tokens.  The
token on the left side of a production can even appear on the right
side as well.  Such a production is called a
\index{Production}\index{Recursive productions}\agterm{recursive}
production.  When a nonterminal token appears on the right side of a
production, it may be represented in this context by \emph{any} of
the grammar rules it produces.  Grammars described in this manner are
called
\index{Context free grammar}\index{Grammar}\agterm{context free grammars}
since there is no contextual constraint on which of the rules that a
token produces can appear in any given context.

Recursive productions may be either left recursive or right recursive.
\agterm{Left recursive} productions are those where the recursively
defined nonterminal appears as the first element in the recursive
rule.  \agterm{Right recursive} productions are those where it is the
last element.  If it appears anywhere between, the production is said
to be \agterm{center recursive}.

Any nonterminal token which has a recursive production must also have
at least one simple, non-recursive production.  Otherwise, it is not
possible to create a finite sequence of terminal tokens from the
nonterminal token.

Recursion may also occur implicitly in a grammar when one of the
tokens on the right side of a production itself has a production
involving the token on the left.  Such implicit recursion sometimes
may involve numerous levels of productions.  Implicit recursion occurs
most commonly in describing constructs such as arithmetic expressions
or the block structure of programming languages.

Clearly, grammars can accommodate multiple levels of structure in the
input sequences they describe.  There must, at the top, be a single
token which encompasses the entire input.  This special token is
variously called the
\index{Grammar token}\index{Configuration parameters}\index{Token}
\agterm{grammar token},
the
\index{Goal token}\agterm{goal token}
or the
\index{Start token}\agterm{start token}.

In AnaGram grammars, terminal tokens need not be declared or otherwise
defined.  Any token name that appears only on the right side of a
production is taken to be a terminal token, and AnaGram presumes that
the input procedure that provides tokens to the parser will be able to
identify them to the parser appropriately.  The details on how to do
this are explained in Chapter 9.

On the other hand, AnaGram allows you to specify terminal tokens
explicitly as ASCII characters, or even as sets of ASCII characters,
right in the grammar.  Thus, you may write \agcode{'0-9'} to represent
the set of ASCII digits, or \agcode{'A-Z'} to represent the set of
upper case letters.  The
\index{Semantic value}\index{Value}\index{Token}semantic value
of such a token is the ASCII \index{Character codes}character code
that actually appears in the input stream.  The rules for representing
\index{Character sets}character sets are given in Chapter 8.  If the
various sets you use in your grammar overlap, they may not properly
represent terminal tokens.  In this case, AnaGram automatically
extends your grammar appropriately.  This ``set partition'' logic is
described in Chapter 6.


\section{How a Parser Works}

The aim of a \index{Parser}parser is to match its input with the full
syntactic structure specified by the productions which make up the
grammar.  The primary component of a parser is an input buffer,
sometimes thought of as a stack, into which tokens are
\agterm{shifted}, or stored sequentially, as they are encountered in
the input.  At the same time that a token is shifted into the input
buffer, its
\index{Semantic value}\index{Token}\index{Value}semantic value is
pushed onto the
\index{Parser value stack}\index{Value stack}\index{Stack}
\agterm{value stack}.
A token is not shifted into the buffer unless it ``makes sense'', that
is, unless it is consistent with the rules of the grammar and with the
input that has preceded it.  If a token does not make sense, the
parser signals a
\index{Syntax error}\index{Errors}\agterm{syntax error}.
In order to determine whether a token makes sense, the parser has a
sort of decision table which provides a list of acceptable tokens for
each of a number of
\index{Parser}\index{State}\agterm{states}.
The table also specifies what the parser is to do with each acceptable
token.  When the table indicates that a token is to be shifted into
the buffer, it also specifies a new state.  The parser stacks the
current state on a
\index{Parser state stack}\index{State stack}\index{Stack}
\agterm{state stack}
and jumps to the new state.  Thus every time a token is shifted into
the input buffer, a state number is pushed onto the state stack.  For
each state of the parser, excepting only the initial state, there is a
unique token which will cause a jump to that state.  This token is
called the
\index{Characteristic token}\index{Token}\agterm{characteristic token}
for the state.

When the rightmost, or most recent, tokens in the input buffer match
the right side of a production precisely, the parser \emph{may}
replace the tokens that match the rule with a single token, the token
on the left side of the production.  This process of replacing a
sequence of tokens with a single token is called
\index{Reduction}\agterm{reduction}.
The token that replaces the sequence of tokens is called the
\index{Reduction token}\index{Token}\agterm{reduction token}.

The actual mechanism of the reduction is quite important.  At the same
time that input tokens are removed from the input buffer, state
numbers are popped from the state stack, so that when all input tokens
matching the rule have been removed, the parser
\index{Parser}\index{State}state has been restored to the value it had
at the time the first token in the rule was seen.  As state numbers
are popped from the state stack, token values are popped from the
value stack.  If the rule has a reduction procedure, temporary
variables are loaded with the values popped from the stack and the
reduction procedure is called.  The reduction token is now shifted
into the input buffer just as though it were an input token.  If the
reduction procedure returned a result it is shifted into the value
stack as the value of the reduction token.  The parser stacks the
current state again and jumps to a new state as determined by the
parser tables.

If there are no errors in your input, when the last token has been
read from the input, shifted into the input buffer, and reductions
performed, there will be precisely one token in the input buffer: the
\index{Grammar token}\index{Goal token}grammar, or goal, token which
describes your entire input.  At this point your parser declares
itself finished.

Reductions do not necessarily occur every time a rule matches the
tokens in the input buffer.  If the reduction token does not make
sense, that is, if it is not consistent with the rules of the grammar
and with the input that has preceded it, the parser will not perform
the reduction.  Suppose there are tokens in the input which match one
of the rules given above for \agcode{date}.  A reduction will not
occur unless \agcode{date} is one of the tokens the parser is actually
looking for at that stage in the input.

Even when the reduction token would make sense, there are still
situations where the reduction would not take place.  Suppose a
grammar includes the two productions given above for \agcode{date} as
well as the following:

\begin{indentingcode}{0.4in}
date
  -> name of month, day
\end{indentingcode}

This production is the same as the first, but with no year
specification.  We often write dates without specifying the year
explicitly.  The year is usually understood from context.  When a
parser directed by this grammar has encountered \agcode{name of month}
and \agcode{day}, it can't tell without looking further whether it has
a short form or a long form date.  In such a circumstance, the parser
looks at the next following token, which is called a
\index{Lookahead token}\index{Token}\agterm{lookahead token}.
If the lookahead token is a comma, then, in the absence of other
productions, the input is a long form date.  If the lookahead token is
not a comma, then the input is certainly not a long form date, and it
is proper to reduce the short form production.

Suppose the lookahead token were a comma and the grammar were also to
contain the following production:

\begin{indentingcode}{0.4in}
appointment
  -> date, comma, time
\end{indentingcode}

Since a comma can follow date, according to this rule, and can also
follow day according to the first production, it is impossible to
determine, simply by looking at the lookahead token, whether the date
was being given in short form or long form.  One would have to look
beyond the comma to see if what follows the comma matches the rules
for time or for year.  Although it is possible to build parsers which
can do this, it is not generally feasible.  This situation is called a
\index{Shift-reduce conflict}\index{Conflicts}
\agterm{shift-reduce conflict},
because the parser cannot decide simply on the basis of the lookahead
token whether to reduce the short form rule or to ignore it and shift
the comma into the input buffer.

AnaGram finds and warns you about the conflicts in your grammar when
it analyzes your grammar.  To do this, it checks all states to find
all completed rules.  A
\index{Completed rule}\index{Rule}\agterm{completed rule}
is a rule that is completely matched by the rightmost tokens in the
input buffer.  For each such rule, AnaGram determines all of the
terminal tokens which could follow once that rule has been reduced.
These tokens are sometimes called the
\index{Token}\agterm{reducing tokens}
for the given rule in the given state.  (Note that ``reducing tokens''
are quite different from ``reduction tokens''.)  AnaGram checks each
state for shift-reduce conflicts by comparing the set of tokens which
can be shifted against all sets of reducing tokens for that state.  If
there is an overlap, there is a conflict.  When AnaGram finds a
conflict, it prepares a detailed report on the conflict.  This
information is then available in the \agwindow{Conflicts} window.  A
shift-reduce conflict does not prevent you from building a parser.  If
you build a parser for a grammar which has such a shift-reduce
conflict, the parser will resolve the issue by shifting the lookahead
token into the input buffer and ignoring the reduction.

There is another type of conflict involving reductions.  Suppose a
given state has more than one
\index{Completed rule}\index{Rule}completed rule.
AnaGram then checks the sets of reducing tokens for all the completed
rules.  If there is an overlap, there is no way for AnaGram to
determine which rule should be reduced simply on the basis of the
\index{Token}\index{Lookahead token}lookahead token.
This situation is called a
\index{Reduce-reduce conflicts}\index{Conflicts}
\agterm{reduce-reduce conflict}.
AnaGram will also diagnose reduce-reduce conflicts in the Conflicts
window.  If you decide to ignore a reduce-reduce conflict and ask
AnaGram to build a parser, the parser will choose the rule AnaGram
encountered first when it read your grammar.

\index{Conflicts}Conflicts are not necessarily errors.  Sometimes you
can build a more compact, faster parser by deliberately introducing
conflicts.  You can rely on AnaGram's default resolution of the
conflicts, or you may use operator precedence to resolve the
conflicts.  The use of operator precedence is described in Chapter 9.

AnaGram has extensive facilities for identifying and correcting
unwanted conflicts.  These facilities are described in Chapter 7.

Traditionally, parsers are built using only shift and reduce actions
as described above.  The parsers AnaGram builds normally use a number
of more complex actions, each of which is equivalent to a number of
shift and reduce actions.  These complex actions permit AnaGram to
build faster, more compact parsers.  If you wish, however, you may
restrict AnaGram to using only shift and reduce actions by setting the
\index{Traditional engine}\index{Configuration switches}
\agparam{traditional engine} configuration switch.  See Appendix A,
Configuration Parameters.


\section{A Note on Notation}

\index{Context free grammar}\index{Grammar}Context free grammars have
been traditionally represented in the literature using
\index{Backus-Naur Form}\agterm{Backus-Naur Form}, or
\index{BNF}\agterm{BNF}.
In Backus-Naur Form, certain characters, called metacharacters, are
used to punctuate productions and all other printable characters are
taken to represent themselves literally.  Named tokens are denoted by
enclosing the name within angle brackets ($< >$).  The left side of a
production is distinguished from the right side by the
% XXX s/characters/symbol/
characters $::=$.
If several productions have the same left side, the vertical
bar $|$ is used to separate them.  The elements of a grammar
rule are simply juxtaposed to indicate that one follows another.
Blanks are ignored.  Thus, in BNF, the first production given for
\agcode{date}, above, would be:

% ~ is a hard space
$$
<date>~~::=~~<name~of~month>~~<day>~~,~~<year>
$$

AnaGram uses a notation more consonant with ordinary programming
usage.  Thus token names need not be bracketed and literal characters
must be appropriately quoted.  The elements of rules are joined by
commas.  Using this approach, there is no need for metacharacters and
it becomes possible to make a number of useful extensions to the
notation.  The detailed rules for writing grammars using AnaGram's
notation are given in Chapter 8.


\section{Reduction Procedures}
\index{Reduction procedure}

Of course, the reason for parsing an input stream is to interpret the
data in the stream and to process it in some useful manner.  The
primary tool for doing this is the \agterm{reduction procedure}.  A
reduction procedure is a piece of C code that is executed when a
particular grammar rule is reduced.  Often, a reduction procedure
calculates a value which becomes the
\index{Value}\index{Semantic value}semantic value
of the
\index{Token}\index{Reduction token}reduction token.
The input to the reduction procedure consists of the values of the
tokens that make up the grammar rule.

AnaGram allows you to assign C variable names to the tokens in the
grammar rule, so that you can refer to them in the reduction
procedure.  To assign a C variable name, simply follow the token in
the rule with a colon and the C variable name.  Simple reduction
procedures can be written as C or C++ expressions.  At the end of the
rule, write an equal sign and an expression, terminated by a
semicolon.  For example:

\begin{indentingcode}{0.4in}
(int) hex digit
  -> '0-9':d = d-'0';
  -> 'a-f':d = d-'a'+10;
  -> 'A-F':d = d-'A'+10;
\end{indentingcode}

When any one of these rules is matched, the value of the token in the
rule is assigned to the temporary variable \agcode{d}.  The expression
to the right of the equal sign is evaluated and the value of the
expression is stored as the value of \agcode{hex digit}, which has
been declared to be an \agcode{int}.  These productions define
hexadecimal digits as ASCII characters, and calculate the binary value
of the digit in terms of the ASCII character code, \agcode{d}.  The
binary value becomes the value of \agcode{hex digit}.  Hexadecimal
digits can be combined to make hexadecimal numbers by writing the
following productions:

\begin{indentingcode}{0.4in}
(int) hex number
  -> hex digit
  -> hex number:n, hex digit:d = 16*n+d;
\end{indentingcode}

%In this example, note that if you do not specify a reduction procedure for
%a grammar rule, the value of the reduction token will be taken to be the
%value of the first token in the rule.  Thus, it is not necessary to provide
%a reduction procedure in the first production for hex number.  The second
%production recursively defines the value of hexadecimal numbers that have
%more than one digit.

There are several important points to notice in this example.  First,
reduction procedures are executed ``from the bottom up''.  That is,
the reduction procedure for \agcode{hex digit} is executed before any
reduction procedure for \agcode{hex number}.  Second, if there is no
reduction procedure for a production, the value of the first token in
the rule is assigned to the reduction token.  Thus, it is not
necessary to provide a reduction procedure in the first production for
\agcode{hex number}.  Third, the reduction procedures for recursive
productions are always executed \emph{after} the reduction procedure,
if any, for the non-recursive production which begins the recursion.
Finally, when an input sequence is described using left recursion, as
in this example, the elements of the sequence are processed left to
right.

%
% XXX this is a bad example as you never want to read floats this way!
% Can we think of something else?
%
If you wish to process the elements of a sequence right to left, you
may use right recursion.  For example, it is sometimes convenient to
define the fraction part of a decimal number thus:

\begin{indentingcode}{0.4in}
(double) fraction part
  -> '0-9':d                  = (d - '0')/10.0;
  -> '0-9':d, fraction part:f = (d - '0' + f)/10.0;
\end{indentingcode}

In this case the leading digits are stored temporarily in the parse
stack, and then the fraction part is evaluated right to left only when
the last digit has been found.

Reduction procedures can be more complex than simple expressions.  After the
equal sign you may include an arbitrary block of C or C++ code, enclosed in
braces, \agcode{\bra \ket}.  To return a
\index{Semantic value}\index{Token}\index{Value}semantic value
for the reduction token simply use a return statement.  Of course,
reduction procedures have the full resources of C or C++ at their
disposal.  They may set and interrogate global variables and may call
functions freely.  Reduction procedures are usually implemented as C
functions.  Very simple procedures may be implemented as macros,
unless you have disabled this option by turning off the
\index{Macros}\index{Allow macros}\agparam{allow macros} configuration
switch.  See Chapter 8 for the rules for writing reduction procedures.
When AnaGram builds a parser it copies all of the reduction procedures
you have defined to the parser file, and includes code to call them at
the right time.

Since the reduction procedures you write will probably need some
support code, such as \agcode{\#include} statements and declarations,
you may incorporate C or C++ code into your syntax file at any point.
You need only enclose it in braces, \agcode{\bra \ket}.  Such code
is called
\index{Embedded C}\agterm{embedded C}.
All embedded C code is also copied to the \index{File}parser file, and
\emph{precedes} all of your reduction procedures.


\section{Building a Parser}
\index{Building a Parser}\index{Parser}

In order to round out a parser into a functioning program it needs
input procedures, as well as error diagnosis and recovery
capabilities.  AnaGram has a number of options available which give
you a high degree of flexibility in configuring a parser to suit your
particular needs.  All of the options are provided with reasonable
defaults, so that you can safely disregard any option until you need
the features it provides.  The following paragraphs summarize the
options AnaGram provides.  These topics are discussed in greater
detail in Chapter 9.

\paragraph{Invoking a Parser.}
\index{Parser}
Normally, AnaGram configures parsers as functions which you can call
from elsewhere in your program.  In this situation, you call the
parser, it processes its input, and returns either when it has
finished or cannot proceed because of errors.  Alternatively, if you
set the
\index{Event driven}\agparam{event driven}
configuration switch, your parser will be configured so that you have
two procedures to call: an \index{Initializer}\agterm{initializer} and
a \index{Parser}\agterm{parser}.  In the event driven configuration
you start the parse by calling the initializer and then you call the
parser once for each unit of input.  Using the event driven mode makes
it quite easy to configure a parser as a filter and to chain several
parsers together so that the output from one parser is the input to
the next.  Such
\index{Parsing}\index{Multi-stage parsing}\agterm{multi-stage parsing}
is a convenient way to deal with complex input that is not context free.

\paragraph{Communicating with a Parser.}
The complete status of your parser is contained in a single data
structure called a
\index{Parser control block}\index{Control block}
\agterm{parser control block}.
All communications with a parser take place via the parser control
block.  \index{Input procedures}Input procedures must place input data
in the appropriate field in the parser control block.  When the parse
is complete or encounters an error, the results of the parse may be
found in the parser control block.  When AnaGram builds a parser it
includes, in the \index{File}\index{Header file}header file it
generates, a \agcode{typedef} statement which defines the structure of
the parser control block.

\paragraph{Parser Input.}
\index{Input}\index{Parser}\index{Input procedures}
The input to your parser may be either characters read directly from
an input stream, or tokens accumulated by a pre-processor or lexical
scanner.  The way you provide input to your parser depends on how your
grammar defines input tokens and also on whether or not you have
requested an event driven parser.

If your parser is event driven, you provide its input by storing the
input code and the input value, if any, into the parser control block
and calling the parser.

If you have set the
\index{Pointer input}\agparam{pointer input}
\index{Configuration switches}configuration switch
in your syntax file, you simply initialize the pointer field in your
parser control block before you call your parser.  Your parser will
then read its input directly from memory by simply incrementing the
pointer as necessary.

Otherwise, your parser will invoke a macro called
\index{\agcode{GET{\us}INPUT}}\index{Macros}\agcode{GET{\us}INPUT} every
time it needs more input.  You may define \agcode{GET{\us}INPUT}
according to your needs.  You can define it so that it calls an input
function, or you can define it so that it executes in-line code each
time it is invoked.  Your \agcode{GET{\us}INPUT} macro should store its
input code in the
\index{\agcode{input{\us}code}}\agcode{input{\us}code}
field of the
\index{PCB}parser control block.
If you do not write a \agcode{GET{\us}INPUT} macro, AnaGram will provide
one which will read characters from \agcode{stdin}.

If your grammar does not define terminal tokens in terms of ASCII
characters or external token numbers, your \agcode{GET{\us}INPUT} will
have to determine the appropriate internal
\index{Token number}\index{Token}\index{Number}token number
for each input token.  To assist you in determining these token
numbers AnaGram provides a \agcode{typedef enum} statement in the
\index{File}\index{Header file}header file.  You can then use named
constants to specify the internal token numbers for the input tokens.

\paragraph{Error Handling.}
\index{Error handling}\index{Error diagnosis}\index{Error recovery}
Your parser must be prepared to deal with erroneous input.  There are
two aspects to error handling: diagnosing the error, and recovering
from the error.  On encountering an error, your parser will invoke a
macro called
\index{\agcode{SYNTAX{\us}ERROR}}\index{Macros}\agcode{SYNTAX{\us}ERROR}.
If you do not provide a definition for \agcode{SYNTAX{\us}ERROR}, AnaGram
will provide a simple error diagnostic.  AnaGram can also provide
automatic error diagnoses which pinpoint the location of the error.

\index{Resynchronization}
AnaGram provides two options for error recovery:
\agterm{error token resynchronization} and
\agterm{automatic resynchronization}.  These are
techniques for getting your parser back in synch with its input so it
can proceed after encountering an error.  Normally, if you do not
select one of these recovery techniques, your parser will terminate
when it encounters an error; however, you may override this default if
you wish and provide your own recovery technique.


\section{Special Features of AnaGram}

AnaGram provides a number of special features not generally available
in other parsing systems.  These features extend the applicability of
syntax directed parsing, make it easier for you to write and debug
your grammars, and make it easier to interface your parser to other
parsers and to your main program.  The following paragraphs provide a
brief introduction to these features.  Their use is described more
fully in the appropriate chapters in this manual.

\paragraph{Configuration Parameters.}
\index{Configuration parameters}\index{Parameters}
AnaGram has a number of configuration switches and parameters which
you can use to control the way AnaGram works in various circumstances.
Since programmers often have different requirements, AnaGram strives
to be as flexible as possible.

Configuration parameters may be set in
\index{Configuration file}\index{File}configuration files, or in your
syntax file.  The rules for writing configuration parameters are given
in Chapter 8.  A summary of all configuration parameters is given in
Appendix A.  The use of particular configuration parameters is
described as appropriate throughout this manual.

\paragraph{Character Sets}.
\index{Character sets}
In your grammar, if you write terminal tokens explicitly as characters
or as sets of ASCII characters, AnaGram will set up your parser to
accept ASCII characters directly, so you will not need to build a
\index{Lexical scanner}lexical scanner and interface it to
your parser.  Character sets may be specified in terms of ranges of
characters, such as \agcode{'a-z'}, as unions, intersections or
differences of sets, denoted by \agcode{+}, \agcode{\&}, or \agcode{-}
respectively, or as the complement of a set, denoted by \agcode{\~{}}.
You may also specify a set consisting of a single character either
with a literal character or with a numeric specification.  The
detailed rules for character set expressions are given in Chapter 8.
Some useful character sets are:

\begin{itemize}
\item Letters: \agcode{'a-z' + 'A-Z'}
\item Digits: \agcode{'0-9'}
\item DOS end of file: \agcode{-1 + \^{}Z}
\item Neither end of file nor end of line:
\agcode{\~{}(-1 + \^{}Z
 + '{\bs}n')}
\end{itemize}

\paragraph{Keywords.}
\index{Keywords}
AnaGram makes special provision for the use of keywords in the parsers
it builds.  In your grammar, you may specify a keyword as a quoted
string.  AnaGram will incorporate special string matching logic in
your parser to recognize keywords in the parser input.  When a keyword
is recognized, it is treated as an individual token, completely
independent of the characters which make up the keyword.  Some
examples of keywords: \agcode{"let"}, \agcode{".EQ."},
\agcode{"<>"}, \agcode{"/*"}.  The
rules for the use of keyword strings are given in Chapter 8.

Keywords are a powerful and useful tool, but they are not completely
consistent with the assumptions underlying the use of context-free
grammars.  Thus under certain circumstances it is possible for a
parser built from a grammar that uses keywords to fail
% XXX s/parse correctly/correctly parse/
to parse correctly text that
appears to be consistent with the grammar.  Such a situation is called a
\index{Keyword anomalies}\index{Anomaly}keyword anomaly.
AnaGram detects and diagnoses keyword anomalies in your grammar, and
provides tools to help you find and correct the problem.  Many such
apparent anomalies, however, are completely innocuous.  Keyword
anomalies are discussed in greater detail in Chapter 7.

\paragraph{Virtual Productions.}
\index{Virtual productions}\index{Production}
Virtual productions are a powerful, shorthand notation for specifying
optional or repetitive input.  Using a virtual production can often
spare you the trouble of writing out four or five regular productions
and naming them.

Most of the notation for virtual productions should be familiar since
it has been used in programming manuals for years.  The basic forms
are these:

\index{\agcode{[]}}
\index{\_opb\_clb}  % {}, apparently. XXX?
\index{\agcode{...}}\index{Ellipsis}
\index{\agcode{?}}
\index{\agcode{?...}}
\index{\agcode{/...}}

% XXX figure out how to not have to encode the line breaks manually.
% I think we need one of the fancier extension forms of tabular.

\begin{tabular}[t]{ll}

\agcode{[]}&enclosing a list of one or more grammar rules separated
by \agcode{|} characters denotes an optional\\
\phantom{}&choice of one of the rules.\\

\agcode{\bra \ket}&enclosing a list of one or more grammar rules separated by
\agcode{|} characters indicates a required\\
\phantom{}&choice of one of the rules.\\

\agcode{...}&following a token, or following \agcode{[]} or \agcode{\bra \ket},
indicates arbitrary repetition of the previous choice or\\
\phantom{}&token.\\

\agcode{?}&following a token or set expression makes the token optional.\\

\agcode{?...}&following a token or set expression denotes zero or more
repetitions of the token.\\

\agcode{/...}&following \agcode{[]} or \agcode{\bra \ket} indicates arbitrary
repetition of the choice subject to the constraint that\\
\phantom{}&choices must alternate.\\

\end{tabular}

\index{Production}\index{Virtual productions}Virtual productions
are simply syntactic devices.  When AnaGram encounters a virtual
production it translates the virtual production into conventional
productions which show up in your grammar tables.  Virtual productions
are described in Chapter 8.

\paragraph{Definition Statements.}
\index{Statement}\index{Definition statement}
AnaGram provides a mechanism for naming particular character sets,
keywords or virtual productions in your syntax file.  The name you
define can be used freely in place of the thing defined.  Definition
statements have the form:

% XXX why the hell is the first line offset by a small amount?
\begin{indentingcode}{0.4in}
name = \codemeta{character set}
name = \codemeta{virtual production}
name = \codemeta{keyword}
name = \codemeta{immediate action}
name = \codemeta{token name}
\end{indentingcode}

The name may be any name acceptable to AnaGram.  The name can then be
used anywhere you might have used the expression on the right side.
For example:

\begin{indentingcode}{0.4in}
upper case letter = 'A-Z'
lower case letter = 'a-z'
letter = upper case letter + lower case letter
statement list = statement?...
while = "WHILE"
dos eof = \^{}Z
\end{indentingcode}

\paragraph{Disregard Statements.}
\index{Disregard statement}\index{Statement}
If you wish your parser generally to skip over certain characters or
constructs in its input, such as blanks, newlines, or comments, you
may use disregard statements to specify precisely what is to
be skipped and under what circumstances.  You may use any number of
disregard statements and you may cause your parser to disregard any
number of tokens, of whatever complexity.  Any \agparam{disregard}
statements must
be placed in a configuration section.  They may appear anywhere in
your syntax file.

The format of the \agparam{disregard} statement is as follows:

\begin{indentingcode}{0.4in}
disregard \codemeta{token name}
\end{indentingcode}

The rules for using disregard statements are given in Chapter 8.

\paragraph{Lexeme Statements.}
\index{Lexeme statement}\index{Statement}
The lexeme statement is used to fine tune the disregard statement.
Any \agparam{lexeme} statements must be placed in a configuration
section.  They may appear anywhere in your syntax file.

The format of the \agparam{lexeme} statement is as follows:

\begin{indentingcode}{0.4in}
lexeme \bra \codemeta{token list} \ket
\end{indentingcode}

where \textit{token list} is a list of nonterminal tokens, separated
by commas.

The lexeme statement allows you to specify that for the purposes of
the disregard statement, the listed tokens are to be treated as
% XXX make this: indivisible lexical units
lexical units, so that the disregard statement will be inoperative
within the tokens specified.  Lexeme statements are discussed in
Chapter 8.

\paragraph{Semantically Determined Productions.}
\index{Semantically determined production}\index{Production}
Traditionally, syntax directed parsing systems have enforced a rigid
distinction between syntactic and semantic information.  In these
systems, it is not possible to use the knowledge that, for example, a
symbol is a function name and not a variable name in the syntax of a
programming language.  AnaGram provides semantically determined
productions as a mechanism to allow semantic information to control
syntactic analysis.

A semantically determined production has more than one
\index{Token}\index{Reduction token}reduction token
on the left side.  The reduction procedure determines which reduction
token the parser should use.  For example:

\begin{indentingcode}{0.4in}
variable name, function name
  -> symbol:s = identify{\us}symbol(s);
\end{indentingcode}

In this situation, the parser will set the reduction token to the
leftmost token, \agcode{variable name}, before calling the reduction
procedure.  If the reduction procedure finds that \agcode{s} in fact
identifies a function name, it can change the reduction token to
\agcode{function name}.  Using this capability, you can write your grammar in
terms of \agcode{variable name} and \agcode{function name}, just as
though they were syntactically distinguishable.  Further discussion of
semantically determined productions may be found in Chapter 9.

\paragraph{Event Driven Parsers.}
\index{Parser}\index{Event driven parser}
Traditionally, parsers have been constructed as subroutines which are
called to analyze a stream of input and return only when they have
finished parsing the input or have encountered an unrecoverable error.
Such a parser itself calls a subroutine to obtain each unit of input.
There are a number of circumstances where this traditional
architecture is inconvenient.  The most obvious is in situations where
programs must be event-driven, as in many popular windowing systems.
A somewhat less obvious situation is where one parser, often called a
\index{Lexical scanner}\agterm{lexical scanner}, provides input to a
second. In these situations it is convenient to invert the role of the
calling and the called programs.

AnaGram provides the capability of specifying that a parser it builds
should be event-driven.  In this case, the parser consists of an
initialization procedure and an event handler.  The main program calls
the initialization procedure and then calls the event handler with
each input token in turn.  The event handler returns every time the
parser needs more input.  Details on event-driven parsers are given in
Chapter 9.

\paragraph{Context tracking.}
\index{Context tracking}
When you are writing a reduction procedure for a particular grammar
rule, you often need to know the value one or another of your program
variables had at the time the first token in the rule was encountered.
Examples of such variables are:

\begin{itemize}
\item Line or column number
\item Index in an input file
\item Index into an array
\item Counters, as of symbols defined, etc.
\end{itemize}

Such variables can be thought of as representing the ``context'' of
the rule you are reducing.  Sometimes it is possible to incorporate
the values of such variables into the values of reduction tokens, but
this can become quite cumbersome.  AnaGram provides a feature known as
``context tracking'' to deal with this problem.  Details on context
tracking can be found in Chapter 9.

\paragraph{Coverage Analysis.}
\index{Coverage analysis}
AnaGram has facilities to help you determine the adequacy of your test
suites.  If you set appropriate configuration switches, AnaGram will
include code in your parser to count the number of times each rule in
your grammar is reduced or the number of times each reduction
procedure in your grammar is executed.  You can use these counts to
determine the completeness of your testing.  The use of these
facilities is discussed in Chapter 9.

\paragraph{Immediate Actions.}
\index{Action}\index{Immediate action}
AnaGram provides for executing particular functions, called
``immediate actions'', at any point in a grammar rule.  Thus,
computation does not need to wait until the rule is complete.
Immediate actions are most useful when using AnaGram to control an
interactive process.  Immediate actions are discussed in Chapter 8.

\paragraph{Grammar Trace.}
\index{Grammar Trace}\index{Window}\index{Trace}
The \agwindow{Grammar Trace} facility of AnaGram allows you to examine
the workings of your parser in detail.  You can set up a
representation of the
\index{Parser state stack}\index{State stack}\index{Stack}parser state stack
and parser state as they might appear in the course of execution of
your parser.  You can then examine the possible inputs and see how the
state and the state stack change in response to any input you choose.
You can back up and try other options.  You can have several grammar
traces active simultaneously so you can compare the results of
different sequences of input.  For any configuration of the state
stack, you can see what grammar rules the parser is in the process of
matching.

The use of the Grammar Trace is described in Chapter 5.

\paragraph{File Trace.}
\index{Trace}\index{File Trace}\index{Window}
The \agwindow{File Trace} is a facility which allows you to see how
your parser will work on real data.  File Trace is an interactive,
interpretive parser governed by the rules in your grammar.  It lets
you see precisely how your grammar parses a test file.  You can see
the contents of the parse stack at any point in the parse.  You can
watch the parse progress at any level of detail you choose.  If you
wish, you may back up and try again.  When AnaGram runs the
\agwindow{File Trace} or \agwindow{Grammar Trace} it also builds a
\index{Coverage}\index{Trace Coverage}\index{Window}
\agwindow{Trace Coverage} table.

Note that AnaGram normally uses a number of short-cut parsing actions.
To make the parse look like a textbook parse, set the
\index{Traditional engine}\agparam{traditional engine}
\index{Configuration parameters}configuration parameter
in your syntax file.

\paragraph{Aids to Debugging.}
AnaGram provides a number of facilities to help you debug your parser.
It provides numerous tables that summarize your grammar and a
windowing environment to assist you in inspecting and comparing this
data.  To assist in identifying a conflict in your grammar it can
provide a
\index{Trace}\index{Conflict Trace}\index{Window}\agwindow{Conflict Trace},
a pre-built grammar trace showing a sequence of input which will
trigger the conflict.  If your parser encounters a
\index{Syntax error}\index{Errors}syntax error,
you can arrange to have AnaGram build an \agwindow{Error Trace}, a
pre-built grammar trace showing the sequence of input tokens that led
to the syntax error.  These facilities are described in Chapter 5.

\paragraph{Conflict Resolution.}
\index{Conflict}
In order to help you deal with conflicts in your grammar, AnaGram
provides several methods for resolving conflicts.  You may include
\agterm{precedence declarations} to assist in the construction of
simple precedence grammars, you may declare tokens to be
\agparam{sticky}, or you may declare tokens to be
\agterm{subgrammar} tokens.
These techniques are described in Chapter 7.
author	David A. Holland
date	Sun, 26 Apr 2009 17:58:26 -0400
parents	13d2b8934445
children