Mercurial > ~dholland > hg > ag > index.cgi

diff doc/manual/isdp.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author: David A. Holland
date: Sat, 22 Dec 2007 17:52:45 -0500
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/manual/isdp.tex	Sat Dec 22 17:52:45 2007 -0500
@@ -0,0 +1,977 @@
+\chapter{Introduction to Syntax Directed Parsing}
+
+Every programmer has to deal with input data.  The units of data may
+consist of single characters from a keyboard or file, unit records
+from a file, mouse events in a windowing system, or even more complex
+data constructs.  When the processing of a unit of data depends only on
+the value of that data, it is usually straightforward.  Usually,
+however, the processing depends also on what has preceded, and often
+even on what follows, the input under consideration.  Keeping track of
+these dependencies in order to know how to process the data is called
+\index{Parsing}\agterm{parsing}.
+
+It is often relatively easy to keep track of simple dependencies when
+first writing a program.  But as the program develops, as new features
+are added, as bugs are fixed, the dependencies often stop being
+simple.  Then input processing becomes a headache, since it is hard to
+keep track of or even identify all the particular cases.  Changes to
+the program cause unexpected problems.  Program maintenance threatens
+to get out of control.
+
+Syntax directed parsing is a technique for dealing with input data
+streams which keeps you in control of the problem, gives you a high
+degree of confidence in your program, makes bugs less likely and makes
+program maintenance and the addition of new features an easy task.
+
+To use syntax directed parsing you create a high level description of
+the structure of your input data, called a
+\index{Grammar}\agterm{grammar}.
+A file which contains a grammar is called a
+\index{Syntax file}\index{File}\agterm{syntax file}.
+A
+\index{Parser generator}\agterm{parser generator},
+such as AnaGram,
+can create, from a syntax file, a function (or program) called a
+\index{Parser}\agterm{parser},
+written in C or C++.  The parser keeps track of all the dependencies
+in your input, and calls certain functions,
+\index{Reduction procedure}\agterm{reduction procedures},
+to deal with specific units or sequences of data as they are
+encountered.
+
+Reduction procedures are functions you write to process your data.
+They are linked, in an obvious way in the grammar, to the structures
+in your input, so that your parser will call them at precisely the
+right times with precisely the right data.  When you write reduction
+procedures, you can concentrate entirely on what you have to do with
+the data.  You don't have to encumber your code with switches and
+tests to determine the structure of your input.
+
+This chapter describes in a general way how you write a grammar and
+how a parser works.  In particular, it defines a number of terms which
+are useful in talking about grammars and syntax directed parsing.  It
+then introduces a number of special features of AnaGram.  More
+detailed treatment of writing grammars is given in Chapter 8.
+Techniques for building a complete functioning parser are discussed in
+Chapter 9.
+
+
+\section{Describing an Input Sequence}
+
+Writing a grammar consists of describing the acceptable input
+sequences for your program.  The vehicle for describing an input
+sequence is called a
+\index{Production}\agterm{production}.
+Productions show how a logical component of the input can be made up
+of a sequence of more elementary components.  A production that
+describes a date might be written:
+
+\begin{indentingcode}{0.4in}
+date
+  -> name of month, day, comma, year
+\end{indentingcode}
+
+The components of the input are called
+\index{Token}\agterm{tokens}.
+The sequence of tokens on the \index{Right side}right side of the
+production is called a
+\index{Grammar rule}\agterm{grammar rule},
+or \index{Rule}\agterm{rule} for short.
+The individual tokens on the right side of the rule are also called
+\index{Rule elements}\index{Rule}\agterm{rule elements}.
+The \agterm{rule length} is the number of elements in the rule.
+
+The token on the \index{Left side}left side of the production is
+called the
+\index{Reduction token}\index{Token}\agterm{reduction token}
+for the rule.  \index{Token}Tokens may have
+\index{Value}\index{Semantic value}\agterm{semantic values},
+as distinguished from \index{Value}\agterm{syntactic values}, which
+you can use in your reduction procedures.  For instance, the value of
+\agcode{name of month} could be an integer in the range zero to
+eleven, or it could be a pointer to an ASCII string.  The value of day
+could be an integer in the range one to thirty-one.
+
+A grammar consists of a number of such productions, each of which
+describes some component of the input in terms of other components.
+It does not take very many productions to describe quite complex input
+streams.  A grammar for the C language, for instance, requires about
+two hundred productions.
+
+Many people find the term \agterm{production} quite confusing.  It
+comes from theoretical linguistics where it is used to describe how
+one may produce sequences which correspond to a set of grammatical
+rules.  Ironically, the major usage of the idea has been in parsing
+where the interest is not so much in creating sequences which satisfy
+the grammatical rules as in decoding and analyzing such sequences.
+Nonetheless, it is convenient, in the above example, to say that the
+token \agcode{date} \agterm{produces} a sequence of tokens consisting
+of \agcode{name of month}, \agcode{day}, \agcode{comma}, and
+\agcode{year}.  We also say that the sequence \agterm{reduces} to
+\agcode{date}.
+
+There may be more than one production to describe a given component,
+if there is more than one way it may be represented.  For instance,
+
+\begin{indentingcode}{0.4in}
+date
+  -> day, name of month, year
+\end{indentingcode}
+
+describes another common way of writing a date.  In other words, a
+reduction token may produce a number of different grammar rules.
+
+Tokens which appear on the left side of one or more productions are called
+\index{Token}\index{Nonterminal token}\agterm{nonterminal tokens}.
+Those which appear \emph{only} on the right sides of productions are
+called
+\index{Token}\index{Terminal token}\agterm{terminal tokens}.
+Terminal tokens are the units which actually appear physically in the
+input.  Nonterminal tokens are identified when a sequence of tokens
+that matches the right side of a production is seen in the input.
+When AnaGram analyzes a grammar, it assigns a unique
+\index{Token number}\index{Number}\index{Token}\agterm{token number}
+to each token it finds in the grammar.
+
+Nonterminal tokens, such as \agcode{date} in the example above, may
+appear in any grammar rule just as though they were input tokens.  The
+token on the left side of a production can even appear on the right
+side as well.  Such a production is called a
+\index{Production}\index{Recursive productions}\agterm{recursive}
+production.  When a nonterminal token appears on the right side of a
+production, it may be represented in this context by \emph{any} of
+the grammar rules it produces.  Grammars described in this manner are
+called
+\index{Context free grammar}\index{Grammar}\agterm{context free grammars}
+since there is no contextual constraint on which of the rules that a
+token produces can appear in any given context.
+
+Recursive productions may be either left recursive or right recursive.
+\agterm{Left recursive} productions are those where the recursively
+defined nonterminal appears as the first element in the recursive
+rule.  \agterm{Right recursive} productions are those where it is the
+last element.  If it appears anywhere between, the production is said
+to be \agterm{center recursive}.
+
+Any nonterminal token which has a recursive production must also have
+at least one simple, non-recursive production.  Otherwise, it is not
+possible to create a finite sequence of terminal tokens from the
+nonterminal token.
+
+Recursion may also occur implicitly in a grammar when one of the
+tokens on the right side of a production itself has a production
+involving the token on the left.  Such implicit recursion sometimes
+may involve numerous levels of productions.  Implicit recursion occurs
+most commonly in describing constructs such as arithmetic expressions
+or the block structure of programming languages.
+
+Clearly, grammars can accommodate multiple levels of structure in the
+input sequences they describe.  There must, at the top, be a single
+token which encompasses the entire input.  This special token is
+variously called the
+\index{Grammar token}\index{Configuration parameters}\index{Token}
+\agterm{grammar token},
+the
+\index{Goal token}\agterm{goal token}
+or the
+\index{Start token}\agterm{start token}.
+
+In AnaGram grammars, terminal tokens need not be declared or otherwise
+defined.  Any token name that appears only on the right side of a
+production is taken to be a terminal token, and AnaGram presumes that
+the input procedure that provides tokens to the parser will be able to
+identify them to the parser appropriately.  The details on how to do
+this are explained in Chapter 9.
+
+On the other hand, AnaGram allows you to specify terminal tokens
+explicitly as ASCII characters, or even as sets of ASCII characters,
+right in the grammar.  Thus, you may write \agcode{'0-9'} to represent
+the set of ASCII digits, or \agcode{'A-Z'} to represent the set of
+upper case letters.  The
+\index{Semantic value}\index{Value}\index{Token}semantic value
+of such a token is the ASCII \index{Character codes}character code
+that actually appears in the input stream.  The rules for representing
+\index{Character sets}character sets are given in Chapter 8.  If the
+various sets you use in your grammar overlap, they may not properly
+represent terminal tokens.  In this case, AnaGram automatically
+extends your grammar appropriately.  This ``set partition'' logic is
+described in Chapter 6.
+
+
+\section{How a Parser Works}
+
+The aim of a \index{Parser}parser is to match its input with the full
+syntactic structure specified by the productions which make up the
+grammar.  The primary component of a parser is an input buffer,
+sometimes thought of as a stack, into which tokens are
+\agterm{shifted}, or stored sequentially, as they are encountered in
+the input.  At the same time that a token is shifted into the input
+buffer, its
+\index{Semantic value}\index{Token}\index{Value}semantic value is
+pushed onto the
+\index{Parser value stack}\index{Value stack}\index{Stack}
+\agterm{value stack}.
+A token is not shifted into the buffer unless it ``makes sense'', that
+is, unless it is consistent with the rules of the grammar and with the
+input that has preceded it.  If a token does not make sense, the
+parser signals a
+\index{Syntax error}\index{Errors}\agterm{syntax error}.
+In order to determine whether a token makes sense, the parser has a
+sort of decision table which provides a list of acceptable tokens for
+each of a number of
+\index{Parser}\index{State}\agterm{states}.
+The table also specifies what the parser is to do with each acceptable
+token.  When the table indicates that a token is to be shifted into
+the buffer, it also specifies a new state.  The parser stacks the
+current state on a
+\index{Parser state stack}\index{State stack}\index{Stack}
+\agterm{state stack}
+and jumps to the new state.  Thus every time a token is shifted into
+the input buffer, a state number is pushed onto the state stack.  For
+each state of the parser, excepting only the initial state, there is a
+unique token which will cause a jump to that state.  This token is
+called the
+\index{Characteristic token}\index{Token}\agterm{characteristic token}
+for the state.
+
+When the rightmost, or most recent, tokens in the input buffer match
+the right side of a production precisely, the parser \emph{may}
+replace the tokens that match the rule with a single token, the token
+on the left side of the production.  This process of replacing a
+sequence of tokens with a single token is called
+\index{Reduction}\agterm{reduction}.
+The token that replaces the sequence of tokens is called the
+\index{Reduction token}\index{Token}\agterm{reduction token}.
+
+The actual mechanism of the reduction is quite important.  At the same
+time that input tokens are removed from the input buffer, state
+numbers are popped from the state stack, so that when all input tokens
+matching the rule have been removed, the parser
+\index{Parser}\index{State}state has been restored to the value it had
+at the time the first token in the rule was seen.  As state numbers
+are popped from the state stack, token values are popped from the
+value stack.  If the rule has a reduction procedure, temporary
+variables are loaded with the values popped from the stack and the
+reduction procedure is called.  The reduction token is now shifted
+into the input buffer just as though it were an input token.  If the
+reduction procedure returned a result it is shifted into the value
+stack as the value of the reduction token.  The parser stacks the
+current state again and jumps to a new state as determined by the
+parser tables.
+
+If there are no errors in your input, when the last token has been
+read from the input, shifted into the input buffer, and reductions
+performed, there will be precisely one token in the input buffer: the
+\index{Grammar token}\index{Goal token}grammar, or goal, token which
+describes your entire input.  At this point your parser declares
+itself finished.
+
+Reductions do not necessarily occur every time a rule matches the
+tokens in the input buffer.  If the reduction token does not make
+sense, that is, if it is not consistent with the rules of the grammar
+and with the input that has preceded it, the parser will not perform
+the reduction.  Suppose there are tokens in the input which match one
+of the rules given above for \agcode{date}.  A reduction will not
+occur unless \agcode{date} is one of the tokens the parser is actually
+looking for at that stage in the input.
+
+Even when the reduction token would make sense, there are still
+situations where the reduction would not take place.  Suppose a
+grammar includes the two productions given above for \agcode{date} as
+well as the following:
+
+\begin{indentingcode}{0.4in}
+date
+  -> name of month, day
+\end{indentingcode}
+
+This production is the same as the first, but with no year
+specification.  We often write dates without specifying the year
+explicitly.  The year is usually understood from context.  When a
+parser directed by this grammar has encountered \agcode{name of month}
+and \agcode{day}, it can't tell without looking further whether it has
+a short form or a long form date.  In such a circumstance, the parser
+looks at the next following token, which is called a
+\index{Lookahead token}\index{Token}\agterm{lookahead token}.
+If the lookahead token is a comma, then, in the absence of other
+productions, the input is a long form date.  If the lookahead token is
+not a comma, then the input is certainly not a long form date, and it
+is proper to reduce the short form production.
+
+Suppose the lookahead token were a comma and the grammar were also to
+contain the following production:
+
+\begin{indentingcode}{0.4in}
+appointment
+  -> date, comma, time
+\end{indentingcode}
+
+Since a comma can follow date, according to this rule, and can also
+follow day according to the first production, it is impossible to
+determine, simply by looking at the lookahead token, whether the date
+was being given in short form or long form.  One would have to look
+beyond the comma to see if what follows the comma matches the rules
+for time or for year.  Although it is possible to build parsers which
+can do this, it is not generally feasible.  This situation is called a
+\index{Shift-reduce conflict}\index{Conflicts}
+\agterm{shift-reduce conflict},
+because the parser cannot decide simply on the basis of the lookahead
+token whether to reduce the short form rule or to ignore it and shift
+the comma into the input buffer.
+
+AnaGram finds and warns you about the conflicts in your grammar when
+it analyzes your grammar.  To do this, it checks all states to find
+all completed rules.  A
+\index{Completed rule}\index{Rule}\agterm{completed rule}
+is a rule that is completely matched by the rightmost tokens in the
+input buffer.  For each such rule, AnaGram determines all of the
+terminal tokens which could follow once that rule has been reduced.
+These tokens are sometimes called the
+\index{Token}\agterm{reducing tokens}
+for the given rule in the given state.  (Note that ``reducing tokens''
+are quite different from ``reduction tokens''.)  AnaGram checks each
+state for shift-reduce conflicts by comparing the set of tokens which
+can be shifted against all sets of reducing tokens for that state.  If
+there is an overlap, there is a conflict.  When AnaGram finds a
+conflict, it prepares a detailed report on the conflict.  This
+information is then available in the \agwindow{Conflicts} window.  A
+shift-reduce conflict does not prevent you from building a parser.  If
+you build a parser for a grammar which has such a shift-reduce
+conflict, the parser will resolve the issue by shifting the lookahead
+token into the input buffer and ignoring the reduction.
+
+There is another type of conflict involving reductions.  Suppose a
+given state has more than one
+\index{Completed rule}\index{Rule}completed rule.
+AnaGram then checks the sets of reducing tokens for all the completed
+rules.  If there is an overlap, there is no way for AnaGram to
+determine which rule should be reduced simply on the basis of the
+\index{Token}\index{Lookahead token}lookahead token.
+This situation is called a
+\index{Reduce-reduce conflicts}\index{Conflicts}
+\agterm{reduce-reduce conflict}.
+AnaGram will also diagnose reduce-reduce conflicts in the Conflicts
+window.  If you decide to ignore a reduce-reduce conflict and ask
+AnaGram to build a parser, the parser will choose the rule AnaGram
+encountered first when it read your grammar.
+
+\index{Conflicts}Conflicts are not necessarily errors.  Sometimes you
+can build a more compact, faster parser by deliberately introducing
+conflicts.  You can rely on AnaGram's default resolution of the
+conflicts, or you may use operator precedence to resolve the
+conflicts.  The use of operator precedence is described in Chapter 9.
+
+AnaGram has extensive facilities for identifying and correcting
+unwanted conflicts.  These facilities are described in Chapter 7.
+
+Traditionally, parsers are built using only shift and reduce actions
+as described above.  The parsers AnaGram builds normally use a number
+of more complex actions, each of which is equivalent to a number of
+shift and reduce actions.  These complex actions permit AnaGram to
+build faster, more compact parsers.  If you wish, however, you may
+restrict AnaGram to using only shift and reduce actions by setting the
+\index{Traditional engine}\index{Configuration switches}
+\agparam{traditional engine} configuration switch.  See Appendix A,
+Configuration Parameters.
+
+
+\section{A Note on Notation}
+
+\index{Context free grammar}\index{Grammar}Context free grammars have
+been traditionally represented in the literature using
+\index{Backus-Naur Form}\agterm{Backus-Naur Form}, or
+\index{BNF}\agterm{BNF}.
+In Backus-Naur Form, certain characters, called metacharacters, are
+used to punctuate productions and all other printable characters are
+taken to represent themselves literally.  Named tokens are denoted by
+enclosing the name within angle brackets ($< >$).  The left side of a
+production is distinguished from the right side by the
+% XXX s/characters/symbol/
+characters $::=$.
+If several productions have the same left side, the vertical
+bar $|$ is used to separate them.  The elements of a grammar
+rule are simply juxtaposed to indicate that one follows another.
+Blanks are ignored.  Thus, in BNF, the first production given for
+\agcode{date}, above, would be:
+
+% ~ is a hard space
+$$
+<date>~~::=~~<name~of~month>~~<day>~~,~~<year>
+$$
+
+AnaGram uses a notation more consonant with ordinary programming
+usage.  Thus token names need not be bracketed and literal characters
+must be appropriately quoted.  The elements of rules are joined by
+commas.  Using this approach, there is no need for metacharacters and
+it becomes possible to make a number of useful extensions to the
+notation.  The detailed rules for writing grammars using AnaGram's
+notation are given in Chapter 8.
+
+
+\section{Reduction Procedures}
+\index{Reduction procedure}
+
+Of course, the reason for parsing an input stream is to interpret the
+data in the stream and to process it in some useful manner.  The
+primary tool for doing this is the \agterm{reduction procedure}.  A
+reduction procedure is a piece of C code that is executed when a
+particular grammar rule is reduced.  Often, a reduction procedure
+calculates a value which becomes the
+\index{Value}\index{Semantic value}semantic value
+of the 
+\index{Token}\index{Reduction token}reduction token.
+The input to the reduction procedure consists of the values of the
+tokens that make up the grammar rule.
+
+AnaGram allows you to assign C variable names to the tokens in the
+grammar rule, so that you can refer to them in the reduction
+procedure.  To assign a C variable name, simply follow the token in
+the rule with a colon and the C variable name.  Simple reduction
+procedures can be written as C or C++ expressions.  At the end of the
+rule, write an equal sign and an expression, terminated by a
+semicolon.  For example:
+
+\begin{indentingcode}{0.4in}
+(int) hex digit
+  -> '0-9':d = d-'0';
+  -> 'a-f':d = d-'a'+10;
+  -> 'A-F':d = d-'A'+10;
+\end{indentingcode}
+
+When any one of these rules is matched, the value of the token in the
+rule is assigned to the temporary variable \agcode{d}.  The expression
+to the right of the equal sign is evaluated and the value of the
+expression is stored as the value of \agcode{hex digit}, which has
+been declared to be an \agcode{int}.  These productions define
+hexadecimal digits as ASCII characters, and calculate the binary value
+of the digit in terms of the ASCII character code, \agcode{d}.  The
+binary value becomes the value of \agcode{hex digit}.  Hexadecimal
+digits can be combined to make hexadecimal numbers by writing the
+following productions:
+
+\begin{indentingcode}{0.4in}
+(int) hex number
+  -> hex digit
+  -> hex number:n, hex digit:d = 16*n+d;
+\end{indentingcode}
+
+%In this example, note that if you do not specify a reduction procedure for
+%a grammar rule, the value of the reduction token will be taken to be the
+%value of the first token in the rule.  Thus, it is not necessary to provide
+%a reduction procedure in the first production for hex number.  The second
+%production recursively defines the value of hexadecimal numbers that have
+%more than one digit.
+
+There are several important points to notice in this example.  First,
+reduction procedures are executed ``from the bottom up''.  That is,
+the reduction procedure for \agcode{hex digit} is executed before any
+reduction procedure for \agcode{hex number}.  Second, if there is no
+reduction procedure for a production, the value of the first token in
+the rule is assigned to the reduction token.  Thus, it is not
+necessary to provide a reduction procedure in the first production for
+\agcode{hex number}.  Third, the reduction procedures for recursive
+productions are always executed \emph{after} the reduction procedure,
+if any, for the non-recursive production which begins the recursion.
+Finally, when an input sequence is described using left recursion, as
+in this example, the elements of the sequence are processed left to
+right.
+
+%
+% XXX this is a bad example as you never want to read floats this way!
+% Can we think of something else?
+%
+If you wish to process the elements of a sequence right to left, you
+may use right recursion.  For example, it is sometimes convenient to
+define the fraction part of a decimal number thus:
+
+\begin{indentingcode}{0.4in}
+(double) fraction part
+  -> '0-9':d                  = (d - '0')/10.0;
+  -> '0-9':d, fraction part:f = (d - '0' + f)/10.0;
+\end{indentingcode}
+
+In this case the leading digits are stored temporarily in the parse
+stack, and then the fraction part is evaluated right to left only when
+the last digit has been found.
+
+Reduction procedures can be more complex than simple expressions.  After the
+equal sign you may include an arbitrary block of C or C++ code, enclosed in
+braces, \agcode{\bra \ket}.  To return a
+\index{Semantic value}\index{Token}\index{Value}semantic value
+for the reduction token simply use a return statement.  Of course,
+reduction procedures have the full resources of C or C++ at their
+disposal.  They may set and interrogate global variables and may call
+functions freely.  Reduction procedures are usually implemented as C
+functions.  Very simple procedures may be implemented as macros,
+unless you have disabled this option by turning off the
+\index{Macros}\index{Allow macros}\agparam{allow macros} configuration
+switch.  See Chapter 8 for the rules for writing reduction procedures.
+When AnaGram builds a parser it copies all of the reduction procedures
+you have defined to the parser file, and includes code to call them at
+the right time.
+
+Since the reduction procedures you write will probably need some
+support code, such as \agcode{\#include} statements and declarations,
+you may incorporate C or C++ code into your syntax file at any point.
+You need only enclose it in braces, \agcode{\bra \ket}.  Such code
+is called
+\index{Embedded C}\agterm{embedded C}.
+All embedded C code is also copied to the \index{File}parser file, and
+\emph{precedes} all of your reduction procedures.
+
+
+\section{Building a Parser}
+\index{Building a Parser}\index{Parser}
+
+In order to round out a parser into a functioning program it needs
+input procedures, as well as error diagnosis and recovery
+capabilities.  AnaGram has a number of options available which give
+you a high degree of flexibility in configuring a parser to suit your
+particular needs.  All of the options are provided with reasonable
+defaults, so that you can safely disregard any option until you need
+the features it provides.  The following paragraphs summarize the
+options AnaGram provides.  These topics are discussed in greater
+detail in Chapter 9.
+
+\paragraph{Invoking a Parser.}
+\index{Parser}
+Normally, AnaGram configures parsers as functions which you can call
+from elsewhere in your program.  In this situation, you call the
+parser, it processes its input, and returns either when it has
+finished or cannot proceed because of errors.  Alternatively, if you
+set the
+\index{Event driven}\agparam{event driven}
+configuration switch, your parser will be configured so that you have
+two procedures to call: an \index{Initializer}\agterm{initializer} and
+a \index{Parser}\agterm{parser}.  In the event driven configuration
+you start the parse by calling the initializer and then you call the
+parser once for each unit of input.  Using the event driven mode makes
+it quite easy to configure a parser as a filter and to chain several
+parsers together so that the output from one parser is the input to
+the next.  Such
+\index{Parsing}\index{Multi-stage parsing}\agterm{multi-stage parsing}
+is a convenient way to deal with complex input that is not context free.
+
+\paragraph{Communicating with a Parser.}
+The complete status of your parser is contained in a single data
+structure called a
+\index{Parser control block}\index{Control block}
+\agterm{parser control block}.
+All communications with a parser take place via the parser control
+block.  \index{Input procedures}Input procedures must place input data
+in the appropriate field in the parser control block.  When the parse
+is complete or encounters an error, the results of the parse may be
+found in the parser control block.  When AnaGram builds a parser it
+includes, in the \index{File}\index{Header file}header file it
+generates, a \agcode{typedef} statement which defines the structure of
+the parser control block.
+
+\paragraph{Parser Input.}
+\index{Input}\index{Parser}\index{Input procedures}
+The input to your parser may be either characters read directly from
+an input stream, or tokens accumulated by a pre-processor or lexical
+scanner.  The way you provide input to your parser depends on how your
+grammar defines input tokens and also on whether or not you have
+requested an event driven parser.
+
+If your parser is event driven, you provide its input by storing the
+input code and the input value, if any, into the parser control block
+and calling the parser.
+
+If you have set the
+\index{Pointer input}\agparam{pointer input}
+\index{Configuration switches}configuration switch
+in your syntax file, you simply initialize the pointer field in your
+parser control block before you call your parser.  Your parser will
+then read its input directly from memory by simply incrementing the
+pointer as necessary.
+
+Otherwise, your parser will invoke a macro called
+\index{\agcode{GET{\us}INPUT}}\index{Macros}\agcode{GET{\us}INPUT} every
+time it needs more input.  You may define \agcode{GET{\us}INPUT}
+according to your needs.  You can define it so that it calls an input
+function, or you can define it so that it executes in-line code each
+time it is invoked.  Your \agcode{GET{\us}INPUT} macro should store its
+input code in the
+\index{\agcode{input{\us}code}}\agcode{input{\us}code}
+field of the
+\index{PCB}parser control block.
+If you do not write a \agcode{GET{\us}INPUT} macro, AnaGram will provide
+one which will read characters from \agcode{stdin}.
+
+If your grammar does not define terminal tokens in terms of ASCII
+characters or external token numbers, your \agcode{GET{\us}INPUT} will
+have to determine the appropriate internal 
+\index{Token number}\index{Token}\index{Number}token number
+for each input token.  To assist you in determining these token
+numbers AnaGram provides a \agcode{typedef enum} statement in the
+\index{File}\index{Header file}header file.  You can then use named
+constants to specify the internal token numbers for the input tokens.
+
+\paragraph{Error Handling.}
+\index{Error handling}\index{Error diagnosis}\index{Error recovery}
+Your parser must be prepared to deal with erroneous input.  There are
+two aspects to error handling: diagnosing the error, and recovering
+from the error.  On encountering an error, your parser will invoke a
+macro called
+\index{\agcode{SYNTAX{\us}ERROR}}\index{Macros}\agcode{SYNTAX{\us}ERROR}.
+If you do not provide a definition for \agcode{SYNTAX{\us}ERROR}, AnaGram
+will provide a simple error diagnostic.  AnaGram can also provide
+automatic error diagnoses which pinpoint the location of the error.
+
+\index{Resynchronization}
+AnaGram provides two options for error recovery:
+\agterm{error token resynchronization} and
+\agterm{automatic resynchronization}.  These are
+techniques for getting your parser back in synch with its input so it
+can proceed after encountering an error.  Normally, if you do not
+select one of these recovery techniques, your parser will terminate
+when it encounters an error; however, you may override this default if
+you wish and provide your own recovery technique.
+
+
+\section{Special Features of AnaGram}
+
+AnaGram provides a number of special features not generally available
+in other parsing systems.  These features extend the applicability of
+syntax directed parsing, make it easier for you to write and debug
+your grammars, and make it easier to interface your parser to other
+parsers and to your main program.  The following paragraphs provide a
+brief introduction to these features.  Their use is described more
+fully in the appropriate chapters in this manual.
+
+\paragraph{Configuration Parameters.}
+\index{Configuration parameters}\index{Parameters}
+AnaGram has a number of configuration switches and parameters which
+you can use to control the way AnaGram works in various circumstances.
+Since programmers often have different requirements, AnaGram strives
+to be as flexible as possible.
+
+Configuration parameters may be set in
+\index{Configuration file}\index{File}configuration files, or in your
+syntax file.  The rules for writing configuration parameters are given
+in Chapter 8.  A summary of all configuration parameters is given in
+Appendix A.  The use of particular configuration parameters is
+described as appropriate throughout this manual.
+
+\paragraph{Character Sets}.
+\index{Character sets}
+In your grammar, if you write terminal tokens explicitly as characters
+or as sets of ASCII characters, AnaGram will set up your parser to
+accept ASCII characters directly, so you will not need to build a
+\index{Lexical scanner}lexical scanner and interface it to
+your parser.  Character sets may be specified in terms of ranges of
+characters, such as \agcode{'a-z'}, as unions, intersections or
+differences of sets, denoted by \agcode{+}, \agcode{\&}, or \agcode{-}
+respectively, or as the complement of a set, denoted by \agcode{\~{}}.
+You may also specify a set consisting of a single character either
+with a literal character or with a numeric specification.  The
+detailed rules for character set expressions are given in Chapter 8.
+Some useful character sets are:
+
+\begin{itemize}
+\item Letters: \agcode{'a-z' + 'A-Z'}
+\item Digits: \agcode{'0-9'}
+\item DOS end of file: \agcode{-1 + \^{}Z}
+\item Neither end of file nor end of line:
+\agcode{\~{}(-1 + \^{}Z
+ + '{\bs}n')}
+\end{itemize}
+
+\paragraph{Keywords.}
+\index{Keywords}
+AnaGram makes special provision for the use of keywords in the parsers
+it builds.  In your grammar, you may specify a keyword as a quoted
+string.  AnaGram will incorporate special string matching logic in
+your parser to recognize keywords in the parser input.  When a keyword
+is recognized, it is treated as an individual token, completely
+independent of the characters which make up the keyword.  Some
+examples of keywords: \agcode{"let"}, \agcode{".EQ."}, 
+\agcode{"<>"}, \agcode{"/*"}.  The
+rules for the use of keyword strings are given in Chapter 8.
+
+Keywords are a powerful and useful tool, but they are not completely
+consistent with the assumptions underlying the use of context-free
+grammars.  Thus under certain circumstances it is possible for a
+parser built from a grammar that uses keywords to fail
+% XXX s/parse correctly/correctly parse/
+to parse correctly text that
+appears to be consistent with the grammar.  Such a situation is called a
+\index{Keyword anomalies}\index{Anomaly}keyword anomaly.
+AnaGram detects and diagnoses keyword anomalies in your grammar, and
+provides tools to help you find and correct the problem.  Many such
+apparent anomalies, however, are completely innocuous.  Keyword
+anomalies are discussed in greater detail in Chapter 7.
+
+\paragraph{Virtual Productions.}
+\index{Virtual productions}\index{Production}
+Virtual productions are a powerful, shorthand notation for specifying
+optional or repetitive input.  Using a virtual production can often
+spare you the trouble of writing out four or five regular productions
+and naming them.
+
+Most of the notation for virtual productions should be familiar since
+it has been used in programming manuals for years.  The basic forms
+are these:
+
+\index{\agcode{[]}}
+\index{\_opb\_clb}  % {}, apparently. XXX?
+\index{\agcode{...}}\index{Ellipsis}
+\index{\agcode{?}}
+\index{\agcode{?...}}
+\index{\agcode{/...}}
+
+% XXX figure out how to not have to encode the line breaks manually.
+% I think we need one of the fancier extension forms of tabular.
+
+\begin{tabular}[t]{ll}
+
+\agcode{[]}&enclosing a list of one or more grammar rules separated
+by \agcode{|} characters denotes an optional\\
+\phantom{}&choice of one of the rules.\\
+
+\agcode{\bra \ket}&enclosing a list of one or more grammar rules separated by 
+\agcode{|} characters indicates a required\\
+\phantom{}&choice of one of the rules.\\
+
+\agcode{...}&following a token, or following \agcode{[]} or \agcode{\bra \ket},
+indicates arbitrary repetition of the previous choice or\\
+\phantom{}&token.\\
+
+\agcode{?}&following a token or set expression makes the token optional.\\
+
+\agcode{?...}&following a token or set expression denotes zero or more
+repetitions of the token.\\
+
+\agcode{/...}&following \agcode{[]} or \agcode{\bra \ket} indicates arbitrary
+repetition of the choice subject to the constraint that\\
+\phantom{}&choices must alternate.\\
+
+\end{tabular}
+
+\index{Production}\index{Virtual productions}Virtual productions
+are simply syntactic devices.  When AnaGram encounters a virtual
+production it translates the virtual production into conventional
+productions which show up in your grammar tables.  Virtual productions
+are described in Chapter 8.
+
+\paragraph{Definition Statements.}
+\index{Statement}\index{Definition statement}
+AnaGram provides a mechanism for naming particular character sets,
+keywords or virtual productions in your syntax file.  The name you
+define can be used freely in place of the thing defined.  Definition
+statements have the form:
+
+% XXX why the hell is the first line offset by a small amount?
+\begin{indentingcode}{0.4in}
+name = \codemeta{character set}
+name = \codemeta{virtual production}
+name = \codemeta{keyword}
+name = \codemeta{immediate action}
+name = \codemeta{token name}
+\end{indentingcode}
+
+The name may be any name acceptable to AnaGram.  The name can then be
+used anywhere you might have used the expression on the right side.
+For example:
+
+\begin{indentingcode}{0.4in}
+upper case letter = 'A-Z'
+lower case letter = 'a-z'
+letter = upper case letter + lower case letter
+statement list = statement?...
+while = "WHILE"
+dos eof = \^{}Z
+\end{indentingcode}
+
+\paragraph{Disregard Statements.}
+\index{Disregard statement}\index{Statement}
+If you wish your parser generally to skip over certain characters or
+constructs in its input, such as blanks, newlines, or comments, you
+may use disregard statements to specify precisely what is to
+be skipped and under what circumstances.  You may use any number of
+disregard statements and you may cause your parser to disregard any
+number of tokens, of whatever complexity.  Any \agparam{disregard}
+statements must
+be placed in a configuration section.  They may appear anywhere in
+your syntax file.
+
+The format of the \agparam{disregard} statement is as follows:
+
+\begin{indentingcode}{0.4in}
+disregard \codemeta{token name}
+\end{indentingcode}
+
+The rules for using disregard statements are given in Chapter 8.
+
+\paragraph{Lexeme Statements.}
+\index{Lexeme statement}\index{Statement}
+The lexeme statement is used to fine tune the disregard statement.
+Any \agparam{lexeme} statements must be placed in a configuration
+section.  They may appear anywhere in your syntax file.
+
+The format of the \agparam{lexeme} statement is as follows:
+
+\begin{indentingcode}{0.4in}
+lexeme \bra \codemeta{token list} \ket
+\end{indentingcode}
+
+where \textit{token list} is a list of nonterminal tokens, separated
+by commas.
+
+The lexeme statement allows you to specify that for the purposes of
+the disregard statement, the listed tokens are to be treated as
+% XXX make this: indivisible lexical units
+lexical units, so that the disregard statement will be inoperative
+within the tokens specified.  Lexeme statements are discussed in
+Chapter 8.
+
+\paragraph{Semantically Determined Productions.}
+\index{Semantically determined production}\index{Production}
+Traditionally, syntax directed parsing systems have enforced a rigid
+distinction between syntactic and semantic information.  In these
+systems, it is not possible to use the knowledge that, for example, a
+symbol is a function name and not a variable name in the syntax of a
+programming language.  AnaGram provides semantically determined
+productions as a mechanism to allow semantic information to control
+syntactic analysis.
+
+A semantically determined production has more than one
+\index{Token}\index{Reduction token}reduction token
+on the left side.  The reduction procedure determines which reduction
+token the parser should use.  For example:
+
+\begin{indentingcode}{0.4in}
+variable name, function name
+  -> symbol:s = identify{\us}symbol(s);
+\end{indentingcode}
+
+In this situation, the parser will set the reduction token to the
+leftmost token, \agcode{variable name}, before calling the reduction
+procedure.  If the reduction procedure finds that \agcode{s} in fact
+identifies a function name, it can change the reduction token to
+\agcode{function name}.  Using this capability, you can write your grammar in
+terms of \agcode{variable name} and \agcode{function name}, just as
+though they were syntactically distinguishable.  Further discussion of
+semantically determined productions may be found in Chapter 9.
+
+\paragraph{Event Driven Parsers.}
+\index{Parser}\index{Event driven parser}
+Traditionally, parsers have been constructed as subroutines which are
+called to analyze a stream of input and return only when they have
+finished parsing the input or have encountered an unrecoverable error.
+Such a parser itself calls a subroutine to obtain each unit of input.
+There are a number of circumstances where this traditional
+architecture is inconvenient.  The most obvious is in situations where
+programs must be event-driven, as in many popular windowing systems.
+A somewhat less obvious situation is where one parser, often called a
+\index{Lexical scanner}\agterm{lexical scanner}, provides input to a
+second. In these situations it is convenient to invert the role of the
+calling and the called programs.
+
+AnaGram provides the capability of specifying that a parser it builds
+should be event-driven.  In this case, the parser consists of an
+initialization procedure and an event handler.  The main program calls
+the initialization procedure and then calls the event handler with
+each input token in turn.  The event handler returns every time the
+parser needs more input.  Details on event-driven parsers are given in
+Chapter 9.
+
+\paragraph{Context tracking.}
+\index{Context tracking}
+When you are writing a reduction procedure for a particular grammar
+rule, you often need to know the value one or another of your program
+variables had at the time the first token in the rule was encountered.
+Examples of such variables are:
+
+\begin{itemize}
+\item Line or column number
+\item Index in an input file
+\item Index into an array
+\item Counters, as of symbols defined, etc.
+\end{itemize}
+
+Such variables can be thought of as representing the ``context'' of
+the rule you are reducing.  Sometimes it is possible to incorporate
+the values of such variables into the values of reduction tokens, but
+this can become quite cumbersome.  AnaGram provides a feature known as
+``context tracking'' to deal with this problem.  Details on context
+tracking can be found in Chapter 9.
+
+\paragraph{Coverage Analysis.}
+\index{Coverage analysis}
+AnaGram has facilities to help you determine the adequacy of your test
+suites.  If you set appropriate configuration switches, AnaGram will
+include code in your parser to count the number of times each rule in
+your grammar is reduced or the number of times each reduction
+procedure in your grammar is executed.  You can use these counts to
+determine the completeness of your testing.  The use of these
+facilities is discussed in Chapter 9.
+
+\paragraph{Immediate Actions.}
+\index{Action}\index{Immediate action}
+AnaGram provides for executing particular functions, called
+``immediate actions'', at any point in a grammar rule.  Thus,
+computation does not need to wait until the rule is complete.
+Immediate actions are most useful when using AnaGram to control an
+interactive process.  Immediate actions are discussed in Chapter 8.
+
+\paragraph{Grammar Trace.}
+\index{Grammar Trace}\index{Window}\index{Trace}
+The \agwindow{Grammar Trace} facility of AnaGram allows you to examine
+the workings of your parser in detail.  You can set up a
+representation of the
+\index{Parser state stack}\index{State stack}\index{Stack}parser state stack
+and parser state as they might appear in the course of execution of
+your parser.  You can then examine the possible inputs and see how the
+state and the state stack change in response to any input you choose.
+You can back up and try other options.  You can have several grammar
+traces active simultaneously so you can compare the results of
+different sequences of input.  For any configuration of the state
+stack, you can see what grammar rules the parser is in the process of
+matching.
+
+The use of the Grammar Trace is described in Chapter 5.
+
+\paragraph{File Trace.}
+\index{Trace}\index{File Trace}\index{Window}
+The \agwindow{File Trace} is a facility which allows you to see how
+your parser will work on real data.  File Trace is an interactive,
+interpretive parser governed by the rules in your grammar.  It lets
+you see precisely how your grammar parses a test file.  You can see
+the contents of the parse stack at any point in the parse.  You can
+watch the parse progress at any level of detail you choose.  If you
+wish, you may back up and try again.  When AnaGram runs the
+\agwindow{File Trace} or \agwindow{Grammar Trace} it also builds a
+\index{Coverage}\index{Trace Coverage}\index{Window}
+\agwindow{Trace Coverage} table.
+
+Note that AnaGram normally uses a number of short-cut parsing actions.
+To make the parse look like a textbook parse, set the
+\index{Traditional engine}\agparam{traditional engine}
+\index{Configuration parameters}configuration parameter
+in your syntax file.
+
+\paragraph{Aids to Debugging.}
+AnaGram provides a number of facilities to help you debug your parser.
+It provides numerous tables that summarize your grammar and a
+windowing environment to assist you in inspecting and comparing this
+data.  To assist in identifying a conflict in your grammar it can
+provide a
+\index{Trace}\index{Conflict Trace}\index{Window}\agwindow{Conflict Trace},
+a pre-built grammar trace showing a sequence of input which will
+trigger the conflict.  If your parser encounters a
+\index{Syntax error}\index{Errors}syntax error,
+you can arrange to have AnaGram build an \agwindow{Error Trace}, a
+pre-built grammar trace showing the sequence of input tokens that led
+to the syntax error.  These facilities are described in Chapter 5.
+
+\paragraph{Conflict Resolution.}
+\index{Conflict}
+In order to help you deal with conflicts in your grammar, AnaGram
+provides several methods for resolving conflicts.  You may include
+\agterm{precedence declarations} to assist in the construction of
+simple precedence grammars, you may declare tokens to be
+\agparam{sticky}, or you may declare tokens to be
+\agterm{subgrammar} tokens.
+These techniques are described in Chapter 7.
author	David A. Holland
date	Sat, 22 Dec 2007 17:52:45 -0500
parents
children