Mercurial > ~dholland > hg > ag > index.cgi
diff doc/manual/isdp.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/manual/isdp.tex Sat Dec 22 17:52:45 2007 -0500 @@ -0,0 +1,977 @@ +\chapter{Introduction to Syntax Directed Parsing} + +Every programmer has to deal with input data. The units of data may +consist of single characters from a keyboard or file, unit records +from a file, mouse events in a windowing system, or even more complex +data constructs. When the processing of a unit of data depends only on +the value of that data, it is usually straightforward. Usually, +however, the processing depends also on what has preceded, and often +even on what follows, the input under consideration. Keeping track of +these dependencies in order to know how to process the data is called +\index{Parsing}\agterm{parsing}. + +It is often relatively easy to keep track of simple dependencies when +first writing a program. But as the program develops, as new features +are added, as bugs are fixed, the dependencies often stop being +simple. Then input processing becomes a headache, since it is hard to +keep track of or even identify all the particular cases. Changes to +the program cause unexpected problems. Program maintenance threatens +to get out of control. + +Syntax directed parsing is a technique for dealing with input data +streams which keeps you in control of the problem, gives you a high +degree of confidence in your program, makes bugs less likely and makes +program maintenance and the addition of new features an easy task. + +To use syntax directed parsing you create a high level description of +the structure of your input data, called a +\index{Grammar}\agterm{grammar}. +A file which contains a grammar is called a +\index{Syntax file}\index{File}\agterm{syntax file}. +A +\index{Parser generator}\agterm{parser generator}, +such as AnaGram, +can create, from a syntax file, a function (or program) called a +\index{Parser}\agterm{parser}, +written in C or C++. The parser keeps track of all the dependencies +in your input, and calls certain functions, +\index{Reduction procedure}\agterm{reduction procedures}, +to deal with specific units or sequences of data as they are +encountered. + +Reduction procedures are functions you write to process your data. +They are linked, in an obvious way in the grammar, to the structures +in your input, so that your parser will call them at precisely the +right times with precisely the right data. When you write reduction +procedures, you can concentrate entirely on what you have to do with +the data. You don't have to encumber your code with switches and +tests to determine the structure of your input. + +This chapter describes in a general way how you write a grammar and +how a parser works. In particular, it defines a number of terms which +are useful in talking about grammars and syntax directed parsing. It +then introduces a number of special features of AnaGram. More +detailed treatment of writing grammars is given in Chapter 8. +Techniques for building a complete functioning parser are discussed in +Chapter 9. + + +\section{Describing an Input Sequence} + +Writing a grammar consists of describing the acceptable input +sequences for your program. The vehicle for describing an input +sequence is called a +\index{Production}\agterm{production}. +Productions show how a logical component of the input can be made up +of a sequence of more elementary components. A production that +describes a date might be written: + +\begin{indentingcode}{0.4in} +date + -> name of month, day, comma, year +\end{indentingcode} + +The components of the input are called +\index{Token}\agterm{tokens}. +The sequence of tokens on the \index{Right side}right side of the +production is called a +\index{Grammar rule}\agterm{grammar rule}, +or \index{Rule}\agterm{rule} for short. +The individual tokens on the right side of the rule are also called +\index{Rule elements}\index{Rule}\agterm{rule elements}. +The \agterm{rule length} is the number of elements in the rule. + +The token on the \index{Left side}left side of the production is +called the +\index{Reduction token}\index{Token}\agterm{reduction token} +for the rule. \index{Token}Tokens may have +\index{Value}\index{Semantic value}\agterm{semantic values}, +as distinguished from \index{Value}\agterm{syntactic values}, which +you can use in your reduction procedures. For instance, the value of +\agcode{name of month} could be an integer in the range zero to +eleven, or it could be a pointer to an ASCII string. The value of day +could be an integer in the range one to thirty-one. + +A grammar consists of a number of such productions, each of which +describes some component of the input in terms of other components. +It does not take very many productions to describe quite complex input +streams. A grammar for the C language, for instance, requires about +two hundred productions. + +Many people find the term \agterm{production} quite confusing. It +comes from theoretical linguistics where it is used to describe how +one may produce sequences which correspond to a set of grammatical +rules. Ironically, the major usage of the idea has been in parsing +where the interest is not so much in creating sequences which satisfy +the grammatical rules as in decoding and analyzing such sequences. +Nonetheless, it is convenient, in the above example, to say that the +token \agcode{date} \agterm{produces} a sequence of tokens consisting +of \agcode{name of month}, \agcode{day}, \agcode{comma}, and +\agcode{year}. We also say that the sequence \agterm{reduces} to +\agcode{date}. + +There may be more than one production to describe a given component, +if there is more than one way it may be represented. For instance, + +\begin{indentingcode}{0.4in} +date + -> day, name of month, year +\end{indentingcode} + +describes another common way of writing a date. In other words, a +reduction token may produce a number of different grammar rules. + +Tokens which appear on the left side of one or more productions are called +\index{Token}\index{Nonterminal token}\agterm{nonterminal tokens}. +Those which appear \emph{only} on the right sides of productions are +called +\index{Token}\index{Terminal token}\agterm{terminal tokens}. +Terminal tokens are the units which actually appear physically in the +input. Nonterminal tokens are identified when a sequence of tokens +that matches the right side of a production is seen in the input. +When AnaGram analyzes a grammar, it assigns a unique +\index{Token number}\index{Number}\index{Token}\agterm{token number} +to each token it finds in the grammar. + +Nonterminal tokens, such as \agcode{date} in the example above, may +appear in any grammar rule just as though they were input tokens. The +token on the left side of a production can even appear on the right +side as well. Such a production is called a +\index{Production}\index{Recursive productions}\agterm{recursive} +production. When a nonterminal token appears on the right side of a +production, it may be represented in this context by \emph{any} of +the grammar rules it produces. Grammars described in this manner are +called +\index{Context free grammar}\index{Grammar}\agterm{context free grammars} +since there is no contextual constraint on which of the rules that a +token produces can appear in any given context. + +Recursive productions may be either left recursive or right recursive. +\agterm{Left recursive} productions are those where the recursively +defined nonterminal appears as the first element in the recursive +rule. \agterm{Right recursive} productions are those where it is the +last element. If it appears anywhere between, the production is said +to be \agterm{center recursive}. + +Any nonterminal token which has a recursive production must also have +at least one simple, non-recursive production. Otherwise, it is not +possible to create a finite sequence of terminal tokens from the +nonterminal token. + +Recursion may also occur implicitly in a grammar when one of the +tokens on the right side of a production itself has a production +involving the token on the left. Such implicit recursion sometimes +may involve numerous levels of productions. Implicit recursion occurs +most commonly in describing constructs such as arithmetic expressions +or the block structure of programming languages. + +Clearly, grammars can accommodate multiple levels of structure in the +input sequences they describe. There must, at the top, be a single +token which encompasses the entire input. This special token is +variously called the +\index{Grammar token}\index{Configuration parameters}\index{Token} +\agterm{grammar token}, +the +\index{Goal token}\agterm{goal token} +or the +\index{Start token}\agterm{start token}. + +In AnaGram grammars, terminal tokens need not be declared or otherwise +defined. Any token name that appears only on the right side of a +production is taken to be a terminal token, and AnaGram presumes that +the input procedure that provides tokens to the parser will be able to +identify them to the parser appropriately. The details on how to do +this are explained in Chapter 9. + +On the other hand, AnaGram allows you to specify terminal tokens +explicitly as ASCII characters, or even as sets of ASCII characters, +right in the grammar. Thus, you may write \agcode{'0-9'} to represent +the set of ASCII digits, or \agcode{'A-Z'} to represent the set of +upper case letters. The +\index{Semantic value}\index{Value}\index{Token}semantic value +of such a token is the ASCII \index{Character codes}character code +that actually appears in the input stream. The rules for representing +\index{Character sets}character sets are given in Chapter 8. If the +various sets you use in your grammar overlap, they may not properly +represent terminal tokens. In this case, AnaGram automatically +extends your grammar appropriately. This ``set partition'' logic is +described in Chapter 6. + + +\section{How a Parser Works} + +The aim of a \index{Parser}parser is to match its input with the full +syntactic structure specified by the productions which make up the +grammar. The primary component of a parser is an input buffer, +sometimes thought of as a stack, into which tokens are +\agterm{shifted}, or stored sequentially, as they are encountered in +the input. At the same time that a token is shifted into the input +buffer, its +\index{Semantic value}\index{Token}\index{Value}semantic value is +pushed onto the +\index{Parser value stack}\index{Value stack}\index{Stack} +\agterm{value stack}. +A token is not shifted into the buffer unless it ``makes sense'', that +is, unless it is consistent with the rules of the grammar and with the +input that has preceded it. If a token does not make sense, the +parser signals a +\index{Syntax error}\index{Errors}\agterm{syntax error}. +In order to determine whether a token makes sense, the parser has a +sort of decision table which provides a list of acceptable tokens for +each of a number of +\index{Parser}\index{State}\agterm{states}. +The table also specifies what the parser is to do with each acceptable +token. When the table indicates that a token is to be shifted into +the buffer, it also specifies a new state. The parser stacks the +current state on a +\index{Parser state stack}\index{State stack}\index{Stack} +\agterm{state stack} +and jumps to the new state. Thus every time a token is shifted into +the input buffer, a state number is pushed onto the state stack. For +each state of the parser, excepting only the initial state, there is a +unique token which will cause a jump to that state. This token is +called the +\index{Characteristic token}\index{Token}\agterm{characteristic token} +for the state. + +When the rightmost, or most recent, tokens in the input buffer match +the right side of a production precisely, the parser \emph{may} +replace the tokens that match the rule with a single token, the token +on the left side of the production. This process of replacing a +sequence of tokens with a single token is called +\index{Reduction}\agterm{reduction}. +The token that replaces the sequence of tokens is called the +\index{Reduction token}\index{Token}\agterm{reduction token}. + +The actual mechanism of the reduction is quite important. At the same +time that input tokens are removed from the input buffer, state +numbers are popped from the state stack, so that when all input tokens +matching the rule have been removed, the parser +\index{Parser}\index{State}state has been restored to the value it had +at the time the first token in the rule was seen. As state numbers +are popped from the state stack, token values are popped from the +value stack. If the rule has a reduction procedure, temporary +variables are loaded with the values popped from the stack and the +reduction procedure is called. The reduction token is now shifted +into the input buffer just as though it were an input token. If the +reduction procedure returned a result it is shifted into the value +stack as the value of the reduction token. The parser stacks the +current state again and jumps to a new state as determined by the +parser tables. + +If there are no errors in your input, when the last token has been +read from the input, shifted into the input buffer, and reductions +performed, there will be precisely one token in the input buffer: the +\index{Grammar token}\index{Goal token}grammar, or goal, token which +describes your entire input. At this point your parser declares +itself finished. + +Reductions do not necessarily occur every time a rule matches the +tokens in the input buffer. If the reduction token does not make +sense, that is, if it is not consistent with the rules of the grammar +and with the input that has preceded it, the parser will not perform +the reduction. Suppose there are tokens in the input which match one +of the rules given above for \agcode{date}. A reduction will not +occur unless \agcode{date} is one of the tokens the parser is actually +looking for at that stage in the input. + +Even when the reduction token would make sense, there are still +situations where the reduction would not take place. Suppose a +grammar includes the two productions given above for \agcode{date} as +well as the following: + +\begin{indentingcode}{0.4in} +date + -> name of month, day +\end{indentingcode} + +This production is the same as the first, but with no year +specification. We often write dates without specifying the year +explicitly. The year is usually understood from context. When a +parser directed by this grammar has encountered \agcode{name of month} +and \agcode{day}, it can't tell without looking further whether it has +a short form or a long form date. In such a circumstance, the parser +looks at the next following token, which is called a +\index{Lookahead token}\index{Token}\agterm{lookahead token}. +If the lookahead token is a comma, then, in the absence of other +productions, the input is a long form date. If the lookahead token is +not a comma, then the input is certainly not a long form date, and it +is proper to reduce the short form production. + +Suppose the lookahead token were a comma and the grammar were also to +contain the following production: + +\begin{indentingcode}{0.4in} +appointment + -> date, comma, time +\end{indentingcode} + +Since a comma can follow date, according to this rule, and can also +follow day according to the first production, it is impossible to +determine, simply by looking at the lookahead token, whether the date +was being given in short form or long form. One would have to look +beyond the comma to see if what follows the comma matches the rules +for time or for year. Although it is possible to build parsers which +can do this, it is not generally feasible. This situation is called a +\index{Shift-reduce conflict}\index{Conflicts} +\agterm{shift-reduce conflict}, +because the parser cannot decide simply on the basis of the lookahead +token whether to reduce the short form rule or to ignore it and shift +the comma into the input buffer. + +AnaGram finds and warns you about the conflicts in your grammar when +it analyzes your grammar. To do this, it checks all states to find +all completed rules. A +\index{Completed rule}\index{Rule}\agterm{completed rule} +is a rule that is completely matched by the rightmost tokens in the +input buffer. For each such rule, AnaGram determines all of the +terminal tokens which could follow once that rule has been reduced. +These tokens are sometimes called the +\index{Token}\agterm{reducing tokens} +for the given rule in the given state. (Note that ``reducing tokens'' +are quite different from ``reduction tokens''.) AnaGram checks each +state for shift-reduce conflicts by comparing the set of tokens which +can be shifted against all sets of reducing tokens for that state. If +there is an overlap, there is a conflict. When AnaGram finds a +conflict, it prepares a detailed report on the conflict. This +information is then available in the \agwindow{Conflicts} window. A +shift-reduce conflict does not prevent you from building a parser. If +you build a parser for a grammar which has such a shift-reduce +conflict, the parser will resolve the issue by shifting the lookahead +token into the input buffer and ignoring the reduction. + +There is another type of conflict involving reductions. Suppose a +given state has more than one +\index{Completed rule}\index{Rule}completed rule. +AnaGram then checks the sets of reducing tokens for all the completed +rules. If there is an overlap, there is no way for AnaGram to +determine which rule should be reduced simply on the basis of the +\index{Token}\index{Lookahead token}lookahead token. +This situation is called a +\index{Reduce-reduce conflicts}\index{Conflicts} +\agterm{reduce-reduce conflict}. +AnaGram will also diagnose reduce-reduce conflicts in the Conflicts +window. If you decide to ignore a reduce-reduce conflict and ask +AnaGram to build a parser, the parser will choose the rule AnaGram +encountered first when it read your grammar. + +\index{Conflicts}Conflicts are not necessarily errors. Sometimes you +can build a more compact, faster parser by deliberately introducing +conflicts. You can rely on AnaGram's default resolution of the +conflicts, or you may use operator precedence to resolve the +conflicts. The use of operator precedence is described in Chapter 9. + +AnaGram has extensive facilities for identifying and correcting +unwanted conflicts. These facilities are described in Chapter 7. + +Traditionally, parsers are built using only shift and reduce actions +as described above. The parsers AnaGram builds normally use a number +of more complex actions, each of which is equivalent to a number of +shift and reduce actions. These complex actions permit AnaGram to +build faster, more compact parsers. If you wish, however, you may +restrict AnaGram to using only shift and reduce actions by setting the +\index{Traditional engine}\index{Configuration switches} +\agparam{traditional engine} configuration switch. See Appendix A, +Configuration Parameters. + + +\section{A Note on Notation} + +\index{Context free grammar}\index{Grammar}Context free grammars have +been traditionally represented in the literature using +\index{Backus-Naur Form}\agterm{Backus-Naur Form}, or +\index{BNF}\agterm{BNF}. +In Backus-Naur Form, certain characters, called metacharacters, are +used to punctuate productions and all other printable characters are +taken to represent themselves literally. Named tokens are denoted by +enclosing the name within angle brackets ($< >$). The left side of a +production is distinguished from the right side by the +% XXX s/characters/symbol/ +characters $::=$. +If several productions have the same left side, the vertical +bar $|$ is used to separate them. The elements of a grammar +rule are simply juxtaposed to indicate that one follows another. +Blanks are ignored. Thus, in BNF, the first production given for +\agcode{date}, above, would be: + +% ~ is a hard space +$$ +<date>~~::=~~<name~of~month>~~<day>~~,~~<year> +$$ + +AnaGram uses a notation more consonant with ordinary programming +usage. Thus token names need not be bracketed and literal characters +must be appropriately quoted. The elements of rules are joined by +commas. Using this approach, there is no need for metacharacters and +it becomes possible to make a number of useful extensions to the +notation. The detailed rules for writing grammars using AnaGram's +notation are given in Chapter 8. + + +\section{Reduction Procedures} +\index{Reduction procedure} + +Of course, the reason for parsing an input stream is to interpret the +data in the stream and to process it in some useful manner. The +primary tool for doing this is the \agterm{reduction procedure}. A +reduction procedure is a piece of C code that is executed when a +particular grammar rule is reduced. Often, a reduction procedure +calculates a value which becomes the +\index{Value}\index{Semantic value}semantic value +of the +\index{Token}\index{Reduction token}reduction token. +The input to the reduction procedure consists of the values of the +tokens that make up the grammar rule. + +AnaGram allows you to assign C variable names to the tokens in the +grammar rule, so that you can refer to them in the reduction +procedure. To assign a C variable name, simply follow the token in +the rule with a colon and the C variable name. Simple reduction +procedures can be written as C or C++ expressions. At the end of the +rule, write an equal sign and an expression, terminated by a +semicolon. For example: + +\begin{indentingcode}{0.4in} +(int) hex digit + -> '0-9':d = d-'0'; + -> 'a-f':d = d-'a'+10; + -> 'A-F':d = d-'A'+10; +\end{indentingcode} + +When any one of these rules is matched, the value of the token in the +rule is assigned to the temporary variable \agcode{d}. The expression +to the right of the equal sign is evaluated and the value of the +expression is stored as the value of \agcode{hex digit}, which has +been declared to be an \agcode{int}. These productions define +hexadecimal digits as ASCII characters, and calculate the binary value +of the digit in terms of the ASCII character code, \agcode{d}. The +binary value becomes the value of \agcode{hex digit}. Hexadecimal +digits can be combined to make hexadecimal numbers by writing the +following productions: + +\begin{indentingcode}{0.4in} +(int) hex number + -> hex digit + -> hex number:n, hex digit:d = 16*n+d; +\end{indentingcode} + +%In this example, note that if you do not specify a reduction procedure for +%a grammar rule, the value of the reduction token will be taken to be the +%value of the first token in the rule. Thus, it is not necessary to provide +%a reduction procedure in the first production for hex number. The second +%production recursively defines the value of hexadecimal numbers that have +%more than one digit. + +There are several important points to notice in this example. First, +reduction procedures are executed ``from the bottom up''. That is, +the reduction procedure for \agcode{hex digit} is executed before any +reduction procedure for \agcode{hex number}. Second, if there is no +reduction procedure for a production, the value of the first token in +the rule is assigned to the reduction token. Thus, it is not +necessary to provide a reduction procedure in the first production for +\agcode{hex number}. Third, the reduction procedures for recursive +productions are always executed \emph{after} the reduction procedure, +if any, for the non-recursive production which begins the recursion. +Finally, when an input sequence is described using left recursion, as +in this example, the elements of the sequence are processed left to +right. + +% +% XXX this is a bad example as you never want to read floats this way! +% Can we think of something else? +% +If you wish to process the elements of a sequence right to left, you +may use right recursion. For example, it is sometimes convenient to +define the fraction part of a decimal number thus: + +\begin{indentingcode}{0.4in} +(double) fraction part + -> '0-9':d = (d - '0')/10.0; + -> '0-9':d, fraction part:f = (d - '0' + f)/10.0; +\end{indentingcode} + +In this case the leading digits are stored temporarily in the parse +stack, and then the fraction part is evaluated right to left only when +the last digit has been found. + +Reduction procedures can be more complex than simple expressions. After the +equal sign you may include an arbitrary block of C or C++ code, enclosed in +braces, \agcode{\bra \ket}. To return a +\index{Semantic value}\index{Token}\index{Value}semantic value +for the reduction token simply use a return statement. Of course, +reduction procedures have the full resources of C or C++ at their +disposal. They may set and interrogate global variables and may call +functions freely. Reduction procedures are usually implemented as C +functions. Very simple procedures may be implemented as macros, +unless you have disabled this option by turning off the +\index{Macros}\index{Allow macros}\agparam{allow macros} configuration +switch. See Chapter 8 for the rules for writing reduction procedures. +When AnaGram builds a parser it copies all of the reduction procedures +you have defined to the parser file, and includes code to call them at +the right time. + +Since the reduction procedures you write will probably need some +support code, such as \agcode{\#include} statements and declarations, +you may incorporate C or C++ code into your syntax file at any point. +You need only enclose it in braces, \agcode{\bra \ket}. Such code +is called +\index{Embedded C}\agterm{embedded C}. +All embedded C code is also copied to the \index{File}parser file, and +\emph{precedes} all of your reduction procedures. + + +\section{Building a Parser} +\index{Building a Parser}\index{Parser} + +In order to round out a parser into a functioning program it needs +input procedures, as well as error diagnosis and recovery +capabilities. AnaGram has a number of options available which give +you a high degree of flexibility in configuring a parser to suit your +particular needs. All of the options are provided with reasonable +defaults, so that you can safely disregard any option until you need +the features it provides. The following paragraphs summarize the +options AnaGram provides. These topics are discussed in greater +detail in Chapter 9. + +\paragraph{Invoking a Parser.} +\index{Parser} +Normally, AnaGram configures parsers as functions which you can call +from elsewhere in your program. In this situation, you call the +parser, it processes its input, and returns either when it has +finished or cannot proceed because of errors. Alternatively, if you +set the +\index{Event driven}\agparam{event driven} +configuration switch, your parser will be configured so that you have +two procedures to call: an \index{Initializer}\agterm{initializer} and +a \index{Parser}\agterm{parser}. In the event driven configuration +you start the parse by calling the initializer and then you call the +parser once for each unit of input. Using the event driven mode makes +it quite easy to configure a parser as a filter and to chain several +parsers together so that the output from one parser is the input to +the next. Such +\index{Parsing}\index{Multi-stage parsing}\agterm{multi-stage parsing} +is a convenient way to deal with complex input that is not context free. + +\paragraph{Communicating with a Parser.} +The complete status of your parser is contained in a single data +structure called a +\index{Parser control block}\index{Control block} +\agterm{parser control block}. +All communications with a parser take place via the parser control +block. \index{Input procedures}Input procedures must place input data +in the appropriate field in the parser control block. When the parse +is complete or encounters an error, the results of the parse may be +found in the parser control block. When AnaGram builds a parser it +includes, in the \index{File}\index{Header file}header file it +generates, a \agcode{typedef} statement which defines the structure of +the parser control block. + +\paragraph{Parser Input.} +\index{Input}\index{Parser}\index{Input procedures} +The input to your parser may be either characters read directly from +an input stream, or tokens accumulated by a pre-processor or lexical +scanner. The way you provide input to your parser depends on how your +grammar defines input tokens and also on whether or not you have +requested an event driven parser. + +If your parser is event driven, you provide its input by storing the +input code and the input value, if any, into the parser control block +and calling the parser. + +If you have set the +\index{Pointer input}\agparam{pointer input} +\index{Configuration switches}configuration switch +in your syntax file, you simply initialize the pointer field in your +parser control block before you call your parser. Your parser will +then read its input directly from memory by simply incrementing the +pointer as necessary. + +Otherwise, your parser will invoke a macro called +\index{\agcode{GET{\us}INPUT}}\index{Macros}\agcode{GET{\us}INPUT} every +time it needs more input. You may define \agcode{GET{\us}INPUT} +according to your needs. You can define it so that it calls an input +function, or you can define it so that it executes in-line code each +time it is invoked. Your \agcode{GET{\us}INPUT} macro should store its +input code in the +\index{\agcode{input{\us}code}}\agcode{input{\us}code} +field of the +\index{PCB}parser control block. +If you do not write a \agcode{GET{\us}INPUT} macro, AnaGram will provide +one which will read characters from \agcode{stdin}. + +If your grammar does not define terminal tokens in terms of ASCII +characters or external token numbers, your \agcode{GET{\us}INPUT} will +have to determine the appropriate internal +\index{Token number}\index{Token}\index{Number}token number +for each input token. To assist you in determining these token +numbers AnaGram provides a \agcode{typedef enum} statement in the +\index{File}\index{Header file}header file. You can then use named +constants to specify the internal token numbers for the input tokens. + +\paragraph{Error Handling.} +\index{Error handling}\index{Error diagnosis}\index{Error recovery} +Your parser must be prepared to deal with erroneous input. There are +two aspects to error handling: diagnosing the error, and recovering +from the error. On encountering an error, your parser will invoke a +macro called +\index{\agcode{SYNTAX{\us}ERROR}}\index{Macros}\agcode{SYNTAX{\us}ERROR}. +If you do not provide a definition for \agcode{SYNTAX{\us}ERROR}, AnaGram +will provide a simple error diagnostic. AnaGram can also provide +automatic error diagnoses which pinpoint the location of the error. + +\index{Resynchronization} +AnaGram provides two options for error recovery: +\agterm{error token resynchronization} and +\agterm{automatic resynchronization}. These are +techniques for getting your parser back in synch with its input so it +can proceed after encountering an error. Normally, if you do not +select one of these recovery techniques, your parser will terminate +when it encounters an error; however, you may override this default if +you wish and provide your own recovery technique. + + +\section{Special Features of AnaGram} + +AnaGram provides a number of special features not generally available +in other parsing systems. These features extend the applicability of +syntax directed parsing, make it easier for you to write and debug +your grammars, and make it easier to interface your parser to other +parsers and to your main program. The following paragraphs provide a +brief introduction to these features. Their use is described more +fully in the appropriate chapters in this manual. + +\paragraph{Configuration Parameters.} +\index{Configuration parameters}\index{Parameters} +AnaGram has a number of configuration switches and parameters which +you can use to control the way AnaGram works in various circumstances. +Since programmers often have different requirements, AnaGram strives +to be as flexible as possible. + +Configuration parameters may be set in +\index{Configuration file}\index{File}configuration files, or in your +syntax file. The rules for writing configuration parameters are given +in Chapter 8. A summary of all configuration parameters is given in +Appendix A. The use of particular configuration parameters is +described as appropriate throughout this manual. + +\paragraph{Character Sets}. +\index{Character sets} +In your grammar, if you write terminal tokens explicitly as characters +or as sets of ASCII characters, AnaGram will set up your parser to +accept ASCII characters directly, so you will not need to build a +\index{Lexical scanner}lexical scanner and interface it to +your parser. Character sets may be specified in terms of ranges of +characters, such as \agcode{'a-z'}, as unions, intersections or +differences of sets, denoted by \agcode{+}, \agcode{\&}, or \agcode{-} +respectively, or as the complement of a set, denoted by \agcode{\~{}}. +You may also specify a set consisting of a single character either +with a literal character or with a numeric specification. The +detailed rules for character set expressions are given in Chapter 8. +Some useful character sets are: + +\begin{itemize} +\item Letters: \agcode{'a-z' + 'A-Z'} +\item Digits: \agcode{'0-9'} +\item DOS end of file: \agcode{-1 + \^{}Z} +\item Neither end of file nor end of line: +\agcode{\~{}(-1 + \^{}Z + + '{\bs}n')} +\end{itemize} + +\paragraph{Keywords.} +\index{Keywords} +AnaGram makes special provision for the use of keywords in the parsers +it builds. In your grammar, you may specify a keyword as a quoted +string. AnaGram will incorporate special string matching logic in +your parser to recognize keywords in the parser input. When a keyword +is recognized, it is treated as an individual token, completely +independent of the characters which make up the keyword. Some +examples of keywords: \agcode{"let"}, \agcode{".EQ."}, +\agcode{"<>"}, \agcode{"/*"}. The +rules for the use of keyword strings are given in Chapter 8. + +Keywords are a powerful and useful tool, but they are not completely +consistent with the assumptions underlying the use of context-free +grammars. Thus under certain circumstances it is possible for a +parser built from a grammar that uses keywords to fail +% XXX s/parse correctly/correctly parse/ +to parse correctly text that +appears to be consistent with the grammar. Such a situation is called a +\index{Keyword anomalies}\index{Anomaly}keyword anomaly. +AnaGram detects and diagnoses keyword anomalies in your grammar, and +provides tools to help you find and correct the problem. Many such +apparent anomalies, however, are completely innocuous. Keyword +anomalies are discussed in greater detail in Chapter 7. + +\paragraph{Virtual Productions.} +\index{Virtual productions}\index{Production} +Virtual productions are a powerful, shorthand notation for specifying +optional or repetitive input. Using a virtual production can often +spare you the trouble of writing out four or five regular productions +and naming them. + +Most of the notation for virtual productions should be familiar since +it has been used in programming manuals for years. The basic forms +are these: + +\index{\agcode{[]}} +\index{\_opb\_clb} % {}, apparently. XXX? +\index{\agcode{...}}\index{Ellipsis} +\index{\agcode{?}} +\index{\agcode{?...}} +\index{\agcode{/...}} + +% XXX figure out how to not have to encode the line breaks manually. +% I think we need one of the fancier extension forms of tabular. + +\begin{tabular}[t]{ll} + +\agcode{[]}&enclosing a list of one or more grammar rules separated +by \agcode{|} characters denotes an optional\\ +\phantom{}&choice of one of the rules.\\ + +\agcode{\bra \ket}&enclosing a list of one or more grammar rules separated by +\agcode{|} characters indicates a required\\ +\phantom{}&choice of one of the rules.\\ + +\agcode{...}&following a token, or following \agcode{[]} or \agcode{\bra \ket}, +indicates arbitrary repetition of the previous choice or\\ +\phantom{}&token.\\ + +\agcode{?}&following a token or set expression makes the token optional.\\ + +\agcode{?...}&following a token or set expression denotes zero or more +repetitions of the token.\\ + +\agcode{/...}&following \agcode{[]} or \agcode{\bra \ket} indicates arbitrary +repetition of the choice subject to the constraint that\\ +\phantom{}&choices must alternate.\\ + +\end{tabular} + +\index{Production}\index{Virtual productions}Virtual productions +are simply syntactic devices. When AnaGram encounters a virtual +production it translates the virtual production into conventional +productions which show up in your grammar tables. Virtual productions +are described in Chapter 8. + +\paragraph{Definition Statements.} +\index{Statement}\index{Definition statement} +AnaGram provides a mechanism for naming particular character sets, +keywords or virtual productions in your syntax file. The name you +define can be used freely in place of the thing defined. Definition +statements have the form: + +% XXX why the hell is the first line offset by a small amount? +\begin{indentingcode}{0.4in} +name = \codemeta{character set} +name = \codemeta{virtual production} +name = \codemeta{keyword} +name = \codemeta{immediate action} +name = \codemeta{token name} +\end{indentingcode} + +The name may be any name acceptable to AnaGram. The name can then be +used anywhere you might have used the expression on the right side. +For example: + +\begin{indentingcode}{0.4in} +upper case letter = 'A-Z' +lower case letter = 'a-z' +letter = upper case letter + lower case letter +statement list = statement?... +while = "WHILE" +dos eof = \^{}Z +\end{indentingcode} + +\paragraph{Disregard Statements.} +\index{Disregard statement}\index{Statement} +If you wish your parser generally to skip over certain characters or +constructs in its input, such as blanks, newlines, or comments, you +may use disregard statements to specify precisely what is to +be skipped and under what circumstances. You may use any number of +disregard statements and you may cause your parser to disregard any +number of tokens, of whatever complexity. Any \agparam{disregard} +statements must +be placed in a configuration section. They may appear anywhere in +your syntax file. + +The format of the \agparam{disregard} statement is as follows: + +\begin{indentingcode}{0.4in} +disregard \codemeta{token name} +\end{indentingcode} + +The rules for using disregard statements are given in Chapter 8. + +\paragraph{Lexeme Statements.} +\index{Lexeme statement}\index{Statement} +The lexeme statement is used to fine tune the disregard statement. +Any \agparam{lexeme} statements must be placed in a configuration +section. They may appear anywhere in your syntax file. + +The format of the \agparam{lexeme} statement is as follows: + +\begin{indentingcode}{0.4in} +lexeme \bra \codemeta{token list} \ket +\end{indentingcode} + +where \textit{token list} is a list of nonterminal tokens, separated +by commas. + +The lexeme statement allows you to specify that for the purposes of +the disregard statement, the listed tokens are to be treated as +% XXX make this: indivisible lexical units +lexical units, so that the disregard statement will be inoperative +within the tokens specified. Lexeme statements are discussed in +Chapter 8. + +\paragraph{Semantically Determined Productions.} +\index{Semantically determined production}\index{Production} +Traditionally, syntax directed parsing systems have enforced a rigid +distinction between syntactic and semantic information. In these +systems, it is not possible to use the knowledge that, for example, a +symbol is a function name and not a variable name in the syntax of a +programming language. AnaGram provides semantically determined +productions as a mechanism to allow semantic information to control +syntactic analysis. + +A semantically determined production has more than one +\index{Token}\index{Reduction token}reduction token +on the left side. The reduction procedure determines which reduction +token the parser should use. For example: + +\begin{indentingcode}{0.4in} +variable name, function name + -> symbol:s = identify{\us}symbol(s); +\end{indentingcode} + +In this situation, the parser will set the reduction token to the +leftmost token, \agcode{variable name}, before calling the reduction +procedure. If the reduction procedure finds that \agcode{s} in fact +identifies a function name, it can change the reduction token to +\agcode{function name}. Using this capability, you can write your grammar in +terms of \agcode{variable name} and \agcode{function name}, just as +though they were syntactically distinguishable. Further discussion of +semantically determined productions may be found in Chapter 9. + +\paragraph{Event Driven Parsers.} +\index{Parser}\index{Event driven parser} +Traditionally, parsers have been constructed as subroutines which are +called to analyze a stream of input and return only when they have +finished parsing the input or have encountered an unrecoverable error. +Such a parser itself calls a subroutine to obtain each unit of input. +There are a number of circumstances where this traditional +architecture is inconvenient. The most obvious is in situations where +programs must be event-driven, as in many popular windowing systems. +A somewhat less obvious situation is where one parser, often called a +\index{Lexical scanner}\agterm{lexical scanner}, provides input to a +second. In these situations it is convenient to invert the role of the +calling and the called programs. + +AnaGram provides the capability of specifying that a parser it builds +should be event-driven. In this case, the parser consists of an +initialization procedure and an event handler. The main program calls +the initialization procedure and then calls the event handler with +each input token in turn. The event handler returns every time the +parser needs more input. Details on event-driven parsers are given in +Chapter 9. + +\paragraph{Context tracking.} +\index{Context tracking} +When you are writing a reduction procedure for a particular grammar +rule, you often need to know the value one or another of your program +variables had at the time the first token in the rule was encountered. +Examples of such variables are: + +\begin{itemize} +\item Line or column number +\item Index in an input file +\item Index into an array +\item Counters, as of symbols defined, etc. +\end{itemize} + +Such variables can be thought of as representing the ``context'' of +the rule you are reducing. Sometimes it is possible to incorporate +the values of such variables into the values of reduction tokens, but +this can become quite cumbersome. AnaGram provides a feature known as +``context tracking'' to deal with this problem. Details on context +tracking can be found in Chapter 9. + +\paragraph{Coverage Analysis.} +\index{Coverage analysis} +AnaGram has facilities to help you determine the adequacy of your test +suites. If you set appropriate configuration switches, AnaGram will +include code in your parser to count the number of times each rule in +your grammar is reduced or the number of times each reduction +procedure in your grammar is executed. You can use these counts to +determine the completeness of your testing. The use of these +facilities is discussed in Chapter 9. + +\paragraph{Immediate Actions.} +\index{Action}\index{Immediate action} +AnaGram provides for executing particular functions, called +``immediate actions'', at any point in a grammar rule. Thus, +computation does not need to wait until the rule is complete. +Immediate actions are most useful when using AnaGram to control an +interactive process. Immediate actions are discussed in Chapter 8. + +\paragraph{Grammar Trace.} +\index{Grammar Trace}\index{Window}\index{Trace} +The \agwindow{Grammar Trace} facility of AnaGram allows you to examine +the workings of your parser in detail. You can set up a +representation of the +\index{Parser state stack}\index{State stack}\index{Stack}parser state stack +and parser state as they might appear in the course of execution of +your parser. You can then examine the possible inputs and see how the +state and the state stack change in response to any input you choose. +You can back up and try other options. You can have several grammar +traces active simultaneously so you can compare the results of +different sequences of input. For any configuration of the state +stack, you can see what grammar rules the parser is in the process of +matching. + +The use of the Grammar Trace is described in Chapter 5. + +\paragraph{File Trace.} +\index{Trace}\index{File Trace}\index{Window} +The \agwindow{File Trace} is a facility which allows you to see how +your parser will work on real data. File Trace is an interactive, +interpretive parser governed by the rules in your grammar. It lets +you see precisely how your grammar parses a test file. You can see +the contents of the parse stack at any point in the parse. You can +watch the parse progress at any level of detail you choose. If you +wish, you may back up and try again. When AnaGram runs the +\agwindow{File Trace} or \agwindow{Grammar Trace} it also builds a +\index{Coverage}\index{Trace Coverage}\index{Window} +\agwindow{Trace Coverage} table. + +Note that AnaGram normally uses a number of short-cut parsing actions. +To make the parse look like a textbook parse, set the +\index{Traditional engine}\agparam{traditional engine} +\index{Configuration parameters}configuration parameter +in your syntax file. + +\paragraph{Aids to Debugging.} +AnaGram provides a number of facilities to help you debug your parser. +It provides numerous tables that summarize your grammar and a +windowing environment to assist you in inspecting and comparing this +data. To assist in identifying a conflict in your grammar it can +provide a +\index{Trace}\index{Conflict Trace}\index{Window}\agwindow{Conflict Trace}, +a pre-built grammar trace showing a sequence of input which will +trigger the conflict. If your parser encounters a +\index{Syntax error}\index{Errors}syntax error, +you can arrange to have AnaGram build an \agwindow{Error Trace}, a +pre-built grammar trace showing the sequence of input tokens that led +to the syntax error. These facilities are described in Chapter 5. + +\paragraph{Conflict Resolution.} +\index{Conflict} +In order to help you deal with conflicts in your grammar, AnaGram +provides several methods for resolving conflicts. You may include +\agterm{precedence declarations} to assist in the construction of +simple precedence grammars, you may declare tokens to be +\agparam{sticky}, or you may declare tokens to be +\agterm{subgrammar} tokens. +These techniques are described in Chapter 7.