Mercurial > ~dholland > hg > ag > index.cgi
diff doc/manual/dd.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/manual/dd.tex Sat Dec 22 17:52:45 2007 -0500 @@ -0,0 +1,1964 @@ +\chapter{Programming With AnaGram} + +Although AnaGram has many options and features which enable you to +build a parser that meets your needs precisely, it has well-defined +defaults so that you do not generally need to learn about an option +until you need the facility it provides. The purpose of this chapter +is to show you how to use the options and features effectively. + +The options and features of AnaGram can be divided roughly into three +groups: those that control the general aspects of your parser, those +that control input to the parser and those that control error +handling. After dealing with these three groups of options and +features, this chapter concludes with a discussion of various advanced +techniques. + +Many aspects of your parser are controlled by setting configuration +parameters, either in a configuration file or in your syntax file. +This chapter presumes you are familiar with setting configuration +parameters. The names of configuration parameters, as they occur in +the text, are printed in \agparam{bold face type}. Appendix A +describes the use of configuration parameters and provides a detailed +discussion of each configuration parameter. + + +\section{General Aspects} + +\subsection{Program Development} + +The first step in writing a program is to write a grammar in AnaGram +notation which describes the input the program expects. The file +containing the grammar, called the syntax file, conventionally has the +extension \agfile{.syn}. You could also make up a few sample input +files at this time, but it is not necessary to write reduction +procedures at this stage. + +Run AnaGram and use the \index{Analyze Grammar}Analyze Grammar command +to create parse tables. If there are syntax errors in the grammar at +this point, you will have to correct them before proceeding, but you +do not necessarily have to eliminate conflicts, if there are any, at +this time. There are, however, many aids available to help you with +conflicts. These aids are described in Chapters 5 through 7, and +somewhat more briefly in the Online Help topics. + +Once syntax errors are corrected, you can try out your grammar on the +sample input files using the File Trace facility. With File Trace, +you can see interactively just how your grammar operates on your test +files. You can also use Grammar Trace to answer ``what if'' questions +concerning input to the grammar. The Grammar Trace does not use a +test file, but rather allows you to make input choices interactively. + +At any time, you can write reduction procedures to process your input +data as its components are identified in the input stream. Each +procedure is associated with a grammar rule. The reduction procedures +will be incorporated into your parser when you create it with the +\index{Build Parser}Build Parser command. + +By default, unless you specify an input procedure, parser input will +be read from \agcode{stdin}, using the default \agcode{GET{\us}INPUT} +macro. You will probably wish to redefine \agcode{GET{\us}INPUT}, or +configure your parser to use \agparam{pointer input} or \agparam{event +driven} input. + +\subsection{The Default Parser} +\index{Parser} + +If you apply the Build Parser command to a syntax file which contains +only a grammar, with no reduction procedures and no embedded C code, +AnaGram will still produce a complete C command line program which you +can compile and run. \index{Input procedures}This parser will parse +character input from \agcode{stdin}. If the input does not satisfy +the rules of your grammar, the parser will issue a syntax error +diagnostic to \agcode{stderr} identifying the exact line and column +numbers of the error. If the parser should overflow its stack, it +will abort with an error message to \agcode{stderr}. If the parse is +successful, that is if the parser succeeds in identifying the grammar +token without encountering an error, it will simply return to the +command line. + +You can extend such a simple parser, often quite effectively, by +adding only reduction procedures. If the reduction procedures write +output to \agcode{stdout}, you can produce a conventional ``filter'' +program without having to pay any attention to input handling, error +handling, or any of the other options AnaGram provides. +%CALC, in the EXAMPLES directory, is an example of such a program. + +\subsection{The Content of the Parser and Header Files} + +% XXX s/from your parser file/from your syntax file/ +AnaGram creates two \index{Output files}\index{File}output files: a +parser file and a header file. \index{Parser file}\index{File}The +parser file contains the C code you need to compile and link before +you can run your parser. It begins with the \index{C +prologue}\index{Prologue}C prologue, if any, from your parser file. +The C prologue is an optional block of \index{Embedded C}embedded C or +C++ which precedes everything else in your syntax file. Although it +can contain anything you wish, normally it is used to place +identification information, \index{Copyright notice}copyright notices, +etc., at the beginning of your parser file. If your parser uses token +types that require definition, the appropriate \agcode{\#include} +statements and definitions should be placed in the C prologue. See +``Defining Token Types'', below. + +Following the C prologue, AnaGram places a number of definitions of +variables and macros that you might need to refer to in your embedded +C, and in your reduction procedures. Not the least of these +definitions is the parser control block, described below. Following +these definitions, AnaGram inserts all your embedded C, in the order +in which it occurred in your syntax file. Following the embedded C +come all your reduction procedures. Finally, AnaGram adds the tables +which summarize your grammar and a parsing engine customized to your +requirements. + +The \index{Header file}\index{File}header file contains definitions +needed by your parser. These include definitions of the \index{Parser +value stack}\index{Value stack}\index{Stack}value stack type, the +input token type, the \index{Parser control block}parser control block +type, and token name enumeration constants. The definitions are +placed in a header file so that you can make them available to other +modules if necessary. + +\subsection{Naming Output Files} +\index{Output files}\index{File} + +Unless you specify otherwise, AnaGram names the parser and header +files following conventional programming practice. Both \index{File +name}\index{File name}files have the same name as your syntax file, +with extensions \agfile{.c} and \agfile{.h} respectively. These +names, however, are controlled by the configuration parameters +\index{Configuration parameters}\index{Name} +\index{Parser file name}\agparam{parser file name} and +\index{Header file name}\agparam{header file name} +respectively, so you can override AnaGram's defaults if you wish. If +you normally use C++ rather than C, for example, you might want to +include the following statement in your configuration file: + +\begin{indentingcode}{0.4in} +parser file name = "\#.cpp" +\end{indentingcode} + +When AnaGram names the parser file it substitutes the name of your +syntax file for the ``\#'' character in the file name template. + +\subsection{Compiling Your Parser} +\index{Parser} + +Although AnaGram was designed primarily with ANSI C in mind, a good +deal of care has been taken to ensure that its output is consistent +with older C compilers and with newer C++ compilers. If your compiler +does not support ANSI function prototypes, you should set the +\index{Old style}\index{Configuration switches}\agparam{old style} +switch in your configuration file. If you are intending to compile +your parser using a 16-bit compiler, you might want to turn on the +\index{Near functions}\index{Configuration switches}\agparam{near functions} +switch in your configuration file. If you are building a parser for +use in an embedded system, you might want to make sure the +\index{Const data}\index{Configuration switch}\agparam{const data} +configuration switch is set so that all the tables AnaGram generates +will be declared \agcode{const}. + +\subsection{Naming Your Parser} +\index{Parser} + +In the default case, AnaGram creates a main program for you. +Generally, however, you will probably want a parser function which you +can call from your own main program. You won't want AnaGram to define +\agcode{main} for you. You can stop AnaGram from defining +\agcode{main} in any of several ways: Include some embedded C in your +syntax file, turn off +\index{Main program}the \index{Configuration switches}\agparam{main program} +configuration switch, or turn on either the \agparam{event driven} or +\agparam{pointer input} switches. Since you almost always will have +some embedded C in your syntax file, you will seldom have to use the +\agparam{main program} switch. + +Normally, AnaGram simply uses the name of your syntax file to create +the name of your parser. Thus if your syntax file is called +\agfile{ana.syn} your parser will have the name \agcode{ana}. AnaGram +does not check the parser name for compliance with the rules of C. If +you use strange characters in your file name, you will get strange +characters in the name of your parser, and you will get unpleasant +remarks from your C compiler when you try to compile your parser. +Thus, for example, if you were to name your parser file +\agfile{!@\#.syn}, AnaGram will call your parser \agcode{!@\#}. Your +compiler will doubtless choke. + +\index{Parser} +If you wish AnaGram to give your parser a name other than the file +name, you may set the +\index{Parser name}\index{Name}\index{Configuration parameters} +\agparam{parser name} +configuration parameter. Thus, to make sure your parser is called +\agcode{periwinkle} you would include the following line in a +configuration section in your syntax file: + +% Note: this is not actually required to be in double quotes. +% It'll also accept anything that's syntactically acceptable to it +% as a C data type, which also lets you give it things like +% ``periwinkle *'' that result in uncompilable code. + +\begin{indentingcode}{0.4in} +parser name = "periwinkle" +\end{indentingcode} + +Besides the parser itself, AnaGram generates a number of other +functions, variables and type definitions when it creates your parser. +All these entities are named using the parser name as the base. The +templates and their usages are as follows: + +\begin{indenting}{0.4in} +\begin{tabular}{ll} + +\index{Parser}\index{Initializer}\index{Name} +\agcode{init{\us}\$}&initializer for parser\\ + +\index{Grammar token}\index{Value} +\agcode{\${\us}value}&returns value of grammar token\\ + +\index{Parser value stack}\index{Value stack}\index{Stack} +\agcode{\${\us}vs{\us}type}&value stack type\\ + +\agcode{\${\us}it{\us}type}&input token union\\ +\agcode{\${\us}token{\us}type}&token name enumeration typedef\\ +\agcode{\${\us}\%{\us}token}&token name enumeration constants\\ +\agcode{\${\us}pcb{\us}type}&typedef of parser control block\\ + +\index{Parser control block} +\agcode{\${\us}pcb}&parser control block\\ + +\index{Rule Count} +\agcode{\${\us}nrc}&rule count table\\ + +\agcode{\${\us}nrpc}&reduction procedure count table\\ +\\ +\end{tabular} +\end{indenting} + +When AnaGram defines these entities it substitutes the parser name for +the dollar sign. In the token name enumeration constants it +substitutes the token name for the \index{{\us}prc}``\%'' character. +Embedded space characters are replaced with underscore characters. + +\subsection{The Parser Control Block} +\index{Parser control block} + +The complete status of a parse is kept in a structure called a +\agterm{parser control block}. As a default, AnaGram defines a parser +control block for you, and provides a macro, \index{PCB}\agcode{PCB}, +which enables you to access it simply. The name AnaGram assigns to +the parser control block is +% XXX +%\agcode{\${\us}pcb}, where as above ``\$'' is replaced with the name of +%your parser. +\agcode{\textit{$<$parser name$>$}{\us}pcb}. +If you need to refer to the parser control block from some module +other than the parser module, use an \agcode{\#include} statement to +include the header file for your parser and refer to the parser +control block by its name as above. The structure of the parser +control block is described in Appendix E. In this chapter, particular +fields will be discussed as necessary. + +Since the parser control block contains the complete status of a +parse, you may interrupt a parse and continue it later by saving and +restoring the control block. If you have multiple input streams, all +controlled by the same grammar, you may have a separate control block +for each stream. If you wish to call your parser recursively, you may +define a fresh control block for each level of recursion. To make +best use of these capabilities, you will need to declare the parser +control block yourself. This is discussed below under ``Advanced +Techniques''. + +\subsection{Calling Your Parser} +% XXX should have an example of actually calling the thing. +% XXX should also have ``terminating your parser'' or something like +% that. + +The parser function AnaGram defines is a simple function which takes +no arguments and returns no values. All communication with the parser +takes place via the parser control block. When your parser returns, +\index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} contains an exit +code describing the outcome of the parse. Symbols for the +exit codes are defined in the header file AnaGram generates. +\index{Exit codes}\index{Error codes}These symbols, their values, +and their meanings are: + +\index{AG{\us}RUNNING{\us}CODE} +\index{AG{\us}SUCCESS{\us}CODE} +\index{AG{\us}SYNTAX{\us}ERROR{\us}CODE} +\index{AG{\us}REDUCTION{\us}ERROR{\us}CODE} +\index{AG{\us}STACK{\us}ERROR{\us}CODE} +\index{AG{\us}SEMANTIC{\us}ERROR{\us}CODE} +\begin{indenting}{0.4in} +\begin{tabular}{lll} +\agcode{AG{\us}RUNNING{\us}CODE}&0&Parse is not yet complete\\ +\agcode{AG{\us}SUCCESS{\us}CODE}&1&Parse terminated successfully\\ +\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE}&2&Syntax error was encountered\\ +\agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE}&3&Bad reduction token encountered\\ +\agcode{AG{\us}STACK{\us}ERROR{\us}CODE}&4&Parser stack overflowed\\ +\agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE}&5&Semantic error\\ +\\ +\end{tabular} +\end{indenting} + +Only an event driven parser will return the value +\agcode{AG{\us}RUNNING{\us}CODE}, since any other parser continues executing +until it terminates successfully or encounters an unrecoverable error. + +Syntax errors, reduction token errors, and stack errors are discussed +below under ``Error Handling''. + +% XXX: this bit belongs somewhere else +\agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE} is a special case. It is available +for you to use in your reduction procedures to terminate a parse for +semantic reasons. +% XXX add: AnaGram will never set it itself. +If, in a reduction procedure, you determine that parsing should not +continue, you need only include the statement: + +\begin{indentingcode}{0.4in} +PCB.exit{\us}flag = AG{\us}SEMANTIC{\us}ERROR{\us}CODE; +\end{indentingcode} + +When your reduction procedure returns, the parse will then terminate +and the parser will return control to the calling program. + +\subsection{Parser Return Value} +\index{Value} + +If, in your grammar, there is a value assigned to the grammar token, +you may retrieve it, after the parse is complete, by calling the +parser value function, the name of which is given by +\agcode{\${\us}value} where ``\$'' is the name of your parser. +\agcode{\${\us}value} takes no arguments, and returns a value of the type +assigned to the grammar token in your syntax file. + +Although in theoretical discussions of parsing the result of the parse +is contained in the value of the grammar token, in practice, more +often than not, results are communicated to other procedures by +setting the values of global variables. Thus the value of the grammar +token is often of little interest. + +Since the parser per se takes no arguments, it is usually convenient +to write a small interface function with a calling sequence +appropriate to the problem. The interface function can then take care +of appropriate initializations, call the parser, and retrieve results. + +\subsection{Defining Token Types} + +When you add reduction procedures to your grammar, you will often find +it convenient to add type declarations for the \index{Semantic +value}\index{Token}\index{Value}semantic values of some of the tokens +in your grammar. As long as the types you use are conventional C data +types\index{Data type}\index{Token}, you don't have to do anything +special. If, however, you have used types or classes that you have +defined yourself, you need to make sure that the appropriate +definition statements precede their use in the code AnaGram generates. +To do this, you need to have a C prologue in your syntax file. In the +C prologue, you should place the definition statements your parser +will need, or at least an \agcode{\#include} statement that will cause +the types or classes to be defined. + +\subsection{Debugging Your Parser} + +Because the ``flow of control'' of your parser is algorithmically +derived from your grammar, debugging your parser separates into two +separate exercises: debugging your grammar, discussed in Chapter 7, +and debugging your reduction procedures. + +When debugging, it is usually a good idea to turn off the +\index{Macros}\index{Allow macros}\index{Configuration switches} +\agparam{allow macros} +switch. This switch is normally on and causes simple reduction +procedures to be implemented as macros. When you turn it off, you get +a proper function definition for each reduction procedure, so you can +put a breakpoint in any reduction procedure you choose. If the +\index{Line numbers}\index{Configuration switches} +\agparam{line numbers} switch +is on each reduction procedure will contain a +\index{\#line}\agcode{\#line} directive to show where the reduction +procedure is found in your syntax file. Once you have acquired +confidence in your reduction procedures you may turn the +\agparam{allow macros} switch back on for slightly improved +performance. + +If your debugger allows you to inspect entire structures, you will +find it convenient to look at the parser control block while you are +debugging. The contents of the parser control block are described in +Appendix E. + +A good way to begin debugging a new parser is to simply put a +breakpoint in each reduction procedure. Start your parser and step +through the reduction procedures one by one, verifying that they +perform as expected. After you have stepped through a reduction +procedure, turn off its breakpoint. If there are multiple paths, +leave breakpoints on the paths not taken. Liberal use of the assert +macro helps assure that your fixes don't break procedures you have +already tested. + +\section{Providing Input to Your Parser} +\index{Parser}\index{Input}\index{Input procedures} + +This section describes three methods for providing input to your +parser. In the first method your program calls the parser which then +requests input tokens as it needs them. It returns only when it has +completed the parse. The parser requests input tokens by invoking a +macro called \agcode{GET{\us}INPUT}, described below. + +The second method for providing input can be used when the entire +sequence of input tokens is available in memory. This method is +controlled by the \index{Pointer input}\index{Configuration +switches}\agparam{pointer input} configuration switch. It is +discussed below. + +The third method for providing input is especially convenient when +using \index{Lexical scanner}lexical scanners or multi-stage parsing. +It is controlled by the \index{Event driven}\index{Configuration +switches}\agparam{event driven} configuration switch. + +\subsection{The \agcode{GET{\us}INPUT} Macro} +\index{GET{\us}INPUT}\index{Macros} + +The default parser simply reads characters from \agcode{stdin}. It +does this by invoking a macro called \agcode{GET{\us}INPUT} every time it +needs an input character. The default definition of +\agcode{GET{\us}INPUT} is: + +\index{PCB}\index{input{\us}code} +\begin{indentingcode}{0.4in} +\#define GET{\us}INPUT (PCB.input{\us}code = getchar()) +\end{indentingcode} + +\agcode{PCB.input{\us}code} is an integer field in the parser control +block which is used to hold the current input \index{Character +codes}character code. + +By including your own definition of \agcode{GET{\us}INPUT} in your +embedded C, you override the default definition provided by AnaGram. +The only requirement for \agcode{GET{\us}INPUT} is that it store a +character in \agcode{PCB.input{\us}code}. Suppose you wish to make a +parser that reads characters from a file provided by the calling +program. You could include the following in your embedded C: + +\begin{indentingcode}{0.4in} +extern FILE *file; +\#define GET{\us}INPUT (PCB.input{\us}code = fgetc(file)) +\end{indentingcode} + +Now your parser, when invoked, will read characters from the specified +file instead of reading them from \agcode{stdin}. Of course, +\agcode{GET{\us}INPUT} is not constrained to reading a file or data +stream. You may implement \agcode{GET{\us}INPUT} in any manner you +choose. You may implement it as a function call, or you may choose to +define \agcode{GET{\us}INPUT} so that it expands into inline code for +faster execution. + +\subsection{Pointer Input} +\index{Pointer input}\index{Input procedures} + +It often happens that the data you wish to parse are already in memory +when you are ready to call the parser. While you could rewrite +\agcode{GET{\us}INPUT} to simply scan the array by incrementing a +pointer, AnaGram provides an alternative approach since this is such a +common situation. In a configuration section in your syntax file +simply turn on the \index{Pointer input}\index{Configuration +switches}\agparam{pointer input} switch. Then before you call your +parser, load \index{pointer}\index{PCB}\agcode{PCB.pointer}, the +pointer field in the parser control block, with a pointer to your +array. Assuming your parser is called \agcode{ana}, and you wish to +call an interface function with an argument consisting of a character +string, here's what you do: + +\begin{indentingcode}{0.4in} +{}[ + pointer input +] + +\bra + void ana{\us}shell(char *source{\us}text) \bra + PCB.pointer = (unsigned char *)source{\us}text; + ana(); + \ket +\ket +\end{indentingcode} + +% XXX s/the// +The type of the \agcode{PCB.pointer} defaults to +\agcode{unsigned char *} to +minimize difficulty with full 256-character sets. If your compiler is +fussy, you should use a cast, as above, when you set the value. If +your data requires more than 256 +\index{Character codes}character codes, you may still use pointer +input by using the \index{Pointer type}\index{Configuration +parameters}\agparam{pointer type} configuration parameter to change +the definition of the field in the parser control block. Normally, +the value of \agparam{pointer type} should be a C data type that +converts to integer. If \agparam{pointer type} does not convert to +integer, you must provide an +\index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro, as +described below, to extract a token identifier. Do not change +\agparam{pointer type} to \agcode{signed char} in order to avoid the +cast in the above example. That will have the effect of making all +character codes above 127 inaccessible to your parser. + +Note that if you use pointer input your parser does not need a +\agcode{GET{\us}INPUT} macro. Parsers that use pointer input usually +run somewhat faster than those that use \agcode{GET{\us}INPUT}, +particularly if they use keywords. +% XXX that is unclear - I know it means that the keyword logic is +% particularly improved by using pointer input, but it could be read +% to imply that adding keywords makes the parser even faster, which is +% backwards. + +\subsection{Event Driven Parsers} +\index{Event driven parser}\index{Parser} + +There are many situations where the input to a parser is developed by +an independent process and the linkage required to implement a +\agcode{GET{\us}INPUT} macro is unduly cumbersome. In these +circumstances, it is convenient to use an \agparam{event driven} +parser. With an event driven parser, you do not simply call the +parser and wait for it to finish. Instead, you call its +\index{Initializer}initializer first, and then call it each time you +have a character for it. The parser processes the character and +returns as soon as it needs more input, encounters an error or finds +the parse complete. You can interrogate +\index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to determine +whether the parser can accept more input. + +To create an event driven parser, set the \index{Event +driven}\index{Configuration switches}\agparam{event driven} switch in +your syntax file. Then, to initialize the parser, call the +initialization procedure, or \index{Initializer}initializer, provided +by AnaGram. The name of this procedure is \agcode{init{\us}\$} where +``\agcode{\$}'' represents the name of your parser. If your parser is named +\agcode{ana}, the +\index{Parser}initialization procedure is named \agcode{init{\us}ana}. +To process a single character, store the character in +\index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code}, then call +\agcode{ana}. When it returns, check +\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to see if the +parser is still running. When the parse is successful, you may +retrieve the value of the grammar token, if you wish, by using the +\index{Parser value function}parser value function, in this case, +\agcode{ana{\us}value}. +% XXX s/case,/case/ above. or s/function,/function;/ + +As an example, let us imagine we are to write a an interface function +for our parser which takes a list of string pointers, a count, and a +pointer to a location into which we may store an error flag. The +input to our parser is to be the concatenation of all the character +strings. We will set up a loop which will call the parser for all the +characters of the strings in turn. We will assume that the function +will return the value of the grammar token, which we will assume to be +also of type double: + +\begin{indentingcode}{0.4in} +{}[ + event driven +] + +\bra + double parse{\us}strings(char **ptr, int n{\us}strings, int *error) \bra + init{\us}ana(); + while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\& + n{\us}strings--) \bra + char *p = *ptr++; + while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\& *p) \bra + PCB.input{\us}code == *p++; + ana(); + \ket + \ket + assert(error); + *error = PCB.exit{\us}flag != AG{\us}SUCCESS{\us}CODE; + return ana{\us}value(); + \ket +\ket +\end{indentingcode} + +The purpose of this example is simply to show how to use an event +driven parser. Of course it would be possible, as far as this example +is concerned, to concatenate the strings and use pointer input +instead. A problem sufficiently complex to \emph{require} an event +driven parser would be too complex to serve as a simple example. + +\subsection{Token Input} +\index{Token input}\index{Input procedures} + +Thus far in this chapter, we have assumed that the input to your +parser consisted of ordinary characters. There are many situations +where it is convenient to have a +\index{Preprocessor}\index{Token}\index{Token}preprocessor, or +\index{Lexical scanner}lexical scanner, which identifies basic tokens +and hands them over to your parser for further processing. Accepting +input from such preprocessors is discussed in the remainder of this +section. + +Sometimes preprocessors simply pass on text characters, acting as +filters to remove unwanted characters, such as white space or +comments, and to insert other text, such as macro expansions. In such +situations, there is no need to treat the preprocessor differently +from any other character source. The input methods described above +are sufficient to deal with the input provided by the preprocessor. + +In what follows, we deal with situations where the preprocessor passes +on \index{Token number}\index{Token}\index{Number}\agterm{token +numbers} rather than character codes. The preprocessor may also pass +on token \emph{values}, which also need accommodation of some sort. +% XXX also also? + +There are two principal interfacing problems to deal with. The first +has to do with identifying the tokens to your parser. The second has +to do with providing the semantic values of the tokens. +% +%If your preprocessor does not provide values with its tokens, your parser +%may use any of the input techniques described above for character input, +%the only difference being that instead of setting PCB.input{\us}code to a +%character value, you set it to the token identifier. +% +%If your preprocessor does provide token values, then you have to use either +%a GET{\us}INPUT macro, or configure your parser to be event driven. If you wish +%to use pointer input, you must provide an INPUT{\us}CODE macro. +% + +\subsection{Identifying Tokens using Predefined Token Numbers} +\index{Token}\index{Number}\index{Token number} + +If you have a pre-existing \index{Lexical scanner}lexical scanner, +written for use with some other parsing system, it probably outputs +its own set of token numbers. The most robust way of interfacing such +a lexical scanner is to include, in your syntax file, either an +\index{Enum statement}\agparam{enum} statement or a set of definition +statements +for the terminal tokens, equating +\index{Terminal token}\index{Token}terminal token names with the +numeric values output by the lexical scanner, so that AnaGram treats +them as character codes. In this situation, you simply set +\index{PCB}\index{input{\us}code}\agcode{PCB.input{\us}code} to the token +number determined by the lexical scanner. Generally, lexical scanners +written for other parsing systems expect to be called for each token. +Therefore, you would normally use a \agcode{GET{\us}INPUT} macro to call +the lexical scanner and provide input to your parser. +% XXX as far as I know, lex expects to call yacc, not vice versa. + +\subsection{Identifying Tokens using AnaGram's Token Numbers} + +If you are writing a new preprocessor, you have more freedom. You +could simply create a set of codes as above. On the other hand, you +can save a level of translation and make your system run faster by +providing your parser with internal token numbers directly. Here's +what you have to do. + +First, when you write your syntax file, leave all the terminal tokens +undefined. That means, of course, that you have to have a name for +each terminal token. You can't use a literal character or a number +for the token. AnaGram will generate a unique token number for each +token in your grammar. In the header file it generates, AnaGram +always provides a set of +\index{Enumeration constants}\index{Constants}enumeration constants +for all the named tokens in your grammar. The names for these +constants are controlled by the +\index{Configuration parameters}\index{Enum constant name} +\agparam{enum constant name} +parameter. (See Appendix A.) These constants normally have the form +\agcode{\textit{$<$parser name$>$}{\us}\textit{$<$token name$>$}{\us}token}. +Note that embedded space in the token name will be replaced with +underscore characters. Assume your parser is called \agcode{ana}, and +in your grammar you have a token called \agcode{integer constant}. +The enumeration constant identifying the token is then +\agcode{ana{\us}integer{\us}constant{\us}token}. Now, to hand off an integer +constant to your parser you write: + +\begin{indentingcode}{0.4in} +PCB.input{\us}code = ana{\us}integer{\us}constant{\us}token; +\end{indentingcode} + +\subsection{Providing Token Values} + +If your \index{Preprocessor}preprocessor provides \index{Semantic +value}\index{Token}\index{Value}semantic values for input tokens, you +must inform AnaGram by setting the +\index{Input values}\index{Configuration switches}\index{Value} +\agparam{input values} +configuration switch in your syntax file. Then, whenever you provide a +token, you must also store a value in +\index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}. +You can do this as part of your \agcode{GET{\us}INPUT} macro, or, if you +have an \agparam{event driven} parser, when you set +\index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code} prior to +calling the parser function. If you are using \index{Pointer +input}\index{Configuration switches}\agparam{pointer input}, the +pointer will presumably identify the token value. You must provide an +\index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro to extract +the identification code from the token value. For example, if the +token value is a structure and the appropriate member field is called +\agcode{id}, you would write: + +\begin{indentingcode}{0.4in} +\#define INPUT{\us}CODE(t) (t).id +\end{indentingcode} + +Generally, the simplest way to interface the preprocessor and your +parser, when you are passing token values, is to use an event driven +parser. In this situation, the preprocessor, when it identifies a +token, simply loads the token identifier into +\agcode{PCB.input{\us}code}, loads the value into +\index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}, and calls +the parser. + +\index{Token} +If the values of your input tokens are all of the same type, you must +set the +\index{Default input type}\index{Configuration parameters} +\index{Input type}\agparam{default input type} +configuration parameter so that AnaGram can declare +\index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value} +appropriately. \index{Token type}\agparam{Default input type} will +default to \agcode{int} if you do not set it either in your configuration file +or in your syntax file. + +Some \index{Lexical scanner}lexical scanners simply provide a pointer +to the text of the token they have identified. In this situation, you +would set \agparam{default input type} to \agcode{char *}. When you +provide a token to the parser you would set \agcode{PCB.input{\us}value} +to point to the text of the token. + +If different tokens have values of different types, the situation +becomes slightly more complex. First, you must tell AnaGram about the +types of your input tokens. You do this by including a +\index{Declaration}\index{Type declarations}\agterm{type declaration} +in your syntax file. A type declaration is a token declaration +preceded by a C data type\index{Data type}\index{Token} in +parentheses. Assume that your \index{Preprocessor}preprocessor +identifies, among others, the following tokens: \agcode{name}, +\agcode{string}, \agcode{real constant}, \agcode{integer constant}, +and \agcode{unsigned constant}. You might then include the following +in your syntax file: + +\begin{indentingcode}{0.4in} +{}[ + input values +] + +(char *) name, string +(double) real constant +(long) integer constant, unsigned constant +\end{indentingcode} + +AnaGram will then create, in the parser control block, an input value +field which can accommodate any of these terminal tokens in your +grammar. + +To enable you to store data into the input value field of the parser +control block, AnaGram provides a convenient macro called +\index{INPUT{\us}VALUE}\index{Macros}\agcode{INPUT{\us}VALUE} to serve as +the destination of an assignment statement. \agcode{INPUT{\us}VALUE} +takes the type of the data as a parameter. Thus one could write: + +\begin{indentingcode}{0.4in} +INPUT{\us}VALUE(char *) = text{\us}pointer; +INPUT{\us}VALUE(long) = constant{\us}value; +\end{indentingcode} + +\section{Error Handling} + +There are two classes of errors your parser needs to be able to deal +with. The first consists of \agterm{implementation errors} and the second +consists of \agterm{syntax errors}. Syntax errors arise because the input to +the parser does not conform to the definition of the language it is +designed to parse. Implementation errors arise because the programs +we write are never perfect and because the environment in which our +programs run is often something less than ideal. + +\subsection{Implementation Errors} +\index{Implementation errors}\index{Errors} + +% XXX parser stack overflow is not really an ``implementation error'' + +There are two implementation errors which your parser needs to be able +to deal with. The first is \agterm{parser stack overflow}. The +second comes from a bad \agterm{reduction token}. + +\index{Stack} +\paragraph{Stack Overflow.} +Stack overflow is an error which your parser must be able to deal +with. In general, no matter how big you make your parser stack, it is +possible for legitimate input to cause it to overflow. The size of +the stack for your parser is controlled by the configuration parameter +\agparam{parser stack size}. This parameter defaults to a value of +32. This value has been found to be adequate for ordinary usage. + +If your parser has only left recursive constructs, then there is a +maximum depth beyond which the parser stack will never grow. If your +parser has center recursive or right recursive productions, then no +matter how much stack space you allocate, there will always be a +syntactically correct input file which causes the stack to overflow. +This can be illustrated by the following set of C statements: + +\begin{indentingcode}{0.4in} + x = y; + x = (y); + x = ((y)); + x = (((y))); + . + . + . +\end{indentingcode} + +Each set of parentheses requires another level on the parser stack. +When this set of statements was tried with Borland C++, it ran out of +stack space at 127 sets of parentheses and diagnosed the problem as +``Expression is too complicated''. + +AnaGram calculates the actual size of the parser stack by calculating +the maximum depth for left recursive constructs and adding half the +value of +\index{Parser stack size}\index{Configuration parameters}\index{Stack} +\index{Parser state stack}\index{State stack} +\agparam{parser stack size}. It then uses the larger of the calculated +value and \agparam{parser stack size} to allocate stack storage. You +may check the value actually used in your parser by inspecting the +definition of +\index{AG{\us}PARSER{\us}STACK{\us}SIZE}\agcode{AG{\us}PARSER{\us}STACK{\us}SIZE}. + +If your parser runs out of stack space, it will set +\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to +\index{AG{\us}STACK{\us}ERROR{\us}CODE}\agcode{AG{\us}STACK{\us}ERROR{\us}CODE}, invoke +the +\index{Macros}\index{PARSER{\us}STACK{\us}OVERFLOW}\agcode{PARSER{\us}STACK{\us}OVERFLOW} +macro and return to the calling program. The default definition of +this macro is: + +\begin{indentingcode}{0.4in} +\#define PARSER{\us}STACK{\us}OVERFLOW \bra fprintf(stderr, {\bs} + "{\bs}nParser stack overflow{\bs}n"); \ket +\end{indentingcode} + +% XXX ``provide your own definition'', not ``redefine'' + +If this definition is not consistent with your needs, you may redefine +it in any block of embedded C in your syntax file. + +\index{Reduction token error} +\paragraph{Reduction Token Error.} +A properly functioning parser should never encounter a reduction token +error. Therefore, reduction token errors should be taken quite +seriously. The only way to cause a reduction token error in an +otherwise properly functioning parser is to set incorrectly the +reduction token for a semantically determined production. +% XXX ``to incorrectly set'' + +Before your parser calls a reduction procedure, it stores the token +number of the token to which the production would normally reduce in +\index{reduction{\us}token}\index{PCB}\agcode{PCB.reduction{\us}token}. If +the production is a semantically determined production, you may, in +your reduction procedure, change the value of +\agcode{PCB.reduction{\us}token} to one of the alternative tokens on +the left side of the production. When your reduction procedure +returns, your parser checks to verify that +\agcode{PCB.reduction{\us}token} is a valid token number for the +current state of the parser. If it is not, it sets +\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to +\index{AG{\us}REDUCTION{\us}ERROR{\us}CODE}\agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE} +and invokes +\index{REDUCTION{\us}TOKEN{\us}ERROR}\index{Macros}\agcode{REDUCTION{\us}TOKEN{\us}ERROR}. +The default definition of this macro is: + +\begin{indentingcode}{0.4in} +\#define REDUCTION{\us}TOKEN{\us}ERROR \bra fprintf(stderr,{\bs} + "{\bs}nReduction{\us}token error{\bs}n"); \ket +\end{indentingcode} + +\subsection{Syntax Errors} +\index{Syntax error}\index{Errors} + +If the input data to your parser does not conform to the rules you +have specified in your grammar, your parser will detect a syntax +error. There are two basic aspects of dealing with syntax errors: +\index{Error diagnosis}\agterm{diagnosing} the error and +\agterm{recovering} from the error, that is, restarting the parse, or +``resynchronizing'' the parser. + +If you use the default settings for syntax error handling, then on +encountering a syntax error your parser will call a diagnostic +procedure which will create an error message and store a pointer to it +in +\index{Error messages}\index{error{\us}message}\index{PCB} +\agcode{PCB.error{\us}message}. +Then, it will set +\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to +\index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE} and +call a macro called +\index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR}. The +default definition of \agcode{SYNTAX{\us}ERROR} will print the error +message on \agcode{stderr}. Finally, in lieu of trying to continue +the parse, it will return to the calling program. + +AnaGram has several options which allow you to tailor diagnostic +messages to your requirements or help you to create your own. It also +provides several options for continuing the parse. + +The options available to help you diagnose errors are: + +\begin{itemize} +\item line and column tracking +\item creation of a diagnostic message +\item identification of the error frame +\end{itemize} + +\index{Numbers}\index{Lines and columns}\index{Configuration switches} +\paragraph{Line and Column Tracking.} +Your parser will automatically track lines and columns in its input if +the \agparam{lines and columns} configuration switch is on. Since +this is a common requirement, \agparam{lines and columns} defaults to +on. If you don't want your parser to spend time counting lines and +columns you should turn the switch off, thus: + +\begin{indentingcode}{0.4in} +\agcode{ +\~{}lines and columns +} +\end{indentingcode} + +Normally, if you are using a \index{Lexical scanner}lexical scanner, +you would turn lines and columns off. +% XXX: this should say *why*. + +The line and column counts are maintained in +\index{line}\index{PCB}\agcode{PCB.line} and +\index{column}\index{PCB}\agcode{PCB.column} respectively. +\agcode{PCB.line} and \agcode{PCB.column} are initialized with the +values of the \index{FIRST{\us}LINE}\index{Macros}\agcode{FIRST{\us}LINE} +and \index{Macros}\index{FIRST{\us}COLUMN}\agcode{FIRST{\us}COLUMN} macros +respectively. These macros provide default initial values of 1 for +both line and column numbers. To override these definitions, simply +include definitions for these macros in your syntax file. If tab +characters are encountered, they are expanded in accordance with the +\index{Tab spacing}\agparam{tab spacing} parameter. + +When your parser is executing a reduction procedure, \agcode{PCB.line} and +\agcode{PCB.column} refer to the first input character following the +rule that is being reduced. When your parser has encountered a syntax +error, and is executing your \agcode{SYNTAX{\us}ERROR} macro, +\agcode{PCB.line} and \agcode{PCB.column} refer to the erroneous input +character. + +\paragraph{Diagnostic Messages.} +If the \index{Diagnose errors}\index{Configuration switches} +\agparam{diagnose errors} switch is on, its default setting, AnaGram +will include an error diagnostic procedure in your parser. When your +parser encounters a syntax error, this procedure will create a simple +diagnostic message and store a pointer to it in +\index{error{\us}message}\index{PCB}\agcode{PCB.error{\us}message} before +your \agcode{SYNTAX{\us}ERROR} macro is executed. The default definition +of \agcode{SYNTAX{\us}ERROR} prints this message on \agcode{stderr}. + +If your parser was in a state where there was a single input character +expected or a simple named token expected, it will create a message of +the form: + +\begin{indentingcode}{0.4in} +Missing ';' +\end{indentingcode} +or +\begin{indentingcode}{0.4in} +Missing semicolon +\end{indentingcode} + +If there was more than one possible input your parser will check to +see if it can identify the erroneous input. If it can it will create +a message of the form: + +\begin{indentingcode}{0.4in} +Unexpected ';' +\end{indentingcode} +or +\begin{indentingcode}{0.4in} +Unexpected semicolon +\end{indentingcode} + +Otherwise, the diagnostic message will be simply: + +\begin{indentingcode}{0.4in} +Unexpected input +\end{indentingcode} + +If you do not need a diagnostic message, or choose to create your own, +you should turn \agparam{diagnose errors} off. + +% XXX Somewhere there should be a discussion of what ``creating your +% own'' would entail. + +\index{Error frame} +\paragraph{Error Frame.} +Often it is desirable to know the ``frame'' of an error, that is, what +the parser thought it was doing when it encountered the error. If, +for instance, you forget to terminate a comment in a C program, your C +compiler sees an unexpected end of file. When you look simply at the +alleged error, of course, you can't see any problem. In order to +understand the error, you need to know that the parser was trying to +find a complete comment. In this case, we can say that the comment is +the ``frame'' of the error. + +AnaGram provides an optional facility in its error diagnostic +procedure, controlled by the +\index{Error frame}\index{Configuration switches}\agparam{error frame} +switch, for identifying the frame of a syntax error. The +\agparam{diagnose errors} switch must also be on to enable the +diagnostic procedure. If you enable \agparam{error frame} in your +syntax file, AnaGram will include a procedure which will scan +backwards on the state stack looking for the frame of the error. When +it finds what appears to be the error frame, it will store the stack +index in +\index{error{\us}frame{\us}ssx}\index{PCB}\agcode{PCB.error{\us}frame{\us}ssx} and +the token number of the nonterminal token the parser was looking for +in +\index{error{\us}frame{\us}token}\index{PCB}\agcode{PCB.error{\us}frame{\us}token}. + +% +% XXX. Why is the discussion of ``hidden'' inside the discussion of +% ``error frame''? hidden applies to ordinary error diagnosis also. +% +% Furthermore, this discussion of error frame needs an example, or +% nobody will ever figure out how to do it. +% + +If, in your grammar, there are nonterminal tokens that are not +suitable for diagnostic use, usually because they name an intermediate +stage in the parse that means nothing to your user, you can make sure +that AnaGram ignores them in doing its analysis by declaring them as +\index{Declaration}\index{Hidden declaration}\agparam{hidden}. To +declare tokens as hidden, include a \agparam{hidden} declaration in a +configuration section. (See Chapter 8.) For instance, consider: + +\begin{indentingcode}{0.4in} +comment + -> comment head, "*/" +comment head + -> "/*" + -> comment head, \~{}end of file +{}[ hidden \bra comment head \ket ] +\end{indentingcode} + +We mark comment head as hidden, because we only wish to talk about +complete comments with our users. + +In order to use the error frame effectively in your diagnostics, you +need to have an ASCII representation of the name of the token as well +as its token number. If you turn the +\index{Token names}\index{Configuration switches}\agparam{token names} +configuration switch on in your syntax file, AnaGram will provide an +array of ASCII strings, indexed by token number, which you may use in +your diagnostics. The name of the array is created by appending +\agcode{{\us}token{\us}names} to the name of your parser. If your parser is +called \agcode{ana}, your token name array will have the name +\agcode{ana{\us}token{\us}names}. As a convenience, AnaGram +also defines a macro, +\index{TOKEN{\us}NAMES}\index{Macros}\agcode{TOKEN{\us}NAMES}, which +evaluates to the name of the token name array. Note that +\agparam{token names} +controls the generation of an array of ASCII strings and should not be +confused with the \agcode{typedef enum} statement in the parser header +file which provides you with a set of enumeration constants. +% XXX maybe it means the *strings* should not be confused? + +If you are tracking context, using the techniques described below, you +can use the macro +\index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} or +\index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT} to +determine the context of the error frame token. + +\index{SYNTAX{\us}ERROR}\index{Macros} +\paragraph{SYNTAX{\us}ERROR Macro.} +When your parser finds a syntax error, it first executes any of the +diagnostic procedures described above that you have enabled, sets +\index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to +\index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE}, +and then invokes the \agcode{SYNTAX{\us}ERROR} macro. If you have not +defined \agcode{SYNTAX{\us}ERROR} it will be defined thus if you have set +\index{Lines and columns}\index{Configuration switches} +\agparam{lines and columns}: + +\begin{indentingcode}{0.4in} +\#define SYNTAX{\us}ERROR {\bs} + fprintf(stderr,"\%s,line \%d,column \%d{\bs}n", {\bs} + PCB.error{\us}message, PCB.line, PCB.column) +\end{indentingcode} + +and thus if you have not: + +\begin{indentingcode}{0.4in} +\#define SYNTAX{\us}ERROR {\bs} + fprintf(stderr, "\%s{\bs}n", PCB.error{\us}message) +\end{indentingcode} + +In most circumstances, you will probably want to write your own +\agcode{SYNTAX{\us}ERROR} macro, since this diagnostic is one your users +will see with some frequency. +% XXX yes and why exactly? is there something we have in mind better +% than just printing PCB.error_message? + +The default macro simply returns to the parser. Your macro doesn't +have to. If you wish, you could call \agcode{abort} or \agcode{exit} +directly from the macro. If the \agcode{SYNTAX{\us}ERROR} macro returns +control to the parser, subsequent events depend on your choices for +error recovery. + +\section{Error Recovery} +\index{Error recovery}\index{Syntax error}\index{Errors} + +Syntax errors can be caused by any of a number of problems. Some come +from simple typographic errors: the user skips a character or types +the wrong one. Others come from true errors: he types something that +might be correct in its place, but in context is totally wrong. +Usually, if your parser is reading a file, you will want to continue +parsing the input, checking for other syntax errors at the very least. +The problem with doing this is getting the parser restarted, or +``resynchronized'', in some reasonable manner. + +AnaGram provides a number of ways for your parser to recover from a +syntax error. The least graceful, of course, is simply to call +\agcode{abort} or \agcode{exit} from the \agcode{SYNTAX{\us}ERROR} macro. +If you don't do this you have several options: + +\begin{itemize} +\item error token resynchronization +\item auto resynchronization +\item simple return to calling program +\item ignore the error +\end{itemize} + +\subsection{Error Token Resynchronization} +\index{Resynchronization} + +When AnaGram builds your parser it checks to see if you have used a +token called \agcode{error} in your grammar or if you have assigned a +token name as the value of the configuration parameter +\index{Error token}\index{token}\index{Configuration parameters} +\agparam{error token}. If so, it includes a call to an error token +resynchronization procedure immediately after the invocation of +\index{SYNTAX{\us}ERROR}\agcode{SYNTAX{\us}ERROR}. The error token +resynchronization procedure works in the following way: It scans the +state stack backwards looking for the most recent state in which +\agcode{error} or the token named by \agparam{error token} was valid +input. It then truncates the stack to this level, and jumps to the +state indicated by the error token. It then passes over any input it +sees until it sees valid input for the state in which it finds itself. +At this point, it returns to the parser which continues as though +nothing had happened. Since this is substantially easier than it +sounds, let's look at an example. Suppose we are writing a C +compiler, and we wish to catch errors in ordinary statements. We add +the following production to our grammar: + +\begin{indentingcode}{0.4in} +statement + -> error, ';' +\end{indentingcode} + +Now, if the parser encounters a syntax error anytime while it is +parsing any statement, it will pop back to the most recent state where +it was looking for a statement, jump forward to the state indicated by +the token \agcode{error} in the new production, and then skip input +until it sees a semicolon. At this point it will continue a normal +parse. The effect of continuing at this point is to recognize and +reduce the above production, i.e., the parser will proceed as if it +had found a complete, correct ``statement''. This production could +even have a reduction procedure to do any clean-up that an error might +require. + +If you use error token resynchronization, you must identify an end of +file token to guarantee that the resynchronization procedure can +always terminate. To do this, either name your end of file token +\agcode{eof} or use the +\index{Eof token}\index{Configuration parameters}\index{Token} +\agparam{eof token} configuration parameter to specify it. + +For example, if your parser is reading conventional stream input, the +end of file will be denoted by a $-1$ value. You can define the end +of file token thus: + +\begin{indentingcode}{0.4in} +eof = -1 +\end{indentingcode} + +% XXX as ``finally'' means something in Java, let's change this to +% ``at last'' +On the other hand, if you have already defined a token named +\agcode{finally}, you can add the following line to any configuration +segment: + +\begin{indentingcode}{0.4in} +eof token = finally +\end{indentingcode} + +The end of file token, of course, must be a terminal token. +% XXX this is not ``of course'' to a casual observer. + +\subsection{Automatic Resynchronization} +\index{Resynchronization}\index{Automatic resynchronization} + +If you have not specified an \agcode{error} token in your syntax file, +AnaGram checks to see if you have turned on the +\index{Auto resynch}\index{Configuration switches} +\agparam{auto resynch} configuration switch. +If so, it includes a call to an automatic resynchronization procedure +immediately after the call to \agcode{SYNTAX{\us}ERROR}. The automatic +resynchronization procedure uses a heuristic based on your grammar to +get back in step with the input. To use it you need do only two +things: You need to turn on the \index{Auto resynch}\agparam{auto +resynch} switch, and you need to specify an end of file token as for +error token resynchronization, above. + +The primary advantage of the automatic resynchronization is that it is +easy to use. The disadvantage is that it turns off all reduction +procedures, so that your parser is reduced to being a syntax checker +after it encounters an error. If your grammar uses semantically +determined productions, your reduction procedures will not be invoked +so the primary reduction token will be used in all cases. + +% XXX *why* does it do this? + +\subsection{Other Ways to Continue} + +% XXX the example of ``reading input from a keyboard'' should be +% clarified to indicate that this means something like an application +% where you press F10 for the menu, not typing at a command line. +% +If you do not wish to use either of the above resynchronization +procedures, you still have a number of options. If your parser is +reading input from a keyboard, for instance, it is probably sufficient +to simply ignore bad input characters. You can do this by simply +resetting \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to +zero in your +\index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro. +% XXX XXX should say \agcode{AG_RUNNING_CODE}, not zero!! +Your parser will then continue, passing over the bad input as though +it had never occurred. If you do this, you should, of course, notify +your user somehow that you're skipping a character. Issuing a beep on +the computer's speaker from the \agcode{SYNTAX{\us}ERROR} macro is +usually enough. + +If you do not wish to continue the parse, but want your main program +to continue, you need do nothing special. \agcode{PCB.exit{\us}flag} is +% XXX XXX should say \agcode{AG_SYNTAX_ERROR_CODE}, not 2!! +set to 2 before the \agcode{SYNTAX{\us}ERROR} macro is called. If your +macro does not change \agcode{PCB.exit{\us}flag}, when it relinquishes +control to your parser, your parser will simply return to the calling +program. The calling program can determine that the parse was +unsuccessful by inspecting \agcode{PCB.exit{\us}flag} and take whatever +action you deem appropriate. + + +\section{Advanced Techniques} + +\subsection{Semantically Determined Productions} +\index{Semantically determined production}\index{Production} + +A semantically determined production is one which has more than one +token on the left side. The reduction procedure then determines which +token has in fact been identified, using whatever criteria are +necessary. In some cases where the purpose is simply to provide +multiple syntactic options to be chosen at execution time, the +determination is made simply by interrogating a switch. Other +situations may require a more complex determination, such as a symbol +table look-up, for instance. + +\index{Production} +The tokens on the left side of the production can be used just like +any other tokens in your grammar. Their semantic values, however, +must all be of the same \index{Data type}\index{Token}data type. + +Depending on how you have defined your grammar, it may be that +whenever any one of the tokens on the left side is syntactically +acceptable input, all the tokens on the left are syntactically +acceptable. That is, the production could reduce to any of the tokens +on the left without causing an immediate error condition. In many +circumstances, however, this is not the case. In a Pascal grammar, +for example, a semantically determined production might be used to +allow a reduction procedure to determine whether a particular +identifier is a constant identifier, a type identifier, a variable +identifier, or so on. In any particular context, only a subset of the +tokens on the left may be syntactically acceptable. + +Before your reduction procedure is called, your parser will set the +reduction token to the first token on the left side which is +syntactically correct. If you need to change this assignment you have +several options. From within your reduction procedure, you may simply +set +\index{reduction{\us}token}\index{PCB}\index{Token}\agcode{PCB.reduction{\us}token} +to the semantically correct value. For this purpose, it is convenient +to use the token name enumeration constants provided in the header +file for your parser. Note that if you select a reduction token that +is not syntactically correct, after your reduction procedure returns, +your parser will encounter a \index{Reduction token +error}\agterm{reduction token error}, described above. + +AnaGram provides several tools to help you set the reduction token +correctly. First, it provides a \agterm{change reduction} function +which will set the reduction token to a specified token only if the +specified token is syntactically correct. It will return a flag to +indicate the outcome: non-zero on success, zero on failure. The name +of this function is given by appending \agcode{{\us}change{\us}reduction} to +the name of your parser. Thus, if your parser is named \agcode{ana}, +the name of the function would be \agcode{ana{\us}change{\us}reduction}. In +those cases where the semantically correct reduction token is not +syntactically correct, you will want to provide error diagnostics for +your user. If you wish the parse to continue, so you can check +errors, you may simply return from the reduction procedure. Since the +default reduction is syntactically correct, the parse can continue as +though there had been no error. + +To simplify use of the change reduction function, AnaGram provides a macro, +\index{CHANGE{\us}REDUCTION}\index{Macros}\agcode{CHANGE{\us}REDUCTION}. +Simply call the macro with the name of the desired token as the +argument, replacing embedded blanks in the token name with +underscores. + +For example, in writing a grammar for the C language, it is quite +convenient to write the following production: + +\begin{indentingcode}{0.4in} +identifier, typedef name + -> name = check{\us}typedef(); +\end{indentingcode} + +The reduction procedure can then check the symbol table to see if +whether the name that has been found is a typedef name. If so, it can +use the \agcode{CHANGE{\us}REDUCTION} macro to change the reduction token +to \agcode{typedef name} and verify that this is acceptable: + +\begin{indentingcode}{0.4in} +if (!CHANGE{\us}REDUCTION(typedef{\us}name)) diagnose{\us}error(); +\end{indentingcode} + +Note that the embedded space in the token name must be replaced with +an underscore character. + +Under some circumstances, in your reduction procedure, you might wish +to know precisely which reduction tokens are syntactically correct. +For instance, you might wish, in an error diagnostic, to tell your +user what you expected to see. If you set the +\index{Reduction choices}\index{Configuration switches} +\agparam{reduction choices} switch, +AnaGram will include in your parser file a function which will +identify the acceptable choices for the reduction token in the current +state. The prototype of this function is: + +\begin{indentingcode}{0.4in} +int \${\us}reduction{\us}choices(int *); +\end{indentingcode} + +where ``\agcode{\$}'' represents the name of your parser. You must provide an +integer array whose length is at least as long as the maximum number +of reduction choices you might have. The function will fill the array +with the token numbers of those which are acceptable in the current +state and return a count of the number of acceptable choices it found. +You can call this function from any reduction procedure. AnaGram also +provides a macro to invoke this procedure: +\index{REDUCTION{\us}CHOICES}\index{Macros}\agcode{REDUCTION{\us}CHOICES}. +For example, to provide a diagnostic which details the acceptable +token, you might combine the use of the \agparam{reduction choices} +switch with the +\index{Token names}\index{Configuration switches}\agparam{token names} +switch described above: + +\begin{indentingcode}{0.4in} +int ok{\us}tokens[20], n{\us}ok{\us}tokens, i; +n{\us}ok{\us}tokens = REDUCTION{\us}CHOICES(ok{\us}tokens); +printf("Acceptable input comprises: {\bs}n"); +for (i = 0; i $<$ n{\us}ok{\us}tokens; i++) \bra + printf(" \%s{\bs}n", TOKEN{\us}NAMES[i]); +\ket +\end{indentingcode} + +A semantically determined production can even be a null production. +You can use a semantically determined null production to interrogate +the settings of parameters and control parsing accordingly: + +\begin{indentingcode}{0.4in} +condition false, condition true + -> = \bra if (condition) CHANGE{\us}REDUCTION(condition{\us}true); \ket +\end{indentingcode} + +There are numerous examples of the use of semantically determined +productions in the examples provided in the +\index{examples}\agfile{examples} directory of your AnaGram +distribution disk. +% XXX too much anaphora +% XXX s/disk// + +\subsection{Defining Parser Control Blocks} +\index{Parser control block} + +All references to the parser control block in your parser are made +using the macro \index{PCB}\agcode{PCB}. The only intrinsic +requirement on PCB is that it evaluate to an \agterm{lvalue} (see +Kernighan and Ritchie) that identifies a parser control block. The +actual access may be direct, indirect through a pointer, subscripted, +or even more complex, although if the access is too complex, the +performance of your parser could suffer. Simple indirect or +subscripted references are usually enough to enable you to build a +system with multiple parallel parsing processes. If you wish to +define \agcode{PCB} in some way other than a simple, direct access to +a compiled-in control block, you will have to declare the control +block yourself. + +When AnaGram builds a parser, it checks the status of the +\index{Declare pcb}\index{Configuration switches}\agparam{declare pcb} +configuration switch. If it is on, the default setting, AnaGram +declares a parser control block for you. AnaGram creates the name of +the parser control block variable by appending \agcode{{\us}pcb} to the +name of your parser. Thus if the name of your parser is +\agcode{ana}, the parser control block is \agcode{ana{\us}pcb}. + +In the header file AnaGram generates, a typedef statement defines the +structure of the parser control block. The typedef name is given by +appending \agcode{{\us}pcb{\us}type} to the name of your parser. Thus if +the name of your parser is \agcode{ana}, the type of the parser +control block is given by \agcode{ana{\us}pcb{\us}type}. Thus, when AnaGram +defines the parser control block for \agcode{ana}, it does so by +including the following two lines of code: + +\begin{indentingcode}{0.4in} +ana{\us}pcb{\us}type ana{\us}pcb; +\#define PCB ana{\us}pcb +\end{indentingcode} + +If you wish to declare the parser control block yourself, you should +turn off the \agparam{declare pcb} switch. To turn \agparam{declare +pcb} off, include the following line in a configuration segment in +your syntax file: + +\begin{indentingcode}{0.4in} +\~{}declare pcb +\end{indentingcode} + +Suppose your program needs to serve up to sixteen ``clients'', each +with its own input stream. You might turn \agparam{declare pcb} off +and declare the parser control block in the following manner: + +\begin{indentingcode}{0.4in} +ana{\us}pcb{\us}type ana{\us}pcb[16]; /* declare control blocks */ +int client; +\#define PCB ana{\us}pcb[client] /* tell parser about it */ +\end{indentingcode} + +Perhaps you need to parse a number of input streams, but you don't +know exactly how many until run time. You might make the following +declarations: + +\begin{indentingcode}{0.4in} +ana{\us}pcb{\us}type *ana{\us}pcb; /* pointer to control block */ +\#define PCB (*ana{\us}pcb) /* tell parser about it */ +\end{indentingcode} + +Note that when you declare \agcode{PCB} as a pointer, you should put +parentheses around the declaration so that your compiler codes the +indirection properly. + +There are many situations where it is convenient for a parser to be +reentrant. A parser used for evaluating formulas in a spreadsheet +program, for instance, needs to be able to call itself recursively if +it is to use natural order recalculation. A parser used to implement +macro substitutions may need to be recursive to deal with embedded +macros. + +Here is an example of an interface function which is designed for +recursive calls to a parser, using the definitions above: + +% XXX can I please at least remove the nonstandard <alloc.h>? +% And fix the misuse of assert, and check malloc for failure? +% And use AG_SUCCESS_CODE instead of 1? +\begin{indentingcode}{0.4in} +\#include <assert.h> +\#include <alloc.h> + +\#define PCB (*ana{\us}pcb) +ana{\us}pcb{\us}type *ana{\us}pcb; + +void do{\us}ana(void) \bra + ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb; + ana{\us}pcb = malloc(sizeof(ana{\us}pcb{\us}type)); + ana(); + assert(ana{\us}pcb.exit{\us}flag == 1); + free(ana{\us}pcb); + ana{\us}pcb = save{\us}ana; +\ket +\end{indentingcode} + +Here is another way to accomplish the same end, this time using stack +storage rather than heap storage: + +% XXX ditto +\begin{indentingcode}{0.4in} +\#include <assert.h> +\#include <alloc.h> + +\#define PCB (*ana{\us}pcb) +ana{\us}pcb{\us}type *ana{\us}pcb; + +void do{\us}ana(void) \bra + ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb; + ana{\us}pcb{\us}type local{\us}pcb; + ana{\us}pcb = \&local{\us}pcb; + ana();\\ + assert(ana{\us}pcb.exit{\us}flag == 1); + ana{\us}pcb = save{\us}ana; +\ket +\end{indentingcode} + +% XXX and here we should discuss \agparam{reentrant parser}, too. + + +\subsection{Multi-stage Parsing} +\index{Parsing}\index{Multi-stage parsing} + +Multi-stage parsing consists of chaining together a number of parsers +in series so that each parser provides input to the following one. +Users of \agfile{lex} and \agfile{yacc} are accustomed to using +two-level parsing, since the ``\index{Lexical scanner}lexical +scanner'', or ``lexer'' they write in \agfile{lex} is really a very +simple parser whose output becomes the input to the parser written in +\agfile{yacc}. AnaGram has been developed so that you may use as many +levels as are appropriate to your problem, and so that, if you wish, +you may write all of the parsers in AnaGram. + +Many problems that do not lend themselves conveniently to solution +with a simple grammar can be neatly solved by using multi-stage +parsing. In many cases this is because multi-stage parsing can be +used to parse constructs that are not context-free. A first level +parser can use semantic information to decide which tokens to pass on +to the next level. Thus, a first level parser for a C compiler can +use semantic information to distinguish typedef names from variable +names. + +% XXX I believe this is referring to QPL. Nowadays there's Python... +As another example, a proprietary programming language used indents to +control its block structure. A first level parser looked only at +lines and indents, passing the text through to the second level +parser. When it encountered changes in indentation level, it inserted +block start and block end tokens as necessary. + +Using AnaGram it is extremely easy to set up multi-stage parses. +Simply configure the second level parser as an event-driven parser. +The first level parser can then hand over tokens or characters to it +as it develops them. + +The C macro preprocessor example, found in the +\index{examples}\agfile{examples} directory of your AnaGram +distribution disk, illustrates the use of multi-stage parsing. + +\subsection{Context Tracking} +\index{Context tracking} + +When you are writing a reduction procedure for a particular grammar +rule, you often need to know the value one or another of your program +variables had at the time the first token in the rule was encountered. +Examples of such variables are: + +\begin{itemize} +\item Line or column number +\item Index in an input file +\item Index into an array +\item Counters, as of symbols defined, etc. +\end{itemize} + +Such variables can be thought of as representing the ``context'' of +the rule you are reducing. Sometimes it is possible to incorporate +the values of such variables into the values of reduction tokens, but +this can become quite cumbersome. AnaGram provides an optional +feature known as ``context tracking'' to deal with this problem. +Here's how it works: + +First, you identify the variables which you want to track. Second, +you write a typedef statement in the \index{C prologue}C prologue of +your parser which defines a data structure with fields to accommodate +values for all of these variables. Third, you tell AnaGram what the +name of the type of your data structure is, using the +\index{Context type}\index{Configuration parameters}\agparam{context type} +configuration parameter. This causes AnaGram to add a field called +\index{PCB}\index{input{\us}context}\agcode{input{\us}context} and a stack, +the \index{Context stack}\index{Stack}\agterm{context stack}, called +\index{PCB}\index{cs}\agcode{cs}, both of the type you have specified, +to your parser control block. Fourth, you write code to gather the +context information for each input character. + +There are several ways to provide the initial context information. +You may write a +\index{GET{\us}CONTEXT}\index{Macros}\agcode{GET{\us}CONTEXT} macro which +sets the context stack variables directly. Using the +\index{CONTEXT}\index{Macros}\agcode{CONTEXT} macro defined below, and +assuming your context type has line, column and pointer fields, you +could define \agcode{GET{\us}CONTEXT} as follows: + +\begin{indentingcode}{0.4in} +\#define GET{\us}CONTEXT CONTEXT.pointer = PCB.pointer,{\bs} + CONTEXT.line = PCB.line,{\bs} + CONTEXT.column = PCB.column +\end{indentingcode} + +If you are using \agparam{pointer input}, you must write a +\agcode{GET{\us}CONTEXT} macro to save context information. If you use a +\index{GET{\us}INPUT}\index{Macros}\agcode{GET{\us}INPUT} macro or have an +event-driven parser, you may either store values directly into +\index{input{\us}context}\index{PCB}\agcode{PCB.input{\us}context} when you +develop the input token, or you may write a \agcode{GET{\us}CONTEXT} +macro. The macro will provide a slight increment in performance. +% XXX say why it's faster (I assume because it won't look up context +% for inputs that don't need it?) + +AnaGram provides six macros to enable you to read values in a +convenient manner from the context stack, +\index{cs}\index{PCB}\agcode{PCB.cs}. Three of these macros are +designed to be used from your parser itself, and three are available +to use from other modules. These three macros are designed for use in +your parser: + +\begin{itemize} +\item \agcode{CONTEXT} +\item \agcode{RULE{\us}CONTEXT} +\item \agcode{ERROR{\us}CONTEXT} +\end{itemize} + +These macros are defined at the beginning of your parser file, so they +may be used anywhere within your parser. + +\index{CONTEXT}\index{Macros}\agcode{CONTEXT} +can be used to read or write the current top of the context stack as +indexed by \index{PCB}\agcode{PCB.ssx}. When your parser is executing +a reduction procedure for a particular grammar rule, \agcode{CONTEXT} +will evaluate to the value of the input context as it was just before +the very first token in the rule. The definition of \agcode{CONTEXT} +is: + +\begin{indentingcode}{0.4in} +\#define CONTEXT (PCB.cs[PCB.ssx]) +\end{indentingcode} + +\index{RULE{\us}CONTEXT}\index{Macros}\agcode{RULE{\us}CONTEXT} can be used +within a reduction procedure to get the context for any element within +the rule being reduced. For example, \agcode{RULE{\us}CONTEXT[0]} is the +context of the first element in the rule, \agcode{RULE{\us}CONTEXT[1]} is +the context of the second element in the rule, and so on. +\agcode{RULE{\us}CONTEXT[0]} is exactly the same as \agcode{CONTEXT}. + +% XXX There should be a way to address the context of tokens in a +% rule by the symbolic names we've bound to them. + +The definition of \agcode{RULE{\us}CONTEXT} is: + +\begin{indentingcode}{0.4in} +\#define RULE{\us}CONTEXT (\&(PCB.cs[PCB.ssx])) +\end{indentingcode} + +As an example, let us suppose that we are writing a parser to read a +parameter file for a program. Let us imagine the following statements +make up a part of our syntax file: + +\begin{indentingcode}{0.4in} +\bra + typedef struct \bra int line, column \ket location; + \#define GET{\us}INPUT {\bs} + PCB.input{\us}code = fgetc(input{\us}file); {\bs} + PCB.input{\us}context.line = PCB.line; {\bs} + PCB.input{\us}context.column = PCB.column; +\ket +{}[ context type = location ]\\ + +parameter assignment + -> parameter name, '=', number +\end{indentingcode} + +Let us suppose that for each parameter we have stored a range of +admissible values. We have to diagnose an attempt to use an incorrect +value. We could write our diagnostic message as follows: + +\begin{indentingcode}{0.4in} +fprintf(stderr, "Bad value at line \%d, column \%d in " + "parameter assignment at line \%d, column \%d", + RULE{\us}CONTEXT[2].line, + RULE{\us}CONTEXT[2].column, + CONTEXT.line, + CONTEXT.column); +\end{indentingcode} + +This diagnostic message would give our user the exact location both of +the bad value and of the beginning of the statement that contained the +bad value. + +\index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} can be +used within a +\index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro to +find the context of an error if you have turned on the +\index{Error frame}\index{Configuration switches}\agparam{error frame} +and +\index{Diagnose errors}\index{Configuration switches} +\agparam{diagnose errors} +switches. AnaGram itself tracks context using a structure consisting +of line and column numbers. In case of errors such as encountering an +end of file in a comment, it uses the \agcode{ERROR{\us}CONTEXT} macro +to determine the line and column number at which the comment began. +% XXX that sounds like something AG does with your grammar, not +% what AG does reading its own input, which is what it is. rephrase... +The definition of \agcode{ERROR{\us}CONTEXT} is: + +\begin{indentingcode}{0.4in} +\#define ERROR{\us}CONTEXT (PCB.cs[PCB.error{\us}frame{\us}ssx]) +\end{indentingcode} + +Three similar macros are also available for more general use: + +\begin{itemize} +\item \index{PCONTEXT}\index{Macros}\agcode{PCONTEXT(pcb)} +\item \index{PRULE{\us}CONTEXT}\index{Macros}\agcode{PRULE{\us}CONTEXT(pcb)} +\item \index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT(pcb)} +\end{itemize} + +% XXX repeating ``modules other than'' is bad +These macros are identical in function to the corresponding macros in +the first class. The only difference is that they take the name of a +parser control block, \agcode{pcb}, as an argument so they can be used +in modules other than the parser module. AnaGram includes the +definitions for these macros in the parser header file so that they +can be used in modules other than the parser itself. Since these +macros are not specific to any one parser, the definitions are +conditional so that they will only be defined once in a given module, +even if you include header files corresponding to several parsers. +The definitions of these macros are as follows: + +\begin{indentingcode}{0.4in} +\#define PCONTEXT(pcb) (pcb.cs[pcb.ssx]) +\#define PRULE{\us}CONTEXT(pcb) (\&(pcb.cs[pcb.ssx])) +\#define PERROR{\us}CONTEXT(pcb) (pcb.cs[pcb.error{\us}frame{\us}ssx]) +\end{indentingcode} + +Note that since the context macros only make sense when called from a +reduction procedure or an error procedure, there are not many +occasions to use these macros. The most common situation would be +when you have compiled the bulk of the code for your reduction +procedures in a separate module. + +Remember that \agcode{PRULE{\us}CONTEXT}, because it identifies an array +rather than a value, requires a subscript. For an example, let us +rewrite the diagnostic message given above for +\agcode{RULE{\us}CONTEXT} using \agcode{PRULE{\us}CONTEXT}, assuming +that the name of our parser control block is \agcode{ana{\us}pcb}: + +\begin{indentingcode}{0.4in} +fprintf(stderr, "Bad value at line \%d, column \%d in " + "resource statement at line \%d, column \%d", + PRULE{\us}CONTEXT(ana{\us}pcb)[2].line, + PRULE{\us}CONTEXT(ana{\us}pcb)[2].column, + PCONTEXT.line, + PCONTEXT.column); +\end{indentingcode} + +\subsection{Coverage Analysis} +\index{Coverage analysis} + +AnaGram has simple facilities for helping you determine the adequacy +of your test suites. The +\index{Rule coverage}\index{Configuration switches} +\agparam{rule coverage} configuration switch +controls these facilities. When you set \agparam{rule coverage}, +AnaGram includes code in your parser to count the number of times the +parser identifies each rule in your grammar. AnaGram also provides +procedures you can use to write these counts to a file and accumulate +them over multiple executions of your parser. Finally, it provides a +window where you may inspect the counts to see the extent to which +your tests have covered the options in your grammar. + +To maintain the counts, AnaGram declares, at the beginning of your +parser, an integer array, whose name is created by appending +\agcode{{\us}nrc} to the name of your parser. The array contains one +counter for each rule you have defined in your grammar. There are no +entries for the auxiliary rules that AnaGram creates to deal with set +overlaps or disregard statements. In order to identify positively all +the rules that the parser reduces, AnaGram turns off certain +optimization features in your parser. Therefore, a parser that has +the \agparam{rule coverage} switch enabled will run slightly slower +than one with the switch off. + +AnaGram also provides procedures to write the counts to a file and to +initialize the counts from a file. The procedures are named by +appending \agcode{{\us}write{\us}counts} and \agcode{{\us}read{\us}counts} +respectively to the name of your parser. Thus, if your parser is +called \agcode{ana}, the procedures are called +\agcode{ana{\us}write{\us}counts} and \agcode{ana{\us}read{\us}counts}. +Neither takes any arguments nor returns a value. To accumulate counts +correctly, you should include calls to the +\index{read{\us}counts}\agcode{read{\us}counts} and +\index{write{\us}counts}\agcode{write{\us}counts} procedures in your +program. A convenient way to do this is to include statements such as +the following in your main program: + +% XXX perhaps this means ``atexit'' +\begin{indentingcode}{0.4in} +ana{\us}read{\us}counts(); /* before calling parser */ +at{\us}exit(ana{\us}write{\us}counts); +\end{indentingcode} + +For your convenience, AnaGram defines two macros, +\index{READ{\us}COUNTS}\index{Macros}\agcode{READ{\us}COUNTS} and +\index{WRITE{\us}COUNTS}\index{Macros}\agcode{WRITE{\us}COUNTS}, in your +parser. They call the \agcode{read{\us}counts} and +\agcode{write{\us}counts} procedures respectively when \agparam{rule +coverage} is set. Otherwise they are null. Thus you may code them +into your main program and it will work whether or not the +\agparam{rule coverage} switch is set. For example, + +\begin{indentingcode}{0.4in} +READ{\us}COUNTS; /* read counts if coverage enabled */ +my{\us}parser(); /* call parser */ +WRITE{\us}COUNTS; /* write updated counts */ +\end{indentingcode} + +The \agcode{write{\us}counts} procedure writes an identifier code and the +counts to a count file. The name of the count file is given by the +\index{Coverage file name}\index{Configuration parameters} +\agparam{coverage file name} parameter, which defaults to the same name as your +syntax file but with the extension +\index{File extension}\index{nrc}\agfile{.nrc}. The identifier code +changes each time you modify your syntax file. The +\agcode{read{\us}counts} procedure attempts to read the count file. If +it cannot find it, or the identifier code is out of date, it simply +initializes the counter array to zeroes. Otherwise, it initializes +the counter arrays to the values found in the file. + +When you run AnaGram and analyze your syntax file, if +\agparam{rule coverage} is set, AnaGram will enable the \agmenu{Rule +Coverage} option on the \agmenu{Browse} menu. If you select +\agmenu{Rule Coverage}, AnaGram will prepare a \agwindow{Rule +Coverage} window from the rule count file you select. AnaGram will +warn you if the file you selected is older than the syntax file, since +under those conditions, the coverage file might be invalid. + +The \index{Rule Coverage}\index{Window}\agwindow{Rule Coverage} window +shows the count for each rule, the rule number and the text of the +rule. It is also synched to the syntax file so that you can see the +rule in context. AnaGram also modifies the display of the +\index{Reduction Procedures}\index{Window}\agwindow{Reduction +Procedures} window so that each procedure descriptor is preceded by +the number of times it has been called. You can use this display to +verify that all your reduction procedures have been tried. + +% XXX having this paragraph here seems confusing +The \index{Trace Coverage}\index{Window}\agwindow{Trace Coverage} +window, created when you use the \agwindow{File Trace} or +\agwindow{Grammar Trace} option, provides information similar to that +provided by \agwindow{Rule Coverage}. The differences are these: +Optimizations are not turned off for the \agwindow{Trace Coverage}, so +that some rules of length zero or one will not be properly counted. +Also, the \agwindow{Trace Coverage} does not tell you about the +reduction procedures you have tested. + +\agwindow{File Trace} can become quite tedious to use if you have very +many semantically determined productions, so in these cases the +\agparam{rule coverage} approach can give you the information you need +more quickly. + +\subsection{Using Precedence Operators} + +The conventional syntax for arithmetic expressions used in most +programming languages can be parsed simply by reference to +\index{Operator precedence}\index{Precedence operators} +\agterm{operator precedence}. Operator precedence refers to +the rules we use to determine the order in which arithmetic operations +should be carried out. In normal usage, this means that +multiplication and division take precedence over addition and +subtraction, which in turn take precedence over comparison operations. +One can formalize this usage by assigning a numeric \index{Precedence +level}\agterm{precedence level} to each operator, so that the +operations are carried out starting with those of highest precedence +and continuing in order of declining precedence. When operators have +the same precedence level, such as addition and subtraction operators, +one can decide the order of operation to be left to right or right to +left. Operators of equal precedence which are to be evaluated left to +right are called \agterm{left associative}. Those which should be +evaluated right to left are called \agterm{right associative}. If the +nature of the operators is such that the question should never arise, +they are called \agterm{non-associative}. + +AnaGram provides three declarations, +\index{Precedence declarations}\index{Left}\index{Right}\index{Nonassoc} +\agparam{left}, \agparam{right}, and \agparam{nonassoc}, which you can +use to associate precedence levels and associativity with tokens in +your grammar. The syntax of these statements is given in Chapter 8. + +When AnaGram encounters a shift-reduce \index{Conflicts}conflict in +your grammar, it looks to see if the conflict can be resolved by using +precedence and associativity rules. If so, it applies the rules to +the conflict and records the resolution in the \index{Resolved +Conflicts}\index{Window}\agwindow{Resolved Conflicts} table. + +There are two occasions where you should consider using precedence +declarations in your grammar: Where rewriting the grammar to get rid +of a conflict would obscure and complicate the grammar, and where you +wish to try to get a more compact, slightly faster parser by using +precedence rules for parsing arithmetic expressions. + +Here is an example of using precedence declarations to parse simple +arithmetic expressions: + +\begin{indentingcode}{0.4in} +unary minus = '-' +{}[ + left \bra '+', '-' \ket + left \bra '*', '/' \ket + right \bra unary minus \ket +] + +exp + -> number + -> unary minus, exp + -> exp, '+', exp + -> exp, '-', exp + -> exp, '*', exp + -> exp, '/', exp +\end{indentingcode} + +A complete working calculator grammar using this syntax, +\agfile{ffcalcx}, can be found in the +\index{examples}\agfile{examples/ffcalc} directory of your +AnaGram distribution disk. +% XXX s/disk// + +\subsection{Parser Performance} + +The parsers AnaGram generates have been engineered to provide maximum +performance subject to constraints of reliability and robustness. +There are a number of steps you may take, however, to make optimize +the performance of your parser. + +\paragraph{Standard Stack Frame.} If your compiler has a switch that +allows you to turn \emph{off} the standard stack frame when you +compile your parser, do so. Your parser uses a large number of very +small functions which run fastest when your compiler does not use the +standard stack frame. + +\paragraph{Error Diagnostic Features.} If your parser does not need +to diagnose errors, turn off the +\index{Diagnose errors}\index{Configuration switches} +\agparam{diagnose errors} switch. +Turn off the +\index{Lines and columns}\index{Configuration switches} +\agparam{lines and columns} switch if you don't need this information. +If your parser doesn't need a diagnostic, and halts on syntax error, +turn off the +\index{Backtrack}\index{Configuration switches}\agparam{backtrack} switch. + +\paragraph{Anti-optimization Switches.} Certain switches de-optimize +your parser for various reasons. These switches, +\index{Traditional engine}\index{Configuration switches} +\agparam{traditional engine} and +\index{Rule coverage}\index{Configuration switches} +\agparam{rule coverage}, +should be turned off once you no longer need their effects. + +\paragraph{Other Switches.} For maximum performance you should use +\index{Pointer input}\index{Configuration switches}\agparam{pointer +input}. If you can guarantee that your input will not have +out-of-range input, you can turn off +\index{Test range}\index{Configuration switches}\index{Range} +\agparam{test range}. +% XXX s/out-of-range input/out-of-range characters or tokens/ +