Mercurial > ~dholland > hg > ag > index.cgi
view doc/manual/dd.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 (2007-12-22) |
parents | |
children |
line wrap: on
line source
\chapter{Programming With AnaGram} Although AnaGram has many options and features which enable you to build a parser that meets your needs precisely, it has well-defined defaults so that you do not generally need to learn about an option until you need the facility it provides. The purpose of this chapter is to show you how to use the options and features effectively. The options and features of AnaGram can be divided roughly into three groups: those that control the general aspects of your parser, those that control input to the parser and those that control error handling. After dealing with these three groups of options and features, this chapter concludes with a discussion of various advanced techniques. Many aspects of your parser are controlled by setting configuration parameters, either in a configuration file or in your syntax file. This chapter presumes you are familiar with setting configuration parameters. The names of configuration parameters, as they occur in the text, are printed in \agparam{bold face type}. Appendix A describes the use of configuration parameters and provides a detailed discussion of each configuration parameter. \section{General Aspects} \subsection{Program Development} The first step in writing a program is to write a grammar in AnaGram notation which describes the input the program expects. The file containing the grammar, called the syntax file, conventionally has the extension \agfile{.syn}. You could also make up a few sample input files at this time, but it is not necessary to write reduction procedures at this stage. Run AnaGram and use the \index{Analyze Grammar}Analyze Grammar command to create parse tables. If there are syntax errors in the grammar at this point, you will have to correct them before proceeding, but you do not necessarily have to eliminate conflicts, if there are any, at this time. There are, however, many aids available to help you with conflicts. These aids are described in Chapters 5 through 7, and somewhat more briefly in the Online Help topics. Once syntax errors are corrected, you can try out your grammar on the sample input files using the File Trace facility. With File Trace, you can see interactively just how your grammar operates on your test files. You can also use Grammar Trace to answer ``what if'' questions concerning input to the grammar. The Grammar Trace does not use a test file, but rather allows you to make input choices interactively. At any time, you can write reduction procedures to process your input data as its components are identified in the input stream. Each procedure is associated with a grammar rule. The reduction procedures will be incorporated into your parser when you create it with the \index{Build Parser}Build Parser command. By default, unless you specify an input procedure, parser input will be read from \agcode{stdin}, using the default \agcode{GET{\us}INPUT} macro. You will probably wish to redefine \agcode{GET{\us}INPUT}, or configure your parser to use \agparam{pointer input} or \agparam{event driven} input. \subsection{The Default Parser} \index{Parser} If you apply the Build Parser command to a syntax file which contains only a grammar, with no reduction procedures and no embedded C code, AnaGram will still produce a complete C command line program which you can compile and run. \index{Input procedures}This parser will parse character input from \agcode{stdin}. If the input does not satisfy the rules of your grammar, the parser will issue a syntax error diagnostic to \agcode{stderr} identifying the exact line and column numbers of the error. If the parser should overflow its stack, it will abort with an error message to \agcode{stderr}. If the parse is successful, that is if the parser succeeds in identifying the grammar token without encountering an error, it will simply return to the command line. You can extend such a simple parser, often quite effectively, by adding only reduction procedures. If the reduction procedures write output to \agcode{stdout}, you can produce a conventional ``filter'' program without having to pay any attention to input handling, error handling, or any of the other options AnaGram provides. %CALC, in the EXAMPLES directory, is an example of such a program. \subsection{The Content of the Parser and Header Files} % XXX s/from your parser file/from your syntax file/ AnaGram creates two \index{Output files}\index{File}output files: a parser file and a header file. \index{Parser file}\index{File}The parser file contains the C code you need to compile and link before you can run your parser. It begins with the \index{C prologue}\index{Prologue}C prologue, if any, from your parser file. The C prologue is an optional block of \index{Embedded C}embedded C or C++ which precedes everything else in your syntax file. Although it can contain anything you wish, normally it is used to place identification information, \index{Copyright notice}copyright notices, etc., at the beginning of your parser file. If your parser uses token types that require definition, the appropriate \agcode{\#include} statements and definitions should be placed in the C prologue. See ``Defining Token Types'', below. Following the C prologue, AnaGram places a number of definitions of variables and macros that you might need to refer to in your embedded C, and in your reduction procedures. Not the least of these definitions is the parser control block, described below. Following these definitions, AnaGram inserts all your embedded C, in the order in which it occurred in your syntax file. Following the embedded C come all your reduction procedures. Finally, AnaGram adds the tables which summarize your grammar and a parsing engine customized to your requirements. The \index{Header file}\index{File}header file contains definitions needed by your parser. These include definitions of the \index{Parser value stack}\index{Value stack}\index{Stack}value stack type, the input token type, the \index{Parser control block}parser control block type, and token name enumeration constants. The definitions are placed in a header file so that you can make them available to other modules if necessary. \subsection{Naming Output Files} \index{Output files}\index{File} Unless you specify otherwise, AnaGram names the parser and header files following conventional programming practice. Both \index{File name}\index{File name}files have the same name as your syntax file, with extensions \agfile{.c} and \agfile{.h} respectively. These names, however, are controlled by the configuration parameters \index{Configuration parameters}\index{Name} \index{Parser file name}\agparam{parser file name} and \index{Header file name}\agparam{header file name} respectively, so you can override AnaGram's defaults if you wish. If you normally use C++ rather than C, for example, you might want to include the following statement in your configuration file: \begin{indentingcode}{0.4in} parser file name = "\#.cpp" \end{indentingcode} When AnaGram names the parser file it substitutes the name of your syntax file for the ``\#'' character in the file name template. \subsection{Compiling Your Parser} \index{Parser} Although AnaGram was designed primarily with ANSI C in mind, a good deal of care has been taken to ensure that its output is consistent with older C compilers and with newer C++ compilers. If your compiler does not support ANSI function prototypes, you should set the \index{Old style}\index{Configuration switches}\agparam{old style} switch in your configuration file. If you are intending to compile your parser using a 16-bit compiler, you might want to turn on the \index{Near functions}\index{Configuration switches}\agparam{near functions} switch in your configuration file. If you are building a parser for use in an embedded system, you might want to make sure the \index{Const data}\index{Configuration switch}\agparam{const data} configuration switch is set so that all the tables AnaGram generates will be declared \agcode{const}. \subsection{Naming Your Parser} \index{Parser} In the default case, AnaGram creates a main program for you. Generally, however, you will probably want a parser function which you can call from your own main program. You won't want AnaGram to define \agcode{main} for you. You can stop AnaGram from defining \agcode{main} in any of several ways: Include some embedded C in your syntax file, turn off \index{Main program}the \index{Configuration switches}\agparam{main program} configuration switch, or turn on either the \agparam{event driven} or \agparam{pointer input} switches. Since you almost always will have some embedded C in your syntax file, you will seldom have to use the \agparam{main program} switch. Normally, AnaGram simply uses the name of your syntax file to create the name of your parser. Thus if your syntax file is called \agfile{ana.syn} your parser will have the name \agcode{ana}. AnaGram does not check the parser name for compliance with the rules of C. If you use strange characters in your file name, you will get strange characters in the name of your parser, and you will get unpleasant remarks from your C compiler when you try to compile your parser. Thus, for example, if you were to name your parser file \agfile{!@\#.syn}, AnaGram will call your parser \agcode{!@\#}. Your compiler will doubtless choke. \index{Parser} If you wish AnaGram to give your parser a name other than the file name, you may set the \index{Parser name}\index{Name}\index{Configuration parameters} \agparam{parser name} configuration parameter. Thus, to make sure your parser is called \agcode{periwinkle} you would include the following line in a configuration section in your syntax file: % Note: this is not actually required to be in double quotes. % It'll also accept anything that's syntactically acceptable to it % as a C data type, which also lets you give it things like % ``periwinkle *'' that result in uncompilable code. \begin{indentingcode}{0.4in} parser name = "periwinkle" \end{indentingcode} Besides the parser itself, AnaGram generates a number of other functions, variables and type definitions when it creates your parser. All these entities are named using the parser name as the base. The templates and their usages are as follows: \begin{indenting}{0.4in} \begin{tabular}{ll} \index{Parser}\index{Initializer}\index{Name} \agcode{init{\us}\$}&initializer for parser\\ \index{Grammar token}\index{Value} \agcode{\${\us}value}&returns value of grammar token\\ \index{Parser value stack}\index{Value stack}\index{Stack} \agcode{\${\us}vs{\us}type}&value stack type\\ \agcode{\${\us}it{\us}type}&input token union\\ \agcode{\${\us}token{\us}type}&token name enumeration typedef\\ \agcode{\${\us}\%{\us}token}&token name enumeration constants\\ \agcode{\${\us}pcb{\us}type}&typedef of parser control block\\ \index{Parser control block} \agcode{\${\us}pcb}&parser control block\\ \index{Rule Count} \agcode{\${\us}nrc}&rule count table\\ \agcode{\${\us}nrpc}&reduction procedure count table\\ \\ \end{tabular} \end{indenting} When AnaGram defines these entities it substitutes the parser name for the dollar sign. In the token name enumeration constants it substitutes the token name for the \index{{\us}prc}``\%'' character. Embedded space characters are replaced with underscore characters. \subsection{The Parser Control Block} \index{Parser control block} The complete status of a parse is kept in a structure called a \agterm{parser control block}. As a default, AnaGram defines a parser control block for you, and provides a macro, \index{PCB}\agcode{PCB}, which enables you to access it simply. The name AnaGram assigns to the parser control block is % XXX %\agcode{\${\us}pcb}, where as above ``\$'' is replaced with the name of %your parser. \agcode{\textit{$<$parser name$>$}{\us}pcb}. If you need to refer to the parser control block from some module other than the parser module, use an \agcode{\#include} statement to include the header file for your parser and refer to the parser control block by its name as above. The structure of the parser control block is described in Appendix E. In this chapter, particular fields will be discussed as necessary. Since the parser control block contains the complete status of a parse, you may interrupt a parse and continue it later by saving and restoring the control block. If you have multiple input streams, all controlled by the same grammar, you may have a separate control block for each stream. If you wish to call your parser recursively, you may define a fresh control block for each level of recursion. To make best use of these capabilities, you will need to declare the parser control block yourself. This is discussed below under ``Advanced Techniques''. \subsection{Calling Your Parser} % XXX should have an example of actually calling the thing. % XXX should also have ``terminating your parser'' or something like % that. The parser function AnaGram defines is a simple function which takes no arguments and returns no values. All communication with the parser takes place via the parser control block. When your parser returns, \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} contains an exit code describing the outcome of the parse. Symbols for the exit codes are defined in the header file AnaGram generates. \index{Exit codes}\index{Error codes}These symbols, their values, and their meanings are: \index{AG{\us}RUNNING{\us}CODE} \index{AG{\us}SUCCESS{\us}CODE} \index{AG{\us}SYNTAX{\us}ERROR{\us}CODE} \index{AG{\us}REDUCTION{\us}ERROR{\us}CODE} \index{AG{\us}STACK{\us}ERROR{\us}CODE} \index{AG{\us}SEMANTIC{\us}ERROR{\us}CODE} \begin{indenting}{0.4in} \begin{tabular}{lll} \agcode{AG{\us}RUNNING{\us}CODE}&0&Parse is not yet complete\\ \agcode{AG{\us}SUCCESS{\us}CODE}&1&Parse terminated successfully\\ \agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE}&2&Syntax error was encountered\\ \agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE}&3&Bad reduction token encountered\\ \agcode{AG{\us}STACK{\us}ERROR{\us}CODE}&4&Parser stack overflowed\\ \agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE}&5&Semantic error\\ \\ \end{tabular} \end{indenting} Only an event driven parser will return the value \agcode{AG{\us}RUNNING{\us}CODE}, since any other parser continues executing until it terminates successfully or encounters an unrecoverable error. Syntax errors, reduction token errors, and stack errors are discussed below under ``Error Handling''. % XXX: this bit belongs somewhere else \agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE} is a special case. It is available for you to use in your reduction procedures to terminate a parse for semantic reasons. % XXX add: AnaGram will never set it itself. If, in a reduction procedure, you determine that parsing should not continue, you need only include the statement: \begin{indentingcode}{0.4in} PCB.exit{\us}flag = AG{\us}SEMANTIC{\us}ERROR{\us}CODE; \end{indentingcode} When your reduction procedure returns, the parse will then terminate and the parser will return control to the calling program. \subsection{Parser Return Value} \index{Value} If, in your grammar, there is a value assigned to the grammar token, you may retrieve it, after the parse is complete, by calling the parser value function, the name of which is given by \agcode{\${\us}value} where ``\$'' is the name of your parser. \agcode{\${\us}value} takes no arguments, and returns a value of the type assigned to the grammar token in your syntax file. Although in theoretical discussions of parsing the result of the parse is contained in the value of the grammar token, in practice, more often than not, results are communicated to other procedures by setting the values of global variables. Thus the value of the grammar token is often of little interest. Since the parser per se takes no arguments, it is usually convenient to write a small interface function with a calling sequence appropriate to the problem. The interface function can then take care of appropriate initializations, call the parser, and retrieve results. \subsection{Defining Token Types} When you add reduction procedures to your grammar, you will often find it convenient to add type declarations for the \index{Semantic value}\index{Token}\index{Value}semantic values of some of the tokens in your grammar. As long as the types you use are conventional C data types\index{Data type}\index{Token}, you don't have to do anything special. If, however, you have used types or classes that you have defined yourself, you need to make sure that the appropriate definition statements precede their use in the code AnaGram generates. To do this, you need to have a C prologue in your syntax file. In the C prologue, you should place the definition statements your parser will need, or at least an \agcode{\#include} statement that will cause the types or classes to be defined. \subsection{Debugging Your Parser} Because the ``flow of control'' of your parser is algorithmically derived from your grammar, debugging your parser separates into two separate exercises: debugging your grammar, discussed in Chapter 7, and debugging your reduction procedures. When debugging, it is usually a good idea to turn off the \index{Macros}\index{Allow macros}\index{Configuration switches} \agparam{allow macros} switch. This switch is normally on and causes simple reduction procedures to be implemented as macros. When you turn it off, you get a proper function definition for each reduction procedure, so you can put a breakpoint in any reduction procedure you choose. If the \index{Line numbers}\index{Configuration switches} \agparam{line numbers} switch is on each reduction procedure will contain a \index{\#line}\agcode{\#line} directive to show where the reduction procedure is found in your syntax file. Once you have acquired confidence in your reduction procedures you may turn the \agparam{allow macros} switch back on for slightly improved performance. If your debugger allows you to inspect entire structures, you will find it convenient to look at the parser control block while you are debugging. The contents of the parser control block are described in Appendix E. A good way to begin debugging a new parser is to simply put a breakpoint in each reduction procedure. Start your parser and step through the reduction procedures one by one, verifying that they perform as expected. After you have stepped through a reduction procedure, turn off its breakpoint. If there are multiple paths, leave breakpoints on the paths not taken. Liberal use of the assert macro helps assure that your fixes don't break procedures you have already tested. \section{Providing Input to Your Parser} \index{Parser}\index{Input}\index{Input procedures} This section describes three methods for providing input to your parser. In the first method your program calls the parser which then requests input tokens as it needs them. It returns only when it has completed the parse. The parser requests input tokens by invoking a macro called \agcode{GET{\us}INPUT}, described below. The second method for providing input can be used when the entire sequence of input tokens is available in memory. This method is controlled by the \index{Pointer input}\index{Configuration switches}\agparam{pointer input} configuration switch. It is discussed below. The third method for providing input is especially convenient when using \index{Lexical scanner}lexical scanners or multi-stage parsing. It is controlled by the \index{Event driven}\index{Configuration switches}\agparam{event driven} configuration switch. \subsection{The \agcode{GET{\us}INPUT} Macro} \index{GET{\us}INPUT}\index{Macros} The default parser simply reads characters from \agcode{stdin}. It does this by invoking a macro called \agcode{GET{\us}INPUT} every time it needs an input character. The default definition of \agcode{GET{\us}INPUT} is: \index{PCB}\index{input{\us}code} \begin{indentingcode}{0.4in} \#define GET{\us}INPUT (PCB.input{\us}code = getchar()) \end{indentingcode} \agcode{PCB.input{\us}code} is an integer field in the parser control block which is used to hold the current input \index{Character codes}character code. By including your own definition of \agcode{GET{\us}INPUT} in your embedded C, you override the default definition provided by AnaGram. The only requirement for \agcode{GET{\us}INPUT} is that it store a character in \agcode{PCB.input{\us}code}. Suppose you wish to make a parser that reads characters from a file provided by the calling program. You could include the following in your embedded C: \begin{indentingcode}{0.4in} extern FILE *file; \#define GET{\us}INPUT (PCB.input{\us}code = fgetc(file)) \end{indentingcode} Now your parser, when invoked, will read characters from the specified file instead of reading them from \agcode{stdin}. Of course, \agcode{GET{\us}INPUT} is not constrained to reading a file or data stream. You may implement \agcode{GET{\us}INPUT} in any manner you choose. You may implement it as a function call, or you may choose to define \agcode{GET{\us}INPUT} so that it expands into inline code for faster execution. \subsection{Pointer Input} \index{Pointer input}\index{Input procedures} It often happens that the data you wish to parse are already in memory when you are ready to call the parser. While you could rewrite \agcode{GET{\us}INPUT} to simply scan the array by incrementing a pointer, AnaGram provides an alternative approach since this is such a common situation. In a configuration section in your syntax file simply turn on the \index{Pointer input}\index{Configuration switches}\agparam{pointer input} switch. Then before you call your parser, load \index{pointer}\index{PCB}\agcode{PCB.pointer}, the pointer field in the parser control block, with a pointer to your array. Assuming your parser is called \agcode{ana}, and you wish to call an interface function with an argument consisting of a character string, here's what you do: \begin{indentingcode}{0.4in} {}[ pointer input ] \bra void ana{\us}shell(char *source{\us}text) \bra PCB.pointer = (unsigned char *)source{\us}text; ana(); \ket \ket \end{indentingcode} % XXX s/the// The type of the \agcode{PCB.pointer} defaults to \agcode{unsigned char *} to minimize difficulty with full 256-character sets. If your compiler is fussy, you should use a cast, as above, when you set the value. If your data requires more than 256 \index{Character codes}character codes, you may still use pointer input by using the \index{Pointer type}\index{Configuration parameters}\agparam{pointer type} configuration parameter to change the definition of the field in the parser control block. Normally, the value of \agparam{pointer type} should be a C data type that converts to integer. If \agparam{pointer type} does not convert to integer, you must provide an \index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro, as described below, to extract a token identifier. Do not change \agparam{pointer type} to \agcode{signed char} in order to avoid the cast in the above example. That will have the effect of making all character codes above 127 inaccessible to your parser. Note that if you use pointer input your parser does not need a \agcode{GET{\us}INPUT} macro. Parsers that use pointer input usually run somewhat faster than those that use \agcode{GET{\us}INPUT}, particularly if they use keywords. % XXX that is unclear - I know it means that the keyword logic is % particularly improved by using pointer input, but it could be read % to imply that adding keywords makes the parser even faster, which is % backwards. \subsection{Event Driven Parsers} \index{Event driven parser}\index{Parser} There are many situations where the input to a parser is developed by an independent process and the linkage required to implement a \agcode{GET{\us}INPUT} macro is unduly cumbersome. In these circumstances, it is convenient to use an \agparam{event driven} parser. With an event driven parser, you do not simply call the parser and wait for it to finish. Instead, you call its \index{Initializer}initializer first, and then call it each time you have a character for it. The parser processes the character and returns as soon as it needs more input, encounters an error or finds the parse complete. You can interrogate \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to determine whether the parser can accept more input. To create an event driven parser, set the \index{Event driven}\index{Configuration switches}\agparam{event driven} switch in your syntax file. Then, to initialize the parser, call the initialization procedure, or \index{Initializer}initializer, provided by AnaGram. The name of this procedure is \agcode{init{\us}\$} where ``\agcode{\$}'' represents the name of your parser. If your parser is named \agcode{ana}, the \index{Parser}initialization procedure is named \agcode{init{\us}ana}. To process a single character, store the character in \index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code}, then call \agcode{ana}. When it returns, check \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to see if the parser is still running. When the parse is successful, you may retrieve the value of the grammar token, if you wish, by using the \index{Parser value function}parser value function, in this case, \agcode{ana{\us}value}. % XXX s/case,/case/ above. or s/function,/function;/ As an example, let us imagine we are to write a an interface function for our parser which takes a list of string pointers, a count, and a pointer to a location into which we may store an error flag. The input to our parser is to be the concatenation of all the character strings. We will set up a loop which will call the parser for all the characters of the strings in turn. We will assume that the function will return the value of the grammar token, which we will assume to be also of type double: \begin{indentingcode}{0.4in} {}[ event driven ] \bra double parse{\us}strings(char **ptr, int n{\us}strings, int *error) \bra init{\us}ana(); while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\& n{\us}strings--) \bra char *p = *ptr++; while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\& *p) \bra PCB.input{\us}code == *p++; ana(); \ket \ket assert(error); *error = PCB.exit{\us}flag != AG{\us}SUCCESS{\us}CODE; return ana{\us}value(); \ket \ket \end{indentingcode} The purpose of this example is simply to show how to use an event driven parser. Of course it would be possible, as far as this example is concerned, to concatenate the strings and use pointer input instead. A problem sufficiently complex to \emph{require} an event driven parser would be too complex to serve as a simple example. \subsection{Token Input} \index{Token input}\index{Input procedures} Thus far in this chapter, we have assumed that the input to your parser consisted of ordinary characters. There are many situations where it is convenient to have a \index{Preprocessor}\index{Token}\index{Token}preprocessor, or \index{Lexical scanner}lexical scanner, which identifies basic tokens and hands them over to your parser for further processing. Accepting input from such preprocessors is discussed in the remainder of this section. Sometimes preprocessors simply pass on text characters, acting as filters to remove unwanted characters, such as white space or comments, and to insert other text, such as macro expansions. In such situations, there is no need to treat the preprocessor differently from any other character source. The input methods described above are sufficient to deal with the input provided by the preprocessor. In what follows, we deal with situations where the preprocessor passes on \index{Token number}\index{Token}\index{Number}\agterm{token numbers} rather than character codes. The preprocessor may also pass on token \emph{values}, which also need accommodation of some sort. % XXX also also? There are two principal interfacing problems to deal with. The first has to do with identifying the tokens to your parser. The second has to do with providing the semantic values of the tokens. % %If your preprocessor does not provide values with its tokens, your parser %may use any of the input techniques described above for character input, %the only difference being that instead of setting PCB.input{\us}code to a %character value, you set it to the token identifier. % %If your preprocessor does provide token values, then you have to use either %a GET{\us}INPUT macro, or configure your parser to be event driven. If you wish %to use pointer input, you must provide an INPUT{\us}CODE macro. % \subsection{Identifying Tokens using Predefined Token Numbers} \index{Token}\index{Number}\index{Token number} If you have a pre-existing \index{Lexical scanner}lexical scanner, written for use with some other parsing system, it probably outputs its own set of token numbers. The most robust way of interfacing such a lexical scanner is to include, in your syntax file, either an \index{Enum statement}\agparam{enum} statement or a set of definition statements for the terminal tokens, equating \index{Terminal token}\index{Token}terminal token names with the numeric values output by the lexical scanner, so that AnaGram treats them as character codes. In this situation, you simply set \index{PCB}\index{input{\us}code}\agcode{PCB.input{\us}code} to the token number determined by the lexical scanner. Generally, lexical scanners written for other parsing systems expect to be called for each token. Therefore, you would normally use a \agcode{GET{\us}INPUT} macro to call the lexical scanner and provide input to your parser. % XXX as far as I know, lex expects to call yacc, not vice versa. \subsection{Identifying Tokens using AnaGram's Token Numbers} If you are writing a new preprocessor, you have more freedom. You could simply create a set of codes as above. On the other hand, you can save a level of translation and make your system run faster by providing your parser with internal token numbers directly. Here's what you have to do. First, when you write your syntax file, leave all the terminal tokens undefined. That means, of course, that you have to have a name for each terminal token. You can't use a literal character or a number for the token. AnaGram will generate a unique token number for each token in your grammar. In the header file it generates, AnaGram always provides a set of \index{Enumeration constants}\index{Constants}enumeration constants for all the named tokens in your grammar. The names for these constants are controlled by the \index{Configuration parameters}\index{Enum constant name} \agparam{enum constant name} parameter. (See Appendix A.) These constants normally have the form \agcode{\textit{$<$parser name$>$}{\us}\textit{$<$token name$>$}{\us}token}. Note that embedded space in the token name will be replaced with underscore characters. Assume your parser is called \agcode{ana}, and in your grammar you have a token called \agcode{integer constant}. The enumeration constant identifying the token is then \agcode{ana{\us}integer{\us}constant{\us}token}. Now, to hand off an integer constant to your parser you write: \begin{indentingcode}{0.4in} PCB.input{\us}code = ana{\us}integer{\us}constant{\us}token; \end{indentingcode} \subsection{Providing Token Values} If your \index{Preprocessor}preprocessor provides \index{Semantic value}\index{Token}\index{Value}semantic values for input tokens, you must inform AnaGram by setting the \index{Input values}\index{Configuration switches}\index{Value} \agparam{input values} configuration switch in your syntax file. Then, whenever you provide a token, you must also store a value in \index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}. You can do this as part of your \agcode{GET{\us}INPUT} macro, or, if you have an \agparam{event driven} parser, when you set \index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code} prior to calling the parser function. If you are using \index{Pointer input}\index{Configuration switches}\agparam{pointer input}, the pointer will presumably identify the token value. You must provide an \index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro to extract the identification code from the token value. For example, if the token value is a structure and the appropriate member field is called \agcode{id}, you would write: \begin{indentingcode}{0.4in} \#define INPUT{\us}CODE(t) (t).id \end{indentingcode} Generally, the simplest way to interface the preprocessor and your parser, when you are passing token values, is to use an event driven parser. In this situation, the preprocessor, when it identifies a token, simply loads the token identifier into \agcode{PCB.input{\us}code}, loads the value into \index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}, and calls the parser. \index{Token} If the values of your input tokens are all of the same type, you must set the \index{Default input type}\index{Configuration parameters} \index{Input type}\agparam{default input type} configuration parameter so that AnaGram can declare \index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value} appropriately. \index{Token type}\agparam{Default input type} will default to \agcode{int} if you do not set it either in your configuration file or in your syntax file. Some \index{Lexical scanner}lexical scanners simply provide a pointer to the text of the token they have identified. In this situation, you would set \agparam{default input type} to \agcode{char *}. When you provide a token to the parser you would set \agcode{PCB.input{\us}value} to point to the text of the token. If different tokens have values of different types, the situation becomes slightly more complex. First, you must tell AnaGram about the types of your input tokens. You do this by including a \index{Declaration}\index{Type declarations}\agterm{type declaration} in your syntax file. A type declaration is a token declaration preceded by a C data type\index{Data type}\index{Token} in parentheses. Assume that your \index{Preprocessor}preprocessor identifies, among others, the following tokens: \agcode{name}, \agcode{string}, \agcode{real constant}, \agcode{integer constant}, and \agcode{unsigned constant}. You might then include the following in your syntax file: \begin{indentingcode}{0.4in} {}[ input values ] (char *) name, string (double) real constant (long) integer constant, unsigned constant \end{indentingcode} AnaGram will then create, in the parser control block, an input value field which can accommodate any of these terminal tokens in your grammar. To enable you to store data into the input value field of the parser control block, AnaGram provides a convenient macro called \index{INPUT{\us}VALUE}\index{Macros}\agcode{INPUT{\us}VALUE} to serve as the destination of an assignment statement. \agcode{INPUT{\us}VALUE} takes the type of the data as a parameter. Thus one could write: \begin{indentingcode}{0.4in} INPUT{\us}VALUE(char *) = text{\us}pointer; INPUT{\us}VALUE(long) = constant{\us}value; \end{indentingcode} \section{Error Handling} There are two classes of errors your parser needs to be able to deal with. The first consists of \agterm{implementation errors} and the second consists of \agterm{syntax errors}. Syntax errors arise because the input to the parser does not conform to the definition of the language it is designed to parse. Implementation errors arise because the programs we write are never perfect and because the environment in which our programs run is often something less than ideal. \subsection{Implementation Errors} \index{Implementation errors}\index{Errors} % XXX parser stack overflow is not really an ``implementation error'' There are two implementation errors which your parser needs to be able to deal with. The first is \agterm{parser stack overflow}. The second comes from a bad \agterm{reduction token}. \index{Stack} \paragraph{Stack Overflow.} Stack overflow is an error which your parser must be able to deal with. In general, no matter how big you make your parser stack, it is possible for legitimate input to cause it to overflow. The size of the stack for your parser is controlled by the configuration parameter \agparam{parser stack size}. This parameter defaults to a value of 32. This value has been found to be adequate for ordinary usage. If your parser has only left recursive constructs, then there is a maximum depth beyond which the parser stack will never grow. If your parser has center recursive or right recursive productions, then no matter how much stack space you allocate, there will always be a syntactically correct input file which causes the stack to overflow. This can be illustrated by the following set of C statements: \begin{indentingcode}{0.4in} x = y; x = (y); x = ((y)); x = (((y))); . . . \end{indentingcode} Each set of parentheses requires another level on the parser stack. When this set of statements was tried with Borland C++, it ran out of stack space at 127 sets of parentheses and diagnosed the problem as ``Expression is too complicated''. AnaGram calculates the actual size of the parser stack by calculating the maximum depth for left recursive constructs and adding half the value of \index{Parser stack size}\index{Configuration parameters}\index{Stack} \index{Parser state stack}\index{State stack} \agparam{parser stack size}. It then uses the larger of the calculated value and \agparam{parser stack size} to allocate stack storage. You may check the value actually used in your parser by inspecting the definition of \index{AG{\us}PARSER{\us}STACK{\us}SIZE}\agcode{AG{\us}PARSER{\us}STACK{\us}SIZE}. If your parser runs out of stack space, it will set \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to \index{AG{\us}STACK{\us}ERROR{\us}CODE}\agcode{AG{\us}STACK{\us}ERROR{\us}CODE}, invoke the \index{Macros}\index{PARSER{\us}STACK{\us}OVERFLOW}\agcode{PARSER{\us}STACK{\us}OVERFLOW} macro and return to the calling program. The default definition of this macro is: \begin{indentingcode}{0.4in} \#define PARSER{\us}STACK{\us}OVERFLOW \bra fprintf(stderr, {\bs} "{\bs}nParser stack overflow{\bs}n"); \ket \end{indentingcode} % XXX ``provide your own definition'', not ``redefine'' If this definition is not consistent with your needs, you may redefine it in any block of embedded C in your syntax file. \index{Reduction token error} \paragraph{Reduction Token Error.} A properly functioning parser should never encounter a reduction token error. Therefore, reduction token errors should be taken quite seriously. The only way to cause a reduction token error in an otherwise properly functioning parser is to set incorrectly the reduction token for a semantically determined production. % XXX ``to incorrectly set'' Before your parser calls a reduction procedure, it stores the token number of the token to which the production would normally reduce in \index{reduction{\us}token}\index{PCB}\agcode{PCB.reduction{\us}token}. If the production is a semantically determined production, you may, in your reduction procedure, change the value of \agcode{PCB.reduction{\us}token} to one of the alternative tokens on the left side of the production. When your reduction procedure returns, your parser checks to verify that \agcode{PCB.reduction{\us}token} is a valid token number for the current state of the parser. If it is not, it sets \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to \index{AG{\us}REDUCTION{\us}ERROR{\us}CODE}\agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE} and invokes \index{REDUCTION{\us}TOKEN{\us}ERROR}\index{Macros}\agcode{REDUCTION{\us}TOKEN{\us}ERROR}. The default definition of this macro is: \begin{indentingcode}{0.4in} \#define REDUCTION{\us}TOKEN{\us}ERROR \bra fprintf(stderr,{\bs} "{\bs}nReduction{\us}token error{\bs}n"); \ket \end{indentingcode} \subsection{Syntax Errors} \index{Syntax error}\index{Errors} If the input data to your parser does not conform to the rules you have specified in your grammar, your parser will detect a syntax error. There are two basic aspects of dealing with syntax errors: \index{Error diagnosis}\agterm{diagnosing} the error and \agterm{recovering} from the error, that is, restarting the parse, or ``resynchronizing'' the parser. If you use the default settings for syntax error handling, then on encountering a syntax error your parser will call a diagnostic procedure which will create an error message and store a pointer to it in \index{Error messages}\index{error{\us}message}\index{PCB} \agcode{PCB.error{\us}message}. Then, it will set \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to \index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE} and call a macro called \index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR}. The default definition of \agcode{SYNTAX{\us}ERROR} will print the error message on \agcode{stderr}. Finally, in lieu of trying to continue the parse, it will return to the calling program. AnaGram has several options which allow you to tailor diagnostic messages to your requirements or help you to create your own. It also provides several options for continuing the parse. The options available to help you diagnose errors are: \begin{itemize} \item line and column tracking \item creation of a diagnostic message \item identification of the error frame \end{itemize} \index{Numbers}\index{Lines and columns}\index{Configuration switches} \paragraph{Line and Column Tracking.} Your parser will automatically track lines and columns in its input if the \agparam{lines and columns} configuration switch is on. Since this is a common requirement, \agparam{lines and columns} defaults to on. If you don't want your parser to spend time counting lines and columns you should turn the switch off, thus: \begin{indentingcode}{0.4in} \agcode{ \~{}lines and columns } \end{indentingcode} Normally, if you are using a \index{Lexical scanner}lexical scanner, you would turn lines and columns off. % XXX: this should say *why*. The line and column counts are maintained in \index{line}\index{PCB}\agcode{PCB.line} and \index{column}\index{PCB}\agcode{PCB.column} respectively. \agcode{PCB.line} and \agcode{PCB.column} are initialized with the values of the \index{FIRST{\us}LINE}\index{Macros}\agcode{FIRST{\us}LINE} and \index{Macros}\index{FIRST{\us}COLUMN}\agcode{FIRST{\us}COLUMN} macros respectively. These macros provide default initial values of 1 for both line and column numbers. To override these definitions, simply include definitions for these macros in your syntax file. If tab characters are encountered, they are expanded in accordance with the \index{Tab spacing}\agparam{tab spacing} parameter. When your parser is executing a reduction procedure, \agcode{PCB.line} and \agcode{PCB.column} refer to the first input character following the rule that is being reduced. When your parser has encountered a syntax error, and is executing your \agcode{SYNTAX{\us}ERROR} macro, \agcode{PCB.line} and \agcode{PCB.column} refer to the erroneous input character. \paragraph{Diagnostic Messages.} If the \index{Diagnose errors}\index{Configuration switches} \agparam{diagnose errors} switch is on, its default setting, AnaGram will include an error diagnostic procedure in your parser. When your parser encounters a syntax error, this procedure will create a simple diagnostic message and store a pointer to it in \index{error{\us}message}\index{PCB}\agcode{PCB.error{\us}message} before your \agcode{SYNTAX{\us}ERROR} macro is executed. The default definition of \agcode{SYNTAX{\us}ERROR} prints this message on \agcode{stderr}. If your parser was in a state where there was a single input character expected or a simple named token expected, it will create a message of the form: \begin{indentingcode}{0.4in} Missing ';' \end{indentingcode} or \begin{indentingcode}{0.4in} Missing semicolon \end{indentingcode} If there was more than one possible input your parser will check to see if it can identify the erroneous input. If it can it will create a message of the form: \begin{indentingcode}{0.4in} Unexpected ';' \end{indentingcode} or \begin{indentingcode}{0.4in} Unexpected semicolon \end{indentingcode} Otherwise, the diagnostic message will be simply: \begin{indentingcode}{0.4in} Unexpected input \end{indentingcode} If you do not need a diagnostic message, or choose to create your own, you should turn \agparam{diagnose errors} off. % XXX Somewhere there should be a discussion of what ``creating your % own'' would entail. \index{Error frame} \paragraph{Error Frame.} Often it is desirable to know the ``frame'' of an error, that is, what the parser thought it was doing when it encountered the error. If, for instance, you forget to terminate a comment in a C program, your C compiler sees an unexpected end of file. When you look simply at the alleged error, of course, you can't see any problem. In order to understand the error, you need to know that the parser was trying to find a complete comment. In this case, we can say that the comment is the ``frame'' of the error. AnaGram provides an optional facility in its error diagnostic procedure, controlled by the \index{Error frame}\index{Configuration switches}\agparam{error frame} switch, for identifying the frame of a syntax error. The \agparam{diagnose errors} switch must also be on to enable the diagnostic procedure. If you enable \agparam{error frame} in your syntax file, AnaGram will include a procedure which will scan backwards on the state stack looking for the frame of the error. When it finds what appears to be the error frame, it will store the stack index in \index{error{\us}frame{\us}ssx}\index{PCB}\agcode{PCB.error{\us}frame{\us}ssx} and the token number of the nonterminal token the parser was looking for in \index{error{\us}frame{\us}token}\index{PCB}\agcode{PCB.error{\us}frame{\us}token}. % % XXX. Why is the discussion of ``hidden'' inside the discussion of % ``error frame''? hidden applies to ordinary error diagnosis also. % % Furthermore, this discussion of error frame needs an example, or % nobody will ever figure out how to do it. % If, in your grammar, there are nonterminal tokens that are not suitable for diagnostic use, usually because they name an intermediate stage in the parse that means nothing to your user, you can make sure that AnaGram ignores them in doing its analysis by declaring them as \index{Declaration}\index{Hidden declaration}\agparam{hidden}. To declare tokens as hidden, include a \agparam{hidden} declaration in a configuration section. (See Chapter 8.) For instance, consider: \begin{indentingcode}{0.4in} comment -> comment head, "*/" comment head -> "/*" -> comment head, \~{}end of file {}[ hidden \bra comment head \ket ] \end{indentingcode} We mark comment head as hidden, because we only wish to talk about complete comments with our users. In order to use the error frame effectively in your diagnostics, you need to have an ASCII representation of the name of the token as well as its token number. If you turn the \index{Token names}\index{Configuration switches}\agparam{token names} configuration switch on in your syntax file, AnaGram will provide an array of ASCII strings, indexed by token number, which you may use in your diagnostics. The name of the array is created by appending \agcode{{\us}token{\us}names} to the name of your parser. If your parser is called \agcode{ana}, your token name array will have the name \agcode{ana{\us}token{\us}names}. As a convenience, AnaGram also defines a macro, \index{TOKEN{\us}NAMES}\index{Macros}\agcode{TOKEN{\us}NAMES}, which evaluates to the name of the token name array. Note that \agparam{token names} controls the generation of an array of ASCII strings and should not be confused with the \agcode{typedef enum} statement in the parser header file which provides you with a set of enumeration constants. % XXX maybe it means the *strings* should not be confused? If you are tracking context, using the techniques described below, you can use the macro \index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} or \index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT} to determine the context of the error frame token. \index{SYNTAX{\us}ERROR}\index{Macros} \paragraph{SYNTAX{\us}ERROR Macro.} When your parser finds a syntax error, it first executes any of the diagnostic procedures described above that you have enabled, sets \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to \index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE}, and then invokes the \agcode{SYNTAX{\us}ERROR} macro. If you have not defined \agcode{SYNTAX{\us}ERROR} it will be defined thus if you have set \index{Lines and columns}\index{Configuration switches} \agparam{lines and columns}: \begin{indentingcode}{0.4in} \#define SYNTAX{\us}ERROR {\bs} fprintf(stderr,"\%s,line \%d,column \%d{\bs}n", {\bs} PCB.error{\us}message, PCB.line, PCB.column) \end{indentingcode} and thus if you have not: \begin{indentingcode}{0.4in} \#define SYNTAX{\us}ERROR {\bs} fprintf(stderr, "\%s{\bs}n", PCB.error{\us}message) \end{indentingcode} In most circumstances, you will probably want to write your own \agcode{SYNTAX{\us}ERROR} macro, since this diagnostic is one your users will see with some frequency. % XXX yes and why exactly? is there something we have in mind better % than just printing PCB.error_message? The default macro simply returns to the parser. Your macro doesn't have to. If you wish, you could call \agcode{abort} or \agcode{exit} directly from the macro. If the \agcode{SYNTAX{\us}ERROR} macro returns control to the parser, subsequent events depend on your choices for error recovery. \section{Error Recovery} \index{Error recovery}\index{Syntax error}\index{Errors} Syntax errors can be caused by any of a number of problems. Some come from simple typographic errors: the user skips a character or types the wrong one. Others come from true errors: he types something that might be correct in its place, but in context is totally wrong. Usually, if your parser is reading a file, you will want to continue parsing the input, checking for other syntax errors at the very least. The problem with doing this is getting the parser restarted, or ``resynchronized'', in some reasonable manner. AnaGram provides a number of ways for your parser to recover from a syntax error. The least graceful, of course, is simply to call \agcode{abort} or \agcode{exit} from the \agcode{SYNTAX{\us}ERROR} macro. If you don't do this you have several options: \begin{itemize} \item error token resynchronization \item auto resynchronization \item simple return to calling program \item ignore the error \end{itemize} \subsection{Error Token Resynchronization} \index{Resynchronization} When AnaGram builds your parser it checks to see if you have used a token called \agcode{error} in your grammar or if you have assigned a token name as the value of the configuration parameter \index{Error token}\index{token}\index{Configuration parameters} \agparam{error token}. If so, it includes a call to an error token resynchronization procedure immediately after the invocation of \index{SYNTAX{\us}ERROR}\agcode{SYNTAX{\us}ERROR}. The error token resynchronization procedure works in the following way: It scans the state stack backwards looking for the most recent state in which \agcode{error} or the token named by \agparam{error token} was valid input. It then truncates the stack to this level, and jumps to the state indicated by the error token. It then passes over any input it sees until it sees valid input for the state in which it finds itself. At this point, it returns to the parser which continues as though nothing had happened. Since this is substantially easier than it sounds, let's look at an example. Suppose we are writing a C compiler, and we wish to catch errors in ordinary statements. We add the following production to our grammar: \begin{indentingcode}{0.4in} statement -> error, ';' \end{indentingcode} Now, if the parser encounters a syntax error anytime while it is parsing any statement, it will pop back to the most recent state where it was looking for a statement, jump forward to the state indicated by the token \agcode{error} in the new production, and then skip input until it sees a semicolon. At this point it will continue a normal parse. The effect of continuing at this point is to recognize and reduce the above production, i.e., the parser will proceed as if it had found a complete, correct ``statement''. This production could even have a reduction procedure to do any clean-up that an error might require. If you use error token resynchronization, you must identify an end of file token to guarantee that the resynchronization procedure can always terminate. To do this, either name your end of file token \agcode{eof} or use the \index{Eof token}\index{Configuration parameters}\index{Token} \agparam{eof token} configuration parameter to specify it. For example, if your parser is reading conventional stream input, the end of file will be denoted by a $-1$ value. You can define the end of file token thus: \begin{indentingcode}{0.4in} eof = -1 \end{indentingcode} % XXX as ``finally'' means something in Java, let's change this to % ``at last'' On the other hand, if you have already defined a token named \agcode{finally}, you can add the following line to any configuration segment: \begin{indentingcode}{0.4in} eof token = finally \end{indentingcode} The end of file token, of course, must be a terminal token. % XXX this is not ``of course'' to a casual observer. \subsection{Automatic Resynchronization} \index{Resynchronization}\index{Automatic resynchronization} If you have not specified an \agcode{error} token in your syntax file, AnaGram checks to see if you have turned on the \index{Auto resynch}\index{Configuration switches} \agparam{auto resynch} configuration switch. If so, it includes a call to an automatic resynchronization procedure immediately after the call to \agcode{SYNTAX{\us}ERROR}. The automatic resynchronization procedure uses a heuristic based on your grammar to get back in step with the input. To use it you need do only two things: You need to turn on the \index{Auto resynch}\agparam{auto resynch} switch, and you need to specify an end of file token as for error token resynchronization, above. The primary advantage of the automatic resynchronization is that it is easy to use. The disadvantage is that it turns off all reduction procedures, so that your parser is reduced to being a syntax checker after it encounters an error. If your grammar uses semantically determined productions, your reduction procedures will not be invoked so the primary reduction token will be used in all cases. % XXX *why* does it do this? \subsection{Other Ways to Continue} % XXX the example of ``reading input from a keyboard'' should be % clarified to indicate that this means something like an application % where you press F10 for the menu, not typing at a command line. % If you do not wish to use either of the above resynchronization procedures, you still have a number of options. If your parser is reading input from a keyboard, for instance, it is probably sufficient to simply ignore bad input characters. You can do this by simply resetting \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to zero in your \index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro. % XXX XXX should say \agcode{AG_RUNNING_CODE}, not zero!! Your parser will then continue, passing over the bad input as though it had never occurred. If you do this, you should, of course, notify your user somehow that you're skipping a character. Issuing a beep on the computer's speaker from the \agcode{SYNTAX{\us}ERROR} macro is usually enough. If you do not wish to continue the parse, but want your main program to continue, you need do nothing special. \agcode{PCB.exit{\us}flag} is % XXX XXX should say \agcode{AG_SYNTAX_ERROR_CODE}, not 2!! set to 2 before the \agcode{SYNTAX{\us}ERROR} macro is called. If your macro does not change \agcode{PCB.exit{\us}flag}, when it relinquishes control to your parser, your parser will simply return to the calling program. The calling program can determine that the parse was unsuccessful by inspecting \agcode{PCB.exit{\us}flag} and take whatever action you deem appropriate. \section{Advanced Techniques} \subsection{Semantically Determined Productions} \index{Semantically determined production}\index{Production} A semantically determined production is one which has more than one token on the left side. The reduction procedure then determines which token has in fact been identified, using whatever criteria are necessary. In some cases where the purpose is simply to provide multiple syntactic options to be chosen at execution time, the determination is made simply by interrogating a switch. Other situations may require a more complex determination, such as a symbol table look-up, for instance. \index{Production} The tokens on the left side of the production can be used just like any other tokens in your grammar. Their semantic values, however, must all be of the same \index{Data type}\index{Token}data type. Depending on how you have defined your grammar, it may be that whenever any one of the tokens on the left side is syntactically acceptable input, all the tokens on the left are syntactically acceptable. That is, the production could reduce to any of the tokens on the left without causing an immediate error condition. In many circumstances, however, this is not the case. In a Pascal grammar, for example, a semantically determined production might be used to allow a reduction procedure to determine whether a particular identifier is a constant identifier, a type identifier, a variable identifier, or so on. In any particular context, only a subset of the tokens on the left may be syntactically acceptable. Before your reduction procedure is called, your parser will set the reduction token to the first token on the left side which is syntactically correct. If you need to change this assignment you have several options. From within your reduction procedure, you may simply set \index{reduction{\us}token}\index{PCB}\index{Token}\agcode{PCB.reduction{\us}token} to the semantically correct value. For this purpose, it is convenient to use the token name enumeration constants provided in the header file for your parser. Note that if you select a reduction token that is not syntactically correct, after your reduction procedure returns, your parser will encounter a \index{Reduction token error}\agterm{reduction token error}, described above. AnaGram provides several tools to help you set the reduction token correctly. First, it provides a \agterm{change reduction} function which will set the reduction token to a specified token only if the specified token is syntactically correct. It will return a flag to indicate the outcome: non-zero on success, zero on failure. The name of this function is given by appending \agcode{{\us}change{\us}reduction} to the name of your parser. Thus, if your parser is named \agcode{ana}, the name of the function would be \agcode{ana{\us}change{\us}reduction}. In those cases where the semantically correct reduction token is not syntactically correct, you will want to provide error diagnostics for your user. If you wish the parse to continue, so you can check errors, you may simply return from the reduction procedure. Since the default reduction is syntactically correct, the parse can continue as though there had been no error. To simplify use of the change reduction function, AnaGram provides a macro, \index{CHANGE{\us}REDUCTION}\index{Macros}\agcode{CHANGE{\us}REDUCTION}. Simply call the macro with the name of the desired token as the argument, replacing embedded blanks in the token name with underscores. For example, in writing a grammar for the C language, it is quite convenient to write the following production: \begin{indentingcode}{0.4in} identifier, typedef name -> name = check{\us}typedef(); \end{indentingcode} The reduction procedure can then check the symbol table to see if whether the name that has been found is a typedef name. If so, it can use the \agcode{CHANGE{\us}REDUCTION} macro to change the reduction token to \agcode{typedef name} and verify that this is acceptable: \begin{indentingcode}{0.4in} if (!CHANGE{\us}REDUCTION(typedef{\us}name)) diagnose{\us}error(); \end{indentingcode} Note that the embedded space in the token name must be replaced with an underscore character. Under some circumstances, in your reduction procedure, you might wish to know precisely which reduction tokens are syntactically correct. For instance, you might wish, in an error diagnostic, to tell your user what you expected to see. If you set the \index{Reduction choices}\index{Configuration switches} \agparam{reduction choices} switch, AnaGram will include in your parser file a function which will identify the acceptable choices for the reduction token in the current state. The prototype of this function is: \begin{indentingcode}{0.4in} int \${\us}reduction{\us}choices(int *); \end{indentingcode} where ``\agcode{\$}'' represents the name of your parser. You must provide an integer array whose length is at least as long as the maximum number of reduction choices you might have. The function will fill the array with the token numbers of those which are acceptable in the current state and return a count of the number of acceptable choices it found. You can call this function from any reduction procedure. AnaGram also provides a macro to invoke this procedure: \index{REDUCTION{\us}CHOICES}\index{Macros}\agcode{REDUCTION{\us}CHOICES}. For example, to provide a diagnostic which details the acceptable token, you might combine the use of the \agparam{reduction choices} switch with the \index{Token names}\index{Configuration switches}\agparam{token names} switch described above: \begin{indentingcode}{0.4in} int ok{\us}tokens[20], n{\us}ok{\us}tokens, i; n{\us}ok{\us}tokens = REDUCTION{\us}CHOICES(ok{\us}tokens); printf("Acceptable input comprises: {\bs}n"); for (i = 0; i $<$ n{\us}ok{\us}tokens; i++) \bra printf(" \%s{\bs}n", TOKEN{\us}NAMES[i]); \ket \end{indentingcode} A semantically determined production can even be a null production. You can use a semantically determined null production to interrogate the settings of parameters and control parsing accordingly: \begin{indentingcode}{0.4in} condition false, condition true -> = \bra if (condition) CHANGE{\us}REDUCTION(condition{\us}true); \ket \end{indentingcode} There are numerous examples of the use of semantically determined productions in the examples provided in the \index{examples}\agfile{examples} directory of your AnaGram distribution disk. % XXX too much anaphora % XXX s/disk// \subsection{Defining Parser Control Blocks} \index{Parser control block} All references to the parser control block in your parser are made using the macro \index{PCB}\agcode{PCB}. The only intrinsic requirement on PCB is that it evaluate to an \agterm{lvalue} (see Kernighan and Ritchie) that identifies a parser control block. The actual access may be direct, indirect through a pointer, subscripted, or even more complex, although if the access is too complex, the performance of your parser could suffer. Simple indirect or subscripted references are usually enough to enable you to build a system with multiple parallel parsing processes. If you wish to define \agcode{PCB} in some way other than a simple, direct access to a compiled-in control block, you will have to declare the control block yourself. When AnaGram builds a parser, it checks the status of the \index{Declare pcb}\index{Configuration switches}\agparam{declare pcb} configuration switch. If it is on, the default setting, AnaGram declares a parser control block for you. AnaGram creates the name of the parser control block variable by appending \agcode{{\us}pcb} to the name of your parser. Thus if the name of your parser is \agcode{ana}, the parser control block is \agcode{ana{\us}pcb}. In the header file AnaGram generates, a typedef statement defines the structure of the parser control block. The typedef name is given by appending \agcode{{\us}pcb{\us}type} to the name of your parser. Thus if the name of your parser is \agcode{ana}, the type of the parser control block is given by \agcode{ana{\us}pcb{\us}type}. Thus, when AnaGram defines the parser control block for \agcode{ana}, it does so by including the following two lines of code: \begin{indentingcode}{0.4in} ana{\us}pcb{\us}type ana{\us}pcb; \#define PCB ana{\us}pcb \end{indentingcode} If you wish to declare the parser control block yourself, you should turn off the \agparam{declare pcb} switch. To turn \agparam{declare pcb} off, include the following line in a configuration segment in your syntax file: \begin{indentingcode}{0.4in} \~{}declare pcb \end{indentingcode} Suppose your program needs to serve up to sixteen ``clients'', each with its own input stream. You might turn \agparam{declare pcb} off and declare the parser control block in the following manner: \begin{indentingcode}{0.4in} ana{\us}pcb{\us}type ana{\us}pcb[16]; /* declare control blocks */ int client; \#define PCB ana{\us}pcb[client] /* tell parser about it */ \end{indentingcode} Perhaps you need to parse a number of input streams, but you don't know exactly how many until run time. You might make the following declarations: \begin{indentingcode}{0.4in} ana{\us}pcb{\us}type *ana{\us}pcb; /* pointer to control block */ \#define PCB (*ana{\us}pcb) /* tell parser about it */ \end{indentingcode} Note that when you declare \agcode{PCB} as a pointer, you should put parentheses around the declaration so that your compiler codes the indirection properly. There are many situations where it is convenient for a parser to be reentrant. A parser used for evaluating formulas in a spreadsheet program, for instance, needs to be able to call itself recursively if it is to use natural order recalculation. A parser used to implement macro substitutions may need to be recursive to deal with embedded macros. Here is an example of an interface function which is designed for recursive calls to a parser, using the definitions above: % XXX can I please at least remove the nonstandard <alloc.h>? % And fix the misuse of assert, and check malloc for failure? % And use AG_SUCCESS_CODE instead of 1? \begin{indentingcode}{0.4in} \#include <assert.h> \#include <alloc.h> \#define PCB (*ana{\us}pcb) ana{\us}pcb{\us}type *ana{\us}pcb; void do{\us}ana(void) \bra ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb; ana{\us}pcb = malloc(sizeof(ana{\us}pcb{\us}type)); ana(); assert(ana{\us}pcb.exit{\us}flag == 1); free(ana{\us}pcb); ana{\us}pcb = save{\us}ana; \ket \end{indentingcode} Here is another way to accomplish the same end, this time using stack storage rather than heap storage: % XXX ditto \begin{indentingcode}{0.4in} \#include <assert.h> \#include <alloc.h> \#define PCB (*ana{\us}pcb) ana{\us}pcb{\us}type *ana{\us}pcb; void do{\us}ana(void) \bra ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb; ana{\us}pcb{\us}type local{\us}pcb; ana{\us}pcb = \&local{\us}pcb; ana();\\ assert(ana{\us}pcb.exit{\us}flag == 1); ana{\us}pcb = save{\us}ana; \ket \end{indentingcode} % XXX and here we should discuss \agparam{reentrant parser}, too. \subsection{Multi-stage Parsing} \index{Parsing}\index{Multi-stage parsing} Multi-stage parsing consists of chaining together a number of parsers in series so that each parser provides input to the following one. Users of \agfile{lex} and \agfile{yacc} are accustomed to using two-level parsing, since the ``\index{Lexical scanner}lexical scanner'', or ``lexer'' they write in \agfile{lex} is really a very simple parser whose output becomes the input to the parser written in \agfile{yacc}. AnaGram has been developed so that you may use as many levels as are appropriate to your problem, and so that, if you wish, you may write all of the parsers in AnaGram. Many problems that do not lend themselves conveniently to solution with a simple grammar can be neatly solved by using multi-stage parsing. In many cases this is because multi-stage parsing can be used to parse constructs that are not context-free. A first level parser can use semantic information to decide which tokens to pass on to the next level. Thus, a first level parser for a C compiler can use semantic information to distinguish typedef names from variable names. % XXX I believe this is referring to QPL. Nowadays there's Python... As another example, a proprietary programming language used indents to control its block structure. A first level parser looked only at lines and indents, passing the text through to the second level parser. When it encountered changes in indentation level, it inserted block start and block end tokens as necessary. Using AnaGram it is extremely easy to set up multi-stage parses. Simply configure the second level parser as an event-driven parser. The first level parser can then hand over tokens or characters to it as it develops them. The C macro preprocessor example, found in the \index{examples}\agfile{examples} directory of your AnaGram distribution disk, illustrates the use of multi-stage parsing. \subsection{Context Tracking} \index{Context tracking} When you are writing a reduction procedure for a particular grammar rule, you often need to know the value one or another of your program variables had at the time the first token in the rule was encountered. Examples of such variables are: \begin{itemize} \item Line or column number \item Index in an input file \item Index into an array \item Counters, as of symbols defined, etc. \end{itemize} Such variables can be thought of as representing the ``context'' of the rule you are reducing. Sometimes it is possible to incorporate the values of such variables into the values of reduction tokens, but this can become quite cumbersome. AnaGram provides an optional feature known as ``context tracking'' to deal with this problem. Here's how it works: First, you identify the variables which you want to track. Second, you write a typedef statement in the \index{C prologue}C prologue of your parser which defines a data structure with fields to accommodate values for all of these variables. Third, you tell AnaGram what the name of the type of your data structure is, using the \index{Context type}\index{Configuration parameters}\agparam{context type} configuration parameter. This causes AnaGram to add a field called \index{PCB}\index{input{\us}context}\agcode{input{\us}context} and a stack, the \index{Context stack}\index{Stack}\agterm{context stack}, called \index{PCB}\index{cs}\agcode{cs}, both of the type you have specified, to your parser control block. Fourth, you write code to gather the context information for each input character. There are several ways to provide the initial context information. You may write a \index{GET{\us}CONTEXT}\index{Macros}\agcode{GET{\us}CONTEXT} macro which sets the context stack variables directly. Using the \index{CONTEXT}\index{Macros}\agcode{CONTEXT} macro defined below, and assuming your context type has line, column and pointer fields, you could define \agcode{GET{\us}CONTEXT} as follows: \begin{indentingcode}{0.4in} \#define GET{\us}CONTEXT CONTEXT.pointer = PCB.pointer,{\bs} CONTEXT.line = PCB.line,{\bs} CONTEXT.column = PCB.column \end{indentingcode} If you are using \agparam{pointer input}, you must write a \agcode{GET{\us}CONTEXT} macro to save context information. If you use a \index{GET{\us}INPUT}\index{Macros}\agcode{GET{\us}INPUT} macro or have an event-driven parser, you may either store values directly into \index{input{\us}context}\index{PCB}\agcode{PCB.input{\us}context} when you develop the input token, or you may write a \agcode{GET{\us}CONTEXT} macro. The macro will provide a slight increment in performance. % XXX say why it's faster (I assume because it won't look up context % for inputs that don't need it?) AnaGram provides six macros to enable you to read values in a convenient manner from the context stack, \index{cs}\index{PCB}\agcode{PCB.cs}. Three of these macros are designed to be used from your parser itself, and three are available to use from other modules. These three macros are designed for use in your parser: \begin{itemize} \item \agcode{CONTEXT} \item \agcode{RULE{\us}CONTEXT} \item \agcode{ERROR{\us}CONTEXT} \end{itemize} These macros are defined at the beginning of your parser file, so they may be used anywhere within your parser. \index{CONTEXT}\index{Macros}\agcode{CONTEXT} can be used to read or write the current top of the context stack as indexed by \index{PCB}\agcode{PCB.ssx}. When your parser is executing a reduction procedure for a particular grammar rule, \agcode{CONTEXT} will evaluate to the value of the input context as it was just before the very first token in the rule. The definition of \agcode{CONTEXT} is: \begin{indentingcode}{0.4in} \#define CONTEXT (PCB.cs[PCB.ssx]) \end{indentingcode} \index{RULE{\us}CONTEXT}\index{Macros}\agcode{RULE{\us}CONTEXT} can be used within a reduction procedure to get the context for any element within the rule being reduced. For example, \agcode{RULE{\us}CONTEXT[0]} is the context of the first element in the rule, \agcode{RULE{\us}CONTEXT[1]} is the context of the second element in the rule, and so on. \agcode{RULE{\us}CONTEXT[0]} is exactly the same as \agcode{CONTEXT}. % XXX There should be a way to address the context of tokens in a % rule by the symbolic names we've bound to them. The definition of \agcode{RULE{\us}CONTEXT} is: \begin{indentingcode}{0.4in} \#define RULE{\us}CONTEXT (\&(PCB.cs[PCB.ssx])) \end{indentingcode} As an example, let us suppose that we are writing a parser to read a parameter file for a program. Let us imagine the following statements make up a part of our syntax file: \begin{indentingcode}{0.4in} \bra typedef struct \bra int line, column \ket location; \#define GET{\us}INPUT {\bs} PCB.input{\us}code = fgetc(input{\us}file); {\bs} PCB.input{\us}context.line = PCB.line; {\bs} PCB.input{\us}context.column = PCB.column; \ket {}[ context type = location ]\\ parameter assignment -> parameter name, '=', number \end{indentingcode} Let us suppose that for each parameter we have stored a range of admissible values. We have to diagnose an attempt to use an incorrect value. We could write our diagnostic message as follows: \begin{indentingcode}{0.4in} fprintf(stderr, "Bad value at line \%d, column \%d in " "parameter assignment at line \%d, column \%d", RULE{\us}CONTEXT[2].line, RULE{\us}CONTEXT[2].column, CONTEXT.line, CONTEXT.column); \end{indentingcode} This diagnostic message would give our user the exact location both of the bad value and of the beginning of the statement that contained the bad value. \index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} can be used within a \index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro to find the context of an error if you have turned on the \index{Error frame}\index{Configuration switches}\agparam{error frame} and \index{Diagnose errors}\index{Configuration switches} \agparam{diagnose errors} switches. AnaGram itself tracks context using a structure consisting of line and column numbers. In case of errors such as encountering an end of file in a comment, it uses the \agcode{ERROR{\us}CONTEXT} macro to determine the line and column number at which the comment began. % XXX that sounds like something AG does with your grammar, not % what AG does reading its own input, which is what it is. rephrase... The definition of \agcode{ERROR{\us}CONTEXT} is: \begin{indentingcode}{0.4in} \#define ERROR{\us}CONTEXT (PCB.cs[PCB.error{\us}frame{\us}ssx]) \end{indentingcode} Three similar macros are also available for more general use: \begin{itemize} \item \index{PCONTEXT}\index{Macros}\agcode{PCONTEXT(pcb)} \item \index{PRULE{\us}CONTEXT}\index{Macros}\agcode{PRULE{\us}CONTEXT(pcb)} \item \index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT(pcb)} \end{itemize} % XXX repeating ``modules other than'' is bad These macros are identical in function to the corresponding macros in the first class. The only difference is that they take the name of a parser control block, \agcode{pcb}, as an argument so they can be used in modules other than the parser module. AnaGram includes the definitions for these macros in the parser header file so that they can be used in modules other than the parser itself. Since these macros are not specific to any one parser, the definitions are conditional so that they will only be defined once in a given module, even if you include header files corresponding to several parsers. The definitions of these macros are as follows: \begin{indentingcode}{0.4in} \#define PCONTEXT(pcb) (pcb.cs[pcb.ssx]) \#define PRULE{\us}CONTEXT(pcb) (\&(pcb.cs[pcb.ssx])) \#define PERROR{\us}CONTEXT(pcb) (pcb.cs[pcb.error{\us}frame{\us}ssx]) \end{indentingcode} Note that since the context macros only make sense when called from a reduction procedure or an error procedure, there are not many occasions to use these macros. The most common situation would be when you have compiled the bulk of the code for your reduction procedures in a separate module. Remember that \agcode{PRULE{\us}CONTEXT}, because it identifies an array rather than a value, requires a subscript. For an example, let us rewrite the diagnostic message given above for \agcode{RULE{\us}CONTEXT} using \agcode{PRULE{\us}CONTEXT}, assuming that the name of our parser control block is \agcode{ana{\us}pcb}: \begin{indentingcode}{0.4in} fprintf(stderr, "Bad value at line \%d, column \%d in " "resource statement at line \%d, column \%d", PRULE{\us}CONTEXT(ana{\us}pcb)[2].line, PRULE{\us}CONTEXT(ana{\us}pcb)[2].column, PCONTEXT.line, PCONTEXT.column); \end{indentingcode} \subsection{Coverage Analysis} \index{Coverage analysis} AnaGram has simple facilities for helping you determine the adequacy of your test suites. The \index{Rule coverage}\index{Configuration switches} \agparam{rule coverage} configuration switch controls these facilities. When you set \agparam{rule coverage}, AnaGram includes code in your parser to count the number of times the parser identifies each rule in your grammar. AnaGram also provides procedures you can use to write these counts to a file and accumulate them over multiple executions of your parser. Finally, it provides a window where you may inspect the counts to see the extent to which your tests have covered the options in your grammar. To maintain the counts, AnaGram declares, at the beginning of your parser, an integer array, whose name is created by appending \agcode{{\us}nrc} to the name of your parser. The array contains one counter for each rule you have defined in your grammar. There are no entries for the auxiliary rules that AnaGram creates to deal with set overlaps or disregard statements. In order to identify positively all the rules that the parser reduces, AnaGram turns off certain optimization features in your parser. Therefore, a parser that has the \agparam{rule coverage} switch enabled will run slightly slower than one with the switch off. AnaGram also provides procedures to write the counts to a file and to initialize the counts from a file. The procedures are named by appending \agcode{{\us}write{\us}counts} and \agcode{{\us}read{\us}counts} respectively to the name of your parser. Thus, if your parser is called \agcode{ana}, the procedures are called \agcode{ana{\us}write{\us}counts} and \agcode{ana{\us}read{\us}counts}. Neither takes any arguments nor returns a value. To accumulate counts correctly, you should include calls to the \index{read{\us}counts}\agcode{read{\us}counts} and \index{write{\us}counts}\agcode{write{\us}counts} procedures in your program. A convenient way to do this is to include statements such as the following in your main program: % XXX perhaps this means ``atexit'' \begin{indentingcode}{0.4in} ana{\us}read{\us}counts(); /* before calling parser */ at{\us}exit(ana{\us}write{\us}counts); \end{indentingcode} For your convenience, AnaGram defines two macros, \index{READ{\us}COUNTS}\index{Macros}\agcode{READ{\us}COUNTS} and \index{WRITE{\us}COUNTS}\index{Macros}\agcode{WRITE{\us}COUNTS}, in your parser. They call the \agcode{read{\us}counts} and \agcode{write{\us}counts} procedures respectively when \agparam{rule coverage} is set. Otherwise they are null. Thus you may code them into your main program and it will work whether or not the \agparam{rule coverage} switch is set. For example, \begin{indentingcode}{0.4in} READ{\us}COUNTS; /* read counts if coverage enabled */ my{\us}parser(); /* call parser */ WRITE{\us}COUNTS; /* write updated counts */ \end{indentingcode} The \agcode{write{\us}counts} procedure writes an identifier code and the counts to a count file. The name of the count file is given by the \index{Coverage file name}\index{Configuration parameters} \agparam{coverage file name} parameter, which defaults to the same name as your syntax file but with the extension \index{File extension}\index{nrc}\agfile{.nrc}. The identifier code changes each time you modify your syntax file. The \agcode{read{\us}counts} procedure attempts to read the count file. If it cannot find it, or the identifier code is out of date, it simply initializes the counter array to zeroes. Otherwise, it initializes the counter arrays to the values found in the file. When you run AnaGram and analyze your syntax file, if \agparam{rule coverage} is set, AnaGram will enable the \agmenu{Rule Coverage} option on the \agmenu{Browse} menu. If you select \agmenu{Rule Coverage}, AnaGram will prepare a \agwindow{Rule Coverage} window from the rule count file you select. AnaGram will warn you if the file you selected is older than the syntax file, since under those conditions, the coverage file might be invalid. The \index{Rule Coverage}\index{Window}\agwindow{Rule Coverage} window shows the count for each rule, the rule number and the text of the rule. It is also synched to the syntax file so that you can see the rule in context. AnaGram also modifies the display of the \index{Reduction Procedures}\index{Window}\agwindow{Reduction Procedures} window so that each procedure descriptor is preceded by the number of times it has been called. You can use this display to verify that all your reduction procedures have been tried. % XXX having this paragraph here seems confusing The \index{Trace Coverage}\index{Window}\agwindow{Trace Coverage} window, created when you use the \agwindow{File Trace} or \agwindow{Grammar Trace} option, provides information similar to that provided by \agwindow{Rule Coverage}. The differences are these: Optimizations are not turned off for the \agwindow{Trace Coverage}, so that some rules of length zero or one will not be properly counted. Also, the \agwindow{Trace Coverage} does not tell you about the reduction procedures you have tested. \agwindow{File Trace} can become quite tedious to use if you have very many semantically determined productions, so in these cases the \agparam{rule coverage} approach can give you the information you need more quickly. \subsection{Using Precedence Operators} The conventional syntax for arithmetic expressions used in most programming languages can be parsed simply by reference to \index{Operator precedence}\index{Precedence operators} \agterm{operator precedence}. Operator precedence refers to the rules we use to determine the order in which arithmetic operations should be carried out. In normal usage, this means that multiplication and division take precedence over addition and subtraction, which in turn take precedence over comparison operations. One can formalize this usage by assigning a numeric \index{Precedence level}\agterm{precedence level} to each operator, so that the operations are carried out starting with those of highest precedence and continuing in order of declining precedence. When operators have the same precedence level, such as addition and subtraction operators, one can decide the order of operation to be left to right or right to left. Operators of equal precedence which are to be evaluated left to right are called \agterm{left associative}. Those which should be evaluated right to left are called \agterm{right associative}. If the nature of the operators is such that the question should never arise, they are called \agterm{non-associative}. AnaGram provides three declarations, \index{Precedence declarations}\index{Left}\index{Right}\index{Nonassoc} \agparam{left}, \agparam{right}, and \agparam{nonassoc}, which you can use to associate precedence levels and associativity with tokens in your grammar. The syntax of these statements is given in Chapter 8. When AnaGram encounters a shift-reduce \index{Conflicts}conflict in your grammar, it looks to see if the conflict can be resolved by using precedence and associativity rules. If so, it applies the rules to the conflict and records the resolution in the \index{Resolved Conflicts}\index{Window}\agwindow{Resolved Conflicts} table. There are two occasions where you should consider using precedence declarations in your grammar: Where rewriting the grammar to get rid of a conflict would obscure and complicate the grammar, and where you wish to try to get a more compact, slightly faster parser by using precedence rules for parsing arithmetic expressions. Here is an example of using precedence declarations to parse simple arithmetic expressions: \begin{indentingcode}{0.4in} unary minus = '-' {}[ left \bra '+', '-' \ket left \bra '*', '/' \ket right \bra unary minus \ket ] exp -> number -> unary minus, exp -> exp, '+', exp -> exp, '-', exp -> exp, '*', exp -> exp, '/', exp \end{indentingcode} A complete working calculator grammar using this syntax, \agfile{ffcalcx}, can be found in the \index{examples}\agfile{examples/ffcalc} directory of your AnaGram distribution disk. % XXX s/disk// \subsection{Parser Performance} The parsers AnaGram generates have been engineered to provide maximum performance subject to constraints of reliability and robustness. There are a number of steps you may take, however, to make optimize the performance of your parser. \paragraph{Standard Stack Frame.} If your compiler has a switch that allows you to turn \emph{off} the standard stack frame when you compile your parser, do so. Your parser uses a large number of very small functions which run fastest when your compiler does not use the standard stack frame. \paragraph{Error Diagnostic Features.} If your parser does not need to diagnose errors, turn off the \index{Diagnose errors}\index{Configuration switches} \agparam{diagnose errors} switch. Turn off the \index{Lines and columns}\index{Configuration switches} \agparam{lines and columns} switch if you don't need this information. If your parser doesn't need a diagnostic, and halts on syntax error, turn off the \index{Backtrack}\index{Configuration switches}\agparam{backtrack} switch. \paragraph{Anti-optimization Switches.} Certain switches de-optimize your parser for various reasons. These switches, \index{Traditional engine}\index{Configuration switches} \agparam{traditional engine} and \index{Rule coverage}\index{Configuration switches} \agparam{rule coverage}, should be turned off once you no longer need their effects. \paragraph{Other Switches.} For maximum performance you should use \index{Pointer input}\index{Configuration switches}\agparam{pointer input}. If you can guarantee that your input will not have out-of-range input, you can turn off \index{Test range}\index{Configuration switches}\index{Range} \agparam{test range}. % XXX s/out-of-range input/out-of-range characters or tokens/