Mercurial > ~dholland > hg > ag > index.cgi
view doc/manual/isdp.tex @ 15:f5acaf0c8a29
Don't cast through "volatile int". Causes a gcc warning nowadays.
XXX: should put something else back here to frighten the optimizer
author | David A. Holland |
---|---|
date | Tue, 31 May 2022 01:00:55 -0400 |
parents | 13d2b8934445 |
children |
line wrap: on
line source
\chapter{Introduction to Syntax Directed Parsing} Every programmer has to deal with input data. The units of data may consist of single characters from a keyboard or file, unit records from a file, mouse events in a windowing system, or even more complex data constructs. When the processing of a unit of data depends only on the value of that data, it is usually straightforward. Usually, however, the processing depends also on what has preceded, and often even on what follows, the input under consideration. Keeping track of these dependencies in order to know how to process the data is called \index{Parsing}\agterm{parsing}. It is often relatively easy to keep track of simple dependencies when first writing a program. But as the program develops, as new features are added, as bugs are fixed, the dependencies often stop being simple. Then input processing becomes a headache, since it is hard to keep track of or even identify all the particular cases. Changes to the program cause unexpected problems. Program maintenance threatens to get out of control. Syntax directed parsing is a technique for dealing with input data streams which keeps you in control of the problem, gives you a high degree of confidence in your program, makes bugs less likely and makes program maintenance and the addition of new features an easy task. To use syntax directed parsing you create a high level description of the structure of your input data, called a \index{Grammar}\agterm{grammar}. A file which contains a grammar is called a \index{Syntax file}\index{File}\agterm{syntax file}. A \index{Parser generator}\agterm{parser generator}, such as AnaGram, can create, from a syntax file, a function (or program) called a \index{Parser}\agterm{parser}, written in C or C++. The parser keeps track of all the dependencies in your input, and calls certain functions, \index{Reduction procedure}\agterm{reduction procedures}, to deal with specific units or sequences of data as they are encountered. Reduction procedures are functions you write to process your data. They are linked, in an obvious way in the grammar, to the structures in your input, so that your parser will call them at precisely the right times with precisely the right data. When you write reduction procedures, you can concentrate entirely on what you have to do with the data. You don't have to encumber your code with switches and tests to determine the structure of your input. This chapter describes in a general way how you write a grammar and how a parser works. In particular, it defines a number of terms which are useful in talking about grammars and syntax directed parsing. It then introduces a number of special features of AnaGram. More detailed treatment of writing grammars is given in Chapter 8. Techniques for building a complete functioning parser are discussed in Chapter 9. \section{Describing an Input Sequence} Writing a grammar consists of describing the acceptable input sequences for your program. The vehicle for describing an input sequence is called a \index{Production}\agterm{production}. Productions show how a logical component of the input can be made up of a sequence of more elementary components. A production that describes a date might be written: \begin{indentingcode}{0.4in} date -> name of month, day, comma, year \end{indentingcode} The components of the input are called \index{Token}\agterm{tokens}. The sequence of tokens on the \index{Right side}right side of the production is called a \index{Grammar rule}\agterm{grammar rule}, or \index{Rule}\agterm{rule} for short. The individual tokens on the right side of the rule are also called \index{Rule elements}\index{Rule}\agterm{rule elements}. The \agterm{rule length} is the number of elements in the rule. The token on the \index{Left side}left side of the production is called the \index{Reduction token}\index{Token}\agterm{reduction token} for the rule. \index{Token}Tokens may have \index{Value}\index{Semantic value}\agterm{semantic values}, as distinguished from \index{Value}\agterm{syntactic values}, which you can use in your reduction procedures. For instance, the value of \agcode{name of month} could be an integer in the range zero to eleven, or it could be a pointer to an ASCII string. The value of day could be an integer in the range one to thirty-one. A grammar consists of a number of such productions, each of which describes some component of the input in terms of other components. It does not take very many productions to describe quite complex input streams. A grammar for the C language, for instance, requires about two hundred productions. Many people find the term \agterm{production} quite confusing. It comes from theoretical linguistics where it is used to describe how one may produce sequences which correspond to a set of grammatical rules. Ironically, the major usage of the idea has been in parsing where the interest is not so much in creating sequences which satisfy the grammatical rules as in decoding and analyzing such sequences. Nonetheless, it is convenient, in the above example, to say that the token \agcode{date} \agterm{produces} a sequence of tokens consisting of \agcode{name of month}, \agcode{day}, \agcode{comma}, and \agcode{year}. We also say that the sequence \agterm{reduces} to \agcode{date}. There may be more than one production to describe a given component, if there is more than one way it may be represented. For instance, \begin{indentingcode}{0.4in} date -> day, name of month, year \end{indentingcode} describes another common way of writing a date. In other words, a reduction token may produce a number of different grammar rules. Tokens which appear on the left side of one or more productions are called \index{Token}\index{Nonterminal token}\agterm{nonterminal tokens}. Those which appear \emph{only} on the right sides of productions are called \index{Token}\index{Terminal token}\agterm{terminal tokens}. Terminal tokens are the units which actually appear physically in the input. Nonterminal tokens are identified when a sequence of tokens that matches the right side of a production is seen in the input. When AnaGram analyzes a grammar, it assigns a unique \index{Token number}\index{Number}\index{Token}\agterm{token number} to each token it finds in the grammar. Nonterminal tokens, such as \agcode{date} in the example above, may appear in any grammar rule just as though they were input tokens. The token on the left side of a production can even appear on the right side as well. Such a production is called a \index{Production}\index{Recursive productions}\agterm{recursive} production. When a nonterminal token appears on the right side of a production, it may be represented in this context by \emph{any} of the grammar rules it produces. Grammars described in this manner are called \index{Context free grammar}\index{Grammar}\agterm{context free grammars} since there is no contextual constraint on which of the rules that a token produces can appear in any given context. Recursive productions may be either left recursive or right recursive. \agterm{Left recursive} productions are those where the recursively defined nonterminal appears as the first element in the recursive rule. \agterm{Right recursive} productions are those where it is the last element. If it appears anywhere between, the production is said to be \agterm{center recursive}. Any nonterminal token which has a recursive production must also have at least one simple, non-recursive production. Otherwise, it is not possible to create a finite sequence of terminal tokens from the nonterminal token. Recursion may also occur implicitly in a grammar when one of the tokens on the right side of a production itself has a production involving the token on the left. Such implicit recursion sometimes may involve numerous levels of productions. Implicit recursion occurs most commonly in describing constructs such as arithmetic expressions or the block structure of programming languages. Clearly, grammars can accommodate multiple levels of structure in the input sequences they describe. There must, at the top, be a single token which encompasses the entire input. This special token is variously called the \index{Grammar token}\index{Configuration parameters}\index{Token} \agterm{grammar token}, the \index{Goal token}\agterm{goal token} or the \index{Start token}\agterm{start token}. In AnaGram grammars, terminal tokens need not be declared or otherwise defined. Any token name that appears only on the right side of a production is taken to be a terminal token, and AnaGram presumes that the input procedure that provides tokens to the parser will be able to identify them to the parser appropriately. The details on how to do this are explained in Chapter 9. On the other hand, AnaGram allows you to specify terminal tokens explicitly as ASCII characters, or even as sets of ASCII characters, right in the grammar. Thus, you may write \agcode{'0-9'} to represent the set of ASCII digits, or \agcode{'A-Z'} to represent the set of upper case letters. The \index{Semantic value}\index{Value}\index{Token}semantic value of such a token is the ASCII \index{Character codes}character code that actually appears in the input stream. The rules for representing \index{Character sets}character sets are given in Chapter 8. If the various sets you use in your grammar overlap, they may not properly represent terminal tokens. In this case, AnaGram automatically extends your grammar appropriately. This ``set partition'' logic is described in Chapter 6. \section{How a Parser Works} The aim of a \index{Parser}parser is to match its input with the full syntactic structure specified by the productions which make up the grammar. The primary component of a parser is an input buffer, sometimes thought of as a stack, into which tokens are \agterm{shifted}, or stored sequentially, as they are encountered in the input. At the same time that a token is shifted into the input buffer, its \index{Semantic value}\index{Token}\index{Value}semantic value is pushed onto the \index{Parser value stack}\index{Value stack}\index{Stack} \agterm{value stack}. A token is not shifted into the buffer unless it ``makes sense'', that is, unless it is consistent with the rules of the grammar and with the input that has preceded it. If a token does not make sense, the parser signals a \index{Syntax error}\index{Errors}\agterm{syntax error}. In order to determine whether a token makes sense, the parser has a sort of decision table which provides a list of acceptable tokens for each of a number of \index{Parser}\index{State}\agterm{states}. The table also specifies what the parser is to do with each acceptable token. When the table indicates that a token is to be shifted into the buffer, it also specifies a new state. The parser stacks the current state on a \index{Parser state stack}\index{State stack}\index{Stack} \agterm{state stack} and jumps to the new state. Thus every time a token is shifted into the input buffer, a state number is pushed onto the state stack. For each state of the parser, excepting only the initial state, there is a unique token which will cause a jump to that state. This token is called the \index{Characteristic token}\index{Token}\agterm{characteristic token} for the state. When the rightmost, or most recent, tokens in the input buffer match the right side of a production precisely, the parser \emph{may} replace the tokens that match the rule with a single token, the token on the left side of the production. This process of replacing a sequence of tokens with a single token is called \index{Reduction}\agterm{reduction}. The token that replaces the sequence of tokens is called the \index{Reduction token}\index{Token}\agterm{reduction token}. The actual mechanism of the reduction is quite important. At the same time that input tokens are removed from the input buffer, state numbers are popped from the state stack, so that when all input tokens matching the rule have been removed, the parser \index{Parser}\index{State}state has been restored to the value it had at the time the first token in the rule was seen. As state numbers are popped from the state stack, token values are popped from the value stack. If the rule has a reduction procedure, temporary variables are loaded with the values popped from the stack and the reduction procedure is called. The reduction token is now shifted into the input buffer just as though it were an input token. If the reduction procedure returned a result it is shifted into the value stack as the value of the reduction token. The parser stacks the current state again and jumps to a new state as determined by the parser tables. If there are no errors in your input, when the last token has been read from the input, shifted into the input buffer, and reductions performed, there will be precisely one token in the input buffer: the \index{Grammar token}\index{Goal token}grammar, or goal, token which describes your entire input. At this point your parser declares itself finished. Reductions do not necessarily occur every time a rule matches the tokens in the input buffer. If the reduction token does not make sense, that is, if it is not consistent with the rules of the grammar and with the input that has preceded it, the parser will not perform the reduction. Suppose there are tokens in the input which match one of the rules given above for \agcode{date}. A reduction will not occur unless \agcode{date} is one of the tokens the parser is actually looking for at that stage in the input. Even when the reduction token would make sense, there are still situations where the reduction would not take place. Suppose a grammar includes the two productions given above for \agcode{date} as well as the following: \begin{indentingcode}{0.4in} date -> name of month, day \end{indentingcode} This production is the same as the first, but with no year specification. We often write dates without specifying the year explicitly. The year is usually understood from context. When a parser directed by this grammar has encountered \agcode{name of month} and \agcode{day}, it can't tell without looking further whether it has a short form or a long form date. In such a circumstance, the parser looks at the next following token, which is called a \index{Lookahead token}\index{Token}\agterm{lookahead token}. If the lookahead token is a comma, then, in the absence of other productions, the input is a long form date. If the lookahead token is not a comma, then the input is certainly not a long form date, and it is proper to reduce the short form production. Suppose the lookahead token were a comma and the grammar were also to contain the following production: \begin{indentingcode}{0.4in} appointment -> date, comma, time \end{indentingcode} Since a comma can follow date, according to this rule, and can also follow day according to the first production, it is impossible to determine, simply by looking at the lookahead token, whether the date was being given in short form or long form. One would have to look beyond the comma to see if what follows the comma matches the rules for time or for year. Although it is possible to build parsers which can do this, it is not generally feasible. This situation is called a \index{Shift-reduce conflict}\index{Conflicts} \agterm{shift-reduce conflict}, because the parser cannot decide simply on the basis of the lookahead token whether to reduce the short form rule or to ignore it and shift the comma into the input buffer. AnaGram finds and warns you about the conflicts in your grammar when it analyzes your grammar. To do this, it checks all states to find all completed rules. A \index{Completed rule}\index{Rule}\agterm{completed rule} is a rule that is completely matched by the rightmost tokens in the input buffer. For each such rule, AnaGram determines all of the terminal tokens which could follow once that rule has been reduced. These tokens are sometimes called the \index{Token}\agterm{reducing tokens} for the given rule in the given state. (Note that ``reducing tokens'' are quite different from ``reduction tokens''.) AnaGram checks each state for shift-reduce conflicts by comparing the set of tokens which can be shifted against all sets of reducing tokens for that state. If there is an overlap, there is a conflict. When AnaGram finds a conflict, it prepares a detailed report on the conflict. This information is then available in the \agwindow{Conflicts} window. A shift-reduce conflict does not prevent you from building a parser. If you build a parser for a grammar which has such a shift-reduce conflict, the parser will resolve the issue by shifting the lookahead token into the input buffer and ignoring the reduction. There is another type of conflict involving reductions. Suppose a given state has more than one \index{Completed rule}\index{Rule}completed rule. AnaGram then checks the sets of reducing tokens for all the completed rules. If there is an overlap, there is no way for AnaGram to determine which rule should be reduced simply on the basis of the \index{Token}\index{Lookahead token}lookahead token. This situation is called a \index{Reduce-reduce conflicts}\index{Conflicts} \agterm{reduce-reduce conflict}. AnaGram will also diagnose reduce-reduce conflicts in the Conflicts window. If you decide to ignore a reduce-reduce conflict and ask AnaGram to build a parser, the parser will choose the rule AnaGram encountered first when it read your grammar. \index{Conflicts}Conflicts are not necessarily errors. Sometimes you can build a more compact, faster parser by deliberately introducing conflicts. You can rely on AnaGram's default resolution of the conflicts, or you may use operator precedence to resolve the conflicts. The use of operator precedence is described in Chapter 9. AnaGram has extensive facilities for identifying and correcting unwanted conflicts. These facilities are described in Chapter 7. Traditionally, parsers are built using only shift and reduce actions as described above. The parsers AnaGram builds normally use a number of more complex actions, each of which is equivalent to a number of shift and reduce actions. These complex actions permit AnaGram to build faster, more compact parsers. If you wish, however, you may restrict AnaGram to using only shift and reduce actions by setting the \index{Traditional engine}\index{Configuration switches} \agparam{traditional engine} configuration switch. See Appendix A, Configuration Parameters. \section{A Note on Notation} \index{Context free grammar}\index{Grammar}Context free grammars have been traditionally represented in the literature using \index{Backus-Naur Form}\agterm{Backus-Naur Form}, or \index{BNF}\agterm{BNF}. In Backus-Naur Form, certain characters, called metacharacters, are used to punctuate productions and all other printable characters are taken to represent themselves literally. Named tokens are denoted by enclosing the name within angle brackets ($< >$). The left side of a production is distinguished from the right side by the % XXX s/characters/symbol/ characters $::=$. If several productions have the same left side, the vertical bar $|$ is used to separate them. The elements of a grammar rule are simply juxtaposed to indicate that one follows another. Blanks are ignored. Thus, in BNF, the first production given for \agcode{date}, above, would be: % ~ is a hard space $$ <date>~~::=~~<name~of~month>~~<day>~~,~~<year> $$ AnaGram uses a notation more consonant with ordinary programming usage. Thus token names need not be bracketed and literal characters must be appropriately quoted. The elements of rules are joined by commas. Using this approach, there is no need for metacharacters and it becomes possible to make a number of useful extensions to the notation. The detailed rules for writing grammars using AnaGram's notation are given in Chapter 8. \section{Reduction Procedures} \index{Reduction procedure} Of course, the reason for parsing an input stream is to interpret the data in the stream and to process it in some useful manner. The primary tool for doing this is the \agterm{reduction procedure}. A reduction procedure is a piece of C code that is executed when a particular grammar rule is reduced. Often, a reduction procedure calculates a value which becomes the \index{Value}\index{Semantic value}semantic value of the \index{Token}\index{Reduction token}reduction token. The input to the reduction procedure consists of the values of the tokens that make up the grammar rule. AnaGram allows you to assign C variable names to the tokens in the grammar rule, so that you can refer to them in the reduction procedure. To assign a C variable name, simply follow the token in the rule with a colon and the C variable name. Simple reduction procedures can be written as C or C++ expressions. At the end of the rule, write an equal sign and an expression, terminated by a semicolon. For example: \begin{indentingcode}{0.4in} (int) hex digit -> '0-9':d = d-'0'; -> 'a-f':d = d-'a'+10; -> 'A-F':d = d-'A'+10; \end{indentingcode} When any one of these rules is matched, the value of the token in the rule is assigned to the temporary variable \agcode{d}. The expression to the right of the equal sign is evaluated and the value of the expression is stored as the value of \agcode{hex digit}, which has been declared to be an \agcode{int}. These productions define hexadecimal digits as ASCII characters, and calculate the binary value of the digit in terms of the ASCII character code, \agcode{d}. The binary value becomes the value of \agcode{hex digit}. Hexadecimal digits can be combined to make hexadecimal numbers by writing the following productions: \begin{indentingcode}{0.4in} (int) hex number -> hex digit -> hex number:n, hex digit:d = 16*n+d; \end{indentingcode} %In this example, note that if you do not specify a reduction procedure for %a grammar rule, the value of the reduction token will be taken to be the %value of the first token in the rule. Thus, it is not necessary to provide %a reduction procedure in the first production for hex number. The second %production recursively defines the value of hexadecimal numbers that have %more than one digit. There are several important points to notice in this example. First, reduction procedures are executed ``from the bottom up''. That is, the reduction procedure for \agcode{hex digit} is executed before any reduction procedure for \agcode{hex number}. Second, if there is no reduction procedure for a production, the value of the first token in the rule is assigned to the reduction token. Thus, it is not necessary to provide a reduction procedure in the first production for \agcode{hex number}. Third, the reduction procedures for recursive productions are always executed \emph{after} the reduction procedure, if any, for the non-recursive production which begins the recursion. Finally, when an input sequence is described using left recursion, as in this example, the elements of the sequence are processed left to right. % % XXX this is a bad example as you never want to read floats this way! % Can we think of something else? % If you wish to process the elements of a sequence right to left, you may use right recursion. For example, it is sometimes convenient to define the fraction part of a decimal number thus: \begin{indentingcode}{0.4in} (double) fraction part -> '0-9':d = (d - '0')/10.0; -> '0-9':d, fraction part:f = (d - '0' + f)/10.0; \end{indentingcode} In this case the leading digits are stored temporarily in the parse stack, and then the fraction part is evaluated right to left only when the last digit has been found. Reduction procedures can be more complex than simple expressions. After the equal sign you may include an arbitrary block of C or C++ code, enclosed in braces, \agcode{\bra \ket}. To return a \index{Semantic value}\index{Token}\index{Value}semantic value for the reduction token simply use a return statement. Of course, reduction procedures have the full resources of C or C++ at their disposal. They may set and interrogate global variables and may call functions freely. Reduction procedures are usually implemented as C functions. Very simple procedures may be implemented as macros, unless you have disabled this option by turning off the \index{Macros}\index{Allow macros}\agparam{allow macros} configuration switch. See Chapter 8 for the rules for writing reduction procedures. When AnaGram builds a parser it copies all of the reduction procedures you have defined to the parser file, and includes code to call them at the right time. Since the reduction procedures you write will probably need some support code, such as \agcode{\#include} statements and declarations, you may incorporate C or C++ code into your syntax file at any point. You need only enclose it in braces, \agcode{\bra \ket}. Such code is called \index{Embedded C}\agterm{embedded C}. All embedded C code is also copied to the \index{File}parser file, and \emph{precedes} all of your reduction procedures. \section{Building a Parser} \index{Building a Parser}\index{Parser} In order to round out a parser into a functioning program it needs input procedures, as well as error diagnosis and recovery capabilities. AnaGram has a number of options available which give you a high degree of flexibility in configuring a parser to suit your particular needs. All of the options are provided with reasonable defaults, so that you can safely disregard any option until you need the features it provides. The following paragraphs summarize the options AnaGram provides. These topics are discussed in greater detail in Chapter 9. \paragraph{Invoking a Parser.} \index{Parser} Normally, AnaGram configures parsers as functions which you can call from elsewhere in your program. In this situation, you call the parser, it processes its input, and returns either when it has finished or cannot proceed because of errors. Alternatively, if you set the \index{Event driven}\agparam{event driven} configuration switch, your parser will be configured so that you have two procedures to call: an \index{Initializer}\agterm{initializer} and a \index{Parser}\agterm{parser}. In the event driven configuration you start the parse by calling the initializer and then you call the parser once for each unit of input. Using the event driven mode makes it quite easy to configure a parser as a filter and to chain several parsers together so that the output from one parser is the input to the next. Such \index{Parsing}\index{Multi-stage parsing}\agterm{multi-stage parsing} is a convenient way to deal with complex input that is not context free. \paragraph{Communicating with a Parser.} The complete status of your parser is contained in a single data structure called a \index{Parser control block}\index{Control block} \agterm{parser control block}. All communications with a parser take place via the parser control block. \index{Input procedures}Input procedures must place input data in the appropriate field in the parser control block. When the parse is complete or encounters an error, the results of the parse may be found in the parser control block. When AnaGram builds a parser it includes, in the \index{File}\index{Header file}header file it generates, a \agcode{typedef} statement which defines the structure of the parser control block. \paragraph{Parser Input.} \index{Input}\index{Parser}\index{Input procedures} The input to your parser may be either characters read directly from an input stream, or tokens accumulated by a pre-processor or lexical scanner. The way you provide input to your parser depends on how your grammar defines input tokens and also on whether or not you have requested an event driven parser. If your parser is event driven, you provide its input by storing the input code and the input value, if any, into the parser control block and calling the parser. If you have set the \index{Pointer input}\agparam{pointer input} \index{Configuration switches}configuration switch in your syntax file, you simply initialize the pointer field in your parser control block before you call your parser. Your parser will then read its input directly from memory by simply incrementing the pointer as necessary. Otherwise, your parser will invoke a macro called \index{\agcode{GET{\us}INPUT}}\index{Macros}\agcode{GET{\us}INPUT} every time it needs more input. You may define \agcode{GET{\us}INPUT} according to your needs. You can define it so that it calls an input function, or you can define it so that it executes in-line code each time it is invoked. Your \agcode{GET{\us}INPUT} macro should store its input code in the \index{\agcode{input{\us}code}}\agcode{input{\us}code} field of the \index{PCB}parser control block. If you do not write a \agcode{GET{\us}INPUT} macro, AnaGram will provide one which will read characters from \agcode{stdin}. If your grammar does not define terminal tokens in terms of ASCII characters or external token numbers, your \agcode{GET{\us}INPUT} will have to determine the appropriate internal \index{Token number}\index{Token}\index{Number}token number for each input token. To assist you in determining these token numbers AnaGram provides a \agcode{typedef enum} statement in the \index{File}\index{Header file}header file. You can then use named constants to specify the internal token numbers for the input tokens. \paragraph{Error Handling.} \index{Error handling}\index{Error diagnosis}\index{Error recovery} Your parser must be prepared to deal with erroneous input. There are two aspects to error handling: diagnosing the error, and recovering from the error. On encountering an error, your parser will invoke a macro called \index{\agcode{SYNTAX{\us}ERROR}}\index{Macros}\agcode{SYNTAX{\us}ERROR}. If you do not provide a definition for \agcode{SYNTAX{\us}ERROR}, AnaGram will provide a simple error diagnostic. AnaGram can also provide automatic error diagnoses which pinpoint the location of the error. \index{Resynchronization} AnaGram provides two options for error recovery: \agterm{error token resynchronization} and \agterm{automatic resynchronization}. These are techniques for getting your parser back in synch with its input so it can proceed after encountering an error. Normally, if you do not select one of these recovery techniques, your parser will terminate when it encounters an error; however, you may override this default if you wish and provide your own recovery technique. \section{Special Features of AnaGram} AnaGram provides a number of special features not generally available in other parsing systems. These features extend the applicability of syntax directed parsing, make it easier for you to write and debug your grammars, and make it easier to interface your parser to other parsers and to your main program. The following paragraphs provide a brief introduction to these features. Their use is described more fully in the appropriate chapters in this manual. \paragraph{Configuration Parameters.} \index{Configuration parameters}\index{Parameters} AnaGram has a number of configuration switches and parameters which you can use to control the way AnaGram works in various circumstances. Since programmers often have different requirements, AnaGram strives to be as flexible as possible. Configuration parameters may be set in \index{Configuration file}\index{File}configuration files, or in your syntax file. The rules for writing configuration parameters are given in Chapter 8. A summary of all configuration parameters is given in Appendix A. The use of particular configuration parameters is described as appropriate throughout this manual. \paragraph{Character Sets}. \index{Character sets} In your grammar, if you write terminal tokens explicitly as characters or as sets of ASCII characters, AnaGram will set up your parser to accept ASCII characters directly, so you will not need to build a \index{Lexical scanner}lexical scanner and interface it to your parser. Character sets may be specified in terms of ranges of characters, such as \agcode{'a-z'}, as unions, intersections or differences of sets, denoted by \agcode{+}, \agcode{\&}, or \agcode{-} respectively, or as the complement of a set, denoted by \agcode{\~{}}. You may also specify a set consisting of a single character either with a literal character or with a numeric specification. The detailed rules for character set expressions are given in Chapter 8. Some useful character sets are: \begin{itemize} \item Letters: \agcode{'a-z' + 'A-Z'} \item Digits: \agcode{'0-9'} \item DOS end of file: \agcode{-1 + \^{}Z} \item Neither end of file nor end of line: \agcode{\~{}(-1 + \^{}Z + '{\bs}n')} \end{itemize} \paragraph{Keywords.} \index{Keywords} AnaGram makes special provision for the use of keywords in the parsers it builds. In your grammar, you may specify a keyword as a quoted string. AnaGram will incorporate special string matching logic in your parser to recognize keywords in the parser input. When a keyword is recognized, it is treated as an individual token, completely independent of the characters which make up the keyword. Some examples of keywords: \agcode{"let"}, \agcode{".EQ."}, \agcode{"<>"}, \agcode{"/*"}. The rules for the use of keyword strings are given in Chapter 8. Keywords are a powerful and useful tool, but they are not completely consistent with the assumptions underlying the use of context-free grammars. Thus under certain circumstances it is possible for a parser built from a grammar that uses keywords to fail % XXX s/parse correctly/correctly parse/ to parse correctly text that appears to be consistent with the grammar. Such a situation is called a \index{Keyword anomalies}\index{Anomaly}keyword anomaly. AnaGram detects and diagnoses keyword anomalies in your grammar, and provides tools to help you find and correct the problem. Many such apparent anomalies, however, are completely innocuous. Keyword anomalies are discussed in greater detail in Chapter 7. \paragraph{Virtual Productions.} \index{Virtual productions}\index{Production} Virtual productions are a powerful, shorthand notation for specifying optional or repetitive input. Using a virtual production can often spare you the trouble of writing out four or five regular productions and naming them. Most of the notation for virtual productions should be familiar since it has been used in programming manuals for years. The basic forms are these: \index{\agcode{[]}} \index{\_opb\_clb} % {}, apparently. XXX? \index{\agcode{...}}\index{Ellipsis} \index{\agcode{?}} \index{\agcode{?...}} \index{\agcode{/...}} % XXX figure out how to not have to encode the line breaks manually. % I think we need one of the fancier extension forms of tabular. \begin{tabular}[t]{ll} \agcode{[]}&enclosing a list of one or more grammar rules separated by \agcode{|} characters denotes an optional\\ \phantom{}&choice of one of the rules.\\ \agcode{\bra \ket}&enclosing a list of one or more grammar rules separated by \agcode{|} characters indicates a required\\ \phantom{}&choice of one of the rules.\\ \agcode{...}&following a token, or following \agcode{[]} or \agcode{\bra \ket}, indicates arbitrary repetition of the previous choice or\\ \phantom{}&token.\\ \agcode{?}&following a token or set expression makes the token optional.\\ \agcode{?...}&following a token or set expression denotes zero or more repetitions of the token.\\ \agcode{/...}&following \agcode{[]} or \agcode{\bra \ket} indicates arbitrary repetition of the choice subject to the constraint that\\ \phantom{}&choices must alternate.\\ \end{tabular} \index{Production}\index{Virtual productions}Virtual productions are simply syntactic devices. When AnaGram encounters a virtual production it translates the virtual production into conventional productions which show up in your grammar tables. Virtual productions are described in Chapter 8. \paragraph{Definition Statements.} \index{Statement}\index{Definition statement} AnaGram provides a mechanism for naming particular character sets, keywords or virtual productions in your syntax file. The name you define can be used freely in place of the thing defined. Definition statements have the form: % XXX why the hell is the first line offset by a small amount? \begin{indentingcode}{0.4in} name = \codemeta{character set} name = \codemeta{virtual production} name = \codemeta{keyword} name = \codemeta{immediate action} name = \codemeta{token name} \end{indentingcode} The name may be any name acceptable to AnaGram. The name can then be used anywhere you might have used the expression on the right side. For example: \begin{indentingcode}{0.4in} upper case letter = 'A-Z' lower case letter = 'a-z' letter = upper case letter + lower case letter statement list = statement?... while = "WHILE" dos eof = \^{}Z \end{indentingcode} \paragraph{Disregard Statements.} \index{Disregard statement}\index{Statement} If you wish your parser generally to skip over certain characters or constructs in its input, such as blanks, newlines, or comments, you may use disregard statements to specify precisely what is to be skipped and under what circumstances. You may use any number of disregard statements and you may cause your parser to disregard any number of tokens, of whatever complexity. Any \agparam{disregard} statements must be placed in a configuration section. They may appear anywhere in your syntax file. The format of the \agparam{disregard} statement is as follows: \begin{indentingcode}{0.4in} disregard \codemeta{token name} \end{indentingcode} The rules for using disregard statements are given in Chapter 8. \paragraph{Lexeme Statements.} \index{Lexeme statement}\index{Statement} The lexeme statement is used to fine tune the disregard statement. Any \agparam{lexeme} statements must be placed in a configuration section. They may appear anywhere in your syntax file. The format of the \agparam{lexeme} statement is as follows: \begin{indentingcode}{0.4in} lexeme \bra \codemeta{token list} \ket \end{indentingcode} where \textit{token list} is a list of nonterminal tokens, separated by commas. The lexeme statement allows you to specify that for the purposes of the disregard statement, the listed tokens are to be treated as % XXX make this: indivisible lexical units lexical units, so that the disregard statement will be inoperative within the tokens specified. Lexeme statements are discussed in Chapter 8. \paragraph{Semantically Determined Productions.} \index{Semantically determined production}\index{Production} Traditionally, syntax directed parsing systems have enforced a rigid distinction between syntactic and semantic information. In these systems, it is not possible to use the knowledge that, for example, a symbol is a function name and not a variable name in the syntax of a programming language. AnaGram provides semantically determined productions as a mechanism to allow semantic information to control syntactic analysis. A semantically determined production has more than one \index{Token}\index{Reduction token}reduction token on the left side. The reduction procedure determines which reduction token the parser should use. For example: \begin{indentingcode}{0.4in} variable name, function name -> symbol:s = identify{\us}symbol(s); \end{indentingcode} In this situation, the parser will set the reduction token to the leftmost token, \agcode{variable name}, before calling the reduction procedure. If the reduction procedure finds that \agcode{s} in fact identifies a function name, it can change the reduction token to \agcode{function name}. Using this capability, you can write your grammar in terms of \agcode{variable name} and \agcode{function name}, just as though they were syntactically distinguishable. Further discussion of semantically determined productions may be found in Chapter 9. \paragraph{Event Driven Parsers.} \index{Parser}\index{Event driven parser} Traditionally, parsers have been constructed as subroutines which are called to analyze a stream of input and return only when they have finished parsing the input or have encountered an unrecoverable error. Such a parser itself calls a subroutine to obtain each unit of input. There are a number of circumstances where this traditional architecture is inconvenient. The most obvious is in situations where programs must be event-driven, as in many popular windowing systems. A somewhat less obvious situation is where one parser, often called a \index{Lexical scanner}\agterm{lexical scanner}, provides input to a second. In these situations it is convenient to invert the role of the calling and the called programs. AnaGram provides the capability of specifying that a parser it builds should be event-driven. In this case, the parser consists of an initialization procedure and an event handler. The main program calls the initialization procedure and then calls the event handler with each input token in turn. The event handler returns every time the parser needs more input. Details on event-driven parsers are given in Chapter 9. \paragraph{Context tracking.} \index{Context tracking} When you are writing a reduction procedure for a particular grammar rule, you often need to know the value one or another of your program variables had at the time the first token in the rule was encountered. Examples of such variables are: \begin{itemize} \item Line or column number \item Index in an input file \item Index into an array \item Counters, as of symbols defined, etc. \end{itemize} Such variables can be thought of as representing the ``context'' of the rule you are reducing. Sometimes it is possible to incorporate the values of such variables into the values of reduction tokens, but this can become quite cumbersome. AnaGram provides a feature known as ``context tracking'' to deal with this problem. Details on context tracking can be found in Chapter 9. \paragraph{Coverage Analysis.} \index{Coverage analysis} AnaGram has facilities to help you determine the adequacy of your test suites. If you set appropriate configuration switches, AnaGram will include code in your parser to count the number of times each rule in your grammar is reduced or the number of times each reduction procedure in your grammar is executed. You can use these counts to determine the completeness of your testing. The use of these facilities is discussed in Chapter 9. \paragraph{Immediate Actions.} \index{Action}\index{Immediate action} AnaGram provides for executing particular functions, called ``immediate actions'', at any point in a grammar rule. Thus, computation does not need to wait until the rule is complete. Immediate actions are most useful when using AnaGram to control an interactive process. Immediate actions are discussed in Chapter 8. \paragraph{Grammar Trace.} \index{Grammar Trace}\index{Window}\index{Trace} The \agwindow{Grammar Trace} facility of AnaGram allows you to examine the workings of your parser in detail. You can set up a representation of the \index{Parser state stack}\index{State stack}\index{Stack}parser state stack and parser state as they might appear in the course of execution of your parser. You can then examine the possible inputs and see how the state and the state stack change in response to any input you choose. You can back up and try other options. You can have several grammar traces active simultaneously so you can compare the results of different sequences of input. For any configuration of the state stack, you can see what grammar rules the parser is in the process of matching. The use of the Grammar Trace is described in Chapter 5. \paragraph{File Trace.} \index{Trace}\index{File Trace}\index{Window} The \agwindow{File Trace} is a facility which allows you to see how your parser will work on real data. File Trace is an interactive, interpretive parser governed by the rules in your grammar. It lets you see precisely how your grammar parses a test file. You can see the contents of the parse stack at any point in the parse. You can watch the parse progress at any level of detail you choose. If you wish, you may back up and try again. When AnaGram runs the \agwindow{File Trace} or \agwindow{Grammar Trace} it also builds a \index{Coverage}\index{Trace Coverage}\index{Window} \agwindow{Trace Coverage} table. Note that AnaGram normally uses a number of short-cut parsing actions. To make the parse look like a textbook parse, set the \index{Traditional engine}\agparam{traditional engine} \index{Configuration parameters}configuration parameter in your syntax file. \paragraph{Aids to Debugging.} AnaGram provides a number of facilities to help you debug your parser. It provides numerous tables that summarize your grammar and a windowing environment to assist you in inspecting and comparing this data. To assist in identifying a conflict in your grammar it can provide a \index{Trace}\index{Conflict Trace}\index{Window}\agwindow{Conflict Trace}, a pre-built grammar trace showing a sequence of input which will trigger the conflict. If your parser encounters a \index{Syntax error}\index{Errors}syntax error, you can arrange to have AnaGram build an \agwindow{Error Trace}, a pre-built grammar trace showing the sequence of input tokens that led to the syntax error. These facilities are described in Chapter 5. \paragraph{Conflict Resolution.} \index{Conflict} In order to help you deal with conflicts in your grammar, AnaGram provides several methods for resolving conflicts. You may include \agterm{precedence declarations} to assist in the construction of simple precedence grammars, you may declare tokens to be \agparam{sticky}, or you may declare tokens to be \agterm{subgrammar} tokens. These techniques are described in Chapter 7.