Mercurial > ~dholland > hg > ag > index.cgi
view doc/manual/sf.tex @ 9:60b08b68c750
Switch to static inline as an expedient build fix.
Should probably set this up with working C99 inline but for the moment
I don't have the energy.
author | David A. Holland |
---|---|
date | Mon, 30 May 2022 23:56:45 -0400 |
parents | 13d2b8934445 |
children |
line wrap: on
line source
\chapter{Syntax Files} \index{Syntax file}\index{File} Input files to AnaGram are called \agterm{syntax files}. A syntax file comprises a grammar and associated C or C++ code. The grammar consists of a number of productions along with supportng information such as configuration sections and definitions of character sets. The associated code consists of reduction procedures (see \S 8.2.13) and embedded C or C++ code (\S 8.2.17). This chapter explains the rules for writing syntax files acceptable to AnaGram. The rules for interfacing your parser to the balance of your program are given in Chapter 9. \section{Lexical Conventions} \index{Lexical conventions} \subsection{Statements} \index{Statements} For purposes of this manual, AnaGram statements are considered to be productions, definition statements, configuration sections, and blocks of embedded C or C++ code, all discussed individually below. Each statement must begin on a new line. It is a good idea to separate statements visually in your file by using blank lines freely. There are generally no restrictions on the \index{Statements}\index{Order of statements}order of statements in a syntax file. Good programming practice, however, suggests that definitions and configuration sections should precede the grammar itself. \subsection{Spaces and Tabs} \index{Spaces}\index{Tabs} AnaGram allows spaces and tabs to be used freely to improve the readability of grammars. Spaces and tabs are ignored, except when embedded in a token name, in a character set definition, or in a keyword. Within a token name, any sequence of spaces and tabs counts as a single space. \subsection{Continuation Lines} \index{Continuation lines} AnaGram statements normally end with a newline character or the end of file. If AnaGram encounters the end of a line and the statement it is reading appears to be complete, it will not look for a continuation. To continue a statement to another line, just make sure that what you have on the first line is clearly incomplete. For example, \begin{indentingcode}{0.4in} prep phrase -> preposition, "the", noun \end{indentingcode} looks complete to AnaGram, whereas \begin{indentingcode}{0.4in} prep phrase -> preposition, "the", noun, \end{indentingcode} looks incomplete because of the dangling comma at the end. \subsection{Comments} \index{Comments} AnaGram accepts comments in accordance with the rules of C and C++, that is, normal C comments bracketed with \agcode{/*} and \agcode{*/}, as well as comments which begin with \agcode{//} and continue to the end of line. AnaGram also observes these conventions when skipping over embedded C code. Since the ANSI standard for C insists that normal C comments do not nest, AnaGram, by default, disallows nested comments. You may, however, set a configuration parameter, \index{Nest comments}\index{Configuration switches}\index{Comments} \agparam{nest comments}, to allow nested comments. See Appendix A. In any case, AnaGram will use the same convention for embedded C as it uses for AnaGram proper. You can change the convention in the middle of the file if necessary. AnaGram treats each comment delimited with \agcode{/*} and \agcode{*/} as though it were a single space. You can even put such comments in the middle of token names if you should want to. A comment that begins with \agcode{//} is treated as though the end of line occurred at the \agcode{//}. \subsection{Blank Lines and Form Feeds} \index{Blank lines} Because blank lines and form feeds are visual separators, AnaGram will not skip either looking for a continuation line. Therefore blank lines and form feeds can occur only between AnaGram statements, not in the middle of a statement. It is a good idea to separate groups of productions with a blank line or two, lest an accidental dangling comma make AnaGram think the beginning of the next production is a continuation of the present one. \section{Elements of Grammars} \subsection{Names} \index{Name}\index{Token} You may use names to represent tokens, character sets, keywords and \index{Virtual productions}\index{Production}virtual productions. Names follow the same general rules as for any programming language, with the notable exception that they may have embedded white space. Names are made up of letters, digits, or underscores. They may not begin with a digit. Any sequence of embedded spaces, tabs or comments counts as a single space. AnaGram distinguishes between upper and lower case\index{Case sensitivity}, so that \agcode{Word} and \agcode{word} are different names. There is no particular limit to the length of a name. There are no reserved words as such, although \agcode{grammar}, \agcode{eof}, and \agcode{error} will be treated as reserved words unless you take special action by setting appropriate configuration parameters. The names AnaGram uses for \index{Configuration parameters}configuration parameters follow the same rules as for other names, except that \index{Case sensitivity}case is ignored. \subsection{Reserved Words} \index{Reserved words}\index{Words} % XXX shouldn't that be \index{Grammar token}? AnaGram treats tokens with the names \index{Grammar}\agcode{grammar}, \index{Eof token}\index{Token}\agcode{eof}, and \index{Error token}\index{Token}\agcode{error} in a special manner unless certain measures are taken. Since you can override AnaGram's use of these names, they are not reserved words in the true sense. If your grammar has a token named \agcode{grammar}, AnaGram will take that token to be the grammar token for your grammar unless you set the \index{Token}\index{Grammar token}\index{Configuration parameters} \agparam{grammar token} configuration parameter or mark some other token as the grammar token using ``\index{ \_dol}\$''.% See below ???. If your grammar has a token named \agcode{error} and you take no further steps, AnaGram will assume you wish to use error token resynchronization in case of \index{Syntax error}\index{Errors}syntax error. See Chapter 9. If you wish to use some other token as an error token you may select it using the \index{Configuration parameters}\index{Token}\index{Error token} \agparam{error token} configuration parameter. If you wish to use \agcode{error} as a token name, but do not want error token resynchronization, you may set the \agparam{error token} configuration parameter to any name that is not used in your grammar. You may then use \agcode{error} as a token name without causing AnaGram to include error token resynchronization in your parser. \index{Resynchronization} If you select automatic resynchronization or error token resynchronization (see Chapter 9), AnaGram will look for a token called \agcode{eof} to use as an end of file indicator. You may either name your end of file token \agcode{eof} or you may set the \agparam{eof token} configuration parameter with the name of your end of file token. \subsection{Variable Names} \index{Name}\index{C variable names} With AnaGram you can associate C/C++ variable names with the \index{Semantic value}\index{Token}\index{Value}semantic values of tokens for use in your \index{Reduction procedure}reduction procedures. Each name follows the corresponding token in the grammar rule on the right of the production, separated from the token by a colon. AnaGram allows variable names made up of letters, digits, and underscores. They may not begin with a digit. Embedded spaces, tabs or comments, are not allowed, of course. AnaGram imposes no restriction on length, but uses your variable names just as you have written them in the code it generates to call reduction procedures. Remember that your compiler may have a limit on the length of variable names. Also, AnaGram itself uses C variable names beginning with \agcode{ag{\us}}. It is therefore wise to avoid using names of this form. \subsection{Terminal Tokens} \index{Terminal token}\index{Token} A \agterm{terminal token} is a token which does not appear on the left side of a production. It represents, therefore, a basic unit of input to your parser. You have several options with respect to terminal tokens. If the input to your parser consists of ASCII characters, you may define terminal tokens explicitly as ASCII characters or as sets of ASCII characters. If you have an input procedure which produces numeric codes, you may define the terminal tokens directly in terms of these numeric codes. On the other hand, you may leave the terminal tokens completely undefined. In this case, you must provide an input procedure which can determine the appropriate \index{Token}\index{Token number}\index{Number}token numbers. It is an all or none situation. If you provide any explicit definitions, you must provide them for all terminal tokens. Input procedures and token input are discussed in Chapter 9. Examples of non-character input may be found in the Macro Preprocessor example in the \agfile{examples/mpp} directory on your AnaGram distribution disk.% Further examples are given in Chapter ???. % XXX change ``on ...distribution disk'' to ``in ...distribution''. \subsection{Character Representations} \index{Character representations} In specifying admissible input characters you may use \index{Character constants}character constants following the normal C conventions. Remember that a character constant may specify only a single character. Although some C compilers will allow constructs such as \agcode{'mv'}, AnaGram doesn't allow this. AnaGram recognizes the same escape sequences as C, including octal and hex sequences, even though this is, strictly speaking, unnecessary. The escape sequences AnaGram recognizes are: % % It would be nice to be able to just write this and tell latex to set % it in three columns. but no... that would be too easy. % % %\begin{tabular}{ll} %\agcode{{\bs}a}&alert (bell) character\\ %\agcode{{\bs}b}&backspace\\ %\agcode{{\bs}f}&formfeed\\ %\agcode{{\bs}n}&newline\\ %\agcode{{\bs}r}&carriage return\\ %\agcode{{\bs}t}&horizontal tab\\ %\agcode{{\bs}v}&vertical tab\\ %\agcode{{\bs\bs}}&backslash\\ %\agcode{{\bs}?}&question mark\\ %\agcode{{\bs}'}&single quote\\ %\agcode{{\bs}"}&double quote\\ %\agcode{{\bs}ooo}&octal number\\ %\agcode{{\bs}xhh}&hexadecimal number\\ %\end{tabular} \begin{indenting}{0.4in} \begin{tabular}{llllll} \agcode{{\bs}a}&alert (bell) character& \agcode{{\bs}t}&horizontal tab& \agcode{{\bs}'}&single quote\\ \agcode{{\bs}b}&backspace& \agcode{{\bs}v}&vertical tab& \agcode{{\bs}"}&double quote\\ \agcode{{\bs}f}&formfeed& \agcode{{\bs\bs}}&backslash& \agcode{{\bs}\textit{ooo}}&octal number\\ \agcode{{\bs}n}&newline& \agcode{{\bs}?}&question mark& \agcode{{\bs}x\textit{hh}}&hexadecimal number\\ \agcode{{\bs}r}&carriage return\\ \end{tabular} \end{indenting} \bigskip The octal escape sequence allows up to three octal digits, in accordance with ANSI specifications for C. The hexadecimal numbers may contain an arbitrary number of digits; however AnaGram will truncate the result to sixteen bits. A backslash followed by any character other than those listed above will cause a syntax error. You may also represent characters by writing the numeric code explicitly, in decimal, octal, or hexadecimal representations. AnaGram follows the C conventions for integer constants: a leading \agcode{0} means the number is octal, a leading \agcode{0x} or \agcode{0X} means it is hexadecimal. The hex digits \agcode{a-f} may be either upper or lower case\index{Case sensitivity}. Numbers may be preceded by an optional minus sign. If your parser uses a pre-existing \index{Lexical scanner}lexical scanner and you wish to use the code numbers it generates to identify tokens, you may simply treat those code numbers as character numbers. You may use the numbers directly in your productions, or you may use definition statements to name them. You may also use an \agparam{enum} statement within a configuration section to attach names to the code numbers. % XXX shouldn't this use of enum be indexed? AnaGram also allows a special notation for control characters. You may represent a control character by using the ``\^{}'' character preceding any printing ascii character. Thus you can write \agcode{\^{}z} or \agcode{\^{}Z} to represent the DOS end-of-file character. Notice that quotation marks are not necessary. Examples of character representations: \begin{indenting}{0.4in} \begin{tabular}{cccc} \agcode{'K'}&\agcode{-1}&\agcode{0}&\agcode{'{\bs}t'}\\ \agcode{\^{}J}&\agcode{'{\bs}xff'}&\agcode{077}&\agcode{0XF3}\\ \end{tabular} \end{indenting} \subsection{Character Ranges} \index{Character range}\index{Range} It is convenient to be able to specify ranges of characters when writing a grammar. AnaGram supports several ways of representing ranges of characters. The first is an extension of the notation for character constants: \agcode{'a-z'} is the set of lower case characters. You can even use escape sequences such as \agcode{'{\bs}n-{\bs}r'} if you like. The order of characters used to specify the range is immaterial: \agcode{'z-a'} is the same as \agcode{'a-z'}. AnaGram will, however, issue a warning just in case the unusual order results from a clerical error. The second way to specify a range is by using two arbitrary character representations, as described above, separated by two dots. For example, \agcode{\^{}C..\^{}Z}, \agcode{3..26}, \agcode{3..032}, \agcode{3..0x1a}, and \agcode{\^{}C..0x1a}, all represent the same range of characters. Similarly, \agcode{'A-F'}, \agcode{'A'..'F'}, \agcode{0101..0106}, \agcode{0x41..0x46}, \agcode{65..70}, and \agcode{65..'F'} all represent the same range of characters. \subsection{Character Sets} \index{Character sets} If you provide explicit definitions for terminal tokens, the basic input unit for your parser will be considered a character set, even if your input procedure provides numeric codes that are not actually characters. As a terminal token, a character set will be matched by any input character that is a member of the set. Character sets may be named in definition statements, but they may also appear on the right sides of productions without being named. A character set may consist of one or more characters. You can specify a character set that consists of a single character by using any of the character representation methods described above. You can specify a set consisting of a range of characters by using any of the representations of character ranges described above. \index{Character sets} To specify more complicated sets, you can write \index{Expressions}\index{Set expressions}expressions using conventional set theoretic operations. In AnaGram input, these operations are specified as follows: \index{Union}\index{Difference}\index{Intersection}\index{Complement} \begin{indenting}{0.4in} \begin{tabular}{cl} \agcode{A + B}&(union)\\ \agcode{A - B}&(difference)\\ \agcode{A \& B}&(intersection)\\ \agcode{\~{}A}&(complement)\\ \end{tabular} \end{indenting} where \agcode{A} and \agcode{B} are arbitrary sets. Union and difference have the same precedence. Intersection has higher precedence and complement has the highest precedence. Thus in the expression \begin{indentingcode}{0.4in} A + \~{}B\&C \end{indentingcode} the complement operation is performed first, then the intersection, and finally the union. Watch out! In an AnaGram syntax file \agcode{65 + 97} represents the character set which consists of lower case \agcode{a} and upper case \agcode{A}. It does not represent 162, the sum of 65 and 97. Parentheses may be used to force the order of evaluation: \begin{indentingcode}{0.4in} \~{}(A \& (B+C)) \end{indentingcode} In this example the union of \agcode{B} and \agcode{C} is calculated, then the intersection of this set with \agcode{A} is calculated, and finally the complement is evaluated. The computation of the \index{Complement}complement of a \index{Character sets}set requires a definition of the \index{Universe}universe of set elements. AnaGram will define the universe to be the set of unsigned 8-bit characters, unless one or more characters outside that range have been specified. In that case, the universe will consist of all characters on the interval defined by the lesser of zero and the lowest character code used and the greater of 255 and the highest character code used. The complement of a character set is everything in this universe except the characters in the set. Characters which make up part of the character universe, but are not legitimate input according to your grammar, are lumped together into a special token which will cause an error if it occurs in your input. When your parser reads an input character, it uses that character to index a conversion table in order to determine the appropriate \index{Token number}\index{Token}\index{Number}token number. If the \index{Range}\index{Test range}\index{Configuration switches} \agparam{test range} configuration switch is on, its default setting, your parser will include code to verify that the character is in bounds before it indexes the conversion table. If you are satisfied that checking bounds is unnecessary, you may turn the \agparam{test range} switch off and get a slightly higher level of performance from your parser. For efficient processing, it is well to keep the number of tokens to a minimum. Therefore if you have a choice between defining a construct as a token, with a production, or a set, with a definition, the set is to be preferred. Some useful character sets are: \begin{indenting}{0.4in} \begin{tabular}{ll} \agcode{'a-z' + 'A-Z'}&Alphabetic characters\\ \agcode{'a-f' + 'A-F'}&Hex digits\\ \agcode{'0-9'}&Decimal digits\\ \agcode{0..127}&ASCII character set\\ \agcode{32..126}&Printing ASCII characters\\ \agcode{\~{}'{\bs}n'}&Anything but newline\\ \agcode{\^{}Z}&Windows/DOS end of file indicator\\ \agcode{-1}&Stream I/O end of file indicator\\ \agcode{0}&String terminator\\ \agcode{32..126 - 'a-z' - 'A-Z' - '0-9'}&Punctuation\\ \end{tabular} \end{indenting} \bigskip % XXX ``punctuation'' is wrong; it should subtract off space too Note that \agcode{'a-z'} is a range of characters but \agcode{32..126 - 'a-z'} is a set difference. When AnaGram encounters a character set in a grammar rule, it assigns a token number to the character set. If it has previously seen the same character set it will assign the same token number; however, it assigns the same token number only if the set expressions are obviously the same. Thus, AnaGram will assign the same token number every time it sees \agcode{A + B}, but will assign a different token number if it sees \agcode{B + A}. Only when AnaGram has finished scanning the entire syntax file can it actually evaluate the character sets. If it finds that several different tokens all refer to the same character set, it will create a single token that represents the true character set and create \index{Shell productions}\index{Production}``shell productions'' for the others. \index{Character sets}If the character sets you use in your grammar overlap, they do not properly represent \index{Terminal token}\index{Token}terminal tokens. To deal with this situation, AnaGram identifies all overlaps among character sets and extends your grammar by adding a number of extra productions. For instance, suppose your grammar uses the following character sets as though they were terminal tokens: \begin{indentingcode}{0.4in} 'a-z' + 'A-Z' '0-9' '0-7' 'a-f' + 'A-F' \end{indentingcode} AnaGram will then modify your grammar by adding the following productions: \begin{indentingcode}{0.4in} 'a-z' + 'A-Z' -> 'a-f' + 'A-F' | 'g-z' + 'G-Z' '0-9' -> '0-7' + '8-9' \end{indentingcode} Although the tokens \agcode{'a-z' + 'A-Z'} and \agcode{'0-9'} are technically now \index{Nonterminal token}\index{Token}nonterminal tokens, for purposes of determining the \index{Token}\index{Data type}data type of their \index{Semantic value}\index{token}\index{Value}semantic values, AnaGram continues to regard them as terminal tokens. This \index{Partition}\index{Universe}\index{Character universe} ``partitioning'' of the character universe is described in Chapter 6. \subsection{Keyword Strings} \index{Keywords} In your grammar, AnaGram recognizes character strings within double quotes (e.g., \agcode{"IF"}) as keywords. The strings follow the same syntactic rules as strings in C. The same escape sequences are honored. AnaGram does not, however, allow for the concatenation of adjacent strings. Note that AnaGram strings are used only for the definition of keywords in your grammar, not for messages to be displayed or printed. Keyword strings may not include null characters and must be at least one character long. You may have any number of keywords. Each is treated as a single terminal token. A keyword may be given a name by means of a definition statement. Keywords may appear in virtual productions. AnaGram's keyword recognition works in the following way. First, for each state in your parser, AnaGram prepares a list of all the keywords that are admissible in that state. Your parser will recognize a keyword \emph{only} if it is in an appropriate state; otherwise it will appear to be an anonymous sequence of characters. Your parser, in any state, checks for keywords it expects before it checks for acceptable characters. That is, \emph{keywords take precedence} over simple characters. It does not look for keywords that would not be acceptable input. The parser will do whatever lookahead is necessary in order to pick up the entire keyword. Thus if the character \agcode{I} and the keyword \agcode{IF} are both legitimate input at some point, \agcode{IF} will be recognized, if present, in preference to \agcode{I}. If several admissible keywords match the input, such as \agcode{IF} and \agcode{IFF}, the parser will select the longest match, \agcode{IFF} in this example. AnaGram does not incorporate keywords into its character sets. Keywords stand apart and should not appear in definitions of character sets. In particular, they are not considered as belonging to the complement of a character set. Thus for the production \begin{indentingcode}{0.4in} next char -> \~{}('{\bs}n' + \^{}Z) \end{indentingcode} a keyword would not be considered legitimate input. Note also that a keyword consisting of a single character does not belong to the character universe. Because of this fact, AnaGram's treatment of \agcode{'X'} and \agcode{"X"} is very different. If this seems confusing at first, try using only keywords which are at least two characters long until you have some experience with them. AnaGram's keyword recognition logic normally does not make any assumptions about what precedes or follows a keyword. Thus if \agcode{int} is a keyword, your parser will be capable of plucking it out of a string of characters such as \agcode{disintegrate} if, according to your grammar, it could follow \agcode{dis}. The \agparam{sticky} declaration and the \agparam{distinguish keywords} statement, described below, can prevent such unwanted recognition of keywords. A keyword following a \agparam{sticky} token will not be recognized if the first character of the keyword can be shifted in as part of the \agparam{sticky} token. The \agparam{distinguish keywords} statement prevents recognition of a keyword if it is followed immediately by a character of the sort that makes up the keyword. \subsection{Type Specifications For Tokens} \index{Token}\index{Token type}\index{Type declarations} When you write productions or token declarations (see below), AnaGram allows you to specify the data type\index{Token}\index{Data type} of the \index{Semantic value}\index{Token}\index{Value}semantic value of a token by using a C or C++ data type specification. The restrictions are that AnaGram does not allow specification of array or function types, nor explicit structure types. Types that are defined with typedef statements, structure definitions, or class definitions, including template classes, in your embedded C or C++ are acceptable. Thus the following specifications, for example, are acceptable: \begin{indentingcode}{0.4in} void int char * unsigned long *near static float *far my{\us}type double * struct descriptor struct widget * vector <double> * \end{indentingcode} On the other hand, the following specifications are \emph{not} valid: \begin{indentingcode}{0.4in} int[20] int *(int, unsigned char) \bra int x,y; float z; \ket struct \bra int k; float z; \ket \end{indentingcode} Note that AnaGram itself does nothing with the type specifications. It simply passes them on to your compiler as appropriate. \subsection{Productions} \index{Production} Productions are the basic units of a grammar. A production consists of a left side and a right side. \index{Left side}The left side of a production consists of one or more token names, joined by commas, optionally preceded by a type specification enclosed in parentheses. \index{Right side}The right side begins with an arrow and may either begin on the same line as the left side or on a new line. For example: \begin{indentingcode}{0.4in} program -> statement list, eof expression -> expression, plus, term (int) variable name, function name -> name:n = look{\us}up(n); \end{indentingcode} The part of the right side of a production following the arrow is called a \index{Grammar rule}\index{Rule}\agterm{grammar rule}, discussed below. A production need not have a right side at all. In this case, it is simply called a \index{Declaration}\index{Token}\agterm{token declaration}. AnaGram assigns \index{Token number}\index{Token}\index{Number}token numbers to the token names on the left side, and, if there is a type specification, records the data type for each of the tokens declared. Declarations of this sort are most useful when using input from a \index{Lexical scanner}lexical scanner. See Chapter 9 for a discussion of techniques for interfacing a lexical scanner to your parser. If you do not intend to use a lexical scanner you will have no need for token declarations. If you do not explicitly specify the type for the \index{Semantic value}\index{Token}\index{Value}semantic value of a token, it will be determined by the configuration parameter \index{Default token type}\index{Configuration parameters}\index{Token} \agparam{default token type} if it is a \index{Nonterminal token}\index{Token}nonterminal token or by the \index{Configuration parameters}configuration parameter \index{Input token type}\index{Default input type}\agparam{default input type} if it is a \index{Token}terminal token. \agparam{Default token type} defaults to \agcode{void}. \agparam{Default input type} defaults to \agcode{int}. If a production has more than one token on the left side, as in the third example above, it is called a \index{Semantically determined production}\index{Production} \agterm{semantically determined production}. Semantically determined productions are a useful tool for exerting semantic control over syntactic analysis. A semantically determined production should have a reduction procedure which determines on a case by case basis which of the tokens on the left side should be taken as the reduction token. If there is no reduction procedure, or if the reduction procedure does not make a choice, the reduction token will be the first syntactically correct token on the left side of the production. In the example above, \agcode{variable name} will be the reduction token unless \agcode{look{\us}up} changes it to \agcode{function name}. Semantically determined productions are discussed more fully in Chapter 9. If several productions have the same left side, it does not need to be repeated. Subsequent right hand sides must each start on a new line. For example: \begin{indentingcode}{0.4in} integer -> digit -> integer, digit name -> letter -> name, letter -> name, digit \end{indentingcode} On the other hand, you do not have to group productions with the same left side. You could write the above productions as follows, although it would certainly not be good programming practice: \begin{indentingcode}{0.4in} name -> name, digit integer -> integer, digit name -> name, letter integer -> digit name -> letter \end{indentingcode} Nevertheless, there are a few occasions involving complex cross recursions and semantically determined productions where it is not possible to group productions neatly. The right side of a production can be empty. Such a production is called a \index{Null productions}\index{Production}\agterm{null production}. Null productions are useful to denote an optional element in a grammar, or a list that may be empty. For example: \begin{indentingcode}{0.4in} optional widget -> -> widget optional qualifiers -> -> optional qualifiers, qualifier \end{indentingcode} A second way to write multiple productions with the same left side uses the \index{Vertical bar}\index{|}vertical bar character, ``$|$'', to separate the grammar rules. The productions given above for \agcode{name}, \agcode{optional widget}, and \agcode{optional qualifiers} can also be written: \begin{indentingcode}{0.4in} name -> letter | name, letter | name, digit optional widget -> | widget optional qualifiers -> | optional qualifiers, qualifier \end{indentingcode} Note that a null production cannot \emph{follow} a vertical bar. A token that has a null production is called a \index{Zero length token}\index{Token}\agterm{zero length token}, since it can be represented by an empty sequence of input characters, that is to say, by nothing at all. Furthermore, even if a token doesn't have any null productions, if it has at least one rule consisting entirely of zero length tokens it is also a zero length token. In the Token Table window, AnaGram notes which tokens are zero length, because they can be a source of conflicts. \subsection{Grammar Token} Every grammar must have a single token which produces the entire grammar. This token is variously called the \index{Token}\index{Grammar token}\agterm{grammar token}, the \index{Goal token}\agterm{goal token} or the \index{Start token}\agterm{start token}. AnaGram provides several methods you may use to specify which token in your grammar is the grammar token. You may simply use the name \agcode{grammar} for the grammar token. If you wish to use some other more descriptive name for your grammar token, you may mark it with a following dollar sign when it appears on the left side of a production. Alternatively, you may set the \index{Grammar token}\index{Configuration parameters}\agparam{grammar token} configuration parameter to specify the grammar token. Here are examples of the methods: \begin{indentingcode}{0.4in} grammar -> [statement | newline]/... program \$ -> [statement | newline]/... {}[ grammar token = program ] program -> [statement | newline]/... \end{indentingcode} If you should use more than one of these techniques, AnaGram resolves the issue in the following manner: A marked token or a configuration parameter setting always takes precedence over simply naming a token \agcode{grammar}. If you mark more than one token or set the configuration parameter more than once, the last setting or mark wins. \subsection{Grammar Rules} \index{Rule}\index{Grammar rule} The part of a production to the right of the arrow is more often called a \agterm{grammar rule}, or simply \agterm{rule}. A grammar rule is a sequence of \index{Rule elements}\agterm{rule elements}, joined by commas, as in the examples of productions given above. Rule elements are token names, character set expressions, virtual productions, or immediate actions (see below). Each rule element may be optionally followed by a parameter assignment. The entire rule may be followed by an optional reduction procedure. A \index{Parameter assignment}parameter assignment is a colon followed by a C variable name. Here are some examples of rule elements with parameter assignments: \begin{indentingcode}{0.4in} '0-9':d integer:n expression:x declaration:declaration{\us}descriptor \end{indentingcode} The parameters you assign to tokens in your grammar rule become the formal parameters for your \index{Reduction procedure}reduction procedure. The data type\index{Data type}\index{Reduction procedure arguments} of the parameter is determined by the data type for the semantic value of the token to which it is assigned. If your grammar rule has parameter assignments, but does not have a reduction procedure, AnaGram will give you a warning in case the lack of a reduction procedure is an oversight. If you don't need a reduction procedure you may safely ignore the warning. On the other hand, AnaGram has no way to determine whether you have failed to make necessary parameter assignments. You won't find out until you compile your parser, when your compiler will give you error messages for undefined symbols. AnaGram assigns a unique rule number to each rule in your grammar. Rules are numbered sequentially as they are encountered in the syntax file. AnaGram constructs rule zero itself. Rule zero normally has a single element, the grammar token, unless you have a \agparam{disregard} statement in your grammar. In this case there will be two elements. \subsection{Reduction Procedures} \index{Reduction procedure} % XXX somewhere in here it ought to say something like % ``in the parsing literature reduction procedures are often known as % \agterm{semantic actions}.'' % Note that R. says there's some subtle difference between the usual % concept of semantic action and AG's concept of reduction procedure. % I don't know what this difference is and I hope she can recall it. % % D. thinks this note ought to be at the end; R. wants it at the top. A \agterm{reduction procedure} is a piece of C code which optionally follows a production. The code is executed when your parser identifies the production in its input. There are two forms for reduction procedures, a short form and a long form. The short form consists of a single C expression. The long form consists of an arbitrary block of C code. When AnaGram builds a parser, it inspects the grammar rule to which the procedure is attached and identifies the parameters for the procedure. It uses these parameters as the formal parameters for the procedure. If the \index{Macros}\index{Allow macros}\index{Configuration switches} \agparam{allow macros} configuration switch has not been turned off, AnaGram codes the reduction procedure as a macro definition whenever possible. Otherwise AnaGram codes it as a function definition. AnaGram builds the name for a reduction procedure by appending its internal procedure number to the string \agcode{ag{\us}rp{\us}}. Thus reduction procedures are numbered in the order in which they are encountered in the syntax file. Both long and short form reduction procedures are preceded by an equal sign which follows the production. The short form consists of a C or C++ expression terminated by a semicolon. When the grammar rule is reduced, the expression will be evaluated and its value will become the value of the reduction token. The expression and the terminating semicolon must be entirely on a single line. Note that, if you really need to make the expression longer than will fit on one line, you can embed a newline in a comment. Some examples of short form reduction procedures are: % XXX is there anything we can do about the ugly underscores? \begin{indentingcode}{0.4in} =0; =1; =10*n + d-'0'; = special{\us}processor(first{\us}parameter, second{\us}parameter); =word{\us}count++; =widget(constant{\us}1*parameter{\us}1 + constant{\us}2*parameter{\us}2 /* {} */ + constant{\us}3*parameter{\us}3); \end{indentingcode} A long form reduction procedure consists of an arbitrary block of C or C++ code, enclosed in braces (\bra \ket). AnaGram will code the reduction procedure as a function. To return a value for the reduction token, simply use the \agcode{return} statement. There are effectively no restrictions on the content or length of a reduction procedure. Of course, if there are unbalanced braces, unterminated comments or unterminated string literals, AnaGram will not be able to determine where the reduction procedure ends. AnaGram treats \index{Comments}nested comments within a reduction procedure according to the value of the \index{Nest comments}\index{Configuration switches}\agparam{nest comments} configuration switch at the point where it encounters the reduction procedure. From a practical point of view it is not usually good practice to have a reduction procedure that is more than a few lines long since a long procedure will hamper your overall view of your grammar. Long reduction procedures should be written as separate named functions, and should either be included in the embedded C portion of your syntax file or should be included in a wholly separate module. Here is an example of a long form reduction procedure: \begin{indentingcode}{0.4in} =\bra if (flag) \bra total += x; return identify(x); \ket else \bra total = 0; flag = 1; return init{\us}table(x); \ket \ket \end{indentingcode} If a rule does not have a reduction procedure, the semantic value of the reduction token will be set to the \index{Semantic value}\index{Token}\index{Value}semantic value of the first token in the rule, unless the rule is a \index{Null productions}null production. In the latter case, the value of the reduction token will be set to zero. % XXX and what if zero isn't a valid value for the type? a compiler % error will occur. % XXX add something like % % Variables appearing in reduction procedures which do not have a % parameter assignment in the corresponding grammar rule can be % declared globally or (file)-statically in your embedded C, or % alternatively could be added to the parser control block using % the \agparam{extend pcb} statement (q.v. | See Section ....). % (Reword this.) % % Should also discuss the sequencing of reduction procedure calls % so that people understand what happens if you use such variables. % % also ``A reduction procedure can be used to terminate parsing for % semantic reasons''. % \subsection{Immediate Actions} \index{Immediate action}\index{Action} An immediate action is a rule element that consists of executable C or C++ code embedded within a grammar rule to be executed when it is encountered. An immediate action is denoted by the use of an exclamation point, \index{!}``!''. The content of an immediate action may be written following the rules for either long form or short form reduction procedures. As with any other rule element, it must be separated from preceding and following rule elements by commas. In the grammar for a simple desk calculator, one might write \begin{indentingcode}{0.4in} transaction -> !printf('\#');, expression:x = printf("\%d{\bs}n", x); \end{indentingcode} % XXX s/apparent/visible/ Notice that the only apparent difference between an immediate action and a reduction procedure is that the immediate action is preceded by ``!'' instead of ``=''. The immediate action must be followed by a comma to separate it from the following rule element. Immediate actions may also be used in definitions: \begin{indentingcode}{0.4in} prompt = !printf('\#'); \end{indentingcode} AnaGram implements an immediate action by creating a special token for it. AnaGram then creates a single null production for the token. Finally, the immediate action is implemented as the reduction procedure for the null production. For example, you could implement \agcode{prompt} by writing a null production with a reduction procedure: \begin{indentingcode}{0.4in} prompt -> = printf('\#'); \end{indentingcode} This production would be equivalent to the definition above. There are two ways, however, in which immediate actions differ from the equivalent null production. Immediate actions may access any parameter assignments which precede them in the rule in which they occur. On the other hand, there is no way to assign a data type to the semantic value, if any, returned by the immediate action. Therefore, the type is determined by your setting of the \index{Default token type}\index{Configuration parameters} \agparam{default token type} configuration parameter. \subsection{Virtual Productions} \index{Virtual productions}\index{Production} Virtual productions are a convenient short form notation for common grammatical constructs involving choice and repetition. The notation represents an extension of notation commonly used in programming manuals. A virtual production may be written in a grammar rule at any place where you could write a token name, even within another virtual production. Note that use of virtual productions is never \emph{required}, since the equivalent productions can always be written out explicitly instead. When AnaGram encounters a virtual production, it replaces the virtual production with a new token and writes appropriate productions for the new token. When you look at your syntax tables using AnaGram windows, you will see the productions that AnaGram generates. AnaGram keeps a record of virtual productions, so that generally if you use the same virtual production a second time, you get the same set of tokens and productions that were generated the first time it was used. This is not the case if the virtual productions contain reduction procedures or immediate actions, since AnaGram is not equipped to determine whether two pieces of C code are equivalent. Thus, a virtual production that contains a reduction procedure will be unique and will not be reused. One disadvantage of virtual productions is that there is no way to specify the data type of the \index{Semantic value}\index{Virtual production}semantic value of a virtual production. Therefore, if you have a reduction procedure within a virtual production, its return value must be consistent with the type defined by the \index{Default token type}\index{Configuration parameters}\agparam{default token type} configuration parameter. The simplest virtual production is the \index{Token}\index{Optional token}\agterm{optional token}. If \agcode{x} is an arbitrary token name or set expression, you can indicate an optional \agcode{x} by writing \index{?}\agcode{x?}. You may also indicate a repetition of \agcode{x} by using the ellipsis with either \agcode{x} or \agcode{x?}. \index{...}\index{Ellipsis}Thus \agcode{x...} represents one or more instances of \agcode{x} and \index{?...}\agcode{x?...} represents zero or more instances of \agcode{x}. For example: \begin{indentingcode}{0.4in} '+'? \end{indentingcode} can be used to represent an optional plus sign, that is, a choice between a plus sign and nothing at all. Similarly, \begin{indentingcode}{0.4in} '{\bs}n'?... \end{indentingcode} represents an optional sequence of newline characters. \index{Brackets}\index{Braces}\index{\_opb\_clb}\index{[]} The next category of virtual productions uses brackets or braces to indicate a choice among a number of enclosed grammar rules separated by vertical bars. A single rule may also be enclosed. Note that \emph{rules}, with following reduction procedures, are allowed, not simply tokens. Braces are used to indicate that one option must be chosen. Brackets are used to indicate the choice is optional, i.e. may be omitted altogether. The ellipsis following a set of options within brackets or braces indicates the option may be repeated an indefinite number of times. You can use braces to indicate a simple choice among a number of options. A Cobol grammar offers the following choice of equivalent keywords: \begin{indentingcode}{0.4in} \bra "RECORD", "IS"? | "RECORDS", "ARE"? \ket \end{indentingcode} \index{\_opb\_clb...}\index{ []...} You may use the ellipsis with braces to indicate an arbitrary positive number of repetitions of the choice: \begin{indentingcode}{0.4in} {\bra}type specifier | storage class specifier{\ket}... \end{indentingcode} This expression requires at least one type specifier or storage class specifier, but will accept any number. \index{[]} To make a choice optional, use brackets instead of braces. An example, again drawn from a Cobol grammar, is: \begin{indentingcode}{0.4in} {}["LIMIT", "IS"? | "LIMITS", "ARE"?] \end{indentingcode} \index{[]...} Ellipses may be used with brackets to indicate an arbitrary number of choices that may be omitted altogether: \begin{indentingcode}{0.4in} {}[argument, [',', argument]...] \end{indentingcode} This expression describes an optional argument list with arguments separated by commas. If you use a null production within braces, it must be the first option: \begin{indentingcode}{0.4in} \bra | '+' | '-' \ket \end{indentingcode} Normally, you would do this only if you wanted to attach a reduction procedure to the null production. Note that if you include a null production within braces, and add an ellipsis after the closing brace for repetition, your grammar will be ambiguous. Just exactly how many times does the null production occur? Use brackets instead, and omit the null production. Null productions are not allowed with brackets, since they would be intrinsically ambiguous. The options within braces or brackets may be grammar rules of any length or complexity and may themselves contain virtual productions of arbitrary complexity. Nevertheless, in practice, clarity suffers as soon as the options get very complex. Virtual productions are most important and useful when used in simple situations. In those situations they will enhance the clarity of your grammar. Here is an example that is moderately complex, even though each rule consists of a single token: \begin{indentingcode}{0.4in} \bra{\bra}"on" | "true"\ket = 1; | {\bra}"off" | "false"\ket = 0; | integer\ket \end{indentingcode} This example can be used to allow as input either an integer or, for special cases, keywords. You could write this option out in the following way: \begin{indentingcode}{0.4in} p1 -> p2 = 1; -> p3 = 0; -> integer p2 -> "on" -> "true" p3 -> "off" -> "false" \end{indentingcode} The final category of virtual production provides a notation for \index{Alternating sequence}\agterm{alternating sequences}. An alternating sequence is a set of choices which may be repeated arbitrarily subject to the side condition that no choice may follow itself, in other words, that the choices must alternate. Alternating sequences are written with either brackets or braces depending on whether the sequence is optional or not, followed by \index{/...}``\agcode{/...}''. Note that the choices themselves may allow sequences. For example: \begin{indentingcode}{0.4in} program -> [statement | newline...]/..., eof \end{indentingcode} represents a sequence of statements separated by one or more newlines. Any two statements must be separated by one or more newline characters, and newlines may also appear at the beginning and the end of the program. Null productions are not allowed within alternating sequences, since they are intrinsically ambiguous in all cases. \subsection{Definition Statements} \index{Definitions}\index{Definition statement}\index{Statement} A definition statement is simply a shorthand way of naming a character set, a \index{Virtual productions}\index{Production}virtual production, a keyword string, or an immediate action. It can also be used for providing an alternate name for a token. Definitions have the form: \begin{indentingcode}{0.4in} name = \codemeta{character set} name = \codemeta{virtual production} name = \codemeta{keyword} name = \codemeta{immediate action} name = \codemeta{token name} \end{indentingcode} The name may be any name acceptable to AnaGram. The name can then be used anywhere you might have used the expression on the right side. \index{!}For example: \begin{indentingcode}{0.4in} upper case letter = 'A-Z' lower case letter = 'a-z' letter = upper case letter + lower case letter statement list = statement?... while keyword = "WHILE" prompt = !printf("Please enter name:"); \end{indentingcode} It is important to recognize that a definition statement that names a set does not define a token. A token is defined only when the set is used in a grammar rule, and then only if the set is used directly, not in combination with some other set. Furthermore, if you use a character set directly in a grammar rule, and in some other rule you use a name that refers to the same set of characters, you will get two different tokens. For example, if you have defined \agcode{upper case letter} as in the above example and use both \agcode{upper case letter} and \agcode{'A-Z'} in grammar rules, AnaGram will assign different \index{Token number}\index{Token}\index{Number}token numbers to accommodate any differences in attributes you may assign to the tokens. Renaming tokens is a convenient way to connect two independently written portions of a grammar. % See the C grammar in the EXAMPLES directory of your distribution % disk for an example. \subsection{Embedded C} \index{Embedded C} You may encapsulate C or C++ code in your syntax file by enclosing it in braces (\bra \ket). Such pieces of code are copied to the parser file untouched, in the order they are found in the syntax file. There may be any number of such pieces of embedded C. The only restriction is that they must not start on the same line as some other AnaGram statement, and following AnaGram statements must also start on fresh lines. Normally, the blocks of embedded C in your syntax file are copied to the parser file \emph{following} a set of definitions and declarations AnaGram needs for the code it generates. However, if the \emph{first} statement in your \index{Syntax file}syntax file is a block of embedded C, it will \emph{precede} AnaGram's definitions and declarations. This block of embedded C is called the \index{Prologue}\index{C prologue}``C prologue''. There are two main reasons for this special treatment. First, you may want to have a title and \index{Copyright notice}copyright notice in your parser. If you include them in an initial block of embedded C they will be right at the beginning of both your syntax file and your parser file. Second, if some of your tokens have data type\index{Token}\index{Data type}s other than those predefined in C or C++, you may include the definitions here, so they will be available to the code AnaGram generates. AnaGram scans embedded C only insofar as is necessary to find the closing right brace. Therefore any braces used within embedded C must balance properly. AnaGram skips braces enclosed in character constants and string literals, as well as braces enclosed in comments. It also recognizes C++ style comments that begin with \agcode{//}. \index{Comments}Treatment of nested versus non-nested comments is controlled by the \index{Nest comments}\index{Configuration switches}\agparam{nest comments} configuration parameter. AnaGram will use the status of this parameter in effect at the beginning of the section of embedded C. AnaGram, of course, can be confused by unterminated strings, unbalanced brackets, and unterminated comments. The most likely outcome, in such a situation, is that AnaGram will encounter the end of file looking for the end of the embedded C. Should this happen, AnaGram will identify the beginning of the piece of embedded C which caused the problem. The code you include as embedded C, of course, has to coexist with the code AnaGram generates. In order to keep the potential for conflicts to a minimum, all variables and functions which AnaGram defines begin either with the name of your parser or with the letters \agcode{ag{\us}}. You should avoid variable names which begin with these letters. Reduction procedures are copied to the \index{Parser file}\index{File}parser file in the order in which they are defined \emph{following} all of the embedded C. Thus your reduction procedures may freely use variables and macros defined anywhere in your embedded C. \subsection{Configuration Sections} \index{Configuration section} A configuration section is a special section of your syntax file enclosed in brackets. Within a configuration section you may set the values of configuration parameters or switches, or you may use one or more of several available attribute statements to specify special treatment for certain tokens. There can be as many or as few configuration sections in your syntax file as you wish. Each configuration section must begin on a new line. Any AnaGram statement which follows a configuration section must also begin on a new line. Within a configuration section, each parameter setting and each attribute statement must begin on a new line. The rules for using comments and continuation lines are the same as for the rest of AnaGram. Configuration parameters control the way AnaGram interprets your syntax file and the way it builds your parser. A full discussion of the use of configuration parameters, including a complete discussion of each parameter and its default value, is given in Appendix A. \index{Attribute statements}\index{Statement} Attribute statements comprise the \index{Precedence declarations}precedence declarations \agparam{left}, \agparam{right}, and \agparam{nonassoc}; the \agparam{sticky} declaration; the \agparam{distinguish keywords} statement; the \agparam{hidden} declaration; the \agparam{disregard} and \agparam{lexeme} statements; the \agparam{enum} statement; the \index{Reserve keywords}\agparam{reserve keywords} declaration; and the \index{Rename macro}\agparam{rename macro} statement. The precedence declarations and the \index{Sticky declaration}\index{Declaration}\agparam{sticky} declaration may be used to resolve conflicts in your grammar. The \agparam{distinguish keywords} statement may be used to control keyword recognition. The \index{Hidden declaration}\index{Declaration}\agparam{hidden} declaration causes certain token names not to be used when your parser produces \index{Syntax error}\index{Errors}\index{Error messages}syntax error messages. You may use the \agparam{disregard} and \agparam{lexeme} statements to cause your parser to skip automatically over certain tokens in its input. The \agparam{enum} statement is almost identical to the enum statement in C. It can be used to assign names to input codes in grammars which are taking input from a \index{Lexical scanner}lexical scanner or another parser. The \index{Reserve keywords}\agparam{reserve keywords} declaration allows you to specify certain keywords as reserved words. The \index{Rename macro}\agparam{rename macro} statement allows you to override the names AnaGram uses for various macro definitions it creates in the code it generates. Attribute statements are discussed below. Except for \agparam{disregard} and \agparam{rename macro} statements, attribute statements accept lists of operands enclosed in braces (\bra \ket) and separated by commas. A dangling comma following the last item in a list will be ignored. \subsection{Setting Configuration Parameters} \index{Configuration parameters}\index{Parameters} Each configuration parameter has a name that follows the AnaGram conventions for symbol names, except that AnaGram ignores case\index{Case sensitivity} when looking up configuration parameter names. There are a number of varieties of configuration parameters. The simplest, \index{Configuration switches}\index{Switches}configuration switches, simply turn some feature of AnaGram on or off. These parameters need simply be stated to turn the feature on, or negated with the tilde (\agcode{\~{}}) to turn the feature off: \begin{indentingcode}{0.4in} nest comments \end{indentingcode} causes AnaGram to allow nested comments, and \begin{indentingcode}{0.4in} \~{}nest comments \end{indentingcode} causes AnaGram to disallow nested comments. You may also set or reset configuration switches with explicit on or off values: \begin{indentingcode}{0.4in} nest comments = on nest comments = off \end{indentingcode} The remaining configuration parameters are assigned values using a simple assignment statement. Depending on the parameter, the value it takes may be the name of a token, a C variable name, a C or C++ data type, a string constant or an integer. String constants are written using the same rules as keyword strings, described above. \begin{indentingcode}{0.4in} grammar token = program parser name = widget default token type = void * header file name = "widget.h" parser stack size = 50 \end{indentingcode} A number of string-valued \index{Configuration parameters}configuration parameters are used to determine file names and variable names. In these parameters, the \index{\#}``\#'', \index{\_dol}``\$'', and ``\index{ \_prc}\%'' characters are used as wild cards. In file name specifications and the specification of the name of your parser, ``\#'' will be replaced by the name of your syntax file. In other function and variable names AnaGram creates while building your parser, ``\$'' will be replaced by the name of your parser. When building enumeration constants for the names of the tokens in your grammar, ``\%'' will be replaced by the name of the token. Note that when entering a Windows/DOS path name as a value for a file name parameter you must quote any backslashes in the path name. For example, \begin{indentingcode}{0.4in} coverage file name = "f:{\bs\bs}sna{\bs\bs}foo.nrc" \end{indentingcode} \subsection{Precedence Declarations} \index{Precedence declarations} AnaGram allows you to resolve shift-reduce conflicts by assigning precedence levels to operators. There are three precedence declarations available, beginning with the keywords \index{Left}\agparam{left}, \index{Right}\agparam{right}, and \index{Nonassoc}\agparam{nonassoc} respectively. Each such declaration consists of the appropriate keyword and a list of tokens enclosed in braces (\bra \ket). All the tokens in the list have the same precedence, higher than tokens in any previous declaration and lower than in any subsequent declaration. If the keyword is \agparam{left}, the tokens will group to the left. If it is \agparam{right}, they will group to the right. If it is \agparam{nonassoc} (for non-associative) no grouping will be assumed. Precedence declarations must be included in a configuration section. Here are precedence declarations appropriate to a simple desk calculator program: \begin{indentingcode}{0.4in} {}[ left \bra '+', '-' \ket left \bra star, '/', '\%' \ket right \bra unary minus \ket ] unary minus = '-' \end{indentingcode} Note that \agcode{unary minus} and \agcode{'-'} can have different precedence. Precedence declarations are one of the few instances in AnaGram where the \index{Statements}\index{Order of statements}order of statements is significant. The use of precedence declarations is discussed in Chapter 9. \subsection{``Sticky'' Declarations} \index{Sticky declaration}\index{Declaration} AnaGram provides another means for resolving shift-reduce conflicts. You may characterize any token as ``sticky''. Then, in the case of a \index{Shift-reduce conflict}\index{Conflicts}shift-reduce conflict where a ``sticky'' token is the last token in the input buffer, the conflict will be resolved by selecting the shift operation. Intuitively, you may think of this as though the ``sticky'' token adheres to and draws in any subsequent input that it can. ``Sticky'' declarations are included in configuration sections. They begin with the keyword \agcode{sticky} followed by a list of tokens, separated by commas inside braces (\bra \ket). Suppose, for instance, you wished to pick up a line of text, skipping any leading space or tab characters. You might write the following syntax: \begin{indentingcode}{0.4in} white space = ' ' + '{\bs}t' text char -> \~{}'{\bs}n':c = do{\us}something(c); line -> leading white space, text char?..., '{\bs}n' leading white space -> -> leading white space, white space \end{indentingcode} Unfortunately, this syntax is ambiguous, since space and tab are legitimate instances of both leading white space and text char. What you really want to do is to skip white space until you find a non-blank character and then you want to accept all characters to the end of the line. There are two ways to address the problem. The first is to define a special token for the first non-blank character and, using it, to write an unambiguous grammar. This approach, while laudable, is tedious and prolix. Instead, use \agparam{sticky} to resolve the problem: \begin{indentingcode}{0.4in} {}[ sticky \bra leading white space \ket ] \end{indentingcode} Now when AnaGram analyzes your grammar, and encounters the ambiguity, it will understand that a blank or tab that could be treated either as leading white space or the as the first text character should be treated as white space. Since \agcode{leading white space} is ``sticky'', any subsequent white space adheres to it. As with conflicts resolved with precedence levels, AnaGram lists all conflicts that it resolves using \agcode{sticky} in the \index{Resolved Conflicts}\index{Window}\agwindow{Resolved Conflicts Table}, so you can verify that the conflicts have been correctly resolved. An important use of sticky tokens is to inhibit the recognition of following \index{Keywords}keywords. Following a sticky token, a keyword, which, according to your grammar, would otherwise be legitimate input, will not be recognized if a shift action is possible for the first character of the keyword. For example, imagine that \agcode{name} has been defined in the conventional way, and there exists a production with name followed immediately by the keyword \agcode{int}. Then if, in your input, the word \agcode{print} were to occur, your grammar would parse it as a name, \agcode{pr}, followed by the keyword \agcode{int}. If you make \agcode{name} sticky, however, the first letter of \agcode{int} will be seen to be an acceptable character for \agcode{name} and the keyword will not be recognized. Your parser will then recognize the \agcode{name} as \agcode{print}. \subsection{Distinguish Keywords Statement} \index{Distinguish keywords}\index{Keywords} Distinguish keywords statements are occasionally needed to prevent keyword recognition. You may, for example, wish to prevent the recognition of the keyword \agcode{int} when it occurs embedded in a word such as \agcode{interval}. Of course, you need to do this only if both the keyword and the other word are both legitimate input at the same point in your grammar. A distinguish keywords statement can prevent recognition of a keyword which is embedded in another word provided at least one character of the other word follows the keyword. The distinguish keywords statement has the form: \begin{indentingcode}{0.4in} distinguish keywords \bra \codemeta{list of character sets} \ket \end{indentingcode} AnaGram compares all the characters in each keyword to the characters included in each character set in turn. If it finds that all the characters in a keyword are members of a particular set, it tells the keyword recognition logic to try to match the keyword only against the longest sequence of characters drawn from the specified set. In other words, in order for a keyword to be recognized, the keyword \emph{must} be followed by a character \emph{not} in the set. The set associated with a keyword is the first one in the list which contains all the characters found in the keyword. If you have more than one \agparam{distinguish keywords} statement in your grammar, the lists are tried in the order in which they appear in the grammar. The purpose of the \agparam{distinguish keywords} statement is to enable your parser to distinguish a keyword from the same sequence of characters embedded within another sequence. Thus suppose that \agcode{int} is a keyword, and, according to your grammar, could appear in the same place as the word \agcode{integral}. If you don't want it to be recognized as a keyword in these circumstances, you would write the following distinguish statement: \begin{indentingcode}{0.4in} distinguish keywords \bra 'a-z'+'A-Z' \ket \end{indentingcode} To also inhibit recognition of \agcode{int} within \agcode{print}, you would combine the use of the distinguish keywords statement with the \agparam{sticky} declaration. \subsection{``Hidden'' Declarations} \index{Hidden declaration}\index{Declaration} AnaGram provides an optional \index{Error diagnosis}error diagnosis feature for your parser (see Chapter 9). The \agparam{hidden} declaration allows you to identify tokens that you do not wish to be used in making up \index{Diagnostic messages}diagnostic messages. These tokens are tokens whose names would not mean anything to your users. The format of a ``hidden'' declaration is the same as that of precedence and ``sticky'' declarations. Within a configuration section, the keyword ``hidden'' is followed by a list of tokens. For example: \begin{indentingcode}{0.4in} {}[ hidden \bra comment head \ket ] comment -> comment head, "*/" comment head -> "/*" -> comment head, \~{}eof \end{indentingcode} This is an AnaGram representation of ANSI standard C comments (non-nested). In this example the token \agcode{comment head} exists only for convenience in writing the grammar and has no particular meaning to an end user. On the other hand, he knows what the word \agcode{comment} refers to. The ``hidden'' attribute will cause AnaGram's diagnostic builder, by backing up the stack until it finds a non-hidden token, to eschew \agcode{comment head} in favor of \agcode{comment}. % XXX eschew obfuscation. how about ``avoid''? \subsection{Disregard Statement} The purpose of the \index{Disregard statement}\index{Statement}\agparam{disregard} statement is to skip over uninteresting \index{White space}white space and comments in your input files. The disregard statement allows you to specify a token that should be passed over in the input to your parser. The statement takes the form: \begin{indentingcode}{0.4in} disregard ws \end{indentingcode} where \agcode{ws} is a token name or character set. Disregard statements may be placed in any configuration section. You may have more than one disregard statement in your grammar. If you do, AnaGram will create a shell production. For example, suppose you write: \begin{indentingcode}{0.4in} {}[ disregard alpha disregard beta ] \end{indentingcode} AnaGram will proceed as though you had written: \begin{indentingcode}{0.4in} gamma -> alpha | beta {}[ disregard gamma ] \end{indentingcode} It frequently happens that you wish your parser to disregard blanks or comments, except that white space within names, numbers, strings, and other elementary constructs is subject to special rules and thus should not be disregarded blindly. In this case, you can use the \agparam{lexeme} statement to declare these constructs off limits for the disregard statement. Within these constructs, the disregard statement will be inoperative and the admissibility of white space will be determined solely by the productions which define these constructs. Outside those productions which define lexemes, you should not generally use a token which is supposed to be disregarded. If you do, your grammar will have conflicts, since the token could satisfy both the explicit usage and the implicit rules set up by the disregard statement. Such conflicts, however, are resolved automatically in favor of your explicit use of the token. The conflicts will appear in the \agwindow{Resolved Conflicts} window. % XXX I'm not sure that's still true. In order to implement the disregard statement AnaGram will redefine some tokens in your grammar. For example, \agcode{+} may be redefined to consist of a simple plus sign followed by optional white space: \begin{indentingcode}{0.4in} '+' -> '+'\%, white space?... \end{indentingcode} The percent sign is used to indicate the original, simple plus sign without the optional white space attached. You will probably notice the percent sign appearing in some windows and traces. In earlier versions of AnaGram, the degree sign, ``\agcode{\degrees}'', was used rather than ``\agcode{\%}''. \subsection{Lexeme Statement} The ``lexeme'' \index{Statement}\index{Lexeme statement}statement is used to fine-tune the disregard statement. The lexeme statement takes the form: \begin{indentingcode}{0.4in} {}[ lexeme \bra \codemeta{nonterminal token list} \ket ] \end{indentingcode} where \textit{nonterminal token list} is a list of nonterminal tokens separated by commas. Lexeme statements may be placed in any configuration section, and there may be any number of them. When you specify that a token is to be disregarded, AnaGram rewrites your grammar so that the token will be passed over whenever it occurs at the beginning of a file or following a lexical unit, or \agterm{lexeme}. If you have no \agparam{lexeme} statement, then the lexemes in your grammar are just the terminal tokens. The \agparam{lexeme} statement allows you to specify that certain nonterminal tokens are also to be treated as lexemes. This means that the disregard token will be skipped following the lexeme, but not between the characters that constitute the lexeme. Lexemes correspond to the tokens that a lexical scanner, if you were using one, would commonly identify and pass to a parser as single tokens. You don't usually wish to disregard white space within these tokens. For example, in a grammar for a conventional programming language where blank characters are to be disregarded, you might include: \begin{indentingcode}{0.4in} {}[ lexeme \bra string, character constant, name, number \ket ] \end{indentingcode} since blank characters must not be overlooked within strings and character constants and should not be permitted within names or numbers. Normally, AnaGram considers the disregard token to be optional; however there are circumstances where treating the disregard token as optional would lead to conflicts: two successive names, or two successive numbers, for example. In this case, you would like to require that the lexemes be separated by instances of the disregard token. To do this, simply set the \index{Distinguish lexemes}\index{Configuration switches} \agparam{distinguish lexemes} configuration switch. When this switch is set, AnaGram will ensure that disregard tokens will be required in those situations where making them optional would lead to conflicts. White space may be used explicitly within definitions of lexeme tokens in your grammar if desired, without causing conflicts. Thus, if you wish to allow embedded space in variable names, you might write: \begin{indentingcode}{0.4in} {}[ disregard space lexeme \bra variable name \ket ] space = ' ' + '{\bs}t' letter = 'a-z' + 'A-Z' digit = '0-9' variable name -> letter -> variable name, letter + digit -> variable name, space..., letter + digit \end{indentingcode} \subsection{Enum Statement} \index{Enum statement}\index{Enumeration}\index{Token} The \agparam{enum} statement follows rules nearly identical to those for C and C++. This makes it possible to copy an enum statement from your syntax file to a program file written in either C or C++, without any need for editing. The only differences are that AnaGram makes no provision for blank lines within the enumeration list, nor does it accept a type name. The \agparam{enum} statement is equivalent to a corresponding set of definition statements. It is especially useful when a parser is accepting token input from another program, a \index{Lexical scanner}lexical scanner, for example. Using the enum statement you may conveniently define all the identification codes for the input tokens. Each entry in an enum statement may be either a name, or a name followed by an ``='' sign and a character representation. If there is a character representation the name is assigned the value of the specified character. Otherwise it is assigned a value one more than that assigned to the previous name. If the first name in the list is not given an explicit value, it will be given the value zero. For example: \begin{indentingcode}{0.4in} {}[ enum \bra eof, a,b,c, blank = '\ ', x, y \ket ] \end{indentingcode} is equivalent to the following definition statements \begin{indentingcode}{0.4in} eof = 0 a = 1 b = 2 c = 3 blank = '\ ' x = 33 y = 34 \end{indentingcode} \subsection{Subgrammar Declarations} \index{Subgrammar declaration}\index{Declaration} A \agparam{subgrammar} declaration can be a useful way to deal with conflicts in certain situations. It tells AnaGram to treat the tokens listed in the declaration as though they were each grammar tokens, each specifying a complete subgrammar in itself, and, in determining shift and reduction actions, to ignore the usage of the tokens in the larger grammar. In some cases it is perfectly reasonable to ignore usage. The most common example occurs when building a lexical scanner for a language such as C as in the example in Section 7.4.4. In this case, you can write a complete grammar for a C token with no difficulty. But if you try to extend it to a sequence of tokens, you get scores of conflicts. This situation arises because you specify that any C token can follow another, when in actual practice, an identifier, for example, cannot follow another identifier without some intervening space or punctuation. It is theoretically possible, but in practice quite awkward, to write a grammar for a sequence of tokens so that there are no conflicts. The subgrammar declaration provides a way around this problem by telling AnaGram that when it is looking for reducing tokens for any rule produced directly or indirectly by a subgrammar token, it should disregard the usage of the token and only consider usage internal to the definition of the subgrammar token, as though the subgrammar token were the start token of the grammar. The subgrammar declaration is made in a configuration section and consists of the keyword \agcode{subgrammar} followed by a list of one or more nonterminal token names, separated by commas and enclosed in braces (\bra \ket). For example: \begin{indentingcode}{0.4in} {}[ subgrammar \bra C token, word \ket ] \end{indentingcode} Since the subgrammar statement changes the way AnaGram determines reducing tokens, it should be used with caution. You should be sure that the conflicts you are eliminating are really inconsequential. \subsection{Reserve Keywords Declaration} \index{Reserve keywords}\index{Keywords}\index{Keyword anomalies} The \agparam{reserve keywords} declaration can be used to specify a list of keywords that are reserved and cannot be used except as explicitly specified in the grammar. This enables AnaGram to avoid issuing meaningless keyword anomaly diagnostics (see \S 7.5). AnaGram does not automatically presume that keywords are also reserved words, since in many grammars there is no need to specify reserved words. The reserve keywords declaration is made in a configuration section and consists of the words \agcode{reserve keywords} followed by a list of one or more keyword strings, separated by commas and enclosed in braces (\bra \ket). For example: \begin{indentingcode}{0.4in} {}[ reserve keywords \bra "int", "char", "float", "double" \ket ] \end{indentingcode} \subsection{Rename Macro Statement} \index{Rename macro}\index{Macros} AnaGram uses a number of macros in its generated code. It is possible, therefore, to run into naming collisions with other components of your program. The \agparam{rename macro} statement allows you to change the name AnaGram uses for a particular macro to avoid these problems. For example, the Windows NT operating system uses \agcode{CONTEXT} structures to perform various internal operations. If you use the context tracking option (see \S 9.5.4) your parser will have a macro called \agcode{CONTEXT}. To avoid the name collision, add the following statement to any configuration section in your grammar: \begin{indentingcode}{0.4in} rename macro CONTEXT AG{\us}CONTEXT \end{indentingcode} Then, simply use \agcode{AG{\us}CONTEXT} where you would otherwise have used \agcode{CONTEXT}.