Mercurial > ~dholland > hg > ag > index.cgi
diff doc/manual/sf.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/manual/sf.tex Sat Dec 22 17:52:45 2007 -0500 @@ -0,0 +1,1827 @@ +\chapter{Syntax Files} +\index{Syntax file}\index{File} + +Input files to AnaGram are called \agterm{syntax files}. A syntax +file comprises a grammar and associated C or C++ code. The grammar +consists of a number of productions along with supportng information +such as configuration sections and definitions of character sets. The +associated code consists of reduction procedures (see \S 8.2.13) and +embedded C or C++ code (\S 8.2.17). This chapter explains the rules +for writing syntax files acceptable to AnaGram. The rules for +interfacing your parser to the balance of your program are given in +Chapter 9. + + +\section{Lexical Conventions} +\index{Lexical conventions} + +\subsection{Statements} +\index{Statements} + +For purposes of this manual, AnaGram statements are considered to be +productions, definition statements, configuration sections, and blocks +of embedded C or C++ code, all discussed individually below. Each +statement must begin on a new line. It is a good idea to separate +statements visually in your file by using blank lines freely. +There are generally no restrictions on the +\index{Statements}\index{Order of statements}order of statements +in a syntax file. Good programming practice, however, suggests that +definitions and configuration sections should precede the grammar +itself. + +\subsection{Spaces and Tabs} +\index{Spaces}\index{Tabs} + +AnaGram allows spaces and tabs to be used freely to improve the +readability of grammars. Spaces and tabs are ignored, except when +embedded in a token name, in a character set definition, or in a +keyword. Within a token name, any sequence of spaces and tabs counts +as a single space. + +\subsection{Continuation Lines} +\index{Continuation lines} + +AnaGram statements normally end with a newline character or the end of +file. If AnaGram encounters the end of a line and the statement it is +reading appears to be complete, it will not look for a continuation. +To continue a statement to another line, just make sure that what you +have on the first line is clearly incomplete. For example, + +\begin{indentingcode}{0.4in} +prep phrase -> preposition, "the", noun +\end{indentingcode} + +looks complete to AnaGram, whereas + +\begin{indentingcode}{0.4in} +prep phrase -> preposition, "the", noun, +\end{indentingcode} + +looks incomplete because of the dangling comma at the end. + +\subsection{Comments} +\index{Comments} + +AnaGram accepts comments in accordance with the rules of C and C++, +that is, normal C comments bracketed with \agcode{/*} and \agcode{*/}, +as well as comments which begin with \agcode{//} and continue to the +end of line. AnaGram also observes these conventions when skipping +over embedded C code. + +Since the ANSI standard for C insists that normal C comments do not +nest, AnaGram, by default, disallows nested comments. You may, +however, set a configuration parameter, +\index{Nest comments}\index{Configuration switches}\index{Comments} +\agparam{nest comments}, +to allow nested comments. See Appendix A. In any case, AnaGram will +use the same convention for embedded C as it uses for AnaGram proper. +You can change the convention in the middle of the file if necessary. + +AnaGram treats each comment delimited with \agcode{/*} and \agcode{*/} +as though it were a single space. You can even put such comments in +the middle of token names if you should want to. A comment that +begins with \agcode{//} is treated as though the end of line occurred +at the \agcode{//}. + +\subsection{Blank Lines and Form Feeds} +\index{Blank lines} + +Because blank lines and form feeds are visual separators, AnaGram will +not skip either looking for a continuation line. Therefore blank lines +and form feeds can occur only between AnaGram statements, not in the +middle of a statement. + +It is a good idea to separate groups of productions with a blank line +or two, lest an accidental dangling comma make AnaGram think the +beginning of the next production is a continuation of the present one. + + +\section{Elements of Grammars} + +\subsection{Names} +\index{Name}\index{Token} + +You may use names to represent tokens, character sets, keywords and +\index{Virtual productions}\index{Production}virtual productions. +Names follow the same general rules as for any programming language, +with the notable exception that they may have embedded white space. +Names are made up of letters, digits, or underscores. They may not +begin with a digit. Any sequence of embedded spaces, tabs or comments +counts as a single space. AnaGram distinguishes between upper and +lower case\index{Case sensitivity}, so that \agcode{Word} and +\agcode{word} are different names. There is no particular limit to the +length of a name. There are no reserved words as such, although +\agcode{grammar}, \agcode{eof}, and \agcode{error} will be treated as +reserved words unless you take special action by setting appropriate +configuration parameters. The names AnaGram uses for +\index{Configuration parameters}configuration parameters +follow the same rules as for other names, except that +\index{Case sensitivity}case +is ignored. + +\subsection{Reserved Words} +\index{Reserved words}\index{Words} + +% XXX shouldn't that be \index{Grammar token}? +AnaGram treats tokens with the names \index{Grammar}\agcode{grammar}, +\index{Eof token}\index{Token}\agcode{eof}, and \index{Error +token}\index{Token}\agcode{error} in a special manner unless certain +measures are taken. Since you can override AnaGram's use of these +names, they are not reserved words in the true sense. + +If your grammar has a token named \agcode{grammar}, AnaGram will take +that token to be the grammar token for your grammar unless you set the +\index{Token}\index{Grammar token}\index{Configuration parameters} +\agparam{grammar token} +configuration parameter or mark some other token as the grammar token +using ``\index{ \_dol}\$''.% See below ???. + +If your grammar has a token named \agcode{error} and you take no +further steps, AnaGram will assume you wish to use error token +resynchronization in case of +\index{Syntax error}\index{Errors}syntax error. See Chapter 9. +If you wish to use some other token as an error token you +may select it using the +\index{Configuration parameters}\index{Token}\index{Error token} +\agparam{error token} configuration parameter. +If you wish to use \agcode{error} as a token name, but do not want +error token resynchronization, you may set the \agparam{error token} +configuration parameter to any name that is not used in your grammar. +You may then use \agcode{error} as a token name without causing +AnaGram to include error token resynchronization in your parser. + +\index{Resynchronization} +If you select automatic resynchronization or error token +resynchronization (see Chapter 9), AnaGram will look for a token +called \agcode{eof} to use as an end of file indicator. You may +either name your end of file token \agcode{eof} or you may set the +\agparam{eof token} configuration parameter with the name of your end +of file token. + +\subsection{Variable Names} +\index{Name}\index{C variable names} + +With AnaGram you can associate C/C++ variable names with the +\index{Semantic value}\index{Token}\index{Value}semantic values of +tokens for use in your \index{Reduction procedure}reduction +procedures. Each name follows the corresponding token in the grammar +rule on the right of the production, separated from the token by a +colon. AnaGram allows variable names made up of letters, digits, and +underscores. They may not begin with a digit. Embedded spaces, tabs +or comments, are not allowed, of course. AnaGram imposes no +restriction on length, but uses your variable names just as you have +written them in the code it generates to call reduction procedures. +Remember that your compiler may have a limit on the length of variable +names. Also, AnaGram itself uses C variable names beginning with +\agcode{ag{\us}}. It is therefore wise to avoid using names of this form. + +\subsection{Terminal Tokens} +\index{Terminal token}\index{Token} + +A \agterm{terminal token} is a token which does not appear on the left +side of a production. It represents, therefore, a basic unit of input +to your parser. You have several options with respect to terminal +tokens. If the input to your parser consists of ASCII characters, you +may define terminal tokens explicitly as ASCII characters or as sets +of ASCII characters. If you have an input procedure which produces +numeric codes, you may define the terminal tokens directly in terms of +these numeric codes. On the other hand, you may leave the terminal +tokens completely undefined. In this case, you must provide an input +procedure which can determine the appropriate +\index{Token}\index{Token number}\index{Number}token numbers. +It is an all or none situation. If you provide any explicit +definitions, you must provide them for all terminal tokens. Input +procedures and token input are discussed in Chapter 9. Examples of +non-character input may be found in the Macro Preprocessor example in +the \agfile{examples/mpp} directory on your AnaGram distribution +disk.% Further examples are given in Chapter ???. +% XXX change ``on ...distribution disk'' to ``in ...distribution''. + +\subsection{Character Representations} +\index{Character representations} + +In specifying admissible input characters you may use \index{Character +constants}character constants following the normal C conventions. +Remember that a character constant may specify only a single +character. Although some C compilers will allow constructs such as +\agcode{'mv'}, AnaGram doesn't allow this. AnaGram recognizes the +same escape sequences as C, including octal and hex sequences, even +though this is, strictly speaking, unnecessary. The escape sequences +AnaGram recognizes are: + +% +% It would be nice to be able to just write this and tell latex to set +% it in three columns. but no... that would be too easy. +% +% +%\begin{tabular}{ll} +%\agcode{{\bs}a}&alert (bell) character\\ +%\agcode{{\bs}b}&backspace\\ +%\agcode{{\bs}f}&formfeed\\ +%\agcode{{\bs}n}&newline\\ +%\agcode{{\bs}r}&carriage return\\ +%\agcode{{\bs}t}&horizontal tab\\ +%\agcode{{\bs}v}&vertical tab\\ +%\agcode{{\bs\bs}}&backslash\\ +%\agcode{{\bs}?}&question mark\\ +%\agcode{{\bs}'}&single quote\\ +%\agcode{{\bs}"}&double quote\\ +%\agcode{{\bs}ooo}&octal number\\ +%\agcode{{\bs}xhh}&hexadecimal number\\ +%\end{tabular} + +\begin{indenting}{0.4in} +\begin{tabular}{llllll} +\agcode{{\bs}a}&alert (bell) character& +\agcode{{\bs}t}&horizontal tab& +\agcode{{\bs}'}&single quote\\ +\agcode{{\bs}b}&backspace& +\agcode{{\bs}v}&vertical tab& +\agcode{{\bs}"}&double quote\\ +\agcode{{\bs}f}&formfeed& +\agcode{{\bs\bs}}&backslash& +\agcode{{\bs}\textit{ooo}}&octal number\\ +\agcode{{\bs}n}&newline& +\agcode{{\bs}?}&question mark& +\agcode{{\bs}x\textit{hh}}&hexadecimal number\\ +\agcode{{\bs}r}&carriage return\\ +\end{tabular} +\end{indenting} +\bigskip + +The octal escape sequence allows up to three octal digits, in +accordance with ANSI specifications for C. The hexadecimal numbers +may contain an arbitrary number of digits; however AnaGram will +truncate the result to sixteen bits. + +A backslash followed by any character other than those listed above +will cause a syntax error. + +You may also represent characters by writing the numeric code +explicitly, in decimal, octal, or hexadecimal representations. +AnaGram follows the C conventions for integer constants: a leading +\agcode{0} means the number is octal, a leading \agcode{0x} or +\agcode{0X} means it is hexadecimal. The hex digits \agcode{a-f} may +be either upper or lower case\index{Case sensitivity}. Numbers may be +preceded by an optional minus sign. + +If your parser uses a pre-existing \index{Lexical scanner}lexical +scanner and you wish to use the code numbers it generates to identify +tokens, you may simply treat those code numbers as character numbers. +You may use the numbers directly in your productions, or you may use +definition statements to name them. You may also use an +\agparam{enum} statement within a configuration section to attach +names to the code numbers. +% XXX shouldn't this use of enum be indexed? + +AnaGram also allows a special notation for control characters. You +may represent a control character by using the ``\^{}'' character +preceding any printing ascii character. Thus you can write +\agcode{\^{}z} or \agcode{\^{}Z} to represent the DOS end-of-file +character. Notice that quotation marks are not necessary. + +Examples of character representations: + +\begin{indenting}{0.4in} +\begin{tabular}{cccc} +\agcode{'K'}&\agcode{-1}&\agcode{0}&\agcode{'{\bs}t'}\\ +\agcode{\^{}J}&\agcode{'{\bs}xff'}&\agcode{077}&\agcode{0XF3}\\ +\end{tabular} +\end{indenting} + +\subsection{Character Ranges} +\index{Character range}\index{Range} + +It is convenient to be able to specify ranges of characters when +writing a grammar. AnaGram supports several ways of representing +ranges of characters. The first is an extension of the notation for +character constants: \agcode{'a-z'} is the set of lower case +characters. You can even use escape sequences such as +\agcode{'{\bs}n-{\bs}r'} if you like. The order of +characters used to specify the range is immaterial: \agcode{'z-a'} is +the same as \agcode{'a-z'}. AnaGram will, however, issue a warning +just in case the unusual order results from a clerical error. + +The second way to specify a range is by using two arbitrary character +representations, as described above, separated by two dots. For +example, \agcode{\^{}C..\^{}Z}, \agcode{3..26}, \agcode{3..032}, +\agcode{3..0x1a}, and \agcode{\^{}C..0x1a}, all represent the same +range of characters. Similarly, \agcode{'A-F'}, \agcode{'A'..'F'}, +\agcode{0101..0106}, \agcode{0x41..0x46}, \agcode{65..70}, and +\agcode{65..'F'} all represent the same range of characters. + +\subsection{Character Sets} +\index{Character sets} + +If you provide explicit definitions for terminal tokens, the basic +input unit for your parser will be considered a character set, even if +your input procedure provides numeric codes that are not actually +characters. As a terminal token, a character set will be matched by +any input character that is a member of the set. Character sets may +be named in definition statements, but they may also appear on the +right sides of productions without being named. + +A character set may consist of one or more characters. You can +specify a character set that consists of a single character by using +any of the character representation methods described above. You can +specify a set consisting of a range of characters by using any of the +representations of character ranges described above. +\index{Character sets} +To specify more complicated sets, you can write +\index{Expressions}\index{Set expressions}expressions +using conventional set theoretic operations. +In AnaGram input, these operations are specified as follows: + +\index{Union}\index{Difference}\index{Intersection}\index{Complement} +\begin{indenting}{0.4in} +\begin{tabular}{cl} +\agcode{A + B}&(union)\\ +\agcode{A - B}&(difference)\\ +\agcode{A \& B}&(intersection)\\ +\agcode{\~{}A}&(complement)\\ +\end{tabular} +\end{indenting} + +where \agcode{A} and \agcode{B} are arbitrary sets. Union and +difference have the same precedence. Intersection has higher +precedence and complement has the highest precedence. Thus in the +expression + +\begin{indentingcode}{0.4in} +A + \~{}B\&C +\end{indentingcode} + +the complement operation is performed first, then the intersection, +and finally the union. + +Watch out! In an AnaGram syntax file \agcode{65 + 97} represents the +character set which consists of lower case \agcode{a} and upper case +\agcode{A}. It does not represent 162, the sum of 65 and 97. + +Parentheses may be used to force the order of evaluation: + +\begin{indentingcode}{0.4in} +\~{}(A \& (B+C)) +\end{indentingcode} + +In this example the union of \agcode{B} and \agcode{C} is calculated, +then the intersection of this set with \agcode{A} is calculated, and +finally the complement is evaluated. + +The computation of the \index{Complement}complement of a +\index{Character sets}set requires a definition of the +\index{Universe}universe of set elements. AnaGram will define the +universe to be the set of unsigned 8-bit characters, unless one or +more characters outside that range have been specified. In that case, +the universe will consist of all characters on the interval defined by +the lesser of zero and the lowest character code used and the greater +of 255 and the highest character code used. The complement of a +character set is everything in this universe except the characters in +the set. + +Characters which make up part of the character universe, but are not +legitimate input according to your grammar, are lumped together into a +special token which will cause an error if it occurs in your input. + +When your parser reads an input character, it uses that character to +index a conversion table in order to determine the appropriate +\index{Token number}\index{Token}\index{Number}token number. If the +\index{Range}\index{Test range}\index{Configuration switches} +\agparam{test range} configuration switch +is on, its default setting, your parser will include code to verify +that the character is in bounds before it indexes the conversion +table. If you are satisfied that checking bounds is unnecessary, you +may turn the \agparam{test range} switch off and get a slightly higher +level of performance from your parser. + +For efficient processing, it is well to keep the number of tokens to a +minimum. Therefore if you have a choice between defining a construct +as a token, with a production, or a set, with a definition, the set is +to be preferred. + +Some useful character sets are: + +\begin{indenting}{0.4in} +\begin{tabular}{ll} +\agcode{'a-z' + 'A-Z'}&Alphabetic characters\\ +\agcode{'a-f' + 'A-F'}&Hex digits\\ +\agcode{'0-9'}&Decimal digits\\ +\agcode{0..127}&ASCII character set\\ +\agcode{32..126}&Printing ASCII characters\\ +\agcode{\~{}'{\bs}n'}&Anything but newline\\ +\agcode{\^{}Z}&Windows/DOS end of file indicator\\ +\agcode{-1}&Stream I/O end of file indicator\\ +\agcode{0}&String terminator\\ +\agcode{32..126 - 'a-z' - 'A-Z' - '0-9'}&Punctuation\\ +\end{tabular} +\end{indenting} +\bigskip +% XXX ``punctuation'' is wrong; it should subtract off space too + +Note that \agcode{'a-z'} is a range of characters but +\agcode{32..126 - 'a-z'} is a set difference. + +When AnaGram encounters a character set in a grammar rule, it assigns +a token number to the character set. If it has previously seen the +same character set it will assign the same token number; however, it +assigns the same token number only if the set expressions are +obviously the same. Thus, AnaGram will assign the same token number +every time it sees \agcode{A + B}, but will assign a different token +number if it sees \agcode{B + A}. Only when AnaGram has finished +scanning the entire syntax file can it actually evaluate the character +sets. If it finds that several different tokens all refer to the same +character set, it will create a single token that represents the true +character set and create +\index{Shell productions}\index{Production}``shell productions'' for +the others. + +\index{Character sets}If the character sets you use in your grammar +overlap, they do not properly represent +\index{Terminal token}\index{Token}terminal tokens. +To deal with this situation, AnaGram identifies all overlaps among +character sets and extends your grammar by adding a number of extra +productions. For instance, suppose your grammar uses the following +character sets as though they were terminal tokens: + +\begin{indentingcode}{0.4in} +'a-z' + 'A-Z' +'0-9' +'0-7' +'a-f' + 'A-F' +\end{indentingcode} + +AnaGram will then modify your grammar by adding the following productions: + +\begin{indentingcode}{0.4in} +'a-z' + 'A-Z' + -> 'a-f' + 'A-F' | 'g-z' + 'G-Z' + +'0-9' + -> '0-7' + '8-9' +\end{indentingcode} + +Although the tokens \agcode{'a-z' + 'A-Z'} and \agcode{'0-9'} are +technically now +\index{Nonterminal token}\index{Token}nonterminal tokens, +for purposes of determining the +\index{Token}\index{Data type}data type of their +\index{Semantic value}\index{token}\index{Value}semantic values, +AnaGram continues to regard them as terminal tokens. + +This \index{Partition}\index{Universe}\index{Character universe} +``partitioning'' of the character universe is described in Chapter 6. + +\subsection{Keyword Strings} +\index{Keywords} + +In your grammar, AnaGram recognizes character strings within double +quotes (e.g., \agcode{"IF"}) as keywords. The strings follow the same +syntactic rules as strings in C. The same escape sequences are +honored. AnaGram does not, however, allow for the concatenation of +adjacent strings. Note that AnaGram strings are used only for the +definition of keywords in your grammar, not for messages to be +displayed or printed. + +Keyword strings may not include null characters and must be at least +one character long. You may have any number of keywords. Each is +treated as a single terminal token. A keyword may be given a name by +means of a definition statement. Keywords may appear in virtual +productions. + +AnaGram's keyword recognition works in the following way. First, for +each state in your parser, AnaGram prepares a list of all the keywords +that are admissible in that state. Your parser will recognize a +keyword \emph{only} if it is in an appropriate state; otherwise it +will appear to be an anonymous sequence of characters. Your parser, +in any state, checks for keywords it expects before it checks for +acceptable characters. That is, \emph{keywords take precedence} over +simple characters. It does not look for keywords that would not be +acceptable input. The parser will do whatever lookahead is necessary +in order to pick up the entire keyword. Thus if the character +\agcode{I} and the keyword \agcode{IF} are both legitimate input at +some point, \agcode{IF} will be recognized, if present, in preference +to \agcode{I}. If several admissible keywords match the input, such +as \agcode{IF} and \agcode{IFF}, the parser will select the longest +match, \agcode{IFF} in this example. + +AnaGram does not incorporate keywords into its character sets. +Keywords stand apart and should not appear in definitions of character +sets. In particular, they are not considered as belonging to the +complement of a character set. Thus for the production + +\begin{indentingcode}{0.4in} +next char -> \~{}('{\bs}n' + \^{}Z) +\end{indentingcode} +a keyword would not be considered legitimate input. + +Note also that a keyword consisting of a single character does not +belong to the character universe. Because of this fact, AnaGram's +treatment of \agcode{'X'} and \agcode{"X"} is very different. If this +seems confusing at first, try using only keywords which are at least +two characters long until you have some experience with them. + +AnaGram's keyword recognition logic normally does not make any +assumptions about what precedes or follows a keyword. Thus if +\agcode{int} is a keyword, your parser will be capable of plucking it +out of a string of characters such as \agcode{disintegrate} if, +according to your grammar, it could follow \agcode{dis}. The +\agparam{sticky} declaration and the \agparam{distinguish keywords} +statement, described below, can prevent such unwanted recognition of +keywords. A keyword following a \agparam{sticky} token will not be +recognized if the first character of the keyword can be shifted in as +part of the \agparam{sticky} token. The \agparam{distinguish +keywords} statement prevents recognition of a keyword if it is +followed immediately by a character of the sort that makes up the +keyword. + +\subsection{Type Specifications For Tokens} +\index{Token}\index{Token type}\index{Type declarations} + +When you write productions or token declarations (see below), AnaGram +allows you to specify the data type\index{Token}\index{Data type} of +the \index{Semantic value}\index{Token}\index{Value}semantic value of +a token by using a C or C++ data type specification. The restrictions +are that AnaGram does not allow specification of array or function +types, nor explicit structure types. Types that are defined with +typedef statements, structure definitions, or class definitions, +including template classes, in your embedded C or C++ are acceptable. +Thus the following specifications, for example, are acceptable: + +\begin{indentingcode}{0.4in} +void +int +char * +unsigned long *near +static float *far +my{\us}type +double * +struct descriptor +struct widget * +vector <double> * +\end{indentingcode} + +On the other hand, the following specifications are \emph{not} valid: + +\begin{indentingcode}{0.4in} +int[20] +int *(int, unsigned char) +\bra int x,y; float z; \ket +struct \bra int k; float z; \ket +\end{indentingcode} + +Note that AnaGram itself does nothing with the type specifications. It +simply passes them on to your compiler as appropriate. + +\subsection{Productions} +\index{Production} + +Productions are the basic units of a grammar. A production consists +of a left side and a right side. \index{Left side}The left side of a +production consists of one or more token names, joined by commas, +optionally preceded by a type specification enclosed in parentheses. +\index{Right side}The right side begins with an arrow and may either +begin on the same line as the left side or on a new line. For +example: + +\begin{indentingcode}{0.4in} +program -> statement list, eof +expression + -> expression, plus, term + +(int) variable name, function name + -> name:n = look{\us}up(n); +\end{indentingcode} + +The part of the right side of a production following the arrow is +called a \index{Grammar rule}\index{Rule}\agterm{grammar rule}, +discussed below. A production need not have a right side at all. In +this case, it is simply called a +\index{Declaration}\index{Token}\agterm{token declaration}. +AnaGram assigns +\index{Token number}\index{Token}\index{Number}token numbers +to the token names on the left side, and, if there is a type +specification, records the data type for each of the tokens declared. +Declarations of this sort are most useful when using input from a +\index{Lexical scanner}lexical scanner. See Chapter 9 for a discussion +of techniques for interfacing a lexical scanner to your parser. If +you do not intend to use a lexical scanner you will have no need for +token declarations. + +If you do not explicitly specify the type for the +\index{Semantic value}\index{Token}\index{Value}semantic value +of a token, it will be determined by the configuration parameter +\index{Default token type}\index{Configuration parameters}\index{Token} +\agparam{default token type} +if it is a \index{Nonterminal token}\index{Token}nonterminal token or +by the \index{Configuration parameters}configuration parameter +\index{Input token type}\index{Default input type}\agparam{default input type} +if it is a \index{Token}terminal token. +\agparam{Default token type} defaults to \agcode{void}. +\agparam{Default input type} defaults to \agcode{int}. + +If a production has more than one token on the left side, as in the +third example above, it is called a +\index{Semantically determined production}\index{Production} +\agterm{semantically determined production}. Semantically determined +productions are a useful tool for exerting semantic control over +syntactic analysis. A semantically determined production should have +a reduction procedure which determines on a case by case basis which +of the tokens on the left side should be taken as the reduction token. +If there is no reduction procedure, or if the reduction procedure does +not make a choice, the reduction token will be the first syntactically +correct token on the left side of the production. In the example +above, \agcode{variable name} will be the reduction token unless +\agcode{look{\us}up} changes it to \agcode{function name}. Semantically +determined productions are discussed more fully in Chapter 9. + +If several productions have the same left side, it does not need to be +repeated. Subsequent right hand sides must each start on a new +line. For example: + +\begin{indentingcode}{0.4in} +integer + -> digit + -> integer, digit + +name + -> letter + -> name, letter + -> name, digit +\end{indentingcode} + +On the other hand, you do not have to group productions with the same +left side. You could write the above productions as follows, although +it would certainly not be good programming practice: + +\begin{indentingcode}{0.4in} +name -> name, digit +integer -> integer, digit +name -> name, letter +integer -> digit +name -> letter +\end{indentingcode} + +Nevertheless, there are a few occasions involving complex cross +recursions and semantically determined productions where it is not +possible to group productions neatly. + +The right side of a production can be empty. Such a production is +called a +\index{Null productions}\index{Production}\agterm{null production}. +Null productions are useful to denote an optional element in a +grammar, or a list that may be empty. For example: + +\begin{indentingcode}{0.4in} +optional widget + -> + -> widget + +optional qualifiers + -> + -> optional qualifiers, qualifier +\end{indentingcode} + +A second way to write multiple productions with the same left side +uses the \index{Vertical bar}\index{|}vertical bar character, ``$|$'', +to separate the grammar rules. The productions given above for +\agcode{name}, \agcode{optional widget}, and \agcode{optional +qualifiers} can also be written: + +\begin{indentingcode}{0.4in} +name -> letter | name, letter | name, digit +optional widget + -> | widget + +optional qualifiers + -> | optional qualifiers, qualifier +\end{indentingcode} + +Note that a null production cannot \emph{follow} a vertical bar. + +A token that has a null production is called a +\index{Zero length token}\index{Token}\agterm{zero length token}, +since it can be represented by an empty sequence of input characters, +that is to say, by nothing at all. Furthermore, even if a token +doesn't have any null productions, if it has at least one rule +consisting entirely of zero length tokens it is also a zero length +token. In the Token Table window, AnaGram notes which tokens are zero +length, because they can be a source of conflicts. + +\subsection{Grammar Token} + +Every grammar must have a single token which produces the entire +grammar. This token is variously called the +\index{Token}\index{Grammar token}\agterm{grammar token}, the +\index{Goal token}\agterm{goal token} or the +\index{Start token}\agterm{start token}. +AnaGram provides several methods you may use to specify which token in +your grammar is the grammar token. + +You may simply use the name \agcode{grammar} for the grammar token. +If you wish to use some other more descriptive name for your grammar +token, you may mark it with a following dollar sign when it appears on +the left side of a production. Alternatively, you may set the +\index{Grammar token}\index{Configuration parameters}\agparam{grammar token} +configuration parameter to specify the grammar token. Here are +examples of the methods: + +\begin{indentingcode}{0.4in} +grammar + -> [statement | newline]/... + +program \$ + -> [statement | newline]/... + +{}[ grammar token = program ] +program + -> [statement | newline]/... +\end{indentingcode} + +If you should use more than one of these techniques, AnaGram resolves +the issue in the following manner: A marked token or a configuration +parameter setting always takes precedence over simply naming a token +\agcode{grammar}. If you mark more than one token or set the +configuration parameter more than once, the last setting or mark wins. + +\subsection{Grammar Rules} +\index{Rule}\index{Grammar rule} + +The part of a production to the right of the arrow is more often +called a \agterm{grammar rule}, or simply \agterm{rule}. A grammar +rule is a sequence of \index{Rule elements}\agterm{rule elements}, +joined by commas, as in the examples of productions given above. Rule +elements are token names, character set expressions, virtual +productions, or immediate actions (see below). Each rule element may +be optionally followed by a parameter assignment. The entire rule may +be followed by an optional reduction procedure. A \index{Parameter +assignment}parameter assignment is a colon followed by a C variable +name. Here are some examples of rule elements with parameter +assignments: + +\begin{indentingcode}{0.4in} +'0-9':d +integer:n +expression:x +declaration:declaration{\us}descriptor +\end{indentingcode} + +The parameters you assign to tokens in your grammar rule become the +formal parameters for your \index{Reduction procedure}reduction +procedure. The data type\index{Data type}\index{Reduction procedure +arguments} of the parameter is determined by the data type for the +semantic value of the token to which it is assigned. If your grammar +rule has parameter assignments, but does not have a reduction +procedure, AnaGram will give you a warning in case the lack of a +reduction procedure is an oversight. If you don't need a reduction +procedure you may safely ignore the warning. On the other hand, +AnaGram has no way to determine whether you have failed to make +necessary parameter assignments. You won't find out until you compile +your parser, when your compiler will give you error messages for +undefined symbols. + +AnaGram assigns a unique rule number to each rule in your grammar. +Rules are numbered sequentially as they are encountered in the syntax +file. AnaGram constructs rule zero itself. Rule zero normally has a +single element, the grammar token, unless you have a +\agparam{disregard} statement in your grammar. In this case there +will be two elements. + +\subsection{Reduction Procedures} +\index{Reduction procedure} + +% XXX somewhere in here it ought to say something like +% ``in the parsing literature reduction procedures are often known as +% \agterm{semantic actions}.'' +% Note that R. says there's some subtle difference between the usual +% concept of semantic action and AG's concept of reduction procedure. +% I don't know what this difference is and I hope she can recall it. +% +% D. thinks this note ought to be at the end; R. wants it at the top. + +A \agterm{reduction procedure} is a piece of C code which optionally +follows a production. The code is executed when your parser +identifies the production in its input. There are two forms for +reduction procedures, a short form and a long form. The short form +consists of a single C expression. The long form consists of an +arbitrary block of C code. When AnaGram builds a parser, it inspects +the grammar rule to which the procedure is attached and identifies the +parameters for the procedure. It uses these parameters as the formal +parameters for the procedure. +If the +\index{Macros}\index{Allow macros}\index{Configuration switches} +\agparam{allow macros} +configuration switch has not been turned off, AnaGram codes the +reduction procedure as a macro definition whenever possible. +Otherwise AnaGram codes it as a function definition. AnaGram builds +the name for a reduction procedure by appending its internal procedure +number to the string \agcode{ag{\us}rp{\us}}. Thus reduction procedures are +numbered in the order in which they are encountered in the syntax +file. + +Both long and short form reduction procedures are preceded by an equal +sign which follows the production. The short form consists of a C or +C++ expression terminated by a semicolon. When the grammar rule is +reduced, the expression will be evaluated and its value will become +the value of the reduction token. The expression and the terminating +semicolon must be entirely on a single line. Note that, if you really +need to make the expression longer than will fit on one line, you can +embed a newline in a comment. Some examples of short form reduction +procedures are: + +% XXX is there anything we can do about the ugly underscores? +\begin{indentingcode}{0.4in} +=0; + +=1; + +=10*n + d-'0'; + += +special{\us}processor(first{\us}parameter, second{\us}parameter); + +=word{\us}count++; + +=widget(constant{\us}1*parameter{\us}1 + constant{\us}2*parameter{\us}2 /* +{} */ + constant{\us}3*parameter{\us}3); +\end{indentingcode} + +A long form reduction procedure consists of an arbitrary block of C or +C++ code, enclosed in braces (\bra \ket). AnaGram will code the reduction +procedure as a function. To return a value for the reduction token, +simply use the \agcode{return} statement. There are effectively no +restrictions on the content or length of a reduction procedure. Of +course, if there are unbalanced braces, unterminated comments or +unterminated string literals, AnaGram will not be able to determine +where the reduction procedure ends. AnaGram treats +\index{Comments}nested comments within a reduction procedure according +to the value of the \index{Nest comments}\index{Configuration +switches}\agparam{nest comments} configuration switch at the point +where it encounters the reduction procedure. + +From a practical point of view it is not usually good practice to have +a reduction procedure that is more than a few lines long since a long +procedure will hamper your overall view of your grammar. Long +reduction procedures should be written as separate named functions, +and should either be included in the embedded C portion of your syntax +file or should be included in a wholly separate module. Here is an +example of a long form reduction procedure: + +\begin{indentingcode}{0.4in} +=\bra + if (flag) \bra + total += x; + return identify(x); + \ket + else \bra + total = 0; + flag = 1; + return init{\us}table(x); + \ket + \ket +\end{indentingcode} + +If a rule does not have a reduction procedure, the semantic value of +the reduction token will be set to the \index{Semantic +value}\index{Token}\index{Value}semantic value of the first token in +the rule, unless the rule is a \index{Null productions}null +production. In the latter case, the value of the reduction token will +be set to zero. +% XXX and what if zero isn't a valid value for the type? a compiler +% error will occur. + +% XXX add something like +% +% Variables appearing in reduction procedures which do not have a +% parameter assignment in the corresponding grammar rule can be +% declared globally or (file)-statically in your embedded C, or +% alternatively could be added to the parser control block using +% the \agparam{extend pcb} statement (q.v. | See Section ....). +% (Reword this.) +% +% Should also discuss the sequencing of reduction procedure calls +% so that people understand what happens if you use such variables. +% +% also ``A reduction procedure can be used to terminate parsing for +% semantic reasons''. +% + +\subsection{Immediate Actions} +\index{Immediate action}\index{Action} + +An immediate action is a rule element that consists of executable C or +C++ code embedded within a grammar rule to be executed when it is +encountered. An immediate action is denoted by the use of an +exclamation point, \index{!}``!''. The content of an immediate action +may be written following the rules for either long form or short form +reduction procedures. As with any other rule element, it must be +separated from preceding and following rule elements by commas. In +the grammar for a simple desk calculator, one might write + +\begin{indentingcode}{0.4in} +transaction + -> !printf('\#');, expression:x = printf("\%d{\bs}n", x); +\end{indentingcode} + +% XXX s/apparent/visible/ +Notice that the only apparent difference between an immediate action +and a reduction procedure is that the immediate action is preceded by +``!'' instead of ``=''. The immediate action must be followed by a +comma to separate it from the following rule element. + +Immediate actions may also be used in definitions: + +\begin{indentingcode}{0.4in} +prompt = !printf('\#'); +\end{indentingcode} + +AnaGram implements an immediate action by creating a special token for +it. AnaGram then creates a single null production for the +token. Finally, the immediate action is implemented as the reduction +procedure for the null production. + +For example, you could implement \agcode{prompt} by writing a null production +with a reduction procedure: + +\begin{indentingcode}{0.4in} +prompt + -> = printf('\#'); +\end{indentingcode} + +This production would be equivalent to the definition above. + +There are two ways, however, in which immediate actions differ from +the equivalent null production. Immediate actions may access any +parameter assignments which precede them in the rule in which they +occur. On the other hand, there is no way to assign a data type to +the semantic value, if any, returned by the immediate action. +Therefore, the type is determined by your setting of the +\index{Default token type}\index{Configuration parameters} +\agparam{default token type} configuration parameter. + +\subsection{Virtual Productions} +\index{Virtual productions}\index{Production} + +Virtual productions are a convenient short form notation for common +grammatical constructs involving choice and repetition. The notation +represents an extension of notation commonly used in programming +manuals. A virtual production may be written in a grammar rule at any +place where you could write a token name, even within another virtual +production. Note that use of virtual productions is never +\emph{required}, since the equivalent productions can always be +written out explicitly instead. + +When AnaGram encounters a virtual production, it replaces the virtual +production with a new token and writes appropriate productions for the +new token. When you look at your syntax tables using AnaGram windows, +you will see the productions that AnaGram generates. AnaGram keeps a +record of virtual productions, so that generally if you use the same +virtual production a second time, you get the same set of tokens and +productions that were generated the first time it was used. This is +not the case if the virtual productions contain reduction procedures +or immediate actions, since AnaGram is not equipped to determine +whether two pieces of C code are equivalent. Thus, a virtual +production that contains a reduction procedure will be unique and will +not be reused. + +One disadvantage of virtual productions is that there is no way to +specify the data type of the \index{Semantic value}\index{Virtual +production}semantic value of a virtual production. Therefore, if you +have a reduction procedure within a virtual production, its return +value must be consistent with the type defined by the \index{Default +token type}\index{Configuration parameters}\agparam{default token type} +configuration parameter. + +The simplest virtual production is the \index{Token}\index{Optional +token}\agterm{optional token}. If \agcode{x} is an arbitrary token +name or set expression, you can indicate an optional \agcode{x} by +writing \index{?}\agcode{x?}. You may also indicate a repetition of +\agcode{x} by using the ellipsis with either \agcode{x} or \agcode{x?}. +\index{...}\index{Ellipsis}Thus \agcode{x...} represents +one or more instances of \agcode{x} and \index{?...}\agcode{x?...} +represents zero or more instances of \agcode{x}. For example: + +\begin{indentingcode}{0.4in} +'+'? +\end{indentingcode} + +can be used to represent an optional plus sign, that is, a choice +between a plus sign and nothing at all. Similarly, + +\begin{indentingcode}{0.4in} +'{\bs}n'?... +\end{indentingcode} + +represents an optional sequence of newline characters. + +\index{Brackets}\index{Braces}\index{\_opb\_clb}\index{[]} +The next category of virtual productions uses brackets or braces to +indicate a choice among a number of enclosed grammar rules separated +by vertical bars. A single rule may also be enclosed. Note that +\emph{rules}, with following reduction procedures, are allowed, not +simply tokens. + +Braces are used to indicate that one option must be chosen. Brackets +are used to indicate the choice is optional, i.e. may be omitted +altogether. The ellipsis following a set of options within brackets +or braces indicates the option may be repeated an indefinite number of +times. + +You can use braces to indicate a simple choice among a number of +options. A Cobol grammar offers the following choice of equivalent +keywords: + +\begin{indentingcode}{0.4in} +\bra "RECORD", "IS"? | "RECORDS", "ARE"? \ket +\end{indentingcode} + +\index{\_opb\_clb...}\index{ []...} +You may use the ellipsis with braces to indicate an arbitrary positive +number of repetitions of the choice: + +\begin{indentingcode}{0.4in} +{\bra}type specifier | storage class specifier{\ket}... +\end{indentingcode} + +This expression requires at least one type specifier or storage class +specifier, but will accept any number. + +\index{[]} +To make a choice optional, use brackets instead of braces. An +example, again drawn from a Cobol grammar, is: + +\begin{indentingcode}{0.4in} +{}["LIMIT", "IS"? | "LIMITS", "ARE"?] +\end{indentingcode} + +\index{[]...} +Ellipses may be used with brackets to indicate an arbitrary number of +choices that may be omitted altogether: + +\begin{indentingcode}{0.4in} +{}[argument, [',', argument]...] +\end{indentingcode} + +This expression describes an optional argument list with arguments +separated by commas. + +If you use a null production within braces, it must be the first option: + +\begin{indentingcode}{0.4in} +\bra | '+' | '-' \ket +\end{indentingcode} + +Normally, you would do this only if you wanted to attach a reduction +procedure to the null production. Note that if you include a null +production within braces, and add an ellipsis after the closing brace +for repetition, your grammar will be ambiguous. Just exactly how many +times does the null production occur? Use brackets instead, and omit +the null production. + +Null productions are not allowed with brackets, since they would be +intrinsically ambiguous. + +The options within braces or brackets may be grammar rules of any +length or complexity and may themselves contain virtual productions of +arbitrary complexity. Nevertheless, in practice, clarity suffers as +soon as the options get very complex. Virtual productions are most +important and useful when used in simple situations. In those +situations they will enhance the clarity of your grammar. + +Here is an example that is moderately complex, even though each rule +consists of a single token: + +\begin{indentingcode}{0.4in} +\bra{\bra}"on" | "true"\ket = 1; | {\bra}"off" | "false"\ket = 0; | integer\ket +\end{indentingcode} + +This example can be used to allow as input either an integer or, for +special cases, keywords. You could write this option out in the +following way: + +\begin{indentingcode}{0.4in} +p1 + -> p2 = 1; + -> p3 = 0; + -> integer + +p2 + -> "on" + -> "true" + +p3 + -> "off" + -> "false" +\end{indentingcode} + +The final category of virtual production provides a notation for +\index{Alternating sequence}\agterm{alternating sequences}. An +alternating sequence is a set of choices which may be repeated +arbitrarily subject to the side condition that no choice may follow +itself, in other words, that the choices must alternate. Alternating +sequences are written with either brackets or braces depending on +whether the sequence is optional or not, followed by +\index{/...}``\agcode{/...}''. Note that the choices themselves may +allow sequences. For example: + +\begin{indentingcode}{0.4in} +program + -> [statement | newline...]/..., eof +\end{indentingcode} + +represents a sequence of statements separated by one or more newlines. +Any two statements must be separated by one or more newline +characters, and newlines may also appear at the beginning and the end +of the program. + +Null productions are not allowed within alternating sequences, since +they are intrinsically ambiguous in all cases. + +\subsection{Definition Statements} +\index{Definitions}\index{Definition statement}\index{Statement} + +A definition statement is simply a shorthand way of naming a character +set, a \index{Virtual productions}\index{Production}virtual +production, a keyword string, or an immediate action. It can also be +used for providing an alternate name for a token. Definitions have the +form: + +\begin{indentingcode}{0.4in} +name = \codemeta{character set} +name = \codemeta{virtual production} +name = \codemeta{keyword} +name = \codemeta{immediate action} +name = \codemeta{token name} +\end{indentingcode} + +The name may be any name acceptable to AnaGram. The name can then be +used anywhere you might have used the expression on the right +side. \index{!}For example: + +\begin{indentingcode}{0.4in} +upper case letter = 'A-Z' +lower case letter = 'a-z' +letter = upper case letter + lower case letter +statement list = statement?... +while keyword = "WHILE" +prompt = !printf("Please enter name:"); +\end{indentingcode} + +It is important to recognize that a definition statement that names a +set does not define a token. A token is defined only when the set is +used in a grammar rule, and then only if the set is used directly, not +in combination with some other set. Furthermore, if you use a +character set directly in a grammar rule, and in some other rule you +use a name that refers to the same set of characters, you will get two +different tokens. For example, if you have defined \agcode{upper case +letter} as in the above example and use both \agcode{upper case +letter} and \agcode{'A-Z'} in grammar rules, AnaGram will assign +different \index{Token number}\index{Token}\index{Number}token numbers +to accommodate any differences in attributes you may assign to the +tokens. + +Renaming tokens is a convenient way to connect two independently +written portions of a grammar. +% See the C grammar in the EXAMPLES directory of your distribution +% disk for an example. + +\subsection{Embedded C} +\index{Embedded C} + +You may encapsulate C or C++ code in your syntax file by enclosing it +in braces (\bra \ket). Such pieces of code are copied to the parser file +untouched, in the order they are found in the syntax file. There may +be any number of such pieces of embedded C. The only restriction is +that they must not start on the same line as some other AnaGram +statement, and following AnaGram statements must also start on fresh +lines. + +Normally, the blocks of embedded C in your syntax file are copied to +the parser file \emph{following} a set of definitions and declarations +AnaGram needs for the code it generates. However, if the \emph{first} +statement in your \index{Syntax file}syntax file is a block of +embedded C, it will \emph{precede} AnaGram's definitions and +declarations. This block of embedded C is called the +\index{Prologue}\index{C prologue}``C prologue''. There are two main +reasons for this special treatment. First, you may want to have a +title and \index{Copyright notice}copyright notice in your parser. If +you include them in an initial block of embedded C they will be right +at the beginning of both your syntax file and your parser file. +Second, if some of your tokens have data type\index{Token}\index{Data +type}s other than those predefined in C or C++, you may include the +definitions here, so they will be available to the code AnaGram +generates. + +AnaGram scans embedded C only insofar as is necessary to find the +closing right brace. Therefore any braces used within embedded C must +balance properly. AnaGram skips braces enclosed in character +constants and string literals, as well as braces enclosed in +comments. It also recognizes C++ style comments that begin with +\agcode{//}. \index{Comments}Treatment of nested versus non-nested comments +is controlled by the +\index{Nest comments}\index{Configuration switches}\agparam{nest comments} +configuration parameter. AnaGram will use the status of this +parameter in effect at the beginning of the section of embedded C. + +AnaGram, of course, can be confused by unterminated strings, +unbalanced brackets, and unterminated comments. The most likely +outcome, in such a situation, is that AnaGram will encounter the end +of file looking for the end of the embedded C. Should this happen, +AnaGram will identify the beginning of the piece of embedded C which +caused the problem. + +The code you include as embedded C, of course, has to coexist with the +code AnaGram generates. In order to keep the potential for conflicts +to a minimum, all variables and functions which AnaGram defines begin +either with the name of your parser or with the letters +\agcode{ag{\us}}. You should avoid variable names which begin with these +letters. + +Reduction procedures are copied to the \index{Parser +file}\index{File}parser file in the order in which they are defined +\emph{following} all of the embedded C. Thus your reduction +procedures may freely use variables and macros defined anywhere in +your embedded C. + +\subsection{Configuration Sections} +\index{Configuration section} + +A configuration section is a special section of your syntax file +enclosed in brackets. Within a configuration section you may set the +values of configuration parameters or switches, or you may use one or +more of several available attribute statements to specify special +treatment for certain tokens. There can be as many or as few +configuration sections in your syntax file as you wish. Each +configuration section must begin on a new line. Any AnaGram statement +which follows a configuration section must also begin on a new line. + +Within a configuration section, each parameter setting and each +attribute statement must begin on a new line. The rules for using +comments and continuation lines are the same as for the rest of +AnaGram. + +Configuration parameters control the way AnaGram interprets your +syntax file and the way it builds your parser. A full discussion of +the use of configuration parameters, including a complete discussion +of each parameter and its default value, is given in Appendix A. + +\index{Attribute statements}\index{Statement} +Attribute statements comprise the +\index{Precedence declarations}precedence declarations \agparam{left}, +\agparam{right}, and \agparam{nonassoc}; the \agparam{sticky} +declaration; the \agparam{distinguish keywords} statement; the +\agparam{hidden} declaration; the \agparam{disregard} and +\agparam{lexeme} statements; the \agparam{enum} statement; the +\index{Reserve keywords}\agparam{reserve keywords} declaration; and +the \index{Rename macro}\agparam{rename macro} statement. + +The precedence declarations and the +\index{Sticky declaration}\index{Declaration}\agparam{sticky} +declaration may be used to resolve conflicts in your grammar. The +\agparam{distinguish keywords} statement may be used to control +keyword recognition. The +\index{Hidden declaration}\index{Declaration}\agparam{hidden} +declaration causes certain token names not to be used when your parser +produces +\index{Syntax error}\index{Errors}\index{Error messages}syntax error +messages. You may use the \agparam{disregard} and \agparam{lexeme} +statements to cause your parser to skip automatically over certain +tokens in its input. The \agparam{enum} statement is almost identical +to the enum statement in C. It can be used to assign names to input +codes in grammars which are taking input from a \index{Lexical +scanner}lexical scanner or another parser. The +\index{Reserve keywords}\agparam{reserve keywords} declaration allows +you to specify certain keywords as reserved words. The +\index{Rename macro}\agparam{rename macro} statement allows you to +override the names AnaGram uses for various macro definitions it +creates in the code it generates. + +Attribute statements are discussed below. Except for +\agparam{disregard} and \agparam{rename macro} statements, attribute +statements accept lists of operands enclosed in braces (\bra \ket) +and separated by commas. A dangling comma following the last item in +a list will be ignored. + +\subsection{Setting Configuration Parameters} +\index{Configuration parameters}\index{Parameters} + +Each configuration parameter has a name that follows the AnaGram +conventions for symbol names, except that AnaGram ignores +case\index{Case sensitivity} when looking up configuration parameter +names. + +There are a number of varieties of configuration parameters. The +simplest, +\index{Configuration switches}\index{Switches}configuration switches, +simply turn some feature of AnaGram on or off. These parameters need +simply be stated to turn the feature on, or negated with the tilde +(\agcode{\~{}}) to turn the feature off: + +\begin{indentingcode}{0.4in} +nest comments +\end{indentingcode} + +causes AnaGram to allow nested comments, and + +\begin{indentingcode}{0.4in} +\~{}nest comments +\end{indentingcode} + +causes AnaGram to disallow nested comments. + +You may also set or reset configuration switches with explicit on or +off values: + +\begin{indentingcode}{0.4in} +nest comments = on +nest comments = off +\end{indentingcode} + +The remaining configuration parameters are assigned values using a +simple assignment statement. Depending on the parameter, the value it +takes may be the name of a token, a C variable name, a C or C++ data +type, a string constant or an integer. String constants are written +using the same rules as keyword strings, described above. + +\begin{indentingcode}{0.4in} +grammar token = program +parser name = widget +default token type = void * +header file name = "widget.h" +parser stack size = 50 +\end{indentingcode} + +A number of string-valued \index{Configuration +parameters}configuration parameters are used to determine file +names and variable names. In these parameters, the \index{\#}``\#'', +\index{\_dol}``\$'', and ``\index{ \_prc}\%'' characters +are used as wild cards. In file name specifications and the +specification of the name of your parser, ``\#'' will be replaced by +the name of your syntax file. In other function and variable names +AnaGram creates while building your parser, ``\$'' will be replaced by +the name of your parser. When building enumeration constants for the +names of the tokens in your grammar, ``\%'' will be replaced by the +name of the token. + +Note that when entering a Windows/DOS path name as a +value for a file name parameter you must quote any backslashes in the +path name. For example, + +\begin{indentingcode}{0.4in} +coverage file name = "f:{\bs\bs}sna{\bs\bs}foo.nrc" +\end{indentingcode} + +\subsection{Precedence Declarations} +\index{Precedence declarations} + +AnaGram allows you to resolve shift-reduce conflicts by assigning +precedence levels to operators. There are three precedence +declarations available, beginning with the keywords +\index{Left}\agparam{left}, \index{Right}\agparam{right}, and +\index{Nonassoc}\agparam{nonassoc} respectively. Each such +declaration consists of the appropriate keyword and a list of tokens +enclosed in braces (\bra \ket). All the tokens in the list have the same +precedence, higher than tokens in any previous declaration and lower +than in any subsequent declaration. If the keyword is \agparam{left}, +the tokens will group to the left. If it is \agparam{right}, they +will group to the right. If it is \agparam{nonassoc} (for +non-associative) no grouping will be assumed. Precedence declarations +must be included in a configuration section. Here are precedence +declarations appropriate to a simple desk calculator program: + +\begin{indentingcode}{0.4in} +{}[ + left \bra '+', '-' \ket + left \bra star, '/', '\%' \ket + right \bra unary minus \ket +] +unary minus = '-' +\end{indentingcode} + +Note that \agcode{unary minus} and \agcode{'-'} can have different +precedence. + +Precedence declarations are one of the few instances in AnaGram where +the \index{Statements}\index{Order of statements}order of statements +is significant. + +The use of precedence declarations is discussed in Chapter 9. + +\subsection{``Sticky'' Declarations} +\index{Sticky declaration}\index{Declaration} + +AnaGram provides another means for resolving shift-reduce conflicts. +You may characterize any token as ``sticky''. Then, in the case of a +\index{Shift-reduce conflict}\index{Conflicts}shift-reduce conflict +where a ``sticky'' token is the last token in the input buffer, the +conflict will be resolved by selecting the shift operation. +Intuitively, you may think of this as though the ``sticky'' token +adheres to and draws in any subsequent input that it can. ``Sticky'' +declarations are included in configuration sections. They begin with +the keyword \agcode{sticky} followed by a list of tokens, separated by +commas inside braces (\bra \ket). Suppose, for instance, you wished to +pick up a line of text, skipping any leading space or tab +characters. You might write the following syntax: + +\begin{indentingcode}{0.4in} +white space = ' ' + '{\bs}t' + +text char + -> \~{}'{\bs}n':c = do{\us}something(c); + +line + -> leading white space, text char?..., '{\bs}n' + +leading white space + -> + -> leading white space, white space +\end{indentingcode} + +Unfortunately, this syntax is ambiguous, since space and tab are +legitimate instances of both leading white space and text char. What +you really want to do is to skip white space until you find a +non-blank character and then you want to accept all characters to the +end of the line. There are two ways to address the problem. The +first is to define a special token for the first non-blank character +and, using it, to write an unambiguous grammar. This approach, while +laudable, is tedious and prolix. Instead, use \agparam{sticky} to +resolve the problem: + +\begin{indentingcode}{0.4in} +{}[ sticky \bra leading white space \ket ] +\end{indentingcode} + +Now when AnaGram analyzes your grammar, and encounters the ambiguity, +it will understand that a blank or tab that could be treated either as +leading white space or the as the first text character should be +treated as white space. Since \agcode{leading white space} is +``sticky'', any subsequent white space adheres to it. + +As with conflicts resolved with precedence levels, AnaGram lists all +conflicts that it resolves using \agcode{sticky} in the +\index{Resolved Conflicts}\index{Window}\agwindow{Resolved Conflicts +Table}, so you can verify that the conflicts have been correctly +resolved. + +An important use of sticky tokens is to inhibit the recognition of +following \index{Keywords}keywords. Following a sticky token, a +keyword, which, according to your grammar, would otherwise be +legitimate input, will not be recognized if a shift action is possible +for the first character of the keyword. For example, imagine that +\agcode{name} has been defined in the conventional way, and there +exists a production with name followed immediately by the keyword +\agcode{int}. Then if, in your input, the word \agcode{print} were to +occur, your grammar would parse it as a name, \agcode{pr}, followed by +the keyword \agcode{int}. If you make \agcode{name} sticky, however, +the first letter of \agcode{int} will be seen to be an acceptable +character for \agcode{name} and the keyword will not be +recognized. Your parser will then recognize the \agcode{name} as +\agcode{print}. + +\subsection{Distinguish Keywords Statement} +\index{Distinguish keywords}\index{Keywords} + +Distinguish keywords statements are occasionally needed to prevent +keyword recognition. You may, for example, wish to prevent the +recognition of the keyword \agcode{int} when it occurs embedded in a +word such as \agcode{interval}. Of course, you need to do this only +if both the keyword and the other word are both legitimate input at +the same point in your grammar. + +A distinguish keywords statement can prevent recognition of a keyword +which is embedded in another word provided at least one character of +the other word follows the keyword. + +The distinguish keywords statement has the form: + +\begin{indentingcode}{0.4in} +distinguish keywords \bra \codemeta{list of character sets} \ket +\end{indentingcode} + +AnaGram compares all the characters in each keyword to the characters +included in each character set in turn. If it finds that all the +characters in a keyword are members of a particular set, it tells the +keyword recognition logic to try to match the keyword only against the +longest sequence of characters drawn from the specified set. In other +words, in order for a keyword to be recognized, the keyword +\emph{must} be followed by a character \emph{not} in the set. The set +associated with a keyword is the first one in the list which contains +all the characters found in the keyword. If you have more than one +\agparam{distinguish keywords} statement in your grammar, the lists +are tried in the order in which they appear in the grammar. + +The purpose of the \agparam{distinguish keywords} statement is to +enable your parser to distinguish a keyword from the same sequence of +characters embedded within another sequence. Thus suppose that +\agcode{int} is a keyword, and, according to your grammar, could +appear in the same place as the word \agcode{integral}. If you don't +want it to be recognized as a keyword in these circumstances, you +would write the following distinguish statement: + +\begin{indentingcode}{0.4in} +distinguish keywords \bra 'a-z'+'A-Z' \ket +\end{indentingcode} + +To also inhibit recognition of \agcode{int} within \agcode{print}, you +would combine the use of the distinguish keywords statement with the +\agparam{sticky} declaration. + +\subsection{``Hidden'' Declarations} +\index{Hidden declaration}\index{Declaration} + +AnaGram provides an optional \index{Error diagnosis}error diagnosis +feature for your parser (see Chapter 9). The \agparam{hidden} +declaration allows you to identify tokens that you do not wish to be +used in making up \index{Diagnostic messages}diagnostic messages. +These tokens are tokens whose names would not mean anything to your +users. The format of a ``hidden'' declaration is the same as that of +precedence and ``sticky'' declarations. Within a configuration +section, the keyword ``hidden'' is followed by a list of tokens. For +example: + +\begin{indentingcode}{0.4in} +{}[ hidden \bra comment head \ket ] +comment + -> comment head, "*/" + +comment head + -> "/*" + -> comment head, \~{}eof +\end{indentingcode} + +This is an AnaGram representation of ANSI standard C comments +(non-nested). In this example the token \agcode{comment head} exists +only for convenience in writing the grammar and has no particular +meaning to an end user. On the other hand, he knows what the word +\agcode{comment} refers to. The ``hidden'' attribute will cause AnaGram's +diagnostic builder, by backing up the stack until it finds a +non-hidden token, to eschew \agcode{comment head} in favor of +\agcode{comment}. +% XXX eschew obfuscation. how about ``avoid''? + +\subsection{Disregard Statement} + +The purpose of the +\index{Disregard statement}\index{Statement}\agparam{disregard} +statement is to skip over uninteresting \index{White space}white space +and comments in your input files. The disregard statement allows you +to specify a token that should be passed over in the input to your +parser. The statement takes the form: + +\begin{indentingcode}{0.4in} +disregard ws +\end{indentingcode} + +where \agcode{ws} is a token name or character set. Disregard +statements may be placed in any configuration section. + +You may have more than one disregard statement in your grammar. If +you do, AnaGram will create a shell production. For example, suppose +you write: + +\begin{indentingcode}{0.4in} +{}[ + disregard alpha + disregard beta +] +\end{indentingcode} + +AnaGram will proceed as though you had written: + +\begin{indentingcode}{0.4in} +gamma + -> alpha | beta +{}[ disregard gamma ] +\end{indentingcode} + +It frequently happens that you wish your parser to disregard blanks or +comments, except that white space within names, numbers, strings, and +other elementary constructs is subject to special rules and thus +should not be disregarded blindly. In this case, you can use the +\agparam{lexeme} statement to declare these constructs off limits +for the disregard statement. Within these constructs, the disregard +statement will be inoperative and the admissibility of white space +will be determined solely by the productions which define these +constructs. + +Outside those productions which define lexemes, you should not +generally use a token which is supposed to be disregarded. If you do, +your grammar will have conflicts, since the token could satisfy both +the explicit usage and the implicit rules set up by the disregard +statement. Such conflicts, however, are resolved automatically in +favor of your explicit use of the token. The conflicts will appear in +the \agwindow{Resolved Conflicts} window. +% XXX I'm not sure that's still true. + +In order to implement the disregard statement AnaGram will redefine +some tokens in your grammar. For example, \agcode{+} may be redefined +to consist of a simple plus sign followed by optional white space: + +\begin{indentingcode}{0.4in} +'+' + -> '+'\%, white space?... +\end{indentingcode} + +The percent sign is used to indicate the original, simple plus sign +without the optional white space attached. You will probably notice +the percent sign appearing in some windows and traces. In earlier +versions of AnaGram, the degree sign, ``\agcode{\degrees}'', was used rather +than ``\agcode{\%}''. + +\subsection{Lexeme Statement} + +The ``lexeme'' \index{Statement}\index{Lexeme statement}statement is +used to fine-tune the disregard statement. +The lexeme statement takes the form: + +\begin{indentingcode}{0.4in} +{}[ lexeme \bra \codemeta{nonterminal token list} \ket ] +\end{indentingcode} + +where \textit{nonterminal token list} is a list of nonterminal tokens +separated by commas. +Lexeme statements may be placed in any configuration section, and +there may be any number of them. + +When you specify that a token is to be disregarded, AnaGram rewrites +your grammar so that the token will be passed over whenever it occurs +at the beginning of a file or following a lexical unit, or +\agterm{lexeme}. If you have no \agparam{lexeme} statement, then the +lexemes in your grammar are just the terminal tokens. + +The \agparam{lexeme} statement allows you to specify that certain +nonterminal tokens are also to be treated as lexemes. This means that +the disregard token will be skipped following the lexeme, but not +between the characters that constitute the lexeme. + +Lexemes correspond to the tokens that a lexical scanner, if you were +using one, would commonly identify and pass to a parser as single +tokens. You don't usually wish to disregard white space within these +tokens. For example, in a grammar for a conventional programming +language where blank characters are to be disregarded, you might +include: + +\begin{indentingcode}{0.4in} +{}[ lexeme \bra string, character constant, name, number \ket ] +\end{indentingcode} + +since blank characters must not be overlooked within strings and +character constants and should not be permitted within names or +numbers. + +Normally, AnaGram considers the disregard token to be optional; +however there are circumstances where treating the disregard token as +optional would lead to conflicts: two successive names, or two +successive numbers, for example. In this case, you would like to +require that the lexemes be separated by instances of the disregard +token. To do this, simply set the +\index{Distinguish lexemes}\index{Configuration switches} +\agparam{distinguish lexemes} +configuration switch. +When this switch is set, AnaGram will ensure that disregard tokens +will be required in those situations where making them optional would +lead to conflicts. + +White space may be used explicitly within definitions of lexeme tokens +in your grammar if desired, without causing conflicts. Thus, if you +wish to allow embedded space in variable names, you might write: + +\begin{indentingcode}{0.4in} +{}[ + disregard space + lexeme \bra variable name \ket +] +space = ' ' + '{\bs}t' +letter = 'a-z' + 'A-Z' +digit = '0-9' + +variable name + -> letter + -> variable name, letter + digit + -> variable name, space..., letter + digit +\end{indentingcode} + +\subsection{Enum Statement} +\index{Enum statement}\index{Enumeration}\index{Token} + +The \agparam{enum} statement follows rules nearly identical to those +for C and C++. This makes it possible to copy an enum statement from +your syntax file to a program file written in either C or C++, without +any need for editing. The only differences are that AnaGram makes no +provision for blank lines within the enumeration list, nor does it +accept a type name. The \agparam{enum} statement is equivalent to a +corresponding set of definition statements. It is especially useful +when a parser is accepting token input from another program, a +\index{Lexical scanner}lexical scanner, for example. Using +the enum statement you may conveniently define all the identification +codes for the input tokens. + +Each entry in an enum statement may be either a name, or a name +followed by an ``='' sign and a character representation. If there is +a character representation the name is assigned the value of the +specified character. Otherwise it is assigned a value one more than +that assigned to the previous name. If the first name in the list is +not given an explicit value, it will be given the value zero. For +example: + +\begin{indentingcode}{0.4in} +{}[ + enum \bra + eof, a,b,c, + blank = '\ ', x, y + \ket +] +\end{indentingcode} + +is equivalent to the following definition statements + +\begin{indentingcode}{0.4in} +eof = 0 +a = 1 +b = 2 +c = 3 +blank = '\ ' +x = 33 +y = 34 +\end{indentingcode} + +\subsection{Subgrammar Declarations} +\index{Subgrammar declaration}\index{Declaration} + +A \agparam{subgrammar} declaration can be a useful way to deal with +conflicts in certain situations. It tells AnaGram to treat the tokens +listed in the declaration as though they were each grammar tokens, +each specifying a complete subgrammar in itself, and, in determining +shift and reduction actions, to ignore the usage of the tokens in the +larger grammar. + +In some cases it is perfectly reasonable to ignore usage. The most +common example occurs when building a lexical scanner for a language +such as C as in the example in Section 7.4.4. In this case, you can +write a complete grammar for a C token with no difficulty. But if you +try to extend it to a sequence of tokens, you get scores of conflicts. +This situation arises because you specify that any C token can follow +another, when in actual practice, an identifier, for example, cannot +follow another identifier without some intervening space or +punctuation. + +It is theoretically possible, but in practice quite awkward, to write +a grammar for a sequence of tokens so that there are no conflicts. +The subgrammar declaration provides a way around this problem by +telling AnaGram that when it is looking for reducing tokens for any +rule produced directly or indirectly by a subgrammar token, it should +disregard the usage of the token and only consider usage internal to +the definition of the subgrammar token, as though the subgrammar token +were the start token of the grammar. + +The subgrammar declaration is made in a configuration section and +consists of the keyword \agcode{subgrammar} followed by a list of one +or more nonterminal token names, separated by commas and enclosed in +braces (\bra \ket). For example: + +\begin{indentingcode}{0.4in} +{}[ subgrammar \bra C token, word \ket ] +\end{indentingcode} + +Since the subgrammar statement changes the way AnaGram determines +reducing tokens, it should be used with caution. You should be sure +that the conflicts you are eliminating are really inconsequential. + +\subsection{Reserve Keywords Declaration} +\index{Reserve keywords}\index{Keywords}\index{Keyword anomalies} + +The \agparam{reserve keywords} declaration can be used to specify a +list of keywords that are reserved and cannot be used except as +explicitly specified in the grammar. This enables AnaGram to avoid +issuing meaningless keyword anomaly diagnostics (see \S 7.5). AnaGram +does not automatically presume that keywords are also reserved words, +since in many grammars there is no need to specify reserved words. + +The reserve keywords declaration is made in a configuration section +and consists of the words \agcode{reserve keywords} followed by a list +of one or more keyword strings, separated by commas and enclosed in +braces (\bra \ket). For example: + +\begin{indentingcode}{0.4in} +{}[ reserve keywords \bra "int", "char", "float", "double" \ket ] +\end{indentingcode} + +\subsection{Rename Macro Statement} +\index{Rename macro}\index{Macros} + +AnaGram uses a number of macros in its generated code. It is +possible, therefore, to run into naming collisions with other +components of your program. The \agparam{rename macro} statement +allows you to change the name AnaGram uses for a particular macro to +avoid these problems. For example, the Windows NT operating system +uses \agcode{CONTEXT} structures to perform various internal +operations. If you use the context tracking option (see \S 9.5.4) +your parser will have a macro called \agcode{CONTEXT}. To avoid the +name collision, add the following statement to any configuration +section in your grammar: + +\begin{indentingcode}{0.4in} +rename macro CONTEXT AG{\us}CONTEXT +\end{indentingcode} + +Then, simply use \agcode{AG{\us}CONTEXT} where you would otherwise have +used \agcode{CONTEXT}.