comparison doc/manual/sf.tex @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.
author David A. Holland
date Sat, 22 Dec 2007 17:52:45 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:13d2b8934445
1 \chapter{Syntax Files}
2 \index{Syntax file}\index{File}
3
4 Input files to AnaGram are called \agterm{syntax files}. A syntax
5 file comprises a grammar and associated C or C++ code. The grammar
6 consists of a number of productions along with supportng information
7 such as configuration sections and definitions of character sets. The
8 associated code consists of reduction procedures (see \S 8.2.13) and
9 embedded C or C++ code (\S 8.2.17). This chapter explains the rules
10 for writing syntax files acceptable to AnaGram. The rules for
11 interfacing your parser to the balance of your program are given in
12 Chapter 9.
13
14
15 \section{Lexical Conventions}
16 \index{Lexical conventions}
17
18 \subsection{Statements}
19 \index{Statements}
20
21 For purposes of this manual, AnaGram statements are considered to be
22 productions, definition statements, configuration sections, and blocks
23 of embedded C or C++ code, all discussed individually below. Each
24 statement must begin on a new line. It is a good idea to separate
25 statements visually in your file by using blank lines freely.
26 There are generally no restrictions on the
27 \index{Statements}\index{Order of statements}order of statements
28 in a syntax file. Good programming practice, however, suggests that
29 definitions and configuration sections should precede the grammar
30 itself.
31
32 \subsection{Spaces and Tabs}
33 \index{Spaces}\index{Tabs}
34
35 AnaGram allows spaces and tabs to be used freely to improve the
36 readability of grammars. Spaces and tabs are ignored, except when
37 embedded in a token name, in a character set definition, or in a
38 keyword. Within a token name, any sequence of spaces and tabs counts
39 as a single space.
40
41 \subsection{Continuation Lines}
42 \index{Continuation lines}
43
44 AnaGram statements normally end with a newline character or the end of
45 file. If AnaGram encounters the end of a line and the statement it is
46 reading appears to be complete, it will not look for a continuation.
47 To continue a statement to another line, just make sure that what you
48 have on the first line is clearly incomplete. For example,
49
50 \begin{indentingcode}{0.4in}
51 prep phrase -> preposition, "the", noun
52 \end{indentingcode}
53
54 looks complete to AnaGram, whereas
55
56 \begin{indentingcode}{0.4in}
57 prep phrase -> preposition, "the", noun,
58 \end{indentingcode}
59
60 looks incomplete because of the dangling comma at the end.
61
62 \subsection{Comments}
63 \index{Comments}
64
65 AnaGram accepts comments in accordance with the rules of C and C++,
66 that is, normal C comments bracketed with \agcode{/*} and \agcode{*/},
67 as well as comments which begin with \agcode{//} and continue to the
68 end of line. AnaGram also observes these conventions when skipping
69 over embedded C code.
70
71 Since the ANSI standard for C insists that normal C comments do not
72 nest, AnaGram, by default, disallows nested comments. You may,
73 however, set a configuration parameter,
74 \index{Nest comments}\index{Configuration switches}\index{Comments}
75 \agparam{nest comments},
76 to allow nested comments. See Appendix A. In any case, AnaGram will
77 use the same convention for embedded C as it uses for AnaGram proper.
78 You can change the convention in the middle of the file if necessary.
79
80 AnaGram treats each comment delimited with \agcode{/*} and \agcode{*/}
81 as though it were a single space. You can even put such comments in
82 the middle of token names if you should want to. A comment that
83 begins with \agcode{//} is treated as though the end of line occurred
84 at the \agcode{//}.
85
86 \subsection{Blank Lines and Form Feeds}
87 \index{Blank lines}
88
89 Because blank lines and form feeds are visual separators, AnaGram will
90 not skip either looking for a continuation line. Therefore blank lines
91 and form feeds can occur only between AnaGram statements, not in the
92 middle of a statement.
93
94 It is a good idea to separate groups of productions with a blank line
95 or two, lest an accidental dangling comma make AnaGram think the
96 beginning of the next production is a continuation of the present one.
97
98
99 \section{Elements of Grammars}
100
101 \subsection{Names}
102 \index{Name}\index{Token}
103
104 You may use names to represent tokens, character sets, keywords and
105 \index{Virtual productions}\index{Production}virtual productions.
106 Names follow the same general rules as for any programming language,
107 with the notable exception that they may have embedded white space.
108 Names are made up of letters, digits, or underscores. They may not
109 begin with a digit. Any sequence of embedded spaces, tabs or comments
110 counts as a single space. AnaGram distinguishes between upper and
111 lower case\index{Case sensitivity}, so that \agcode{Word} and
112 \agcode{word} are different names. There is no particular limit to the
113 length of a name. There are no reserved words as such, although
114 \agcode{grammar}, \agcode{eof}, and \agcode{error} will be treated as
115 reserved words unless you take special action by setting appropriate
116 configuration parameters. The names AnaGram uses for
117 \index{Configuration parameters}configuration parameters
118 follow the same rules as for other names, except that
119 \index{Case sensitivity}case
120 is ignored.
121
122 \subsection{Reserved Words}
123 \index{Reserved words}\index{Words}
124
125 % XXX shouldn't that be \index{Grammar token}?
126 AnaGram treats tokens with the names \index{Grammar}\agcode{grammar},
127 \index{Eof token}\index{Token}\agcode{eof}, and \index{Error
128 token}\index{Token}\agcode{error} in a special manner unless certain
129 measures are taken. Since you can override AnaGram's use of these
130 names, they are not reserved words in the true sense.
131
132 If your grammar has a token named \agcode{grammar}, AnaGram will take
133 that token to be the grammar token for your grammar unless you set the
134 \index{Token}\index{Grammar token}\index{Configuration parameters}
135 \agparam{grammar token}
136 configuration parameter or mark some other token as the grammar token
137 using ``\index{ \_dol}\$''.% See below ???.
138
139 If your grammar has a token named \agcode{error} and you take no
140 further steps, AnaGram will assume you wish to use error token
141 resynchronization in case of
142 \index{Syntax error}\index{Errors}syntax error. See Chapter 9.
143 If you wish to use some other token as an error token you
144 may select it using the
145 \index{Configuration parameters}\index{Token}\index{Error token}
146 \agparam{error token} configuration parameter.
147 If you wish to use \agcode{error} as a token name, but do not want
148 error token resynchronization, you may set the \agparam{error token}
149 configuration parameter to any name that is not used in your grammar.
150 You may then use \agcode{error} as a token name without causing
151 AnaGram to include error token resynchronization in your parser.
152
153 \index{Resynchronization}
154 If you select automatic resynchronization or error token
155 resynchronization (see Chapter 9), AnaGram will look for a token
156 called \agcode{eof} to use as an end of file indicator. You may
157 either name your end of file token \agcode{eof} or you may set the
158 \agparam{eof token} configuration parameter with the name of your end
159 of file token.
160
161 \subsection{Variable Names}
162 \index{Name}\index{C variable names}
163
164 With AnaGram you can associate C/C++ variable names with the
165 \index{Semantic value}\index{Token}\index{Value}semantic values of
166 tokens for use in your \index{Reduction procedure}reduction
167 procedures. Each name follows the corresponding token in the grammar
168 rule on the right of the production, separated from the token by a
169 colon. AnaGram allows variable names made up of letters, digits, and
170 underscores. They may not begin with a digit. Embedded spaces, tabs
171 or comments, are not allowed, of course. AnaGram imposes no
172 restriction on length, but uses your variable names just as you have
173 written them in the code it generates to call reduction procedures.
174 Remember that your compiler may have a limit on the length of variable
175 names. Also, AnaGram itself uses C variable names beginning with
176 \agcode{ag{\us}}. It is therefore wise to avoid using names of this form.
177
178 \subsection{Terminal Tokens}
179 \index{Terminal token}\index{Token}
180
181 A \agterm{terminal token} is a token which does not appear on the left
182 side of a production. It represents, therefore, a basic unit of input
183 to your parser. You have several options with respect to terminal
184 tokens. If the input to your parser consists of ASCII characters, you
185 may define terminal tokens explicitly as ASCII characters or as sets
186 of ASCII characters. If you have an input procedure which produces
187 numeric codes, you may define the terminal tokens directly in terms of
188 these numeric codes. On the other hand, you may leave the terminal
189 tokens completely undefined. In this case, you must provide an input
190 procedure which can determine the appropriate
191 \index{Token}\index{Token number}\index{Number}token numbers.
192 It is an all or none situation. If you provide any explicit
193 definitions, you must provide them for all terminal tokens. Input
194 procedures and token input are discussed in Chapter 9. Examples of
195 non-character input may be found in the Macro Preprocessor example in
196 the \agfile{examples/mpp} directory on your AnaGram distribution
197 disk.% Further examples are given in Chapter ???.
198 % XXX change ``on ...distribution disk'' to ``in ...distribution''.
199
200 \subsection{Character Representations}
201 \index{Character representations}
202
203 In specifying admissible input characters you may use \index{Character
204 constants}character constants following the normal C conventions.
205 Remember that a character constant may specify only a single
206 character. Although some C compilers will allow constructs such as
207 \agcode{'mv'}, AnaGram doesn't allow this. AnaGram recognizes the
208 same escape sequences as C, including octal and hex sequences, even
209 though this is, strictly speaking, unnecessary. The escape sequences
210 AnaGram recognizes are:
211
212 %
213 % It would be nice to be able to just write this and tell latex to set
214 % it in three columns. but no... that would be too easy.
215 %
216 %
217 %\begin{tabular}{ll}
218 %\agcode{{\bs}a}&alert (bell) character\\
219 %\agcode{{\bs}b}&backspace\\
220 %\agcode{{\bs}f}&formfeed\\
221 %\agcode{{\bs}n}&newline\\
222 %\agcode{{\bs}r}&carriage return\\
223 %\agcode{{\bs}t}&horizontal tab\\
224 %\agcode{{\bs}v}&vertical tab\\
225 %\agcode{{\bs\bs}}&backslash\\
226 %\agcode{{\bs}?}&question mark\\
227 %\agcode{{\bs}'}&single quote\\
228 %\agcode{{\bs}"}&double quote\\
229 %\agcode{{\bs}ooo}&octal number\\
230 %\agcode{{\bs}xhh}&hexadecimal number\\
231 %\end{tabular}
232
233 \begin{indenting}{0.4in}
234 \begin{tabular}{llllll}
235 \agcode{{\bs}a}&alert (bell) character&
236 \agcode{{\bs}t}&horizontal tab&
237 \agcode{{\bs}'}&single quote\\
238 \agcode{{\bs}b}&backspace&
239 \agcode{{\bs}v}&vertical tab&
240 \agcode{{\bs}"}&double quote\\
241 \agcode{{\bs}f}&formfeed&
242 \agcode{{\bs\bs}}&backslash&
243 \agcode{{\bs}\textit{ooo}}&octal number\\
244 \agcode{{\bs}n}&newline&
245 \agcode{{\bs}?}&question mark&
246 \agcode{{\bs}x\textit{hh}}&hexadecimal number\\
247 \agcode{{\bs}r}&carriage return\\
248 \end{tabular}
249 \end{indenting}
250 \bigskip
251
252 The octal escape sequence allows up to three octal digits, in
253 accordance with ANSI specifications for C. The hexadecimal numbers
254 may contain an arbitrary number of digits; however AnaGram will
255 truncate the result to sixteen bits.
256
257 A backslash followed by any character other than those listed above
258 will cause a syntax error.
259
260 You may also represent characters by writing the numeric code
261 explicitly, in decimal, octal, or hexadecimal representations.
262 AnaGram follows the C conventions for integer constants: a leading
263 \agcode{0} means the number is octal, a leading \agcode{0x} or
264 \agcode{0X} means it is hexadecimal. The hex digits \agcode{a-f} may
265 be either upper or lower case\index{Case sensitivity}. Numbers may be
266 preceded by an optional minus sign.
267
268 If your parser uses a pre-existing \index{Lexical scanner}lexical
269 scanner and you wish to use the code numbers it generates to identify
270 tokens, you may simply treat those code numbers as character numbers.
271 You may use the numbers directly in your productions, or you may use
272 definition statements to name them. You may also use an
273 \agparam{enum} statement within a configuration section to attach
274 names to the code numbers.
275 % XXX shouldn't this use of enum be indexed?
276
277 AnaGram also allows a special notation for control characters. You
278 may represent a control character by using the ``\^{}'' character
279 preceding any printing ascii character. Thus you can write
280 \agcode{\^{}z} or \agcode{\^{}Z} to represent the DOS end-of-file
281 character. Notice that quotation marks are not necessary.
282
283 Examples of character representations:
284
285 \begin{indenting}{0.4in}
286 \begin{tabular}{cccc}
287 \agcode{'K'}&\agcode{-1}&\agcode{0}&\agcode{'{\bs}t'}\\
288 \agcode{\^{}J}&\agcode{'{\bs}xff'}&\agcode{077}&\agcode{0XF3}\\
289 \end{tabular}
290 \end{indenting}
291
292 \subsection{Character Ranges}
293 \index{Character range}\index{Range}
294
295 It is convenient to be able to specify ranges of characters when
296 writing a grammar. AnaGram supports several ways of representing
297 ranges of characters. The first is an extension of the notation for
298 character constants: \agcode{'a-z'} is the set of lower case
299 characters. You can even use escape sequences such as
300 \agcode{'{\bs}n-{\bs}r'} if you like. The order of
301 characters used to specify the range is immaterial: \agcode{'z-a'} is
302 the same as \agcode{'a-z'}. AnaGram will, however, issue a warning
303 just in case the unusual order results from a clerical error.
304
305 The second way to specify a range is by using two arbitrary character
306 representations, as described above, separated by two dots. For
307 example, \agcode{\^{}C..\^{}Z}, \agcode{3..26}, \agcode{3..032},
308 \agcode{3..0x1a}, and \agcode{\^{}C..0x1a}, all represent the same
309 range of characters. Similarly, \agcode{'A-F'}, \agcode{'A'..'F'},
310 \agcode{0101..0106}, \agcode{0x41..0x46}, \agcode{65..70}, and
311 \agcode{65..'F'} all represent the same range of characters.
312
313 \subsection{Character Sets}
314 \index{Character sets}
315
316 If you provide explicit definitions for terminal tokens, the basic
317 input unit for your parser will be considered a character set, even if
318 your input procedure provides numeric codes that are not actually
319 characters. As a terminal token, a character set will be matched by
320 any input character that is a member of the set. Character sets may
321 be named in definition statements, but they may also appear on the
322 right sides of productions without being named.
323
324 A character set may consist of one or more characters. You can
325 specify a character set that consists of a single character by using
326 any of the character representation methods described above. You can
327 specify a set consisting of a range of characters by using any of the
328 representations of character ranges described above.
329 \index{Character sets}
330 To specify more complicated sets, you can write
331 \index{Expressions}\index{Set expressions}expressions
332 using conventional set theoretic operations.
333 In AnaGram input, these operations are specified as follows:
334
335 \index{Union}\index{Difference}\index{Intersection}\index{Complement}
336 \begin{indenting}{0.4in}
337 \begin{tabular}{cl}
338 \agcode{A + B}&(union)\\
339 \agcode{A - B}&(difference)\\
340 \agcode{A \& B}&(intersection)\\
341 \agcode{\~{}A}&(complement)\\
342 \end{tabular}
343 \end{indenting}
344
345 where \agcode{A} and \agcode{B} are arbitrary sets. Union and
346 difference have the same precedence. Intersection has higher
347 precedence and complement has the highest precedence. Thus in the
348 expression
349
350 \begin{indentingcode}{0.4in}
351 A + \~{}B\&C
352 \end{indentingcode}
353
354 the complement operation is performed first, then the intersection,
355 and finally the union.
356
357 Watch out! In an AnaGram syntax file \agcode{65 + 97} represents the
358 character set which consists of lower case \agcode{a} and upper case
359 \agcode{A}. It does not represent 162, the sum of 65 and 97.
360
361 Parentheses may be used to force the order of evaluation:
362
363 \begin{indentingcode}{0.4in}
364 \~{}(A \& (B+C))
365 \end{indentingcode}
366
367 In this example the union of \agcode{B} and \agcode{C} is calculated,
368 then the intersection of this set with \agcode{A} is calculated, and
369 finally the complement is evaluated.
370
371 The computation of the \index{Complement}complement of a
372 \index{Character sets}set requires a definition of the
373 \index{Universe}universe of set elements. AnaGram will define the
374 universe to be the set of unsigned 8-bit characters, unless one or
375 more characters outside that range have been specified. In that case,
376 the universe will consist of all characters on the interval defined by
377 the lesser of zero and the lowest character code used and the greater
378 of 255 and the highest character code used. The complement of a
379 character set is everything in this universe except the characters in
380 the set.
381
382 Characters which make up part of the character universe, but are not
383 legitimate input according to your grammar, are lumped together into a
384 special token which will cause an error if it occurs in your input.
385
386 When your parser reads an input character, it uses that character to
387 index a conversion table in order to determine the appropriate
388 \index{Token number}\index{Token}\index{Number}token number. If the
389 \index{Range}\index{Test range}\index{Configuration switches}
390 \agparam{test range} configuration switch
391 is on, its default setting, your parser will include code to verify
392 that the character is in bounds before it indexes the conversion
393 table. If you are satisfied that checking bounds is unnecessary, you
394 may turn the \agparam{test range} switch off and get a slightly higher
395 level of performance from your parser.
396
397 For efficient processing, it is well to keep the number of tokens to a
398 minimum. Therefore if you have a choice between defining a construct
399 as a token, with a production, or a set, with a definition, the set is
400 to be preferred.
401
402 Some useful character sets are:
403
404 \begin{indenting}{0.4in}
405 \begin{tabular}{ll}
406 \agcode{'a-z' + 'A-Z'}&Alphabetic characters\\
407 \agcode{'a-f' + 'A-F'}&Hex digits\\
408 \agcode{'0-9'}&Decimal digits\\
409 \agcode{0..127}&ASCII character set\\
410 \agcode{32..126}&Printing ASCII characters\\
411 \agcode{\~{}'{\bs}n'}&Anything but newline\\
412 \agcode{\^{}Z}&Windows/DOS end of file indicator\\
413 \agcode{-1}&Stream I/O end of file indicator\\
414 \agcode{0}&String terminator\\
415 \agcode{32..126 - 'a-z' - 'A-Z' - '0-9'}&Punctuation\\
416 \end{tabular}
417 \end{indenting}
418 \bigskip
419 % XXX ``punctuation'' is wrong; it should subtract off space too
420
421 Note that \agcode{'a-z'} is a range of characters but
422 \agcode{32..126 - 'a-z'} is a set difference.
423
424 When AnaGram encounters a character set in a grammar rule, it assigns
425 a token number to the character set. If it has previously seen the
426 same character set it will assign the same token number; however, it
427 assigns the same token number only if the set expressions are
428 obviously the same. Thus, AnaGram will assign the same token number
429 every time it sees \agcode{A + B}, but will assign a different token
430 number if it sees \agcode{B + A}. Only when AnaGram has finished
431 scanning the entire syntax file can it actually evaluate the character
432 sets. If it finds that several different tokens all refer to the same
433 character set, it will create a single token that represents the true
434 character set and create
435 \index{Shell productions}\index{Production}``shell productions'' for
436 the others.
437
438 \index{Character sets}If the character sets you use in your grammar
439 overlap, they do not properly represent
440 \index{Terminal token}\index{Token}terminal tokens.
441 To deal with this situation, AnaGram identifies all overlaps among
442 character sets and extends your grammar by adding a number of extra
443 productions. For instance, suppose your grammar uses the following
444 character sets as though they were terminal tokens:
445
446 \begin{indentingcode}{0.4in}
447 'a-z' + 'A-Z'
448 '0-9'
449 '0-7'
450 'a-f' + 'A-F'
451 \end{indentingcode}
452
453 AnaGram will then modify your grammar by adding the following productions:
454
455 \begin{indentingcode}{0.4in}
456 'a-z' + 'A-Z'
457 -> 'a-f' + 'A-F' | 'g-z' + 'G-Z'
458
459 '0-9'
460 -> '0-7' + '8-9'
461 \end{indentingcode}
462
463 Although the tokens \agcode{'a-z' + 'A-Z'} and \agcode{'0-9'} are
464 technically now
465 \index{Nonterminal token}\index{Token}nonterminal tokens,
466 for purposes of determining the
467 \index{Token}\index{Data type}data type of their
468 \index{Semantic value}\index{token}\index{Value}semantic values,
469 AnaGram continues to regard them as terminal tokens.
470
471 This \index{Partition}\index{Universe}\index{Character universe}
472 ``partitioning'' of the character universe is described in Chapter 6.
473
474 \subsection{Keyword Strings}
475 \index{Keywords}
476
477 In your grammar, AnaGram recognizes character strings within double
478 quotes (e.g., \agcode{"IF"}) as keywords. The strings follow the same
479 syntactic rules as strings in C. The same escape sequences are
480 honored. AnaGram does not, however, allow for the concatenation of
481 adjacent strings. Note that AnaGram strings are used only for the
482 definition of keywords in your grammar, not for messages to be
483 displayed or printed.
484
485 Keyword strings may not include null characters and must be at least
486 one character long. You may have any number of keywords. Each is
487 treated as a single terminal token. A keyword may be given a name by
488 means of a definition statement. Keywords may appear in virtual
489 productions.
490
491 AnaGram's keyword recognition works in the following way. First, for
492 each state in your parser, AnaGram prepares a list of all the keywords
493 that are admissible in that state. Your parser will recognize a
494 keyword \emph{only} if it is in an appropriate state; otherwise it
495 will appear to be an anonymous sequence of characters. Your parser,
496 in any state, checks for keywords it expects before it checks for
497 acceptable characters. That is, \emph{keywords take precedence} over
498 simple characters. It does not look for keywords that would not be
499 acceptable input. The parser will do whatever lookahead is necessary
500 in order to pick up the entire keyword. Thus if the character
501 \agcode{I} and the keyword \agcode{IF} are both legitimate input at
502 some point, \agcode{IF} will be recognized, if present, in preference
503 to \agcode{I}. If several admissible keywords match the input, such
504 as \agcode{IF} and \agcode{IFF}, the parser will select the longest
505 match, \agcode{IFF} in this example.
506
507 AnaGram does not incorporate keywords into its character sets.
508 Keywords stand apart and should not appear in definitions of character
509 sets. In particular, they are not considered as belonging to the
510 complement of a character set. Thus for the production
511
512 \begin{indentingcode}{0.4in}
513 next char -> \~{}('{\bs}n' + \^{}Z)
514 \end{indentingcode}
515 a keyword would not be considered legitimate input.
516
517 Note also that a keyword consisting of a single character does not
518 belong to the character universe. Because of this fact, AnaGram's
519 treatment of \agcode{'X'} and \agcode{"X"} is very different. If this
520 seems confusing at first, try using only keywords which are at least
521 two characters long until you have some experience with them.
522
523 AnaGram's keyword recognition logic normally does not make any
524 assumptions about what precedes or follows a keyword. Thus if
525 \agcode{int} is a keyword, your parser will be capable of plucking it
526 out of a string of characters such as \agcode{disintegrate} if,
527 according to your grammar, it could follow \agcode{dis}. The
528 \agparam{sticky} declaration and the \agparam{distinguish keywords}
529 statement, described below, can prevent such unwanted recognition of
530 keywords. A keyword following a \agparam{sticky} token will not be
531 recognized if the first character of the keyword can be shifted in as
532 part of the \agparam{sticky} token. The \agparam{distinguish
533 keywords} statement prevents recognition of a keyword if it is
534 followed immediately by a character of the sort that makes up the
535 keyword.
536
537 \subsection{Type Specifications For Tokens}
538 \index{Token}\index{Token type}\index{Type declarations}
539
540 When you write productions or token declarations (see below), AnaGram
541 allows you to specify the data type\index{Token}\index{Data type} of
542 the \index{Semantic value}\index{Token}\index{Value}semantic value of
543 a token by using a C or C++ data type specification. The restrictions
544 are that AnaGram does not allow specification of array or function
545 types, nor explicit structure types. Types that are defined with
546 typedef statements, structure definitions, or class definitions,
547 including template classes, in your embedded C or C++ are acceptable.
548 Thus the following specifications, for example, are acceptable:
549
550 \begin{indentingcode}{0.4in}
551 void
552 int
553 char *
554 unsigned long *near
555 static float *far
556 my{\us}type
557 double *
558 struct descriptor
559 struct widget *
560 vector <double> *
561 \end{indentingcode}
562
563 On the other hand, the following specifications are \emph{not} valid:
564
565 \begin{indentingcode}{0.4in}
566 int[20]
567 int *(int, unsigned char)
568 \bra int x,y; float z; \ket
569 struct \bra int k; float z; \ket
570 \end{indentingcode}
571
572 Note that AnaGram itself does nothing with the type specifications. It
573 simply passes them on to your compiler as appropriate.
574
575 \subsection{Productions}
576 \index{Production}
577
578 Productions are the basic units of a grammar. A production consists
579 of a left side and a right side. \index{Left side}The left side of a
580 production consists of one or more token names, joined by commas,
581 optionally preceded by a type specification enclosed in parentheses.
582 \index{Right side}The right side begins with an arrow and may either
583 begin on the same line as the left side or on a new line. For
584 example:
585
586 \begin{indentingcode}{0.4in}
587 program -> statement list, eof
588 expression
589 -> expression, plus, term
590
591 (int) variable name, function name
592 -> name:n = look{\us}up(n);
593 \end{indentingcode}
594
595 The part of the right side of a production following the arrow is
596 called a \index{Grammar rule}\index{Rule}\agterm{grammar rule},
597 discussed below. A production need not have a right side at all. In
598 this case, it is simply called a
599 \index{Declaration}\index{Token}\agterm{token declaration}.
600 AnaGram assigns
601 \index{Token number}\index{Token}\index{Number}token numbers
602 to the token names on the left side, and, if there is a type
603 specification, records the data type for each of the tokens declared.
604 Declarations of this sort are most useful when using input from a
605 \index{Lexical scanner}lexical scanner. See Chapter 9 for a discussion
606 of techniques for interfacing a lexical scanner to your parser. If
607 you do not intend to use a lexical scanner you will have no need for
608 token declarations.
609
610 If you do not explicitly specify the type for the
611 \index{Semantic value}\index{Token}\index{Value}semantic value
612 of a token, it will be determined by the configuration parameter
613 \index{Default token type}\index{Configuration parameters}\index{Token}
614 \agparam{default token type}
615 if it is a \index{Nonterminal token}\index{Token}nonterminal token or
616 by the \index{Configuration parameters}configuration parameter
617 \index{Input token type}\index{Default input type}\agparam{default input type}
618 if it is a \index{Token}terminal token.
619 \agparam{Default token type} defaults to \agcode{void}.
620 \agparam{Default input type} defaults to \agcode{int}.
621
622 If a production has more than one token on the left side, as in the
623 third example above, it is called a
624 \index{Semantically determined production}\index{Production}
625 \agterm{semantically determined production}. Semantically determined
626 productions are a useful tool for exerting semantic control over
627 syntactic analysis. A semantically determined production should have
628 a reduction procedure which determines on a case by case basis which
629 of the tokens on the left side should be taken as the reduction token.
630 If there is no reduction procedure, or if the reduction procedure does
631 not make a choice, the reduction token will be the first syntactically
632 correct token on the left side of the production. In the example
633 above, \agcode{variable name} will be the reduction token unless
634 \agcode{look{\us}up} changes it to \agcode{function name}. Semantically
635 determined productions are discussed more fully in Chapter 9.
636
637 If several productions have the same left side, it does not need to be
638 repeated. Subsequent right hand sides must each start on a new
639 line. For example:
640
641 \begin{indentingcode}{0.4in}
642 integer
643 -> digit
644 -> integer, digit
645
646 name
647 -> letter
648 -> name, letter
649 -> name, digit
650 \end{indentingcode}
651
652 On the other hand, you do not have to group productions with the same
653 left side. You could write the above productions as follows, although
654 it would certainly not be good programming practice:
655
656 \begin{indentingcode}{0.4in}
657 name -> name, digit
658 integer -> integer, digit
659 name -> name, letter
660 integer -> digit
661 name -> letter
662 \end{indentingcode}
663
664 Nevertheless, there are a few occasions involving complex cross
665 recursions and semantically determined productions where it is not
666 possible to group productions neatly.
667
668 The right side of a production can be empty. Such a production is
669 called a
670 \index{Null productions}\index{Production}\agterm{null production}.
671 Null productions are useful to denote an optional element in a
672 grammar, or a list that may be empty. For example:
673
674 \begin{indentingcode}{0.4in}
675 optional widget
676 ->
677 -> widget
678
679 optional qualifiers
680 ->
681 -> optional qualifiers, qualifier
682 \end{indentingcode}
683
684 A second way to write multiple productions with the same left side
685 uses the \index{Vertical bar}\index{|}vertical bar character, ``$|$'',
686 to separate the grammar rules. The productions given above for
687 \agcode{name}, \agcode{optional widget}, and \agcode{optional
688 qualifiers} can also be written:
689
690 \begin{indentingcode}{0.4in}
691 name -> letter | name, letter | name, digit
692 optional widget
693 -> | widget
694
695 optional qualifiers
696 -> | optional qualifiers, qualifier
697 \end{indentingcode}
698
699 Note that a null production cannot \emph{follow} a vertical bar.
700
701 A token that has a null production is called a
702 \index{Zero length token}\index{Token}\agterm{zero length token},
703 since it can be represented by an empty sequence of input characters,
704 that is to say, by nothing at all. Furthermore, even if a token
705 doesn't have any null productions, if it has at least one rule
706 consisting entirely of zero length tokens it is also a zero length
707 token. In the Token Table window, AnaGram notes which tokens are zero
708 length, because they can be a source of conflicts.
709
710 \subsection{Grammar Token}
711
712 Every grammar must have a single token which produces the entire
713 grammar. This token is variously called the
714 \index{Token}\index{Grammar token}\agterm{grammar token}, the
715 \index{Goal token}\agterm{goal token} or the
716 \index{Start token}\agterm{start token}.
717 AnaGram provides several methods you may use to specify which token in
718 your grammar is the grammar token.
719
720 You may simply use the name \agcode{grammar} for the grammar token.
721 If you wish to use some other more descriptive name for your grammar
722 token, you may mark it with a following dollar sign when it appears on
723 the left side of a production. Alternatively, you may set the
724 \index{Grammar token}\index{Configuration parameters}\agparam{grammar token}
725 configuration parameter to specify the grammar token. Here are
726 examples of the methods:
727
728 \begin{indentingcode}{0.4in}
729 grammar
730 -> [statement | newline]/...
731
732 program \$
733 -> [statement | newline]/...
734
735 {}[ grammar token = program ]
736 program
737 -> [statement | newline]/...
738 \end{indentingcode}
739
740 If you should use more than one of these techniques, AnaGram resolves
741 the issue in the following manner: A marked token or a configuration
742 parameter setting always takes precedence over simply naming a token
743 \agcode{grammar}. If you mark more than one token or set the
744 configuration parameter more than once, the last setting or mark wins.
745
746 \subsection{Grammar Rules}
747 \index{Rule}\index{Grammar rule}
748
749 The part of a production to the right of the arrow is more often
750 called a \agterm{grammar rule}, or simply \agterm{rule}. A grammar
751 rule is a sequence of \index{Rule elements}\agterm{rule elements},
752 joined by commas, as in the examples of productions given above. Rule
753 elements are token names, character set expressions, virtual
754 productions, or immediate actions (see below). Each rule element may
755 be optionally followed by a parameter assignment. The entire rule may
756 be followed by an optional reduction procedure. A \index{Parameter
757 assignment}parameter assignment is a colon followed by a C variable
758 name. Here are some examples of rule elements with parameter
759 assignments:
760
761 \begin{indentingcode}{0.4in}
762 '0-9':d
763 integer:n
764 expression:x
765 declaration:declaration{\us}descriptor
766 \end{indentingcode}
767
768 The parameters you assign to tokens in your grammar rule become the
769 formal parameters for your \index{Reduction procedure}reduction
770 procedure. The data type\index{Data type}\index{Reduction procedure
771 arguments} of the parameter is determined by the data type for the
772 semantic value of the token to which it is assigned. If your grammar
773 rule has parameter assignments, but does not have a reduction
774 procedure, AnaGram will give you a warning in case the lack of a
775 reduction procedure is an oversight. If you don't need a reduction
776 procedure you may safely ignore the warning. On the other hand,
777 AnaGram has no way to determine whether you have failed to make
778 necessary parameter assignments. You won't find out until you compile
779 your parser, when your compiler will give you error messages for
780 undefined symbols.
781
782 AnaGram assigns a unique rule number to each rule in your grammar.
783 Rules are numbered sequentially as they are encountered in the syntax
784 file. AnaGram constructs rule zero itself. Rule zero normally has a
785 single element, the grammar token, unless you have a
786 \agparam{disregard} statement in your grammar. In this case there
787 will be two elements.
788
789 \subsection{Reduction Procedures}
790 \index{Reduction procedure}
791
792 % XXX somewhere in here it ought to say something like
793 % ``in the parsing literature reduction procedures are often known as
794 % \agterm{semantic actions}.''
795 % Note that R. says there's some subtle difference between the usual
796 % concept of semantic action and AG's concept of reduction procedure.
797 % I don't know what this difference is and I hope she can recall it.
798 %
799 % D. thinks this note ought to be at the end; R. wants it at the top.
800
801 A \agterm{reduction procedure} is a piece of C code which optionally
802 follows a production. The code is executed when your parser
803 identifies the production in its input. There are two forms for
804 reduction procedures, a short form and a long form. The short form
805 consists of a single C expression. The long form consists of an
806 arbitrary block of C code. When AnaGram builds a parser, it inspects
807 the grammar rule to which the procedure is attached and identifies the
808 parameters for the procedure. It uses these parameters as the formal
809 parameters for the procedure.
810 If the
811 \index{Macros}\index{Allow macros}\index{Configuration switches}
812 \agparam{allow macros}
813 configuration switch has not been turned off, AnaGram codes the
814 reduction procedure as a macro definition whenever possible.
815 Otherwise AnaGram codes it as a function definition. AnaGram builds
816 the name for a reduction procedure by appending its internal procedure
817 number to the string \agcode{ag{\us}rp{\us}}. Thus reduction procedures are
818 numbered in the order in which they are encountered in the syntax
819 file.
820
821 Both long and short form reduction procedures are preceded by an equal
822 sign which follows the production. The short form consists of a C or
823 C++ expression terminated by a semicolon. When the grammar rule is
824 reduced, the expression will be evaluated and its value will become
825 the value of the reduction token. The expression and the terminating
826 semicolon must be entirely on a single line. Note that, if you really
827 need to make the expression longer than will fit on one line, you can
828 embed a newline in a comment. Some examples of short form reduction
829 procedures are:
830
831 % XXX is there anything we can do about the ugly underscores?
832 \begin{indentingcode}{0.4in}
833 =0;
834
835 =1;
836
837 =10*n + d-'0';
838
839 =
840 special{\us}processor(first{\us}parameter, second{\us}parameter);
841
842 =word{\us}count++;
843
844 =widget(constant{\us}1*parameter{\us}1 + constant{\us}2*parameter{\us}2 /*
845 {} */ + constant{\us}3*parameter{\us}3);
846 \end{indentingcode}
847
848 A long form reduction procedure consists of an arbitrary block of C or
849 C++ code, enclosed in braces (\bra \ket). AnaGram will code the reduction
850 procedure as a function. To return a value for the reduction token,
851 simply use the \agcode{return} statement. There are effectively no
852 restrictions on the content or length of a reduction procedure. Of
853 course, if there are unbalanced braces, unterminated comments or
854 unterminated string literals, AnaGram will not be able to determine
855 where the reduction procedure ends. AnaGram treats
856 \index{Comments}nested comments within a reduction procedure according
857 to the value of the \index{Nest comments}\index{Configuration
858 switches}\agparam{nest comments} configuration switch at the point
859 where it encounters the reduction procedure.
860
861 From a practical point of view it is not usually good practice to have
862 a reduction procedure that is more than a few lines long since a long
863 procedure will hamper your overall view of your grammar. Long
864 reduction procedures should be written as separate named functions,
865 and should either be included in the embedded C portion of your syntax
866 file or should be included in a wholly separate module. Here is an
867 example of a long form reduction procedure:
868
869 \begin{indentingcode}{0.4in}
870 =\bra
871 if (flag) \bra
872 total += x;
873 return identify(x);
874 \ket
875 else \bra
876 total = 0;
877 flag = 1;
878 return init{\us}table(x);
879 \ket
880 \ket
881 \end{indentingcode}
882
883 If a rule does not have a reduction procedure, the semantic value of
884 the reduction token will be set to the \index{Semantic
885 value}\index{Token}\index{Value}semantic value of the first token in
886 the rule, unless the rule is a \index{Null productions}null
887 production. In the latter case, the value of the reduction token will
888 be set to zero.
889 % XXX and what if zero isn't a valid value for the type? a compiler
890 % error will occur.
891
892 % XXX add something like
893 %
894 % Variables appearing in reduction procedures which do not have a
895 % parameter assignment in the corresponding grammar rule can be
896 % declared globally or (file)-statically in your embedded C, or
897 % alternatively could be added to the parser control block using
898 % the \agparam{extend pcb} statement (q.v. | See Section ....).
899 % (Reword this.)
900 %
901 % Should also discuss the sequencing of reduction procedure calls
902 % so that people understand what happens if you use such variables.
903 %
904 % also ``A reduction procedure can be used to terminate parsing for
905 % semantic reasons''.
906 %
907
908 \subsection{Immediate Actions}
909 \index{Immediate action}\index{Action}
910
911 An immediate action is a rule element that consists of executable C or
912 C++ code embedded within a grammar rule to be executed when it is
913 encountered. An immediate action is denoted by the use of an
914 exclamation point, \index{!}``!''. The content of an immediate action
915 may be written following the rules for either long form or short form
916 reduction procedures. As with any other rule element, it must be
917 separated from preceding and following rule elements by commas. In
918 the grammar for a simple desk calculator, one might write
919
920 \begin{indentingcode}{0.4in}
921 transaction
922 -> !printf('\#');, expression:x = printf("\%d{\bs}n", x);
923 \end{indentingcode}
924
925 % XXX s/apparent/visible/
926 Notice that the only apparent difference between an immediate action
927 and a reduction procedure is that the immediate action is preceded by
928 ``!'' instead of ``=''. The immediate action must be followed by a
929 comma to separate it from the following rule element.
930
931 Immediate actions may also be used in definitions:
932
933 \begin{indentingcode}{0.4in}
934 prompt = !printf('\#');
935 \end{indentingcode}
936
937 AnaGram implements an immediate action by creating a special token for
938 it. AnaGram then creates a single null production for the
939 token. Finally, the immediate action is implemented as the reduction
940 procedure for the null production.
941
942 For example, you could implement \agcode{prompt} by writing a null production
943 with a reduction procedure:
944
945 \begin{indentingcode}{0.4in}
946 prompt
947 -> = printf('\#');
948 \end{indentingcode}
949
950 This production would be equivalent to the definition above.
951
952 There are two ways, however, in which immediate actions differ from
953 the equivalent null production. Immediate actions may access any
954 parameter assignments which precede them in the rule in which they
955 occur. On the other hand, there is no way to assign a data type to
956 the semantic value, if any, returned by the immediate action.
957 Therefore, the type is determined by your setting of the
958 \index{Default token type}\index{Configuration parameters}
959 \agparam{default token type} configuration parameter.
960
961 \subsection{Virtual Productions}
962 \index{Virtual productions}\index{Production}
963
964 Virtual productions are a convenient short form notation for common
965 grammatical constructs involving choice and repetition. The notation
966 represents an extension of notation commonly used in programming
967 manuals. A virtual production may be written in a grammar rule at any
968 place where you could write a token name, even within another virtual
969 production. Note that use of virtual productions is never
970 \emph{required}, since the equivalent productions can always be
971 written out explicitly instead.
972
973 When AnaGram encounters a virtual production, it replaces the virtual
974 production with a new token and writes appropriate productions for the
975 new token. When you look at your syntax tables using AnaGram windows,
976 you will see the productions that AnaGram generates. AnaGram keeps a
977 record of virtual productions, so that generally if you use the same
978 virtual production a second time, you get the same set of tokens and
979 productions that were generated the first time it was used. This is
980 not the case if the virtual productions contain reduction procedures
981 or immediate actions, since AnaGram is not equipped to determine
982 whether two pieces of C code are equivalent. Thus, a virtual
983 production that contains a reduction procedure will be unique and will
984 not be reused.
985
986 One disadvantage of virtual productions is that there is no way to
987 specify the data type of the \index{Semantic value}\index{Virtual
988 production}semantic value of a virtual production. Therefore, if you
989 have a reduction procedure within a virtual production, its return
990 value must be consistent with the type defined by the \index{Default
991 token type}\index{Configuration parameters}\agparam{default token type}
992 configuration parameter.
993
994 The simplest virtual production is the \index{Token}\index{Optional
995 token}\agterm{optional token}. If \agcode{x} is an arbitrary token
996 name or set expression, you can indicate an optional \agcode{x} by
997 writing \index{?}\agcode{x?}. You may also indicate a repetition of
998 \agcode{x} by using the ellipsis with either \agcode{x} or \agcode{x?}.
999 \index{...}\index{Ellipsis}Thus \agcode{x...} represents
1000 one or more instances of \agcode{x} and \index{?...}\agcode{x?...}
1001 represents zero or more instances of \agcode{x}. For example:
1002
1003 \begin{indentingcode}{0.4in}
1004 '+'?
1005 \end{indentingcode}
1006
1007 can be used to represent an optional plus sign, that is, a choice
1008 between a plus sign and nothing at all. Similarly,
1009
1010 \begin{indentingcode}{0.4in}
1011 '{\bs}n'?...
1012 \end{indentingcode}
1013
1014 represents an optional sequence of newline characters.
1015
1016 \index{Brackets}\index{Braces}\index{\_opb\_clb}\index{[]}
1017 The next category of virtual productions uses brackets or braces to
1018 indicate a choice among a number of enclosed grammar rules separated
1019 by vertical bars. A single rule may also be enclosed. Note that
1020 \emph{rules}, with following reduction procedures, are allowed, not
1021 simply tokens.
1022
1023 Braces are used to indicate that one option must be chosen. Brackets
1024 are used to indicate the choice is optional, i.e. may be omitted
1025 altogether. The ellipsis following a set of options within brackets
1026 or braces indicates the option may be repeated an indefinite number of
1027 times.
1028
1029 You can use braces to indicate a simple choice among a number of
1030 options. A Cobol grammar offers the following choice of equivalent
1031 keywords:
1032
1033 \begin{indentingcode}{0.4in}
1034 \bra "RECORD", "IS"? | "RECORDS", "ARE"? \ket
1035 \end{indentingcode}
1036
1037 \index{\_opb\_clb...}\index{ []...}
1038 You may use the ellipsis with braces to indicate an arbitrary positive
1039 number of repetitions of the choice:
1040
1041 \begin{indentingcode}{0.4in}
1042 {\bra}type specifier | storage class specifier{\ket}...
1043 \end{indentingcode}
1044
1045 This expression requires at least one type specifier or storage class
1046 specifier, but will accept any number.
1047
1048 \index{[]}
1049 To make a choice optional, use brackets instead of braces. An
1050 example, again drawn from a Cobol grammar, is:
1051
1052 \begin{indentingcode}{0.4in}
1053 {}["LIMIT", "IS"? | "LIMITS", "ARE"?]
1054 \end{indentingcode}
1055
1056 \index{[]...}
1057 Ellipses may be used with brackets to indicate an arbitrary number of
1058 choices that may be omitted altogether:
1059
1060 \begin{indentingcode}{0.4in}
1061 {}[argument, [',', argument]...]
1062 \end{indentingcode}
1063
1064 This expression describes an optional argument list with arguments
1065 separated by commas.
1066
1067 If you use a null production within braces, it must be the first option:
1068
1069 \begin{indentingcode}{0.4in}
1070 \bra | '+' | '-' \ket
1071 \end{indentingcode}
1072
1073 Normally, you would do this only if you wanted to attach a reduction
1074 procedure to the null production. Note that if you include a null
1075 production within braces, and add an ellipsis after the closing brace
1076 for repetition, your grammar will be ambiguous. Just exactly how many
1077 times does the null production occur? Use brackets instead, and omit
1078 the null production.
1079
1080 Null productions are not allowed with brackets, since they would be
1081 intrinsically ambiguous.
1082
1083 The options within braces or brackets may be grammar rules of any
1084 length or complexity and may themselves contain virtual productions of
1085 arbitrary complexity. Nevertheless, in practice, clarity suffers as
1086 soon as the options get very complex. Virtual productions are most
1087 important and useful when used in simple situations. In those
1088 situations they will enhance the clarity of your grammar.
1089
1090 Here is an example that is moderately complex, even though each rule
1091 consists of a single token:
1092
1093 \begin{indentingcode}{0.4in}
1094 \bra{\bra}"on" | "true"\ket = 1; | {\bra}"off" | "false"\ket = 0; | integer\ket
1095 \end{indentingcode}
1096
1097 This example can be used to allow as input either an integer or, for
1098 special cases, keywords. You could write this option out in the
1099 following way:
1100
1101 \begin{indentingcode}{0.4in}
1102 p1
1103 -> p2 = 1;
1104 -> p3 = 0;
1105 -> integer
1106
1107 p2
1108 -> "on"
1109 -> "true"
1110
1111 p3
1112 -> "off"
1113 -> "false"
1114 \end{indentingcode}
1115
1116 The final category of virtual production provides a notation for
1117 \index{Alternating sequence}\agterm{alternating sequences}. An
1118 alternating sequence is a set of choices which may be repeated
1119 arbitrarily subject to the side condition that no choice may follow
1120 itself, in other words, that the choices must alternate. Alternating
1121 sequences are written with either brackets or braces depending on
1122 whether the sequence is optional or not, followed by
1123 \index{/...}``\agcode{/...}''. Note that the choices themselves may
1124 allow sequences. For example:
1125
1126 \begin{indentingcode}{0.4in}
1127 program
1128 -> [statement | newline...]/..., eof
1129 \end{indentingcode}
1130
1131 represents a sequence of statements separated by one or more newlines.
1132 Any two statements must be separated by one or more newline
1133 characters, and newlines may also appear at the beginning and the end
1134 of the program.
1135
1136 Null productions are not allowed within alternating sequences, since
1137 they are intrinsically ambiguous in all cases.
1138
1139 \subsection{Definition Statements}
1140 \index{Definitions}\index{Definition statement}\index{Statement}
1141
1142 A definition statement is simply a shorthand way of naming a character
1143 set, a \index{Virtual productions}\index{Production}virtual
1144 production, a keyword string, or an immediate action. It can also be
1145 used for providing an alternate name for a token. Definitions have the
1146 form:
1147
1148 \begin{indentingcode}{0.4in}
1149 name = \codemeta{character set}
1150 name = \codemeta{virtual production}
1151 name = \codemeta{keyword}
1152 name = \codemeta{immediate action}
1153 name = \codemeta{token name}
1154 \end{indentingcode}
1155
1156 The name may be any name acceptable to AnaGram. The name can then be
1157 used anywhere you might have used the expression on the right
1158 side. \index{!}For example:
1159
1160 \begin{indentingcode}{0.4in}
1161 upper case letter = 'A-Z'
1162 lower case letter = 'a-z'
1163 letter = upper case letter + lower case letter
1164 statement list = statement?...
1165 while keyword = "WHILE"
1166 prompt = !printf("Please enter name:");
1167 \end{indentingcode}
1168
1169 It is important to recognize that a definition statement that names a
1170 set does not define a token. A token is defined only when the set is
1171 used in a grammar rule, and then only if the set is used directly, not
1172 in combination with some other set. Furthermore, if you use a
1173 character set directly in a grammar rule, and in some other rule you
1174 use a name that refers to the same set of characters, you will get two
1175 different tokens. For example, if you have defined \agcode{upper case
1176 letter} as in the above example and use both \agcode{upper case
1177 letter} and \agcode{'A-Z'} in grammar rules, AnaGram will assign
1178 different \index{Token number}\index{Token}\index{Number}token numbers
1179 to accommodate any differences in attributes you may assign to the
1180 tokens.
1181
1182 Renaming tokens is a convenient way to connect two independently
1183 written portions of a grammar.
1184 % See the C grammar in the EXAMPLES directory of your distribution
1185 % disk for an example.
1186
1187 \subsection{Embedded C}
1188 \index{Embedded C}
1189
1190 You may encapsulate C or C++ code in your syntax file by enclosing it
1191 in braces (\bra \ket). Such pieces of code are copied to the parser file
1192 untouched, in the order they are found in the syntax file. There may
1193 be any number of such pieces of embedded C. The only restriction is
1194 that they must not start on the same line as some other AnaGram
1195 statement, and following AnaGram statements must also start on fresh
1196 lines.
1197
1198 Normally, the blocks of embedded C in your syntax file are copied to
1199 the parser file \emph{following} a set of definitions and declarations
1200 AnaGram needs for the code it generates. However, if the \emph{first}
1201 statement in your \index{Syntax file}syntax file is a block of
1202 embedded C, it will \emph{precede} AnaGram's definitions and
1203 declarations. This block of embedded C is called the
1204 \index{Prologue}\index{C prologue}``C prologue''. There are two main
1205 reasons for this special treatment. First, you may want to have a
1206 title and \index{Copyright notice}copyright notice in your parser. If
1207 you include them in an initial block of embedded C they will be right
1208 at the beginning of both your syntax file and your parser file.
1209 Second, if some of your tokens have data type\index{Token}\index{Data
1210 type}s other than those predefined in C or C++, you may include the
1211 definitions here, so they will be available to the code AnaGram
1212 generates.
1213
1214 AnaGram scans embedded C only insofar as is necessary to find the
1215 closing right brace. Therefore any braces used within embedded C must
1216 balance properly. AnaGram skips braces enclosed in character
1217 constants and string literals, as well as braces enclosed in
1218 comments. It also recognizes C++ style comments that begin with
1219 \agcode{//}. \index{Comments}Treatment of nested versus non-nested comments
1220 is controlled by the
1221 \index{Nest comments}\index{Configuration switches}\agparam{nest comments}
1222 configuration parameter. AnaGram will use the status of this
1223 parameter in effect at the beginning of the section of embedded C.
1224
1225 AnaGram, of course, can be confused by unterminated strings,
1226 unbalanced brackets, and unterminated comments. The most likely
1227 outcome, in such a situation, is that AnaGram will encounter the end
1228 of file looking for the end of the embedded C. Should this happen,
1229 AnaGram will identify the beginning of the piece of embedded C which
1230 caused the problem.
1231
1232 The code you include as embedded C, of course, has to coexist with the
1233 code AnaGram generates. In order to keep the potential for conflicts
1234 to a minimum, all variables and functions which AnaGram defines begin
1235 either with the name of your parser or with the letters
1236 \agcode{ag{\us}}. You should avoid variable names which begin with these
1237 letters.
1238
1239 Reduction procedures are copied to the \index{Parser
1240 file}\index{File}parser file in the order in which they are defined
1241 \emph{following} all of the embedded C. Thus your reduction
1242 procedures may freely use variables and macros defined anywhere in
1243 your embedded C.
1244
1245 \subsection{Configuration Sections}
1246 \index{Configuration section}
1247
1248 A configuration section is a special section of your syntax file
1249 enclosed in brackets. Within a configuration section you may set the
1250 values of configuration parameters or switches, or you may use one or
1251 more of several available attribute statements to specify special
1252 treatment for certain tokens. There can be as many or as few
1253 configuration sections in your syntax file as you wish. Each
1254 configuration section must begin on a new line. Any AnaGram statement
1255 which follows a configuration section must also begin on a new line.
1256
1257 Within a configuration section, each parameter setting and each
1258 attribute statement must begin on a new line. The rules for using
1259 comments and continuation lines are the same as for the rest of
1260 AnaGram.
1261
1262 Configuration parameters control the way AnaGram interprets your
1263 syntax file and the way it builds your parser. A full discussion of
1264 the use of configuration parameters, including a complete discussion
1265 of each parameter and its default value, is given in Appendix A.
1266
1267 \index{Attribute statements}\index{Statement}
1268 Attribute statements comprise the
1269 \index{Precedence declarations}precedence declarations \agparam{left},
1270 \agparam{right}, and \agparam{nonassoc}; the \agparam{sticky}
1271 declaration; the \agparam{distinguish keywords} statement; the
1272 \agparam{hidden} declaration; the \agparam{disregard} and
1273 \agparam{lexeme} statements; the \agparam{enum} statement; the
1274 \index{Reserve keywords}\agparam{reserve keywords} declaration; and
1275 the \index{Rename macro}\agparam{rename macro} statement.
1276
1277 The precedence declarations and the
1278 \index{Sticky declaration}\index{Declaration}\agparam{sticky}
1279 declaration may be used to resolve conflicts in your grammar. The
1280 \agparam{distinguish keywords} statement may be used to control
1281 keyword recognition. The
1282 \index{Hidden declaration}\index{Declaration}\agparam{hidden}
1283 declaration causes certain token names not to be used when your parser
1284 produces
1285 \index{Syntax error}\index{Errors}\index{Error messages}syntax error
1286 messages. You may use the \agparam{disregard} and \agparam{lexeme}
1287 statements to cause your parser to skip automatically over certain
1288 tokens in its input. The \agparam{enum} statement is almost identical
1289 to the enum statement in C. It can be used to assign names to input
1290 codes in grammars which are taking input from a \index{Lexical
1291 scanner}lexical scanner or another parser. The
1292 \index{Reserve keywords}\agparam{reserve keywords} declaration allows
1293 you to specify certain keywords as reserved words. The
1294 \index{Rename macro}\agparam{rename macro} statement allows you to
1295 override the names AnaGram uses for various macro definitions it
1296 creates in the code it generates.
1297
1298 Attribute statements are discussed below. Except for
1299 \agparam{disregard} and \agparam{rename macro} statements, attribute
1300 statements accept lists of operands enclosed in braces (\bra \ket)
1301 and separated by commas. A dangling comma following the last item in
1302 a list will be ignored.
1303
1304 \subsection{Setting Configuration Parameters}
1305 \index{Configuration parameters}\index{Parameters}
1306
1307 Each configuration parameter has a name that follows the AnaGram
1308 conventions for symbol names, except that AnaGram ignores
1309 case\index{Case sensitivity} when looking up configuration parameter
1310 names.
1311
1312 There are a number of varieties of configuration parameters. The
1313 simplest,
1314 \index{Configuration switches}\index{Switches}configuration switches,
1315 simply turn some feature of AnaGram on or off. These parameters need
1316 simply be stated to turn the feature on, or negated with the tilde
1317 (\agcode{\~{}}) to turn the feature off:
1318
1319 \begin{indentingcode}{0.4in}
1320 nest comments
1321 \end{indentingcode}
1322
1323 causes AnaGram to allow nested comments, and
1324
1325 \begin{indentingcode}{0.4in}
1326 \~{}nest comments
1327 \end{indentingcode}
1328
1329 causes AnaGram to disallow nested comments.
1330
1331 You may also set or reset configuration switches with explicit on or
1332 off values:
1333
1334 \begin{indentingcode}{0.4in}
1335 nest comments = on
1336 nest comments = off
1337 \end{indentingcode}
1338
1339 The remaining configuration parameters are assigned values using a
1340 simple assignment statement. Depending on the parameter, the value it
1341 takes may be the name of a token, a C variable name, a C or C++ data
1342 type, a string constant or an integer. String constants are written
1343 using the same rules as keyword strings, described above.
1344
1345 \begin{indentingcode}{0.4in}
1346 grammar token = program
1347 parser name = widget
1348 default token type = void *
1349 header file name = "widget.h"
1350 parser stack size = 50
1351 \end{indentingcode}
1352
1353 A number of string-valued \index{Configuration
1354 parameters}configuration parameters are used to determine file
1355 names and variable names. In these parameters, the \index{\#}``\#'',
1356 \index{\_dol}``\$'', and ``\index{ \_prc}\%'' characters
1357 are used as wild cards. In file name specifications and the
1358 specification of the name of your parser, ``\#'' will be replaced by
1359 the name of your syntax file. In other function and variable names
1360 AnaGram creates while building your parser, ``\$'' will be replaced by
1361 the name of your parser. When building enumeration constants for the
1362 names of the tokens in your grammar, ``\%'' will be replaced by the
1363 name of the token.
1364
1365 Note that when entering a Windows/DOS path name as a
1366 value for a file name parameter you must quote any backslashes in the
1367 path name. For example,
1368
1369 \begin{indentingcode}{0.4in}
1370 coverage file name = "f:{\bs\bs}sna{\bs\bs}foo.nrc"
1371 \end{indentingcode}
1372
1373 \subsection{Precedence Declarations}
1374 \index{Precedence declarations}
1375
1376 AnaGram allows you to resolve shift-reduce conflicts by assigning
1377 precedence levels to operators. There are three precedence
1378 declarations available, beginning with the keywords
1379 \index{Left}\agparam{left}, \index{Right}\agparam{right}, and
1380 \index{Nonassoc}\agparam{nonassoc} respectively. Each such
1381 declaration consists of the appropriate keyword and a list of tokens
1382 enclosed in braces (\bra \ket). All the tokens in the list have the same
1383 precedence, higher than tokens in any previous declaration and lower
1384 than in any subsequent declaration. If the keyword is \agparam{left},
1385 the tokens will group to the left. If it is \agparam{right}, they
1386 will group to the right. If it is \agparam{nonassoc} (for
1387 non-associative) no grouping will be assumed. Precedence declarations
1388 must be included in a configuration section. Here are precedence
1389 declarations appropriate to a simple desk calculator program:
1390
1391 \begin{indentingcode}{0.4in}
1392 {}[
1393 left \bra '+', '-' \ket
1394 left \bra star, '/', '\%' \ket
1395 right \bra unary minus \ket
1396 ]
1397 unary minus = '-'
1398 \end{indentingcode}
1399
1400 Note that \agcode{unary minus} and \agcode{'-'} can have different
1401 precedence.
1402
1403 Precedence declarations are one of the few instances in AnaGram where
1404 the \index{Statements}\index{Order of statements}order of statements
1405 is significant.
1406
1407 The use of precedence declarations is discussed in Chapter 9.
1408
1409 \subsection{``Sticky'' Declarations}
1410 \index{Sticky declaration}\index{Declaration}
1411
1412 AnaGram provides another means for resolving shift-reduce conflicts.
1413 You may characterize any token as ``sticky''. Then, in the case of a
1414 \index{Shift-reduce conflict}\index{Conflicts}shift-reduce conflict
1415 where a ``sticky'' token is the last token in the input buffer, the
1416 conflict will be resolved by selecting the shift operation.
1417 Intuitively, you may think of this as though the ``sticky'' token
1418 adheres to and draws in any subsequent input that it can. ``Sticky''
1419 declarations are included in configuration sections. They begin with
1420 the keyword \agcode{sticky} followed by a list of tokens, separated by
1421 commas inside braces (\bra \ket). Suppose, for instance, you wished to
1422 pick up a line of text, skipping any leading space or tab
1423 characters. You might write the following syntax:
1424
1425 \begin{indentingcode}{0.4in}
1426 white space = ' ' + '{\bs}t'
1427
1428 text char
1429 -> \~{}'{\bs}n':c = do{\us}something(c);
1430
1431 line
1432 -> leading white space, text char?..., '{\bs}n'
1433
1434 leading white space
1435 ->
1436 -> leading white space, white space
1437 \end{indentingcode}
1438
1439 Unfortunately, this syntax is ambiguous, since space and tab are
1440 legitimate instances of both leading white space and text char. What
1441 you really want to do is to skip white space until you find a
1442 non-blank character and then you want to accept all characters to the
1443 end of the line. There are two ways to address the problem. The
1444 first is to define a special token for the first non-blank character
1445 and, using it, to write an unambiguous grammar. This approach, while
1446 laudable, is tedious and prolix. Instead, use \agparam{sticky} to
1447 resolve the problem:
1448
1449 \begin{indentingcode}{0.4in}
1450 {}[ sticky \bra leading white space \ket ]
1451 \end{indentingcode}
1452
1453 Now when AnaGram analyzes your grammar, and encounters the ambiguity,
1454 it will understand that a blank or tab that could be treated either as
1455 leading white space or the as the first text character should be
1456 treated as white space. Since \agcode{leading white space} is
1457 ``sticky'', any subsequent white space adheres to it.
1458
1459 As with conflicts resolved with precedence levels, AnaGram lists all
1460 conflicts that it resolves using \agcode{sticky} in the
1461 \index{Resolved Conflicts}\index{Window}\agwindow{Resolved Conflicts
1462 Table}, so you can verify that the conflicts have been correctly
1463 resolved.
1464
1465 An important use of sticky tokens is to inhibit the recognition of
1466 following \index{Keywords}keywords. Following a sticky token, a
1467 keyword, which, according to your grammar, would otherwise be
1468 legitimate input, will not be recognized if a shift action is possible
1469 for the first character of the keyword. For example, imagine that
1470 \agcode{name} has been defined in the conventional way, and there
1471 exists a production with name followed immediately by the keyword
1472 \agcode{int}. Then if, in your input, the word \agcode{print} were to
1473 occur, your grammar would parse it as a name, \agcode{pr}, followed by
1474 the keyword \agcode{int}. If you make \agcode{name} sticky, however,
1475 the first letter of \agcode{int} will be seen to be an acceptable
1476 character for \agcode{name} and the keyword will not be
1477 recognized. Your parser will then recognize the \agcode{name} as
1478 \agcode{print}.
1479
1480 \subsection{Distinguish Keywords Statement}
1481 \index{Distinguish keywords}\index{Keywords}
1482
1483 Distinguish keywords statements are occasionally needed to prevent
1484 keyword recognition. You may, for example, wish to prevent the
1485 recognition of the keyword \agcode{int} when it occurs embedded in a
1486 word such as \agcode{interval}. Of course, you need to do this only
1487 if both the keyword and the other word are both legitimate input at
1488 the same point in your grammar.
1489
1490 A distinguish keywords statement can prevent recognition of a keyword
1491 which is embedded in another word provided at least one character of
1492 the other word follows the keyword.
1493
1494 The distinguish keywords statement has the form:
1495
1496 \begin{indentingcode}{0.4in}
1497 distinguish keywords \bra \codemeta{list of character sets} \ket
1498 \end{indentingcode}
1499
1500 AnaGram compares all the characters in each keyword to the characters
1501 included in each character set in turn. If it finds that all the
1502 characters in a keyword are members of a particular set, it tells the
1503 keyword recognition logic to try to match the keyword only against the
1504 longest sequence of characters drawn from the specified set. In other
1505 words, in order for a keyword to be recognized, the keyword
1506 \emph{must} be followed by a character \emph{not} in the set. The set
1507 associated with a keyword is the first one in the list which contains
1508 all the characters found in the keyword. If you have more than one
1509 \agparam{distinguish keywords} statement in your grammar, the lists
1510 are tried in the order in which they appear in the grammar.
1511
1512 The purpose of the \agparam{distinguish keywords} statement is to
1513 enable your parser to distinguish a keyword from the same sequence of
1514 characters embedded within another sequence. Thus suppose that
1515 \agcode{int} is a keyword, and, according to your grammar, could
1516 appear in the same place as the word \agcode{integral}. If you don't
1517 want it to be recognized as a keyword in these circumstances, you
1518 would write the following distinguish statement:
1519
1520 \begin{indentingcode}{0.4in}
1521 distinguish keywords \bra 'a-z'+'A-Z' \ket
1522 \end{indentingcode}
1523
1524 To also inhibit recognition of \agcode{int} within \agcode{print}, you
1525 would combine the use of the distinguish keywords statement with the
1526 \agparam{sticky} declaration.
1527
1528 \subsection{``Hidden'' Declarations}
1529 \index{Hidden declaration}\index{Declaration}
1530
1531 AnaGram provides an optional \index{Error diagnosis}error diagnosis
1532 feature for your parser (see Chapter 9). The \agparam{hidden}
1533 declaration allows you to identify tokens that you do not wish to be
1534 used in making up \index{Diagnostic messages}diagnostic messages.
1535 These tokens are tokens whose names would not mean anything to your
1536 users. The format of a ``hidden'' declaration is the same as that of
1537 precedence and ``sticky'' declarations. Within a configuration
1538 section, the keyword ``hidden'' is followed by a list of tokens. For
1539 example:
1540
1541 \begin{indentingcode}{0.4in}
1542 {}[ hidden \bra comment head \ket ]
1543 comment
1544 -> comment head, "*/"
1545
1546 comment head
1547 -> "/*"
1548 -> comment head, \~{}eof
1549 \end{indentingcode}
1550
1551 This is an AnaGram representation of ANSI standard C comments
1552 (non-nested). In this example the token \agcode{comment head} exists
1553 only for convenience in writing the grammar and has no particular
1554 meaning to an end user. On the other hand, he knows what the word
1555 \agcode{comment} refers to. The ``hidden'' attribute will cause AnaGram's
1556 diagnostic builder, by backing up the stack until it finds a
1557 non-hidden token, to eschew \agcode{comment head} in favor of
1558 \agcode{comment}.
1559 % XXX eschew obfuscation. how about ``avoid''?
1560
1561 \subsection{Disregard Statement}
1562
1563 The purpose of the
1564 \index{Disregard statement}\index{Statement}\agparam{disregard}
1565 statement is to skip over uninteresting \index{White space}white space
1566 and comments in your input files. The disregard statement allows you
1567 to specify a token that should be passed over in the input to your
1568 parser. The statement takes the form:
1569
1570 \begin{indentingcode}{0.4in}
1571 disregard ws
1572 \end{indentingcode}
1573
1574 where \agcode{ws} is a token name or character set. Disregard
1575 statements may be placed in any configuration section.
1576
1577 You may have more than one disregard statement in your grammar. If
1578 you do, AnaGram will create a shell production. For example, suppose
1579 you write:
1580
1581 \begin{indentingcode}{0.4in}
1582 {}[
1583 disregard alpha
1584 disregard beta
1585 ]
1586 \end{indentingcode}
1587
1588 AnaGram will proceed as though you had written:
1589
1590 \begin{indentingcode}{0.4in}
1591 gamma
1592 -> alpha | beta
1593 {}[ disregard gamma ]
1594 \end{indentingcode}
1595
1596 It frequently happens that you wish your parser to disregard blanks or
1597 comments, except that white space within names, numbers, strings, and
1598 other elementary constructs is subject to special rules and thus
1599 should not be disregarded blindly. In this case, you can use the
1600 \agparam{lexeme} statement to declare these constructs off limits
1601 for the disregard statement. Within these constructs, the disregard
1602 statement will be inoperative and the admissibility of white space
1603 will be determined solely by the productions which define these
1604 constructs.
1605
1606 Outside those productions which define lexemes, you should not
1607 generally use a token which is supposed to be disregarded. If you do,
1608 your grammar will have conflicts, since the token could satisfy both
1609 the explicit usage and the implicit rules set up by the disregard
1610 statement. Such conflicts, however, are resolved automatically in
1611 favor of your explicit use of the token. The conflicts will appear in
1612 the \agwindow{Resolved Conflicts} window.
1613 % XXX I'm not sure that's still true.
1614
1615 In order to implement the disregard statement AnaGram will redefine
1616 some tokens in your grammar. For example, \agcode{+} may be redefined
1617 to consist of a simple plus sign followed by optional white space:
1618
1619 \begin{indentingcode}{0.4in}
1620 '+'
1621 -> '+'\%, white space?...
1622 \end{indentingcode}
1623
1624 The percent sign is used to indicate the original, simple plus sign
1625 without the optional white space attached. You will probably notice
1626 the percent sign appearing in some windows and traces. In earlier
1627 versions of AnaGram, the degree sign, ``\agcode{\degrees}'', was used rather
1628 than ``\agcode{\%}''.
1629
1630 \subsection{Lexeme Statement}
1631
1632 The ``lexeme'' \index{Statement}\index{Lexeme statement}statement is
1633 used to fine-tune the disregard statement.
1634 The lexeme statement takes the form:
1635
1636 \begin{indentingcode}{0.4in}
1637 {}[ lexeme \bra \codemeta{nonterminal token list} \ket ]
1638 \end{indentingcode}
1639
1640 where \textit{nonterminal token list} is a list of nonterminal tokens
1641 separated by commas.
1642 Lexeme statements may be placed in any configuration section, and
1643 there may be any number of them.
1644
1645 When you specify that a token is to be disregarded, AnaGram rewrites
1646 your grammar so that the token will be passed over whenever it occurs
1647 at the beginning of a file or following a lexical unit, or
1648 \agterm{lexeme}. If you have no \agparam{lexeme} statement, then the
1649 lexemes in your grammar are just the terminal tokens.
1650
1651 The \agparam{lexeme} statement allows you to specify that certain
1652 nonterminal tokens are also to be treated as lexemes. This means that
1653 the disregard token will be skipped following the lexeme, but not
1654 between the characters that constitute the lexeme.
1655
1656 Lexemes correspond to the tokens that a lexical scanner, if you were
1657 using one, would commonly identify and pass to a parser as single
1658 tokens. You don't usually wish to disregard white space within these
1659 tokens. For example, in a grammar for a conventional programming
1660 language where blank characters are to be disregarded, you might
1661 include:
1662
1663 \begin{indentingcode}{0.4in}
1664 {}[ lexeme \bra string, character constant, name, number \ket ]
1665 \end{indentingcode}
1666
1667 since blank characters must not be overlooked within strings and
1668 character constants and should not be permitted within names or
1669 numbers.
1670
1671 Normally, AnaGram considers the disregard token to be optional;
1672 however there are circumstances where treating the disregard token as
1673 optional would lead to conflicts: two successive names, or two
1674 successive numbers, for example. In this case, you would like to
1675 require that the lexemes be separated by instances of the disregard
1676 token. To do this, simply set the
1677 \index{Distinguish lexemes}\index{Configuration switches}
1678 \agparam{distinguish lexemes}
1679 configuration switch.
1680 When this switch is set, AnaGram will ensure that disregard tokens
1681 will be required in those situations where making them optional would
1682 lead to conflicts.
1683
1684 White space may be used explicitly within definitions of lexeme tokens
1685 in your grammar if desired, without causing conflicts. Thus, if you
1686 wish to allow embedded space in variable names, you might write:
1687
1688 \begin{indentingcode}{0.4in}
1689 {}[
1690 disregard space
1691 lexeme \bra variable name \ket
1692 ]
1693 space = ' ' + '{\bs}t'
1694 letter = 'a-z' + 'A-Z'
1695 digit = '0-9'
1696
1697 variable name
1698 -> letter
1699 -> variable name, letter + digit
1700 -> variable name, space..., letter + digit
1701 \end{indentingcode}
1702
1703 \subsection{Enum Statement}
1704 \index{Enum statement}\index{Enumeration}\index{Token}
1705
1706 The \agparam{enum} statement follows rules nearly identical to those
1707 for C and C++. This makes it possible to copy an enum statement from
1708 your syntax file to a program file written in either C or C++, without
1709 any need for editing. The only differences are that AnaGram makes no
1710 provision for blank lines within the enumeration list, nor does it
1711 accept a type name. The \agparam{enum} statement is equivalent to a
1712 corresponding set of definition statements. It is especially useful
1713 when a parser is accepting token input from another program, a
1714 \index{Lexical scanner}lexical scanner, for example. Using
1715 the enum statement you may conveniently define all the identification
1716 codes for the input tokens.
1717
1718 Each entry in an enum statement may be either a name, or a name
1719 followed by an ``='' sign and a character representation. If there is
1720 a character representation the name is assigned the value of the
1721 specified character. Otherwise it is assigned a value one more than
1722 that assigned to the previous name. If the first name in the list is
1723 not given an explicit value, it will be given the value zero. For
1724 example:
1725
1726 \begin{indentingcode}{0.4in}
1727 {}[
1728 enum \bra
1729 eof, a,b,c,
1730 blank = '\ ', x, y
1731 \ket
1732 ]
1733 \end{indentingcode}
1734
1735 is equivalent to the following definition statements
1736
1737 \begin{indentingcode}{0.4in}
1738 eof = 0
1739 a = 1
1740 b = 2
1741 c = 3
1742 blank = '\ '
1743 x = 33
1744 y = 34
1745 \end{indentingcode}
1746
1747 \subsection{Subgrammar Declarations}
1748 \index{Subgrammar declaration}\index{Declaration}
1749
1750 A \agparam{subgrammar} declaration can be a useful way to deal with
1751 conflicts in certain situations. It tells AnaGram to treat the tokens
1752 listed in the declaration as though they were each grammar tokens,
1753 each specifying a complete subgrammar in itself, and, in determining
1754 shift and reduction actions, to ignore the usage of the tokens in the
1755 larger grammar.
1756
1757 In some cases it is perfectly reasonable to ignore usage. The most
1758 common example occurs when building a lexical scanner for a language
1759 such as C as in the example in Section 7.4.4. In this case, you can
1760 write a complete grammar for a C token with no difficulty. But if you
1761 try to extend it to a sequence of tokens, you get scores of conflicts.
1762 This situation arises because you specify that any C token can follow
1763 another, when in actual practice, an identifier, for example, cannot
1764 follow another identifier without some intervening space or
1765 punctuation.
1766
1767 It is theoretically possible, but in practice quite awkward, to write
1768 a grammar for a sequence of tokens so that there are no conflicts.
1769 The subgrammar declaration provides a way around this problem by
1770 telling AnaGram that when it is looking for reducing tokens for any
1771 rule produced directly or indirectly by a subgrammar token, it should
1772 disregard the usage of the token and only consider usage internal to
1773 the definition of the subgrammar token, as though the subgrammar token
1774 were the start token of the grammar.
1775
1776 The subgrammar declaration is made in a configuration section and
1777 consists of the keyword \agcode{subgrammar} followed by a list of one
1778 or more nonterminal token names, separated by commas and enclosed in
1779 braces (\bra \ket). For example:
1780
1781 \begin{indentingcode}{0.4in}
1782 {}[ subgrammar \bra C token, word \ket ]
1783 \end{indentingcode}
1784
1785 Since the subgrammar statement changes the way AnaGram determines
1786 reducing tokens, it should be used with caution. You should be sure
1787 that the conflicts you are eliminating are really inconsequential.
1788
1789 \subsection{Reserve Keywords Declaration}
1790 \index{Reserve keywords}\index{Keywords}\index{Keyword anomalies}
1791
1792 The \agparam{reserve keywords} declaration can be used to specify a
1793 list of keywords that are reserved and cannot be used except as
1794 explicitly specified in the grammar. This enables AnaGram to avoid
1795 issuing meaningless keyword anomaly diagnostics (see \S 7.5). AnaGram
1796 does not automatically presume that keywords are also reserved words,
1797 since in many grammars there is no need to specify reserved words.
1798
1799 The reserve keywords declaration is made in a configuration section
1800 and consists of the words \agcode{reserve keywords} followed by a list
1801 of one or more keyword strings, separated by commas and enclosed in
1802 braces (\bra \ket). For example:
1803
1804 \begin{indentingcode}{0.4in}
1805 {}[ reserve keywords \bra "int", "char", "float", "double" \ket ]
1806 \end{indentingcode}
1807
1808 \subsection{Rename Macro Statement}
1809 \index{Rename macro}\index{Macros}
1810
1811 AnaGram uses a number of macros in its generated code. It is
1812 possible, therefore, to run into naming collisions with other
1813 components of your program. The \agparam{rename macro} statement
1814 allows you to change the name AnaGram uses for a particular macro to
1815 avoid these problems. For example, the Windows NT operating system
1816 uses \agcode{CONTEXT} structures to perform various internal
1817 operations. If you use the context tracking option (see \S 9.5.4)
1818 your parser will have a macro called \agcode{CONTEXT}. To avoid the
1819 name collision, add the following statement to any configuration
1820 section in your grammar:
1821
1822 \begin{indentingcode}{0.4in}
1823 rename macro CONTEXT AG{\us}CONTEXT
1824 \end{indentingcode}
1825
1826 Then, simply use \agcode{AG{\us}CONTEXT} where you would otherwise have
1827 used \agcode{CONTEXT}.