Mercurial > ~dholland > hg > ag > index.cgi
comparison doc/manual/sf.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:13d2b8934445 |
---|---|
1 \chapter{Syntax Files} | |
2 \index{Syntax file}\index{File} | |
3 | |
4 Input files to AnaGram are called \agterm{syntax files}. A syntax | |
5 file comprises a grammar and associated C or C++ code. The grammar | |
6 consists of a number of productions along with supportng information | |
7 such as configuration sections and definitions of character sets. The | |
8 associated code consists of reduction procedures (see \S 8.2.13) and | |
9 embedded C or C++ code (\S 8.2.17). This chapter explains the rules | |
10 for writing syntax files acceptable to AnaGram. The rules for | |
11 interfacing your parser to the balance of your program are given in | |
12 Chapter 9. | |
13 | |
14 | |
15 \section{Lexical Conventions} | |
16 \index{Lexical conventions} | |
17 | |
18 \subsection{Statements} | |
19 \index{Statements} | |
20 | |
21 For purposes of this manual, AnaGram statements are considered to be | |
22 productions, definition statements, configuration sections, and blocks | |
23 of embedded C or C++ code, all discussed individually below. Each | |
24 statement must begin on a new line. It is a good idea to separate | |
25 statements visually in your file by using blank lines freely. | |
26 There are generally no restrictions on the | |
27 \index{Statements}\index{Order of statements}order of statements | |
28 in a syntax file. Good programming practice, however, suggests that | |
29 definitions and configuration sections should precede the grammar | |
30 itself. | |
31 | |
32 \subsection{Spaces and Tabs} | |
33 \index{Spaces}\index{Tabs} | |
34 | |
35 AnaGram allows spaces and tabs to be used freely to improve the | |
36 readability of grammars. Spaces and tabs are ignored, except when | |
37 embedded in a token name, in a character set definition, or in a | |
38 keyword. Within a token name, any sequence of spaces and tabs counts | |
39 as a single space. | |
40 | |
41 \subsection{Continuation Lines} | |
42 \index{Continuation lines} | |
43 | |
44 AnaGram statements normally end with a newline character or the end of | |
45 file. If AnaGram encounters the end of a line and the statement it is | |
46 reading appears to be complete, it will not look for a continuation. | |
47 To continue a statement to another line, just make sure that what you | |
48 have on the first line is clearly incomplete. For example, | |
49 | |
50 \begin{indentingcode}{0.4in} | |
51 prep phrase -> preposition, "the", noun | |
52 \end{indentingcode} | |
53 | |
54 looks complete to AnaGram, whereas | |
55 | |
56 \begin{indentingcode}{0.4in} | |
57 prep phrase -> preposition, "the", noun, | |
58 \end{indentingcode} | |
59 | |
60 looks incomplete because of the dangling comma at the end. | |
61 | |
62 \subsection{Comments} | |
63 \index{Comments} | |
64 | |
65 AnaGram accepts comments in accordance with the rules of C and C++, | |
66 that is, normal C comments bracketed with \agcode{/*} and \agcode{*/}, | |
67 as well as comments which begin with \agcode{//} and continue to the | |
68 end of line. AnaGram also observes these conventions when skipping | |
69 over embedded C code. | |
70 | |
71 Since the ANSI standard for C insists that normal C comments do not | |
72 nest, AnaGram, by default, disallows nested comments. You may, | |
73 however, set a configuration parameter, | |
74 \index{Nest comments}\index{Configuration switches}\index{Comments} | |
75 \agparam{nest comments}, | |
76 to allow nested comments. See Appendix A. In any case, AnaGram will | |
77 use the same convention for embedded C as it uses for AnaGram proper. | |
78 You can change the convention in the middle of the file if necessary. | |
79 | |
80 AnaGram treats each comment delimited with \agcode{/*} and \agcode{*/} | |
81 as though it were a single space. You can even put such comments in | |
82 the middle of token names if you should want to. A comment that | |
83 begins with \agcode{//} is treated as though the end of line occurred | |
84 at the \agcode{//}. | |
85 | |
86 \subsection{Blank Lines and Form Feeds} | |
87 \index{Blank lines} | |
88 | |
89 Because blank lines and form feeds are visual separators, AnaGram will | |
90 not skip either looking for a continuation line. Therefore blank lines | |
91 and form feeds can occur only between AnaGram statements, not in the | |
92 middle of a statement. | |
93 | |
94 It is a good idea to separate groups of productions with a blank line | |
95 or two, lest an accidental dangling comma make AnaGram think the | |
96 beginning of the next production is a continuation of the present one. | |
97 | |
98 | |
99 \section{Elements of Grammars} | |
100 | |
101 \subsection{Names} | |
102 \index{Name}\index{Token} | |
103 | |
104 You may use names to represent tokens, character sets, keywords and | |
105 \index{Virtual productions}\index{Production}virtual productions. | |
106 Names follow the same general rules as for any programming language, | |
107 with the notable exception that they may have embedded white space. | |
108 Names are made up of letters, digits, or underscores. They may not | |
109 begin with a digit. Any sequence of embedded spaces, tabs or comments | |
110 counts as a single space. AnaGram distinguishes between upper and | |
111 lower case\index{Case sensitivity}, so that \agcode{Word} and | |
112 \agcode{word} are different names. There is no particular limit to the | |
113 length of a name. There are no reserved words as such, although | |
114 \agcode{grammar}, \agcode{eof}, and \agcode{error} will be treated as | |
115 reserved words unless you take special action by setting appropriate | |
116 configuration parameters. The names AnaGram uses for | |
117 \index{Configuration parameters}configuration parameters | |
118 follow the same rules as for other names, except that | |
119 \index{Case sensitivity}case | |
120 is ignored. | |
121 | |
122 \subsection{Reserved Words} | |
123 \index{Reserved words}\index{Words} | |
124 | |
125 % XXX shouldn't that be \index{Grammar token}? | |
126 AnaGram treats tokens with the names \index{Grammar}\agcode{grammar}, | |
127 \index{Eof token}\index{Token}\agcode{eof}, and \index{Error | |
128 token}\index{Token}\agcode{error} in a special manner unless certain | |
129 measures are taken. Since you can override AnaGram's use of these | |
130 names, they are not reserved words in the true sense. | |
131 | |
132 If your grammar has a token named \agcode{grammar}, AnaGram will take | |
133 that token to be the grammar token for your grammar unless you set the | |
134 \index{Token}\index{Grammar token}\index{Configuration parameters} | |
135 \agparam{grammar token} | |
136 configuration parameter or mark some other token as the grammar token | |
137 using ``\index{ \_dol}\$''.% See below ???. | |
138 | |
139 If your grammar has a token named \agcode{error} and you take no | |
140 further steps, AnaGram will assume you wish to use error token | |
141 resynchronization in case of | |
142 \index{Syntax error}\index{Errors}syntax error. See Chapter 9. | |
143 If you wish to use some other token as an error token you | |
144 may select it using the | |
145 \index{Configuration parameters}\index{Token}\index{Error token} | |
146 \agparam{error token} configuration parameter. | |
147 If you wish to use \agcode{error} as a token name, but do not want | |
148 error token resynchronization, you may set the \agparam{error token} | |
149 configuration parameter to any name that is not used in your grammar. | |
150 You may then use \agcode{error} as a token name without causing | |
151 AnaGram to include error token resynchronization in your parser. | |
152 | |
153 \index{Resynchronization} | |
154 If you select automatic resynchronization or error token | |
155 resynchronization (see Chapter 9), AnaGram will look for a token | |
156 called \agcode{eof} to use as an end of file indicator. You may | |
157 either name your end of file token \agcode{eof} or you may set the | |
158 \agparam{eof token} configuration parameter with the name of your end | |
159 of file token. | |
160 | |
161 \subsection{Variable Names} | |
162 \index{Name}\index{C variable names} | |
163 | |
164 With AnaGram you can associate C/C++ variable names with the | |
165 \index{Semantic value}\index{Token}\index{Value}semantic values of | |
166 tokens for use in your \index{Reduction procedure}reduction | |
167 procedures. Each name follows the corresponding token in the grammar | |
168 rule on the right of the production, separated from the token by a | |
169 colon. AnaGram allows variable names made up of letters, digits, and | |
170 underscores. They may not begin with a digit. Embedded spaces, tabs | |
171 or comments, are not allowed, of course. AnaGram imposes no | |
172 restriction on length, but uses your variable names just as you have | |
173 written them in the code it generates to call reduction procedures. | |
174 Remember that your compiler may have a limit on the length of variable | |
175 names. Also, AnaGram itself uses C variable names beginning with | |
176 \agcode{ag{\us}}. It is therefore wise to avoid using names of this form. | |
177 | |
178 \subsection{Terminal Tokens} | |
179 \index{Terminal token}\index{Token} | |
180 | |
181 A \agterm{terminal token} is a token which does not appear on the left | |
182 side of a production. It represents, therefore, a basic unit of input | |
183 to your parser. You have several options with respect to terminal | |
184 tokens. If the input to your parser consists of ASCII characters, you | |
185 may define terminal tokens explicitly as ASCII characters or as sets | |
186 of ASCII characters. If you have an input procedure which produces | |
187 numeric codes, you may define the terminal tokens directly in terms of | |
188 these numeric codes. On the other hand, you may leave the terminal | |
189 tokens completely undefined. In this case, you must provide an input | |
190 procedure which can determine the appropriate | |
191 \index{Token}\index{Token number}\index{Number}token numbers. | |
192 It is an all or none situation. If you provide any explicit | |
193 definitions, you must provide them for all terminal tokens. Input | |
194 procedures and token input are discussed in Chapter 9. Examples of | |
195 non-character input may be found in the Macro Preprocessor example in | |
196 the \agfile{examples/mpp} directory on your AnaGram distribution | |
197 disk.% Further examples are given in Chapter ???. | |
198 % XXX change ``on ...distribution disk'' to ``in ...distribution''. | |
199 | |
200 \subsection{Character Representations} | |
201 \index{Character representations} | |
202 | |
203 In specifying admissible input characters you may use \index{Character | |
204 constants}character constants following the normal C conventions. | |
205 Remember that a character constant may specify only a single | |
206 character. Although some C compilers will allow constructs such as | |
207 \agcode{'mv'}, AnaGram doesn't allow this. AnaGram recognizes the | |
208 same escape sequences as C, including octal and hex sequences, even | |
209 though this is, strictly speaking, unnecessary. The escape sequences | |
210 AnaGram recognizes are: | |
211 | |
212 % | |
213 % It would be nice to be able to just write this and tell latex to set | |
214 % it in three columns. but no... that would be too easy. | |
215 % | |
216 % | |
217 %\begin{tabular}{ll} | |
218 %\agcode{{\bs}a}&alert (bell) character\\ | |
219 %\agcode{{\bs}b}&backspace\\ | |
220 %\agcode{{\bs}f}&formfeed\\ | |
221 %\agcode{{\bs}n}&newline\\ | |
222 %\agcode{{\bs}r}&carriage return\\ | |
223 %\agcode{{\bs}t}&horizontal tab\\ | |
224 %\agcode{{\bs}v}&vertical tab\\ | |
225 %\agcode{{\bs\bs}}&backslash\\ | |
226 %\agcode{{\bs}?}&question mark\\ | |
227 %\agcode{{\bs}'}&single quote\\ | |
228 %\agcode{{\bs}"}&double quote\\ | |
229 %\agcode{{\bs}ooo}&octal number\\ | |
230 %\agcode{{\bs}xhh}&hexadecimal number\\ | |
231 %\end{tabular} | |
232 | |
233 \begin{indenting}{0.4in} | |
234 \begin{tabular}{llllll} | |
235 \agcode{{\bs}a}&alert (bell) character& | |
236 \agcode{{\bs}t}&horizontal tab& | |
237 \agcode{{\bs}'}&single quote\\ | |
238 \agcode{{\bs}b}&backspace& | |
239 \agcode{{\bs}v}&vertical tab& | |
240 \agcode{{\bs}"}&double quote\\ | |
241 \agcode{{\bs}f}&formfeed& | |
242 \agcode{{\bs\bs}}&backslash& | |
243 \agcode{{\bs}\textit{ooo}}&octal number\\ | |
244 \agcode{{\bs}n}&newline& | |
245 \agcode{{\bs}?}&question mark& | |
246 \agcode{{\bs}x\textit{hh}}&hexadecimal number\\ | |
247 \agcode{{\bs}r}&carriage return\\ | |
248 \end{tabular} | |
249 \end{indenting} | |
250 \bigskip | |
251 | |
252 The octal escape sequence allows up to three octal digits, in | |
253 accordance with ANSI specifications for C. The hexadecimal numbers | |
254 may contain an arbitrary number of digits; however AnaGram will | |
255 truncate the result to sixteen bits. | |
256 | |
257 A backslash followed by any character other than those listed above | |
258 will cause a syntax error. | |
259 | |
260 You may also represent characters by writing the numeric code | |
261 explicitly, in decimal, octal, or hexadecimal representations. | |
262 AnaGram follows the C conventions for integer constants: a leading | |
263 \agcode{0} means the number is octal, a leading \agcode{0x} or | |
264 \agcode{0X} means it is hexadecimal. The hex digits \agcode{a-f} may | |
265 be either upper or lower case\index{Case sensitivity}. Numbers may be | |
266 preceded by an optional minus sign. | |
267 | |
268 If your parser uses a pre-existing \index{Lexical scanner}lexical | |
269 scanner and you wish to use the code numbers it generates to identify | |
270 tokens, you may simply treat those code numbers as character numbers. | |
271 You may use the numbers directly in your productions, or you may use | |
272 definition statements to name them. You may also use an | |
273 \agparam{enum} statement within a configuration section to attach | |
274 names to the code numbers. | |
275 % XXX shouldn't this use of enum be indexed? | |
276 | |
277 AnaGram also allows a special notation for control characters. You | |
278 may represent a control character by using the ``\^{}'' character | |
279 preceding any printing ascii character. Thus you can write | |
280 \agcode{\^{}z} or \agcode{\^{}Z} to represent the DOS end-of-file | |
281 character. Notice that quotation marks are not necessary. | |
282 | |
283 Examples of character representations: | |
284 | |
285 \begin{indenting}{0.4in} | |
286 \begin{tabular}{cccc} | |
287 \agcode{'K'}&\agcode{-1}&\agcode{0}&\agcode{'{\bs}t'}\\ | |
288 \agcode{\^{}J}&\agcode{'{\bs}xff'}&\agcode{077}&\agcode{0XF3}\\ | |
289 \end{tabular} | |
290 \end{indenting} | |
291 | |
292 \subsection{Character Ranges} | |
293 \index{Character range}\index{Range} | |
294 | |
295 It is convenient to be able to specify ranges of characters when | |
296 writing a grammar. AnaGram supports several ways of representing | |
297 ranges of characters. The first is an extension of the notation for | |
298 character constants: \agcode{'a-z'} is the set of lower case | |
299 characters. You can even use escape sequences such as | |
300 \agcode{'{\bs}n-{\bs}r'} if you like. The order of | |
301 characters used to specify the range is immaterial: \agcode{'z-a'} is | |
302 the same as \agcode{'a-z'}. AnaGram will, however, issue a warning | |
303 just in case the unusual order results from a clerical error. | |
304 | |
305 The second way to specify a range is by using two arbitrary character | |
306 representations, as described above, separated by two dots. For | |
307 example, \agcode{\^{}C..\^{}Z}, \agcode{3..26}, \agcode{3..032}, | |
308 \agcode{3..0x1a}, and \agcode{\^{}C..0x1a}, all represent the same | |
309 range of characters. Similarly, \agcode{'A-F'}, \agcode{'A'..'F'}, | |
310 \agcode{0101..0106}, \agcode{0x41..0x46}, \agcode{65..70}, and | |
311 \agcode{65..'F'} all represent the same range of characters. | |
312 | |
313 \subsection{Character Sets} | |
314 \index{Character sets} | |
315 | |
316 If you provide explicit definitions for terminal tokens, the basic | |
317 input unit for your parser will be considered a character set, even if | |
318 your input procedure provides numeric codes that are not actually | |
319 characters. As a terminal token, a character set will be matched by | |
320 any input character that is a member of the set. Character sets may | |
321 be named in definition statements, but they may also appear on the | |
322 right sides of productions without being named. | |
323 | |
324 A character set may consist of one or more characters. You can | |
325 specify a character set that consists of a single character by using | |
326 any of the character representation methods described above. You can | |
327 specify a set consisting of a range of characters by using any of the | |
328 representations of character ranges described above. | |
329 \index{Character sets} | |
330 To specify more complicated sets, you can write | |
331 \index{Expressions}\index{Set expressions}expressions | |
332 using conventional set theoretic operations. | |
333 In AnaGram input, these operations are specified as follows: | |
334 | |
335 \index{Union}\index{Difference}\index{Intersection}\index{Complement} | |
336 \begin{indenting}{0.4in} | |
337 \begin{tabular}{cl} | |
338 \agcode{A + B}&(union)\\ | |
339 \agcode{A - B}&(difference)\\ | |
340 \agcode{A \& B}&(intersection)\\ | |
341 \agcode{\~{}A}&(complement)\\ | |
342 \end{tabular} | |
343 \end{indenting} | |
344 | |
345 where \agcode{A} and \agcode{B} are arbitrary sets. Union and | |
346 difference have the same precedence. Intersection has higher | |
347 precedence and complement has the highest precedence. Thus in the | |
348 expression | |
349 | |
350 \begin{indentingcode}{0.4in} | |
351 A + \~{}B\&C | |
352 \end{indentingcode} | |
353 | |
354 the complement operation is performed first, then the intersection, | |
355 and finally the union. | |
356 | |
357 Watch out! In an AnaGram syntax file \agcode{65 + 97} represents the | |
358 character set which consists of lower case \agcode{a} and upper case | |
359 \agcode{A}. It does not represent 162, the sum of 65 and 97. | |
360 | |
361 Parentheses may be used to force the order of evaluation: | |
362 | |
363 \begin{indentingcode}{0.4in} | |
364 \~{}(A \& (B+C)) | |
365 \end{indentingcode} | |
366 | |
367 In this example the union of \agcode{B} and \agcode{C} is calculated, | |
368 then the intersection of this set with \agcode{A} is calculated, and | |
369 finally the complement is evaluated. | |
370 | |
371 The computation of the \index{Complement}complement of a | |
372 \index{Character sets}set requires a definition of the | |
373 \index{Universe}universe of set elements. AnaGram will define the | |
374 universe to be the set of unsigned 8-bit characters, unless one or | |
375 more characters outside that range have been specified. In that case, | |
376 the universe will consist of all characters on the interval defined by | |
377 the lesser of zero and the lowest character code used and the greater | |
378 of 255 and the highest character code used. The complement of a | |
379 character set is everything in this universe except the characters in | |
380 the set. | |
381 | |
382 Characters which make up part of the character universe, but are not | |
383 legitimate input according to your grammar, are lumped together into a | |
384 special token which will cause an error if it occurs in your input. | |
385 | |
386 When your parser reads an input character, it uses that character to | |
387 index a conversion table in order to determine the appropriate | |
388 \index{Token number}\index{Token}\index{Number}token number. If the | |
389 \index{Range}\index{Test range}\index{Configuration switches} | |
390 \agparam{test range} configuration switch | |
391 is on, its default setting, your parser will include code to verify | |
392 that the character is in bounds before it indexes the conversion | |
393 table. If you are satisfied that checking bounds is unnecessary, you | |
394 may turn the \agparam{test range} switch off and get a slightly higher | |
395 level of performance from your parser. | |
396 | |
397 For efficient processing, it is well to keep the number of tokens to a | |
398 minimum. Therefore if you have a choice between defining a construct | |
399 as a token, with a production, or a set, with a definition, the set is | |
400 to be preferred. | |
401 | |
402 Some useful character sets are: | |
403 | |
404 \begin{indenting}{0.4in} | |
405 \begin{tabular}{ll} | |
406 \agcode{'a-z' + 'A-Z'}&Alphabetic characters\\ | |
407 \agcode{'a-f' + 'A-F'}&Hex digits\\ | |
408 \agcode{'0-9'}&Decimal digits\\ | |
409 \agcode{0..127}&ASCII character set\\ | |
410 \agcode{32..126}&Printing ASCII characters\\ | |
411 \agcode{\~{}'{\bs}n'}&Anything but newline\\ | |
412 \agcode{\^{}Z}&Windows/DOS end of file indicator\\ | |
413 \agcode{-1}&Stream I/O end of file indicator\\ | |
414 \agcode{0}&String terminator\\ | |
415 \agcode{32..126 - 'a-z' - 'A-Z' - '0-9'}&Punctuation\\ | |
416 \end{tabular} | |
417 \end{indenting} | |
418 \bigskip | |
419 % XXX ``punctuation'' is wrong; it should subtract off space too | |
420 | |
421 Note that \agcode{'a-z'} is a range of characters but | |
422 \agcode{32..126 - 'a-z'} is a set difference. | |
423 | |
424 When AnaGram encounters a character set in a grammar rule, it assigns | |
425 a token number to the character set. If it has previously seen the | |
426 same character set it will assign the same token number; however, it | |
427 assigns the same token number only if the set expressions are | |
428 obviously the same. Thus, AnaGram will assign the same token number | |
429 every time it sees \agcode{A + B}, but will assign a different token | |
430 number if it sees \agcode{B + A}. Only when AnaGram has finished | |
431 scanning the entire syntax file can it actually evaluate the character | |
432 sets. If it finds that several different tokens all refer to the same | |
433 character set, it will create a single token that represents the true | |
434 character set and create | |
435 \index{Shell productions}\index{Production}``shell productions'' for | |
436 the others. | |
437 | |
438 \index{Character sets}If the character sets you use in your grammar | |
439 overlap, they do not properly represent | |
440 \index{Terminal token}\index{Token}terminal tokens. | |
441 To deal with this situation, AnaGram identifies all overlaps among | |
442 character sets and extends your grammar by adding a number of extra | |
443 productions. For instance, suppose your grammar uses the following | |
444 character sets as though they were terminal tokens: | |
445 | |
446 \begin{indentingcode}{0.4in} | |
447 'a-z' + 'A-Z' | |
448 '0-9' | |
449 '0-7' | |
450 'a-f' + 'A-F' | |
451 \end{indentingcode} | |
452 | |
453 AnaGram will then modify your grammar by adding the following productions: | |
454 | |
455 \begin{indentingcode}{0.4in} | |
456 'a-z' + 'A-Z' | |
457 -> 'a-f' + 'A-F' | 'g-z' + 'G-Z' | |
458 | |
459 '0-9' | |
460 -> '0-7' + '8-9' | |
461 \end{indentingcode} | |
462 | |
463 Although the tokens \agcode{'a-z' + 'A-Z'} and \agcode{'0-9'} are | |
464 technically now | |
465 \index{Nonterminal token}\index{Token}nonterminal tokens, | |
466 for purposes of determining the | |
467 \index{Token}\index{Data type}data type of their | |
468 \index{Semantic value}\index{token}\index{Value}semantic values, | |
469 AnaGram continues to regard them as terminal tokens. | |
470 | |
471 This \index{Partition}\index{Universe}\index{Character universe} | |
472 ``partitioning'' of the character universe is described in Chapter 6. | |
473 | |
474 \subsection{Keyword Strings} | |
475 \index{Keywords} | |
476 | |
477 In your grammar, AnaGram recognizes character strings within double | |
478 quotes (e.g., \agcode{"IF"}) as keywords. The strings follow the same | |
479 syntactic rules as strings in C. The same escape sequences are | |
480 honored. AnaGram does not, however, allow for the concatenation of | |
481 adjacent strings. Note that AnaGram strings are used only for the | |
482 definition of keywords in your grammar, not for messages to be | |
483 displayed or printed. | |
484 | |
485 Keyword strings may not include null characters and must be at least | |
486 one character long. You may have any number of keywords. Each is | |
487 treated as a single terminal token. A keyword may be given a name by | |
488 means of a definition statement. Keywords may appear in virtual | |
489 productions. | |
490 | |
491 AnaGram's keyword recognition works in the following way. First, for | |
492 each state in your parser, AnaGram prepares a list of all the keywords | |
493 that are admissible in that state. Your parser will recognize a | |
494 keyword \emph{only} if it is in an appropriate state; otherwise it | |
495 will appear to be an anonymous sequence of characters. Your parser, | |
496 in any state, checks for keywords it expects before it checks for | |
497 acceptable characters. That is, \emph{keywords take precedence} over | |
498 simple characters. It does not look for keywords that would not be | |
499 acceptable input. The parser will do whatever lookahead is necessary | |
500 in order to pick up the entire keyword. Thus if the character | |
501 \agcode{I} and the keyword \agcode{IF} are both legitimate input at | |
502 some point, \agcode{IF} will be recognized, if present, in preference | |
503 to \agcode{I}. If several admissible keywords match the input, such | |
504 as \agcode{IF} and \agcode{IFF}, the parser will select the longest | |
505 match, \agcode{IFF} in this example. | |
506 | |
507 AnaGram does not incorporate keywords into its character sets. | |
508 Keywords stand apart and should not appear in definitions of character | |
509 sets. In particular, they are not considered as belonging to the | |
510 complement of a character set. Thus for the production | |
511 | |
512 \begin{indentingcode}{0.4in} | |
513 next char -> \~{}('{\bs}n' + \^{}Z) | |
514 \end{indentingcode} | |
515 a keyword would not be considered legitimate input. | |
516 | |
517 Note also that a keyword consisting of a single character does not | |
518 belong to the character universe. Because of this fact, AnaGram's | |
519 treatment of \agcode{'X'} and \agcode{"X"} is very different. If this | |
520 seems confusing at first, try using only keywords which are at least | |
521 two characters long until you have some experience with them. | |
522 | |
523 AnaGram's keyword recognition logic normally does not make any | |
524 assumptions about what precedes or follows a keyword. Thus if | |
525 \agcode{int} is a keyword, your parser will be capable of plucking it | |
526 out of a string of characters such as \agcode{disintegrate} if, | |
527 according to your grammar, it could follow \agcode{dis}. The | |
528 \agparam{sticky} declaration and the \agparam{distinguish keywords} | |
529 statement, described below, can prevent such unwanted recognition of | |
530 keywords. A keyword following a \agparam{sticky} token will not be | |
531 recognized if the first character of the keyword can be shifted in as | |
532 part of the \agparam{sticky} token. The \agparam{distinguish | |
533 keywords} statement prevents recognition of a keyword if it is | |
534 followed immediately by a character of the sort that makes up the | |
535 keyword. | |
536 | |
537 \subsection{Type Specifications For Tokens} | |
538 \index{Token}\index{Token type}\index{Type declarations} | |
539 | |
540 When you write productions or token declarations (see below), AnaGram | |
541 allows you to specify the data type\index{Token}\index{Data type} of | |
542 the \index{Semantic value}\index{Token}\index{Value}semantic value of | |
543 a token by using a C or C++ data type specification. The restrictions | |
544 are that AnaGram does not allow specification of array or function | |
545 types, nor explicit structure types. Types that are defined with | |
546 typedef statements, structure definitions, or class definitions, | |
547 including template classes, in your embedded C or C++ are acceptable. | |
548 Thus the following specifications, for example, are acceptable: | |
549 | |
550 \begin{indentingcode}{0.4in} | |
551 void | |
552 int | |
553 char * | |
554 unsigned long *near | |
555 static float *far | |
556 my{\us}type | |
557 double * | |
558 struct descriptor | |
559 struct widget * | |
560 vector <double> * | |
561 \end{indentingcode} | |
562 | |
563 On the other hand, the following specifications are \emph{not} valid: | |
564 | |
565 \begin{indentingcode}{0.4in} | |
566 int[20] | |
567 int *(int, unsigned char) | |
568 \bra int x,y; float z; \ket | |
569 struct \bra int k; float z; \ket | |
570 \end{indentingcode} | |
571 | |
572 Note that AnaGram itself does nothing with the type specifications. It | |
573 simply passes them on to your compiler as appropriate. | |
574 | |
575 \subsection{Productions} | |
576 \index{Production} | |
577 | |
578 Productions are the basic units of a grammar. A production consists | |
579 of a left side and a right side. \index{Left side}The left side of a | |
580 production consists of one or more token names, joined by commas, | |
581 optionally preceded by a type specification enclosed in parentheses. | |
582 \index{Right side}The right side begins with an arrow and may either | |
583 begin on the same line as the left side or on a new line. For | |
584 example: | |
585 | |
586 \begin{indentingcode}{0.4in} | |
587 program -> statement list, eof | |
588 expression | |
589 -> expression, plus, term | |
590 | |
591 (int) variable name, function name | |
592 -> name:n = look{\us}up(n); | |
593 \end{indentingcode} | |
594 | |
595 The part of the right side of a production following the arrow is | |
596 called a \index{Grammar rule}\index{Rule}\agterm{grammar rule}, | |
597 discussed below. A production need not have a right side at all. In | |
598 this case, it is simply called a | |
599 \index{Declaration}\index{Token}\agterm{token declaration}. | |
600 AnaGram assigns | |
601 \index{Token number}\index{Token}\index{Number}token numbers | |
602 to the token names on the left side, and, if there is a type | |
603 specification, records the data type for each of the tokens declared. | |
604 Declarations of this sort are most useful when using input from a | |
605 \index{Lexical scanner}lexical scanner. See Chapter 9 for a discussion | |
606 of techniques for interfacing a lexical scanner to your parser. If | |
607 you do not intend to use a lexical scanner you will have no need for | |
608 token declarations. | |
609 | |
610 If you do not explicitly specify the type for the | |
611 \index{Semantic value}\index{Token}\index{Value}semantic value | |
612 of a token, it will be determined by the configuration parameter | |
613 \index{Default token type}\index{Configuration parameters}\index{Token} | |
614 \agparam{default token type} | |
615 if it is a \index{Nonterminal token}\index{Token}nonterminal token or | |
616 by the \index{Configuration parameters}configuration parameter | |
617 \index{Input token type}\index{Default input type}\agparam{default input type} | |
618 if it is a \index{Token}terminal token. | |
619 \agparam{Default token type} defaults to \agcode{void}. | |
620 \agparam{Default input type} defaults to \agcode{int}. | |
621 | |
622 If a production has more than one token on the left side, as in the | |
623 third example above, it is called a | |
624 \index{Semantically determined production}\index{Production} | |
625 \agterm{semantically determined production}. Semantically determined | |
626 productions are a useful tool for exerting semantic control over | |
627 syntactic analysis. A semantically determined production should have | |
628 a reduction procedure which determines on a case by case basis which | |
629 of the tokens on the left side should be taken as the reduction token. | |
630 If there is no reduction procedure, or if the reduction procedure does | |
631 not make a choice, the reduction token will be the first syntactically | |
632 correct token on the left side of the production. In the example | |
633 above, \agcode{variable name} will be the reduction token unless | |
634 \agcode{look{\us}up} changes it to \agcode{function name}. Semantically | |
635 determined productions are discussed more fully in Chapter 9. | |
636 | |
637 If several productions have the same left side, it does not need to be | |
638 repeated. Subsequent right hand sides must each start on a new | |
639 line. For example: | |
640 | |
641 \begin{indentingcode}{0.4in} | |
642 integer | |
643 -> digit | |
644 -> integer, digit | |
645 | |
646 name | |
647 -> letter | |
648 -> name, letter | |
649 -> name, digit | |
650 \end{indentingcode} | |
651 | |
652 On the other hand, you do not have to group productions with the same | |
653 left side. You could write the above productions as follows, although | |
654 it would certainly not be good programming practice: | |
655 | |
656 \begin{indentingcode}{0.4in} | |
657 name -> name, digit | |
658 integer -> integer, digit | |
659 name -> name, letter | |
660 integer -> digit | |
661 name -> letter | |
662 \end{indentingcode} | |
663 | |
664 Nevertheless, there are a few occasions involving complex cross | |
665 recursions and semantically determined productions where it is not | |
666 possible to group productions neatly. | |
667 | |
668 The right side of a production can be empty. Such a production is | |
669 called a | |
670 \index{Null productions}\index{Production}\agterm{null production}. | |
671 Null productions are useful to denote an optional element in a | |
672 grammar, or a list that may be empty. For example: | |
673 | |
674 \begin{indentingcode}{0.4in} | |
675 optional widget | |
676 -> | |
677 -> widget | |
678 | |
679 optional qualifiers | |
680 -> | |
681 -> optional qualifiers, qualifier | |
682 \end{indentingcode} | |
683 | |
684 A second way to write multiple productions with the same left side | |
685 uses the \index{Vertical bar}\index{|}vertical bar character, ``$|$'', | |
686 to separate the grammar rules. The productions given above for | |
687 \agcode{name}, \agcode{optional widget}, and \agcode{optional | |
688 qualifiers} can also be written: | |
689 | |
690 \begin{indentingcode}{0.4in} | |
691 name -> letter | name, letter | name, digit | |
692 optional widget | |
693 -> | widget | |
694 | |
695 optional qualifiers | |
696 -> | optional qualifiers, qualifier | |
697 \end{indentingcode} | |
698 | |
699 Note that a null production cannot \emph{follow} a vertical bar. | |
700 | |
701 A token that has a null production is called a | |
702 \index{Zero length token}\index{Token}\agterm{zero length token}, | |
703 since it can be represented by an empty sequence of input characters, | |
704 that is to say, by nothing at all. Furthermore, even if a token | |
705 doesn't have any null productions, if it has at least one rule | |
706 consisting entirely of zero length tokens it is also a zero length | |
707 token. In the Token Table window, AnaGram notes which tokens are zero | |
708 length, because they can be a source of conflicts. | |
709 | |
710 \subsection{Grammar Token} | |
711 | |
712 Every grammar must have a single token which produces the entire | |
713 grammar. This token is variously called the | |
714 \index{Token}\index{Grammar token}\agterm{grammar token}, the | |
715 \index{Goal token}\agterm{goal token} or the | |
716 \index{Start token}\agterm{start token}. | |
717 AnaGram provides several methods you may use to specify which token in | |
718 your grammar is the grammar token. | |
719 | |
720 You may simply use the name \agcode{grammar} for the grammar token. | |
721 If you wish to use some other more descriptive name for your grammar | |
722 token, you may mark it with a following dollar sign when it appears on | |
723 the left side of a production. Alternatively, you may set the | |
724 \index{Grammar token}\index{Configuration parameters}\agparam{grammar token} | |
725 configuration parameter to specify the grammar token. Here are | |
726 examples of the methods: | |
727 | |
728 \begin{indentingcode}{0.4in} | |
729 grammar | |
730 -> [statement | newline]/... | |
731 | |
732 program \$ | |
733 -> [statement | newline]/... | |
734 | |
735 {}[ grammar token = program ] | |
736 program | |
737 -> [statement | newline]/... | |
738 \end{indentingcode} | |
739 | |
740 If you should use more than one of these techniques, AnaGram resolves | |
741 the issue in the following manner: A marked token or a configuration | |
742 parameter setting always takes precedence over simply naming a token | |
743 \agcode{grammar}. If you mark more than one token or set the | |
744 configuration parameter more than once, the last setting or mark wins. | |
745 | |
746 \subsection{Grammar Rules} | |
747 \index{Rule}\index{Grammar rule} | |
748 | |
749 The part of a production to the right of the arrow is more often | |
750 called a \agterm{grammar rule}, or simply \agterm{rule}. A grammar | |
751 rule is a sequence of \index{Rule elements}\agterm{rule elements}, | |
752 joined by commas, as in the examples of productions given above. Rule | |
753 elements are token names, character set expressions, virtual | |
754 productions, or immediate actions (see below). Each rule element may | |
755 be optionally followed by a parameter assignment. The entire rule may | |
756 be followed by an optional reduction procedure. A \index{Parameter | |
757 assignment}parameter assignment is a colon followed by a C variable | |
758 name. Here are some examples of rule elements with parameter | |
759 assignments: | |
760 | |
761 \begin{indentingcode}{0.4in} | |
762 '0-9':d | |
763 integer:n | |
764 expression:x | |
765 declaration:declaration{\us}descriptor | |
766 \end{indentingcode} | |
767 | |
768 The parameters you assign to tokens in your grammar rule become the | |
769 formal parameters for your \index{Reduction procedure}reduction | |
770 procedure. The data type\index{Data type}\index{Reduction procedure | |
771 arguments} of the parameter is determined by the data type for the | |
772 semantic value of the token to which it is assigned. If your grammar | |
773 rule has parameter assignments, but does not have a reduction | |
774 procedure, AnaGram will give you a warning in case the lack of a | |
775 reduction procedure is an oversight. If you don't need a reduction | |
776 procedure you may safely ignore the warning. On the other hand, | |
777 AnaGram has no way to determine whether you have failed to make | |
778 necessary parameter assignments. You won't find out until you compile | |
779 your parser, when your compiler will give you error messages for | |
780 undefined symbols. | |
781 | |
782 AnaGram assigns a unique rule number to each rule in your grammar. | |
783 Rules are numbered sequentially as they are encountered in the syntax | |
784 file. AnaGram constructs rule zero itself. Rule zero normally has a | |
785 single element, the grammar token, unless you have a | |
786 \agparam{disregard} statement in your grammar. In this case there | |
787 will be two elements. | |
788 | |
789 \subsection{Reduction Procedures} | |
790 \index{Reduction procedure} | |
791 | |
792 % XXX somewhere in here it ought to say something like | |
793 % ``in the parsing literature reduction procedures are often known as | |
794 % \agterm{semantic actions}.'' | |
795 % Note that R. says there's some subtle difference between the usual | |
796 % concept of semantic action and AG's concept of reduction procedure. | |
797 % I don't know what this difference is and I hope she can recall it. | |
798 % | |
799 % D. thinks this note ought to be at the end; R. wants it at the top. | |
800 | |
801 A \agterm{reduction procedure} is a piece of C code which optionally | |
802 follows a production. The code is executed when your parser | |
803 identifies the production in its input. There are two forms for | |
804 reduction procedures, a short form and a long form. The short form | |
805 consists of a single C expression. The long form consists of an | |
806 arbitrary block of C code. When AnaGram builds a parser, it inspects | |
807 the grammar rule to which the procedure is attached and identifies the | |
808 parameters for the procedure. It uses these parameters as the formal | |
809 parameters for the procedure. | |
810 If the | |
811 \index{Macros}\index{Allow macros}\index{Configuration switches} | |
812 \agparam{allow macros} | |
813 configuration switch has not been turned off, AnaGram codes the | |
814 reduction procedure as a macro definition whenever possible. | |
815 Otherwise AnaGram codes it as a function definition. AnaGram builds | |
816 the name for a reduction procedure by appending its internal procedure | |
817 number to the string \agcode{ag{\us}rp{\us}}. Thus reduction procedures are | |
818 numbered in the order in which they are encountered in the syntax | |
819 file. | |
820 | |
821 Both long and short form reduction procedures are preceded by an equal | |
822 sign which follows the production. The short form consists of a C or | |
823 C++ expression terminated by a semicolon. When the grammar rule is | |
824 reduced, the expression will be evaluated and its value will become | |
825 the value of the reduction token. The expression and the terminating | |
826 semicolon must be entirely on a single line. Note that, if you really | |
827 need to make the expression longer than will fit on one line, you can | |
828 embed a newline in a comment. Some examples of short form reduction | |
829 procedures are: | |
830 | |
831 % XXX is there anything we can do about the ugly underscores? | |
832 \begin{indentingcode}{0.4in} | |
833 =0; | |
834 | |
835 =1; | |
836 | |
837 =10*n + d-'0'; | |
838 | |
839 = | |
840 special{\us}processor(first{\us}parameter, second{\us}parameter); | |
841 | |
842 =word{\us}count++; | |
843 | |
844 =widget(constant{\us}1*parameter{\us}1 + constant{\us}2*parameter{\us}2 /* | |
845 {} */ + constant{\us}3*parameter{\us}3); | |
846 \end{indentingcode} | |
847 | |
848 A long form reduction procedure consists of an arbitrary block of C or | |
849 C++ code, enclosed in braces (\bra \ket). AnaGram will code the reduction | |
850 procedure as a function. To return a value for the reduction token, | |
851 simply use the \agcode{return} statement. There are effectively no | |
852 restrictions on the content or length of a reduction procedure. Of | |
853 course, if there are unbalanced braces, unterminated comments or | |
854 unterminated string literals, AnaGram will not be able to determine | |
855 where the reduction procedure ends. AnaGram treats | |
856 \index{Comments}nested comments within a reduction procedure according | |
857 to the value of the \index{Nest comments}\index{Configuration | |
858 switches}\agparam{nest comments} configuration switch at the point | |
859 where it encounters the reduction procedure. | |
860 | |
861 From a practical point of view it is not usually good practice to have | |
862 a reduction procedure that is more than a few lines long since a long | |
863 procedure will hamper your overall view of your grammar. Long | |
864 reduction procedures should be written as separate named functions, | |
865 and should either be included in the embedded C portion of your syntax | |
866 file or should be included in a wholly separate module. Here is an | |
867 example of a long form reduction procedure: | |
868 | |
869 \begin{indentingcode}{0.4in} | |
870 =\bra | |
871 if (flag) \bra | |
872 total += x; | |
873 return identify(x); | |
874 \ket | |
875 else \bra | |
876 total = 0; | |
877 flag = 1; | |
878 return init{\us}table(x); | |
879 \ket | |
880 \ket | |
881 \end{indentingcode} | |
882 | |
883 If a rule does not have a reduction procedure, the semantic value of | |
884 the reduction token will be set to the \index{Semantic | |
885 value}\index{Token}\index{Value}semantic value of the first token in | |
886 the rule, unless the rule is a \index{Null productions}null | |
887 production. In the latter case, the value of the reduction token will | |
888 be set to zero. | |
889 % XXX and what if zero isn't a valid value for the type? a compiler | |
890 % error will occur. | |
891 | |
892 % XXX add something like | |
893 % | |
894 % Variables appearing in reduction procedures which do not have a | |
895 % parameter assignment in the corresponding grammar rule can be | |
896 % declared globally or (file)-statically in your embedded C, or | |
897 % alternatively could be added to the parser control block using | |
898 % the \agparam{extend pcb} statement (q.v. | See Section ....). | |
899 % (Reword this.) | |
900 % | |
901 % Should also discuss the sequencing of reduction procedure calls | |
902 % so that people understand what happens if you use such variables. | |
903 % | |
904 % also ``A reduction procedure can be used to terminate parsing for | |
905 % semantic reasons''. | |
906 % | |
907 | |
908 \subsection{Immediate Actions} | |
909 \index{Immediate action}\index{Action} | |
910 | |
911 An immediate action is a rule element that consists of executable C or | |
912 C++ code embedded within a grammar rule to be executed when it is | |
913 encountered. An immediate action is denoted by the use of an | |
914 exclamation point, \index{!}``!''. The content of an immediate action | |
915 may be written following the rules for either long form or short form | |
916 reduction procedures. As with any other rule element, it must be | |
917 separated from preceding and following rule elements by commas. In | |
918 the grammar for a simple desk calculator, one might write | |
919 | |
920 \begin{indentingcode}{0.4in} | |
921 transaction | |
922 -> !printf('\#');, expression:x = printf("\%d{\bs}n", x); | |
923 \end{indentingcode} | |
924 | |
925 % XXX s/apparent/visible/ | |
926 Notice that the only apparent difference between an immediate action | |
927 and a reduction procedure is that the immediate action is preceded by | |
928 ``!'' instead of ``=''. The immediate action must be followed by a | |
929 comma to separate it from the following rule element. | |
930 | |
931 Immediate actions may also be used in definitions: | |
932 | |
933 \begin{indentingcode}{0.4in} | |
934 prompt = !printf('\#'); | |
935 \end{indentingcode} | |
936 | |
937 AnaGram implements an immediate action by creating a special token for | |
938 it. AnaGram then creates a single null production for the | |
939 token. Finally, the immediate action is implemented as the reduction | |
940 procedure for the null production. | |
941 | |
942 For example, you could implement \agcode{prompt} by writing a null production | |
943 with a reduction procedure: | |
944 | |
945 \begin{indentingcode}{0.4in} | |
946 prompt | |
947 -> = printf('\#'); | |
948 \end{indentingcode} | |
949 | |
950 This production would be equivalent to the definition above. | |
951 | |
952 There are two ways, however, in which immediate actions differ from | |
953 the equivalent null production. Immediate actions may access any | |
954 parameter assignments which precede them in the rule in which they | |
955 occur. On the other hand, there is no way to assign a data type to | |
956 the semantic value, if any, returned by the immediate action. | |
957 Therefore, the type is determined by your setting of the | |
958 \index{Default token type}\index{Configuration parameters} | |
959 \agparam{default token type} configuration parameter. | |
960 | |
961 \subsection{Virtual Productions} | |
962 \index{Virtual productions}\index{Production} | |
963 | |
964 Virtual productions are a convenient short form notation for common | |
965 grammatical constructs involving choice and repetition. The notation | |
966 represents an extension of notation commonly used in programming | |
967 manuals. A virtual production may be written in a grammar rule at any | |
968 place where you could write a token name, even within another virtual | |
969 production. Note that use of virtual productions is never | |
970 \emph{required}, since the equivalent productions can always be | |
971 written out explicitly instead. | |
972 | |
973 When AnaGram encounters a virtual production, it replaces the virtual | |
974 production with a new token and writes appropriate productions for the | |
975 new token. When you look at your syntax tables using AnaGram windows, | |
976 you will see the productions that AnaGram generates. AnaGram keeps a | |
977 record of virtual productions, so that generally if you use the same | |
978 virtual production a second time, you get the same set of tokens and | |
979 productions that were generated the first time it was used. This is | |
980 not the case if the virtual productions contain reduction procedures | |
981 or immediate actions, since AnaGram is not equipped to determine | |
982 whether two pieces of C code are equivalent. Thus, a virtual | |
983 production that contains a reduction procedure will be unique and will | |
984 not be reused. | |
985 | |
986 One disadvantage of virtual productions is that there is no way to | |
987 specify the data type of the \index{Semantic value}\index{Virtual | |
988 production}semantic value of a virtual production. Therefore, if you | |
989 have a reduction procedure within a virtual production, its return | |
990 value must be consistent with the type defined by the \index{Default | |
991 token type}\index{Configuration parameters}\agparam{default token type} | |
992 configuration parameter. | |
993 | |
994 The simplest virtual production is the \index{Token}\index{Optional | |
995 token}\agterm{optional token}. If \agcode{x} is an arbitrary token | |
996 name or set expression, you can indicate an optional \agcode{x} by | |
997 writing \index{?}\agcode{x?}. You may also indicate a repetition of | |
998 \agcode{x} by using the ellipsis with either \agcode{x} or \agcode{x?}. | |
999 \index{...}\index{Ellipsis}Thus \agcode{x...} represents | |
1000 one or more instances of \agcode{x} and \index{?...}\agcode{x?...} | |
1001 represents zero or more instances of \agcode{x}. For example: | |
1002 | |
1003 \begin{indentingcode}{0.4in} | |
1004 '+'? | |
1005 \end{indentingcode} | |
1006 | |
1007 can be used to represent an optional plus sign, that is, a choice | |
1008 between a plus sign and nothing at all. Similarly, | |
1009 | |
1010 \begin{indentingcode}{0.4in} | |
1011 '{\bs}n'?... | |
1012 \end{indentingcode} | |
1013 | |
1014 represents an optional sequence of newline characters. | |
1015 | |
1016 \index{Brackets}\index{Braces}\index{\_opb\_clb}\index{[]} | |
1017 The next category of virtual productions uses brackets or braces to | |
1018 indicate a choice among a number of enclosed grammar rules separated | |
1019 by vertical bars. A single rule may also be enclosed. Note that | |
1020 \emph{rules}, with following reduction procedures, are allowed, not | |
1021 simply tokens. | |
1022 | |
1023 Braces are used to indicate that one option must be chosen. Brackets | |
1024 are used to indicate the choice is optional, i.e. may be omitted | |
1025 altogether. The ellipsis following a set of options within brackets | |
1026 or braces indicates the option may be repeated an indefinite number of | |
1027 times. | |
1028 | |
1029 You can use braces to indicate a simple choice among a number of | |
1030 options. A Cobol grammar offers the following choice of equivalent | |
1031 keywords: | |
1032 | |
1033 \begin{indentingcode}{0.4in} | |
1034 \bra "RECORD", "IS"? | "RECORDS", "ARE"? \ket | |
1035 \end{indentingcode} | |
1036 | |
1037 \index{\_opb\_clb...}\index{ []...} | |
1038 You may use the ellipsis with braces to indicate an arbitrary positive | |
1039 number of repetitions of the choice: | |
1040 | |
1041 \begin{indentingcode}{0.4in} | |
1042 {\bra}type specifier | storage class specifier{\ket}... | |
1043 \end{indentingcode} | |
1044 | |
1045 This expression requires at least one type specifier or storage class | |
1046 specifier, but will accept any number. | |
1047 | |
1048 \index{[]} | |
1049 To make a choice optional, use brackets instead of braces. An | |
1050 example, again drawn from a Cobol grammar, is: | |
1051 | |
1052 \begin{indentingcode}{0.4in} | |
1053 {}["LIMIT", "IS"? | "LIMITS", "ARE"?] | |
1054 \end{indentingcode} | |
1055 | |
1056 \index{[]...} | |
1057 Ellipses may be used with brackets to indicate an arbitrary number of | |
1058 choices that may be omitted altogether: | |
1059 | |
1060 \begin{indentingcode}{0.4in} | |
1061 {}[argument, [',', argument]...] | |
1062 \end{indentingcode} | |
1063 | |
1064 This expression describes an optional argument list with arguments | |
1065 separated by commas. | |
1066 | |
1067 If you use a null production within braces, it must be the first option: | |
1068 | |
1069 \begin{indentingcode}{0.4in} | |
1070 \bra | '+' | '-' \ket | |
1071 \end{indentingcode} | |
1072 | |
1073 Normally, you would do this only if you wanted to attach a reduction | |
1074 procedure to the null production. Note that if you include a null | |
1075 production within braces, and add an ellipsis after the closing brace | |
1076 for repetition, your grammar will be ambiguous. Just exactly how many | |
1077 times does the null production occur? Use brackets instead, and omit | |
1078 the null production. | |
1079 | |
1080 Null productions are not allowed with brackets, since they would be | |
1081 intrinsically ambiguous. | |
1082 | |
1083 The options within braces or brackets may be grammar rules of any | |
1084 length or complexity and may themselves contain virtual productions of | |
1085 arbitrary complexity. Nevertheless, in practice, clarity suffers as | |
1086 soon as the options get very complex. Virtual productions are most | |
1087 important and useful when used in simple situations. In those | |
1088 situations they will enhance the clarity of your grammar. | |
1089 | |
1090 Here is an example that is moderately complex, even though each rule | |
1091 consists of a single token: | |
1092 | |
1093 \begin{indentingcode}{0.4in} | |
1094 \bra{\bra}"on" | "true"\ket = 1; | {\bra}"off" | "false"\ket = 0; | integer\ket | |
1095 \end{indentingcode} | |
1096 | |
1097 This example can be used to allow as input either an integer or, for | |
1098 special cases, keywords. You could write this option out in the | |
1099 following way: | |
1100 | |
1101 \begin{indentingcode}{0.4in} | |
1102 p1 | |
1103 -> p2 = 1; | |
1104 -> p3 = 0; | |
1105 -> integer | |
1106 | |
1107 p2 | |
1108 -> "on" | |
1109 -> "true" | |
1110 | |
1111 p3 | |
1112 -> "off" | |
1113 -> "false" | |
1114 \end{indentingcode} | |
1115 | |
1116 The final category of virtual production provides a notation for | |
1117 \index{Alternating sequence}\agterm{alternating sequences}. An | |
1118 alternating sequence is a set of choices which may be repeated | |
1119 arbitrarily subject to the side condition that no choice may follow | |
1120 itself, in other words, that the choices must alternate. Alternating | |
1121 sequences are written with either brackets or braces depending on | |
1122 whether the sequence is optional or not, followed by | |
1123 \index{/...}``\agcode{/...}''. Note that the choices themselves may | |
1124 allow sequences. For example: | |
1125 | |
1126 \begin{indentingcode}{0.4in} | |
1127 program | |
1128 -> [statement | newline...]/..., eof | |
1129 \end{indentingcode} | |
1130 | |
1131 represents a sequence of statements separated by one or more newlines. | |
1132 Any two statements must be separated by one or more newline | |
1133 characters, and newlines may also appear at the beginning and the end | |
1134 of the program. | |
1135 | |
1136 Null productions are not allowed within alternating sequences, since | |
1137 they are intrinsically ambiguous in all cases. | |
1138 | |
1139 \subsection{Definition Statements} | |
1140 \index{Definitions}\index{Definition statement}\index{Statement} | |
1141 | |
1142 A definition statement is simply a shorthand way of naming a character | |
1143 set, a \index{Virtual productions}\index{Production}virtual | |
1144 production, a keyword string, or an immediate action. It can also be | |
1145 used for providing an alternate name for a token. Definitions have the | |
1146 form: | |
1147 | |
1148 \begin{indentingcode}{0.4in} | |
1149 name = \codemeta{character set} | |
1150 name = \codemeta{virtual production} | |
1151 name = \codemeta{keyword} | |
1152 name = \codemeta{immediate action} | |
1153 name = \codemeta{token name} | |
1154 \end{indentingcode} | |
1155 | |
1156 The name may be any name acceptable to AnaGram. The name can then be | |
1157 used anywhere you might have used the expression on the right | |
1158 side. \index{!}For example: | |
1159 | |
1160 \begin{indentingcode}{0.4in} | |
1161 upper case letter = 'A-Z' | |
1162 lower case letter = 'a-z' | |
1163 letter = upper case letter + lower case letter | |
1164 statement list = statement?... | |
1165 while keyword = "WHILE" | |
1166 prompt = !printf("Please enter name:"); | |
1167 \end{indentingcode} | |
1168 | |
1169 It is important to recognize that a definition statement that names a | |
1170 set does not define a token. A token is defined only when the set is | |
1171 used in a grammar rule, and then only if the set is used directly, not | |
1172 in combination with some other set. Furthermore, if you use a | |
1173 character set directly in a grammar rule, and in some other rule you | |
1174 use a name that refers to the same set of characters, you will get two | |
1175 different tokens. For example, if you have defined \agcode{upper case | |
1176 letter} as in the above example and use both \agcode{upper case | |
1177 letter} and \agcode{'A-Z'} in grammar rules, AnaGram will assign | |
1178 different \index{Token number}\index{Token}\index{Number}token numbers | |
1179 to accommodate any differences in attributes you may assign to the | |
1180 tokens. | |
1181 | |
1182 Renaming tokens is a convenient way to connect two independently | |
1183 written portions of a grammar. | |
1184 % See the C grammar in the EXAMPLES directory of your distribution | |
1185 % disk for an example. | |
1186 | |
1187 \subsection{Embedded C} | |
1188 \index{Embedded C} | |
1189 | |
1190 You may encapsulate C or C++ code in your syntax file by enclosing it | |
1191 in braces (\bra \ket). Such pieces of code are copied to the parser file | |
1192 untouched, in the order they are found in the syntax file. There may | |
1193 be any number of such pieces of embedded C. The only restriction is | |
1194 that they must not start on the same line as some other AnaGram | |
1195 statement, and following AnaGram statements must also start on fresh | |
1196 lines. | |
1197 | |
1198 Normally, the blocks of embedded C in your syntax file are copied to | |
1199 the parser file \emph{following} a set of definitions and declarations | |
1200 AnaGram needs for the code it generates. However, if the \emph{first} | |
1201 statement in your \index{Syntax file}syntax file is a block of | |
1202 embedded C, it will \emph{precede} AnaGram's definitions and | |
1203 declarations. This block of embedded C is called the | |
1204 \index{Prologue}\index{C prologue}``C prologue''. There are two main | |
1205 reasons for this special treatment. First, you may want to have a | |
1206 title and \index{Copyright notice}copyright notice in your parser. If | |
1207 you include them in an initial block of embedded C they will be right | |
1208 at the beginning of both your syntax file and your parser file. | |
1209 Second, if some of your tokens have data type\index{Token}\index{Data | |
1210 type}s other than those predefined in C or C++, you may include the | |
1211 definitions here, so they will be available to the code AnaGram | |
1212 generates. | |
1213 | |
1214 AnaGram scans embedded C only insofar as is necessary to find the | |
1215 closing right brace. Therefore any braces used within embedded C must | |
1216 balance properly. AnaGram skips braces enclosed in character | |
1217 constants and string literals, as well as braces enclosed in | |
1218 comments. It also recognizes C++ style comments that begin with | |
1219 \agcode{//}. \index{Comments}Treatment of nested versus non-nested comments | |
1220 is controlled by the | |
1221 \index{Nest comments}\index{Configuration switches}\agparam{nest comments} | |
1222 configuration parameter. AnaGram will use the status of this | |
1223 parameter in effect at the beginning of the section of embedded C. | |
1224 | |
1225 AnaGram, of course, can be confused by unterminated strings, | |
1226 unbalanced brackets, and unterminated comments. The most likely | |
1227 outcome, in such a situation, is that AnaGram will encounter the end | |
1228 of file looking for the end of the embedded C. Should this happen, | |
1229 AnaGram will identify the beginning of the piece of embedded C which | |
1230 caused the problem. | |
1231 | |
1232 The code you include as embedded C, of course, has to coexist with the | |
1233 code AnaGram generates. In order to keep the potential for conflicts | |
1234 to a minimum, all variables and functions which AnaGram defines begin | |
1235 either with the name of your parser or with the letters | |
1236 \agcode{ag{\us}}. You should avoid variable names which begin with these | |
1237 letters. | |
1238 | |
1239 Reduction procedures are copied to the \index{Parser | |
1240 file}\index{File}parser file in the order in which they are defined | |
1241 \emph{following} all of the embedded C. Thus your reduction | |
1242 procedures may freely use variables and macros defined anywhere in | |
1243 your embedded C. | |
1244 | |
1245 \subsection{Configuration Sections} | |
1246 \index{Configuration section} | |
1247 | |
1248 A configuration section is a special section of your syntax file | |
1249 enclosed in brackets. Within a configuration section you may set the | |
1250 values of configuration parameters or switches, or you may use one or | |
1251 more of several available attribute statements to specify special | |
1252 treatment for certain tokens. There can be as many or as few | |
1253 configuration sections in your syntax file as you wish. Each | |
1254 configuration section must begin on a new line. Any AnaGram statement | |
1255 which follows a configuration section must also begin on a new line. | |
1256 | |
1257 Within a configuration section, each parameter setting and each | |
1258 attribute statement must begin on a new line. The rules for using | |
1259 comments and continuation lines are the same as for the rest of | |
1260 AnaGram. | |
1261 | |
1262 Configuration parameters control the way AnaGram interprets your | |
1263 syntax file and the way it builds your parser. A full discussion of | |
1264 the use of configuration parameters, including a complete discussion | |
1265 of each parameter and its default value, is given in Appendix A. | |
1266 | |
1267 \index{Attribute statements}\index{Statement} | |
1268 Attribute statements comprise the | |
1269 \index{Precedence declarations}precedence declarations \agparam{left}, | |
1270 \agparam{right}, and \agparam{nonassoc}; the \agparam{sticky} | |
1271 declaration; the \agparam{distinguish keywords} statement; the | |
1272 \agparam{hidden} declaration; the \agparam{disregard} and | |
1273 \agparam{lexeme} statements; the \agparam{enum} statement; the | |
1274 \index{Reserve keywords}\agparam{reserve keywords} declaration; and | |
1275 the \index{Rename macro}\agparam{rename macro} statement. | |
1276 | |
1277 The precedence declarations and the | |
1278 \index{Sticky declaration}\index{Declaration}\agparam{sticky} | |
1279 declaration may be used to resolve conflicts in your grammar. The | |
1280 \agparam{distinguish keywords} statement may be used to control | |
1281 keyword recognition. The | |
1282 \index{Hidden declaration}\index{Declaration}\agparam{hidden} | |
1283 declaration causes certain token names not to be used when your parser | |
1284 produces | |
1285 \index{Syntax error}\index{Errors}\index{Error messages}syntax error | |
1286 messages. You may use the \agparam{disregard} and \agparam{lexeme} | |
1287 statements to cause your parser to skip automatically over certain | |
1288 tokens in its input. The \agparam{enum} statement is almost identical | |
1289 to the enum statement in C. It can be used to assign names to input | |
1290 codes in grammars which are taking input from a \index{Lexical | |
1291 scanner}lexical scanner or another parser. The | |
1292 \index{Reserve keywords}\agparam{reserve keywords} declaration allows | |
1293 you to specify certain keywords as reserved words. The | |
1294 \index{Rename macro}\agparam{rename macro} statement allows you to | |
1295 override the names AnaGram uses for various macro definitions it | |
1296 creates in the code it generates. | |
1297 | |
1298 Attribute statements are discussed below. Except for | |
1299 \agparam{disregard} and \agparam{rename macro} statements, attribute | |
1300 statements accept lists of operands enclosed in braces (\bra \ket) | |
1301 and separated by commas. A dangling comma following the last item in | |
1302 a list will be ignored. | |
1303 | |
1304 \subsection{Setting Configuration Parameters} | |
1305 \index{Configuration parameters}\index{Parameters} | |
1306 | |
1307 Each configuration parameter has a name that follows the AnaGram | |
1308 conventions for symbol names, except that AnaGram ignores | |
1309 case\index{Case sensitivity} when looking up configuration parameter | |
1310 names. | |
1311 | |
1312 There are a number of varieties of configuration parameters. The | |
1313 simplest, | |
1314 \index{Configuration switches}\index{Switches}configuration switches, | |
1315 simply turn some feature of AnaGram on or off. These parameters need | |
1316 simply be stated to turn the feature on, or negated with the tilde | |
1317 (\agcode{\~{}}) to turn the feature off: | |
1318 | |
1319 \begin{indentingcode}{0.4in} | |
1320 nest comments | |
1321 \end{indentingcode} | |
1322 | |
1323 causes AnaGram to allow nested comments, and | |
1324 | |
1325 \begin{indentingcode}{0.4in} | |
1326 \~{}nest comments | |
1327 \end{indentingcode} | |
1328 | |
1329 causes AnaGram to disallow nested comments. | |
1330 | |
1331 You may also set or reset configuration switches with explicit on or | |
1332 off values: | |
1333 | |
1334 \begin{indentingcode}{0.4in} | |
1335 nest comments = on | |
1336 nest comments = off | |
1337 \end{indentingcode} | |
1338 | |
1339 The remaining configuration parameters are assigned values using a | |
1340 simple assignment statement. Depending on the parameter, the value it | |
1341 takes may be the name of a token, a C variable name, a C or C++ data | |
1342 type, a string constant or an integer. String constants are written | |
1343 using the same rules as keyword strings, described above. | |
1344 | |
1345 \begin{indentingcode}{0.4in} | |
1346 grammar token = program | |
1347 parser name = widget | |
1348 default token type = void * | |
1349 header file name = "widget.h" | |
1350 parser stack size = 50 | |
1351 \end{indentingcode} | |
1352 | |
1353 A number of string-valued \index{Configuration | |
1354 parameters}configuration parameters are used to determine file | |
1355 names and variable names. In these parameters, the \index{\#}``\#'', | |
1356 \index{\_dol}``\$'', and ``\index{ \_prc}\%'' characters | |
1357 are used as wild cards. In file name specifications and the | |
1358 specification of the name of your parser, ``\#'' will be replaced by | |
1359 the name of your syntax file. In other function and variable names | |
1360 AnaGram creates while building your parser, ``\$'' will be replaced by | |
1361 the name of your parser. When building enumeration constants for the | |
1362 names of the tokens in your grammar, ``\%'' will be replaced by the | |
1363 name of the token. | |
1364 | |
1365 Note that when entering a Windows/DOS path name as a | |
1366 value for a file name parameter you must quote any backslashes in the | |
1367 path name. For example, | |
1368 | |
1369 \begin{indentingcode}{0.4in} | |
1370 coverage file name = "f:{\bs\bs}sna{\bs\bs}foo.nrc" | |
1371 \end{indentingcode} | |
1372 | |
1373 \subsection{Precedence Declarations} | |
1374 \index{Precedence declarations} | |
1375 | |
1376 AnaGram allows you to resolve shift-reduce conflicts by assigning | |
1377 precedence levels to operators. There are three precedence | |
1378 declarations available, beginning with the keywords | |
1379 \index{Left}\agparam{left}, \index{Right}\agparam{right}, and | |
1380 \index{Nonassoc}\agparam{nonassoc} respectively. Each such | |
1381 declaration consists of the appropriate keyword and a list of tokens | |
1382 enclosed in braces (\bra \ket). All the tokens in the list have the same | |
1383 precedence, higher than tokens in any previous declaration and lower | |
1384 than in any subsequent declaration. If the keyword is \agparam{left}, | |
1385 the tokens will group to the left. If it is \agparam{right}, they | |
1386 will group to the right. If it is \agparam{nonassoc} (for | |
1387 non-associative) no grouping will be assumed. Precedence declarations | |
1388 must be included in a configuration section. Here are precedence | |
1389 declarations appropriate to a simple desk calculator program: | |
1390 | |
1391 \begin{indentingcode}{0.4in} | |
1392 {}[ | |
1393 left \bra '+', '-' \ket | |
1394 left \bra star, '/', '\%' \ket | |
1395 right \bra unary minus \ket | |
1396 ] | |
1397 unary minus = '-' | |
1398 \end{indentingcode} | |
1399 | |
1400 Note that \agcode{unary minus} and \agcode{'-'} can have different | |
1401 precedence. | |
1402 | |
1403 Precedence declarations are one of the few instances in AnaGram where | |
1404 the \index{Statements}\index{Order of statements}order of statements | |
1405 is significant. | |
1406 | |
1407 The use of precedence declarations is discussed in Chapter 9. | |
1408 | |
1409 \subsection{``Sticky'' Declarations} | |
1410 \index{Sticky declaration}\index{Declaration} | |
1411 | |
1412 AnaGram provides another means for resolving shift-reduce conflicts. | |
1413 You may characterize any token as ``sticky''. Then, in the case of a | |
1414 \index{Shift-reduce conflict}\index{Conflicts}shift-reduce conflict | |
1415 where a ``sticky'' token is the last token in the input buffer, the | |
1416 conflict will be resolved by selecting the shift operation. | |
1417 Intuitively, you may think of this as though the ``sticky'' token | |
1418 adheres to and draws in any subsequent input that it can. ``Sticky'' | |
1419 declarations are included in configuration sections. They begin with | |
1420 the keyword \agcode{sticky} followed by a list of tokens, separated by | |
1421 commas inside braces (\bra \ket). Suppose, for instance, you wished to | |
1422 pick up a line of text, skipping any leading space or tab | |
1423 characters. You might write the following syntax: | |
1424 | |
1425 \begin{indentingcode}{0.4in} | |
1426 white space = ' ' + '{\bs}t' | |
1427 | |
1428 text char | |
1429 -> \~{}'{\bs}n':c = do{\us}something(c); | |
1430 | |
1431 line | |
1432 -> leading white space, text char?..., '{\bs}n' | |
1433 | |
1434 leading white space | |
1435 -> | |
1436 -> leading white space, white space | |
1437 \end{indentingcode} | |
1438 | |
1439 Unfortunately, this syntax is ambiguous, since space and tab are | |
1440 legitimate instances of both leading white space and text char. What | |
1441 you really want to do is to skip white space until you find a | |
1442 non-blank character and then you want to accept all characters to the | |
1443 end of the line. There are two ways to address the problem. The | |
1444 first is to define a special token for the first non-blank character | |
1445 and, using it, to write an unambiguous grammar. This approach, while | |
1446 laudable, is tedious and prolix. Instead, use \agparam{sticky} to | |
1447 resolve the problem: | |
1448 | |
1449 \begin{indentingcode}{0.4in} | |
1450 {}[ sticky \bra leading white space \ket ] | |
1451 \end{indentingcode} | |
1452 | |
1453 Now when AnaGram analyzes your grammar, and encounters the ambiguity, | |
1454 it will understand that a blank or tab that could be treated either as | |
1455 leading white space or the as the first text character should be | |
1456 treated as white space. Since \agcode{leading white space} is | |
1457 ``sticky'', any subsequent white space adheres to it. | |
1458 | |
1459 As with conflicts resolved with precedence levels, AnaGram lists all | |
1460 conflicts that it resolves using \agcode{sticky} in the | |
1461 \index{Resolved Conflicts}\index{Window}\agwindow{Resolved Conflicts | |
1462 Table}, so you can verify that the conflicts have been correctly | |
1463 resolved. | |
1464 | |
1465 An important use of sticky tokens is to inhibit the recognition of | |
1466 following \index{Keywords}keywords. Following a sticky token, a | |
1467 keyword, which, according to your grammar, would otherwise be | |
1468 legitimate input, will not be recognized if a shift action is possible | |
1469 for the first character of the keyword. For example, imagine that | |
1470 \agcode{name} has been defined in the conventional way, and there | |
1471 exists a production with name followed immediately by the keyword | |
1472 \agcode{int}. Then if, in your input, the word \agcode{print} were to | |
1473 occur, your grammar would parse it as a name, \agcode{pr}, followed by | |
1474 the keyword \agcode{int}. If you make \agcode{name} sticky, however, | |
1475 the first letter of \agcode{int} will be seen to be an acceptable | |
1476 character for \agcode{name} and the keyword will not be | |
1477 recognized. Your parser will then recognize the \agcode{name} as | |
1478 \agcode{print}. | |
1479 | |
1480 \subsection{Distinguish Keywords Statement} | |
1481 \index{Distinguish keywords}\index{Keywords} | |
1482 | |
1483 Distinguish keywords statements are occasionally needed to prevent | |
1484 keyword recognition. You may, for example, wish to prevent the | |
1485 recognition of the keyword \agcode{int} when it occurs embedded in a | |
1486 word such as \agcode{interval}. Of course, you need to do this only | |
1487 if both the keyword and the other word are both legitimate input at | |
1488 the same point in your grammar. | |
1489 | |
1490 A distinguish keywords statement can prevent recognition of a keyword | |
1491 which is embedded in another word provided at least one character of | |
1492 the other word follows the keyword. | |
1493 | |
1494 The distinguish keywords statement has the form: | |
1495 | |
1496 \begin{indentingcode}{0.4in} | |
1497 distinguish keywords \bra \codemeta{list of character sets} \ket | |
1498 \end{indentingcode} | |
1499 | |
1500 AnaGram compares all the characters in each keyword to the characters | |
1501 included in each character set in turn. If it finds that all the | |
1502 characters in a keyword are members of a particular set, it tells the | |
1503 keyword recognition logic to try to match the keyword only against the | |
1504 longest sequence of characters drawn from the specified set. In other | |
1505 words, in order for a keyword to be recognized, the keyword | |
1506 \emph{must} be followed by a character \emph{not} in the set. The set | |
1507 associated with a keyword is the first one in the list which contains | |
1508 all the characters found in the keyword. If you have more than one | |
1509 \agparam{distinguish keywords} statement in your grammar, the lists | |
1510 are tried in the order in which they appear in the grammar. | |
1511 | |
1512 The purpose of the \agparam{distinguish keywords} statement is to | |
1513 enable your parser to distinguish a keyword from the same sequence of | |
1514 characters embedded within another sequence. Thus suppose that | |
1515 \agcode{int} is a keyword, and, according to your grammar, could | |
1516 appear in the same place as the word \agcode{integral}. If you don't | |
1517 want it to be recognized as a keyword in these circumstances, you | |
1518 would write the following distinguish statement: | |
1519 | |
1520 \begin{indentingcode}{0.4in} | |
1521 distinguish keywords \bra 'a-z'+'A-Z' \ket | |
1522 \end{indentingcode} | |
1523 | |
1524 To also inhibit recognition of \agcode{int} within \agcode{print}, you | |
1525 would combine the use of the distinguish keywords statement with the | |
1526 \agparam{sticky} declaration. | |
1527 | |
1528 \subsection{``Hidden'' Declarations} | |
1529 \index{Hidden declaration}\index{Declaration} | |
1530 | |
1531 AnaGram provides an optional \index{Error diagnosis}error diagnosis | |
1532 feature for your parser (see Chapter 9). The \agparam{hidden} | |
1533 declaration allows you to identify tokens that you do not wish to be | |
1534 used in making up \index{Diagnostic messages}diagnostic messages. | |
1535 These tokens are tokens whose names would not mean anything to your | |
1536 users. The format of a ``hidden'' declaration is the same as that of | |
1537 precedence and ``sticky'' declarations. Within a configuration | |
1538 section, the keyword ``hidden'' is followed by a list of tokens. For | |
1539 example: | |
1540 | |
1541 \begin{indentingcode}{0.4in} | |
1542 {}[ hidden \bra comment head \ket ] | |
1543 comment | |
1544 -> comment head, "*/" | |
1545 | |
1546 comment head | |
1547 -> "/*" | |
1548 -> comment head, \~{}eof | |
1549 \end{indentingcode} | |
1550 | |
1551 This is an AnaGram representation of ANSI standard C comments | |
1552 (non-nested). In this example the token \agcode{comment head} exists | |
1553 only for convenience in writing the grammar and has no particular | |
1554 meaning to an end user. On the other hand, he knows what the word | |
1555 \agcode{comment} refers to. The ``hidden'' attribute will cause AnaGram's | |
1556 diagnostic builder, by backing up the stack until it finds a | |
1557 non-hidden token, to eschew \agcode{comment head} in favor of | |
1558 \agcode{comment}. | |
1559 % XXX eschew obfuscation. how about ``avoid''? | |
1560 | |
1561 \subsection{Disregard Statement} | |
1562 | |
1563 The purpose of the | |
1564 \index{Disregard statement}\index{Statement}\agparam{disregard} | |
1565 statement is to skip over uninteresting \index{White space}white space | |
1566 and comments in your input files. The disregard statement allows you | |
1567 to specify a token that should be passed over in the input to your | |
1568 parser. The statement takes the form: | |
1569 | |
1570 \begin{indentingcode}{0.4in} | |
1571 disregard ws | |
1572 \end{indentingcode} | |
1573 | |
1574 where \agcode{ws} is a token name or character set. Disregard | |
1575 statements may be placed in any configuration section. | |
1576 | |
1577 You may have more than one disregard statement in your grammar. If | |
1578 you do, AnaGram will create a shell production. For example, suppose | |
1579 you write: | |
1580 | |
1581 \begin{indentingcode}{0.4in} | |
1582 {}[ | |
1583 disregard alpha | |
1584 disregard beta | |
1585 ] | |
1586 \end{indentingcode} | |
1587 | |
1588 AnaGram will proceed as though you had written: | |
1589 | |
1590 \begin{indentingcode}{0.4in} | |
1591 gamma | |
1592 -> alpha | beta | |
1593 {}[ disregard gamma ] | |
1594 \end{indentingcode} | |
1595 | |
1596 It frequently happens that you wish your parser to disregard blanks or | |
1597 comments, except that white space within names, numbers, strings, and | |
1598 other elementary constructs is subject to special rules and thus | |
1599 should not be disregarded blindly. In this case, you can use the | |
1600 \agparam{lexeme} statement to declare these constructs off limits | |
1601 for the disregard statement. Within these constructs, the disregard | |
1602 statement will be inoperative and the admissibility of white space | |
1603 will be determined solely by the productions which define these | |
1604 constructs. | |
1605 | |
1606 Outside those productions which define lexemes, you should not | |
1607 generally use a token which is supposed to be disregarded. If you do, | |
1608 your grammar will have conflicts, since the token could satisfy both | |
1609 the explicit usage and the implicit rules set up by the disregard | |
1610 statement. Such conflicts, however, are resolved automatically in | |
1611 favor of your explicit use of the token. The conflicts will appear in | |
1612 the \agwindow{Resolved Conflicts} window. | |
1613 % XXX I'm not sure that's still true. | |
1614 | |
1615 In order to implement the disregard statement AnaGram will redefine | |
1616 some tokens in your grammar. For example, \agcode{+} may be redefined | |
1617 to consist of a simple plus sign followed by optional white space: | |
1618 | |
1619 \begin{indentingcode}{0.4in} | |
1620 '+' | |
1621 -> '+'\%, white space?... | |
1622 \end{indentingcode} | |
1623 | |
1624 The percent sign is used to indicate the original, simple plus sign | |
1625 without the optional white space attached. You will probably notice | |
1626 the percent sign appearing in some windows and traces. In earlier | |
1627 versions of AnaGram, the degree sign, ``\agcode{\degrees}'', was used rather | |
1628 than ``\agcode{\%}''. | |
1629 | |
1630 \subsection{Lexeme Statement} | |
1631 | |
1632 The ``lexeme'' \index{Statement}\index{Lexeme statement}statement is | |
1633 used to fine-tune the disregard statement. | |
1634 The lexeme statement takes the form: | |
1635 | |
1636 \begin{indentingcode}{0.4in} | |
1637 {}[ lexeme \bra \codemeta{nonterminal token list} \ket ] | |
1638 \end{indentingcode} | |
1639 | |
1640 where \textit{nonterminal token list} is a list of nonterminal tokens | |
1641 separated by commas. | |
1642 Lexeme statements may be placed in any configuration section, and | |
1643 there may be any number of them. | |
1644 | |
1645 When you specify that a token is to be disregarded, AnaGram rewrites | |
1646 your grammar so that the token will be passed over whenever it occurs | |
1647 at the beginning of a file or following a lexical unit, or | |
1648 \agterm{lexeme}. If you have no \agparam{lexeme} statement, then the | |
1649 lexemes in your grammar are just the terminal tokens. | |
1650 | |
1651 The \agparam{lexeme} statement allows you to specify that certain | |
1652 nonterminal tokens are also to be treated as lexemes. This means that | |
1653 the disregard token will be skipped following the lexeme, but not | |
1654 between the characters that constitute the lexeme. | |
1655 | |
1656 Lexemes correspond to the tokens that a lexical scanner, if you were | |
1657 using one, would commonly identify and pass to a parser as single | |
1658 tokens. You don't usually wish to disregard white space within these | |
1659 tokens. For example, in a grammar for a conventional programming | |
1660 language where blank characters are to be disregarded, you might | |
1661 include: | |
1662 | |
1663 \begin{indentingcode}{0.4in} | |
1664 {}[ lexeme \bra string, character constant, name, number \ket ] | |
1665 \end{indentingcode} | |
1666 | |
1667 since blank characters must not be overlooked within strings and | |
1668 character constants and should not be permitted within names or | |
1669 numbers. | |
1670 | |
1671 Normally, AnaGram considers the disregard token to be optional; | |
1672 however there are circumstances where treating the disregard token as | |
1673 optional would lead to conflicts: two successive names, or two | |
1674 successive numbers, for example. In this case, you would like to | |
1675 require that the lexemes be separated by instances of the disregard | |
1676 token. To do this, simply set the | |
1677 \index{Distinguish lexemes}\index{Configuration switches} | |
1678 \agparam{distinguish lexemes} | |
1679 configuration switch. | |
1680 When this switch is set, AnaGram will ensure that disregard tokens | |
1681 will be required in those situations where making them optional would | |
1682 lead to conflicts. | |
1683 | |
1684 White space may be used explicitly within definitions of lexeme tokens | |
1685 in your grammar if desired, without causing conflicts. Thus, if you | |
1686 wish to allow embedded space in variable names, you might write: | |
1687 | |
1688 \begin{indentingcode}{0.4in} | |
1689 {}[ | |
1690 disregard space | |
1691 lexeme \bra variable name \ket | |
1692 ] | |
1693 space = ' ' + '{\bs}t' | |
1694 letter = 'a-z' + 'A-Z' | |
1695 digit = '0-9' | |
1696 | |
1697 variable name | |
1698 -> letter | |
1699 -> variable name, letter + digit | |
1700 -> variable name, space..., letter + digit | |
1701 \end{indentingcode} | |
1702 | |
1703 \subsection{Enum Statement} | |
1704 \index{Enum statement}\index{Enumeration}\index{Token} | |
1705 | |
1706 The \agparam{enum} statement follows rules nearly identical to those | |
1707 for C and C++. This makes it possible to copy an enum statement from | |
1708 your syntax file to a program file written in either C or C++, without | |
1709 any need for editing. The only differences are that AnaGram makes no | |
1710 provision for blank lines within the enumeration list, nor does it | |
1711 accept a type name. The \agparam{enum} statement is equivalent to a | |
1712 corresponding set of definition statements. It is especially useful | |
1713 when a parser is accepting token input from another program, a | |
1714 \index{Lexical scanner}lexical scanner, for example. Using | |
1715 the enum statement you may conveniently define all the identification | |
1716 codes for the input tokens. | |
1717 | |
1718 Each entry in an enum statement may be either a name, or a name | |
1719 followed by an ``='' sign and a character representation. If there is | |
1720 a character representation the name is assigned the value of the | |
1721 specified character. Otherwise it is assigned a value one more than | |
1722 that assigned to the previous name. If the first name in the list is | |
1723 not given an explicit value, it will be given the value zero. For | |
1724 example: | |
1725 | |
1726 \begin{indentingcode}{0.4in} | |
1727 {}[ | |
1728 enum \bra | |
1729 eof, a,b,c, | |
1730 blank = '\ ', x, y | |
1731 \ket | |
1732 ] | |
1733 \end{indentingcode} | |
1734 | |
1735 is equivalent to the following definition statements | |
1736 | |
1737 \begin{indentingcode}{0.4in} | |
1738 eof = 0 | |
1739 a = 1 | |
1740 b = 2 | |
1741 c = 3 | |
1742 blank = '\ ' | |
1743 x = 33 | |
1744 y = 34 | |
1745 \end{indentingcode} | |
1746 | |
1747 \subsection{Subgrammar Declarations} | |
1748 \index{Subgrammar declaration}\index{Declaration} | |
1749 | |
1750 A \agparam{subgrammar} declaration can be a useful way to deal with | |
1751 conflicts in certain situations. It tells AnaGram to treat the tokens | |
1752 listed in the declaration as though they were each grammar tokens, | |
1753 each specifying a complete subgrammar in itself, and, in determining | |
1754 shift and reduction actions, to ignore the usage of the tokens in the | |
1755 larger grammar. | |
1756 | |
1757 In some cases it is perfectly reasonable to ignore usage. The most | |
1758 common example occurs when building a lexical scanner for a language | |
1759 such as C as in the example in Section 7.4.4. In this case, you can | |
1760 write a complete grammar for a C token with no difficulty. But if you | |
1761 try to extend it to a sequence of tokens, you get scores of conflicts. | |
1762 This situation arises because you specify that any C token can follow | |
1763 another, when in actual practice, an identifier, for example, cannot | |
1764 follow another identifier without some intervening space or | |
1765 punctuation. | |
1766 | |
1767 It is theoretically possible, but in practice quite awkward, to write | |
1768 a grammar for a sequence of tokens so that there are no conflicts. | |
1769 The subgrammar declaration provides a way around this problem by | |
1770 telling AnaGram that when it is looking for reducing tokens for any | |
1771 rule produced directly or indirectly by a subgrammar token, it should | |
1772 disregard the usage of the token and only consider usage internal to | |
1773 the definition of the subgrammar token, as though the subgrammar token | |
1774 were the start token of the grammar. | |
1775 | |
1776 The subgrammar declaration is made in a configuration section and | |
1777 consists of the keyword \agcode{subgrammar} followed by a list of one | |
1778 or more nonterminal token names, separated by commas and enclosed in | |
1779 braces (\bra \ket). For example: | |
1780 | |
1781 \begin{indentingcode}{0.4in} | |
1782 {}[ subgrammar \bra C token, word \ket ] | |
1783 \end{indentingcode} | |
1784 | |
1785 Since the subgrammar statement changes the way AnaGram determines | |
1786 reducing tokens, it should be used with caution. You should be sure | |
1787 that the conflicts you are eliminating are really inconsequential. | |
1788 | |
1789 \subsection{Reserve Keywords Declaration} | |
1790 \index{Reserve keywords}\index{Keywords}\index{Keyword anomalies} | |
1791 | |
1792 The \agparam{reserve keywords} declaration can be used to specify a | |
1793 list of keywords that are reserved and cannot be used except as | |
1794 explicitly specified in the grammar. This enables AnaGram to avoid | |
1795 issuing meaningless keyword anomaly diagnostics (see \S 7.5). AnaGram | |
1796 does not automatically presume that keywords are also reserved words, | |
1797 since in many grammars there is no need to specify reserved words. | |
1798 | |
1799 The reserve keywords declaration is made in a configuration section | |
1800 and consists of the words \agcode{reserve keywords} followed by a list | |
1801 of one or more keyword strings, separated by commas and enclosed in | |
1802 braces (\bra \ket). For example: | |
1803 | |
1804 \begin{indentingcode}{0.4in} | |
1805 {}[ reserve keywords \bra "int", "char", "float", "double" \ket ] | |
1806 \end{indentingcode} | |
1807 | |
1808 \subsection{Rename Macro Statement} | |
1809 \index{Rename macro}\index{Macros} | |
1810 | |
1811 AnaGram uses a number of macros in its generated code. It is | |
1812 possible, therefore, to run into naming collisions with other | |
1813 components of your program. The \agparam{rename macro} statement | |
1814 allows you to change the name AnaGram uses for a particular macro to | |
1815 avoid these problems. For example, the Windows NT operating system | |
1816 uses \agcode{CONTEXT} structures to perform various internal | |
1817 operations. If you use the context tracking option (see \S 9.5.4) | |
1818 your parser will have a macro called \agcode{CONTEXT}. To avoid the | |
1819 name collision, add the following statement to any configuration | |
1820 section in your grammar: | |
1821 | |
1822 \begin{indentingcode}{0.4in} | |
1823 rename macro CONTEXT AG{\us}CONTEXT | |
1824 \end{indentingcode} | |
1825 | |
1826 Then, simply use \agcode{AG{\us}CONTEXT} where you would otherwise have | |
1827 used \agcode{CONTEXT}. |