Mercurial > ~dholland > hg > ag > index.cgi
comparison doc/manual/sbb.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:13d2b8934445 |
---|---|
1 \chapter{Syntactic Building Blocks} | |
2 | |
3 This appendix contains examples of the productions necessary to parse | |
4 the more common syntactic elements in your grammar. Each is provided | |
5 with sample reduction procedures, which are adequate in the vast | |
6 majority of cases. You can use these productions like building blocks | |
7 to quickly assemble quite powerful grammars. These productions can be | |
8 also be found in the file \agfile{sbb-doc.txt} in the | |
9 \agfile{examples} directory of your AnaGram distribution. An | |
10 HTML version may be found in the \agfile{html} directory. | |
11 % XXX: s/disk// | |
12 | |
13 \index{End of file} | |
14 \paragraph{End of File.} | |
15 If your parser is going to use conventional stream I/O to read its | |
16 input, you can define end of file as minus one: | |
17 \begin{indentingcode}{0.4in} | |
18 eof = $-$1 | |
19 \end{indentingcode} | |
20 | |
21 If your parser is going to run under Windows/DOS and will use low level I/O | |
22 instead of stream I/O you might define eof as Control Z: | |
23 | |
24 \begin{indentingcode}{0.4in} | |
25 eof = \^{}Z | |
26 \end{indentingcode} | |
27 | |
28 % XXX the following is crap | |
29 If your parser is going to run under UNIX and will use low level I/O instead of | |
30 stream I/O you might define eof as Control D: | |
31 \begin{indentingcode}{0.4in} | |
32 eof = \^{}D | |
33 \end{indentingcode} | |
34 | |
35 If your parser is simply going to parse a string in memory, the | |
36 definition of end of file should be a null character: | |
37 | |
38 \begin{indentingcode}{0.4in} | |
39 eof = 0 | |
40 \end{indentingcode} | |
41 | |
42 It is often convenient, however, to simply define end of file so it | |
43 will work in all of these contexts: | |
44 | |
45 \begin{indentingcode}{0.4in} | |
46 eof = $-$1 + 0 + \^{}D + \^{}Z | |
47 \end{indentingcode} | |
48 | |
49 \index{White space} | |
50 \paragraph{White Space.} | |
51 It is convenient to have a representation for white space in your | |
52 grammar. Usually you do not wish to distinguish between space | |
53 characters and tab characters, so you can write: | |
54 | |
55 \begin{indentingcode}{0.4in} | |
56 blank = ' ' + '{\bs}t' | |
57 \end{indentingcode} | |
58 | |
59 Using this definition, you can represent required white space of | |
60 indeterminate length with \agcode{blank...} and optional white space | |
61 with \agcode{blank?...} | |
62 | |
63 It is common, however, to include comments (see below) as white space. | |
64 In this case, you can define the following productions: | |
65 | |
66 \begin{indentingcode}{0.4in} | |
67 ws | |
68 -> blank | |
69 -> comment | |
70 \end{indentingcode} | |
71 | |
72 \index{End of line} | |
73 \paragraph{End of Line.} | |
74 | |
75 Because different systems use different representations for end of | |
76 line, it is wise to use an abstract end of line token in your grammar, | |
77 and define the token separately. If your parser is going to use files | |
78 that are known to use carriage return alone or line feed alone as the | |
79 end of line delimiter you can use one of the following definitions: | |
80 | |
81 \begin{indentingcode}{0.4in} | |
82 eol = '{\bs}r' // carriage return only | |
83 eol = '{\bs}n' // line feed only | |
84 \end{indentingcode} | |
85 | |
86 If your input files use the newline character as a line terminator, | |
87 but you wish to allow for optional carriage returns, you might write: | |
88 | |
89 \begin{indentingcode}{0.4in} | |
90 eol | |
91 -> '{\bs}r'?, '{\bs}n' | |
92 \end{indentingcode} | |
93 | |
94 or even | |
95 | |
96 \begin{indentingcode}{0.4in} | |
97 eol | |
98 -> '{\bs}r'?..., '{\bs}n' | |
99 \end{indentingcode} | |
100 | |
101 If you wish to make a grammar that can deal with a file of more or | |
102 less arbitrary provenance, some caution is in order. If you allow | |
103 carriage return and line feed both to function as end of line | |
104 characters and you allow blank lines, you may wind up with a very | |
105 ambiguous grammar. For example, suppose you use \agcode{eol...} | |
106 somewhere in your grammar and you have defined eol thus: | |
107 | |
108 \begin{indentingcode}{0.4in} | |
109 eol | |
110 -> '{\bs}r' | |
111 -> '{\bs}n' | |
112 -> '{\bs}r', '{\bs}n' // trouble! | |
113 \end{indentingcode} | |
114 | |
115 Your grammar is ambiguous since a carriage return, line feed pair can | |
116 be considered two line endings, according to the first two | |
117 productions, or just one, according to the third. | |
118 | |
119 If you really need to allow isolated carriage returns and linefeeds to | |
120 mark ends of line, but need to deal with the pair as well, you can | |
121 make good use of the idiosyncracies of AnaGram's keyword logic by | |
122 writing the following: | |
123 | |
124 \begin{indentingcode}{0.4in} | |
125 eol | |
126 -> '{\bs}r' | |
127 -> '{\bs}n' | |
128 -> "{\bs}r{\bs}n" | |
129 \end{indentingcode} | |
130 | |
131 This will treat a carriage return followed by a line feed as a single | |
132 end of line. | |
133 | |
134 \index{Comments}\index{Parsing} | |
135 \paragraph{Comments.} | |
136 There are two different categories of comments in common use: those | |
137 that run to the end of line, and those that run to a specific | |
138 terminator. The first can be conveniently incorporated into your end | |
139 of line token as a virtual production: | |
140 | |
141 \begin{indentingcode}{0.4in} | |
142 eol | |
143 -> ["//", \~{}(eof + '{\bs}n')?...], '{\bs}n' | |
144 \end{indentingcode} | |
145 | |
146 Thus, \agcode{eol} allows for an optional comment at the end of every | |
147 line. Note that you never want to allow end of file characters in | |
148 your comments. | |
149 | |
150 % XXX: simply/simple | |
151 Conventional C comments are slightly more complicated, depending on | |
152 your treatment of nesting. If you are not interested in nesting | |
153 comments, you can simply use the following simple syntax: | |
154 | |
155 \begin{indentingcode}{0.4in} | |
156 comment | |
157 -> "/*", \~{}eof?..., "*/" | |
158 \end{indentingcode} | |
159 | |
160 Note that this production works because keywords take precedence over | |
161 ordinary character input. | |
162 % XXX add: Otherwise, it would probably conflict with other uses of | |
163 % ``*'' and ``/''. | |
164 | |
165 A somewhat more complex way to write this is useful: | |
166 | |
167 \begin{indentingcode}{0.4in} | |
168 comment | |
169 -> comment head, "*/" | |
170 -> comment head, comment | |
171 comment head | |
172 -> "/*" | |
173 -> comment head, \~{}eof | |
174 \end{indentingcode} | |
175 | |
176 Note that where the previous grammar simply ignored the beginning of | |
177 any nested comment, this grammar recognizes a nested comment, but then | |
178 deliberately chooses to ignore nesting. | |
179 % XXX add: | |
180 | |
181 If you want your comments to nest, you can easily rewrite this grammar | |
182 to allow nesting: | |
183 | |
184 \begin{indentingcode}{0.4in} | |
185 comment | |
186 -> comment head, "*/" | |
187 comment head | |
188 -> "/*" | |
189 -> comment head, \~{}eof | |
190 -> comment head, comment | |
191 \end{indentingcode} | |
192 | |
193 Note that the only difference between nested and non-nested comments | |
194 is whether the grammar rule \agcode{comment head, comment} reduces to | |
195 \agcode{comment head} or to \agcode{comment}. | |
196 | |
197 If you wish to defer the question of nesting to run-time, you can | |
198 use a semantically determined production to make the decision: | |
199 | |
200 \begin{indentingcode}{0.4in} | |
201 comment | |
202 -> comment head, "*/" | |
203 comment head | |
204 -> "/*" | |
205 -> comment head, \~{}eof | |
206 comment, comment head | |
207 -> comment head, comment = check{\us}nesting(); | |
208 \end{indentingcode} | |
209 | |
210 Although the final production in this grammar has a somewhat | |
211 inscrutable air about it, a little study will show that if | |
212 \agcode{check{\us}nesting} sets the reduction token to \agcode{comment}, | |
213 the effective grammar is the same as the grammar above for non-nested | |
214 comments. On the other hand, if \agcode{check{\us}nesting} sets the | |
215 reduction token to comment head, the effective grammar is the same as | |
216 the grammar for nested comments. | |
217 | |
218 Assuming you have a switch called \agcode{nest{\us}comments}, you could | |
219 write \agcode{check{\us}nesting} as follows: | |
220 | |
221 \begin{indentingcode}{0.4in} | |
222 check{\us}nesting() \bra | |
223 if (nest{\us}comments) | |
224 CHANGE{\us}REDUCTION(comment{\us}head); | |
225 else | |
226 CHANGE{\us}REDUCTION(comment); | |
227 \ket | |
228 \end{indentingcode} | |
229 % XXX add some void to that | |
230 | |
231 The \agcode{else} clause in this procedure is technically unnecessary | |
232 since \agcode{comment} is the default value of the reduction token. | |
233 | |
234 If you wish to skip white space in your input, and wish to consider | |
235 comments as simple white space, you might want to add the following | |
236 production to your grammar: | |
237 | |
238 \begin{indentingcode}{0.4in} | |
239 ws | |
240 -> blank | |
241 -> comment | |
242 \end{indentingcode} | |
243 | |
244 You can then use the \agparam{disregard} statement to ignore | |
245 \agcode{ws} appropriately. | |
246 | |
247 \index{Integers}\index{Parsing} | |
248 \paragraph{Integers.} | |
249 It is quite easy to convert ordinary integers to binary form. For | |
250 instance: | |
251 | |
252 \begin{indentingcode}{0.4in} | |
253 (int) decimal integer | |
254 -> '0-9':d = d-'0'; | |
255 -> decimal integer:n, '0-9':d = 10*n + d-'0'; | |
256 \end{indentingcode} | |
257 | |
258 If necessary, you can specify that the value of decimal integer be | |
259 maintained as a long: | |
260 | |
261 \begin{indentingcode}{0.4in} | |
262 (long) decimal integer | |
263 -> '0-9':d = d-'0'; | |
264 -> decimal integer:n, '0-9':d = 10*n + d-'0'; | |
265 \end{indentingcode} | |
266 | |
267 Other representations are equally simple: | |
268 | |
269 \begin{indentingcode}{0.4in} | |
270 (long) octal integer | |
271 -> '0' = 0; | |
272 -> octal integer:n, '0-7':d = 8*n + d-'0'; | |
273 | |
274 (long) hex digit | |
275 -> '0-9':d = d-'0'; | |
276 -> 'A-F' + 'a-f':d = 9 + (d \& 7); | |
277 | |
278 (long) hex integer | |
279 -> {\bra}"0x" + "0X"{\ket}, hex digit:d = d; | |
280 -> hex integer:n, hex digit:d = 16*n + d; | |
281 \end{indentingcode} | |
282 | |
283 % XXX: s/if/if as shown/ | |
284 Note that if you use the C convention for octal integers, you have to | |
285 redefine decimal integer to avoid ambiguity: | |
286 | |
287 \begin{indentingcode}{0.4in} | |
288 (long) decimal integer | |
289 -> '1-9':d = d-'0'; | |
290 -> decimal integer:n, '0-9':d = 10*n + d-'0'; | |
291 \end{indentingcode} | |
292 | |
293 % XXX add: This is because 0 standing alone could be either octal or | |
294 % decimal. | |
295 | |
296 You can then represent the general case as follows: | |
297 | |
298 \begin{indentingcode}{0.4in} | |
299 (long) integer | |
300 -> decimal integer | octal integer | hex integer | |
301 \end{indentingcode} | |
302 | |
303 You can allow for a signed integer with the following productions: | |
304 | |
305 \begin{indentingcode}{0.4in} | |
306 (long) signed integer | |
307 -> '+'?, integer:n = n; | |
308 -> '-', integer:n = -n; | |
309 \end{indentingcode} | |
310 | |
311 If you have included a disregard statement in your grammar to skip | |
312 over irrelevant white space in your input, you might add the following | |
313 to avoid skipping white space inside a number: | |
314 | |
315 \begin{indentingcode}{0.4in} | |
316 {}[ lexeme \bra integer \ket ] | |
317 \end{indentingcode} | |
318 | |
319 Note that if you were to declare signed integer as a lexeme, your | |
320 parser would not allow space between a leading sign and the integer. | |
321 | |
322 % XXX. This should really at least mention dealing with integer | |
323 % overflow. | |
324 | |
325 \index{Floating point numbers} | |
326 \index{Parsing} | |
327 \paragraph{Floating Point Numbers.} | |
328 Floating point numbers are somewhat more complex than integers, but | |
329 not significantly so: | |
330 | |
331 \begin{indentingcode}{0.4in} | |
332 (double) real | |
333 -> simple real | |
334 -> integer part:r, exponent field:x = r*pow10(x); | |
335 -> simple real:r, exponent field:x = r*pow10(x); | |
336 | |
337 (double) simple real | |
338 -> integer part:i, '.', fraction part:f = i+f; | |
339 -> integer part, '.' | |
340 -> '.', fraction part:f = f; | |
341 | |
342 (double) integer part | |
343 -> '0-9':d = d-'0'; | |
344 -> integer part:n, '0-9':d = 10*n + d-'0'; | |
345 | |
346 (double) fraction part | |
347 -> '0-9':d = (d-'0')/10.; | |
348 -> '0-9':d, fraction part:f = (d-'0'+f)/10.; | |
349 | |
350 exponent field | |
351 -> 'e' + 'E', signed exponent:x = x; | |
352 | |
353 signed exponent | |
354 -> '+'?, exponent:n = n; | |
355 -> '-', exponent:n = -n; | |
356 | |
357 exponent | |
358 -> '0-9':d = d-'0'; | |
359 -> exponent:n, '0-9':d = 10*n + d-'0'; | |
360 \end{indentingcode} | |
361 | |
362 % XXX this is all wet - you really need to collect it as a string and | |
363 % then use strtod on it, or you lose precision. | |
364 | |
365 Note that fraction part uses right recursion rather than left | |
366 recursion. \agcode{exponent} is effectively the same as | |
367 \agcode{decimal integer}, above, but allows for initial zeroes. | |
368 | |
369 The situation becomes somewhat more complex if you wish to allow both | |
370 integer and floating point forms, particularly if you wish to follow | |
371 the C convention for octal integers. First, you cannot have distinct | |
372 productions for integer part and decimal integer, since there is no | |
373 way to distinguish them until a decimal point or an exponent field is | |
374 encountered. Second, since 0129.3 looks for all the world like an | |
375 octal number until the ``9'' is encountered, one either has to | |
376 postpone all conversion calculations until the issue is resolved or | |
377 resort to trickery. Here is a way to resolve the problem by | |
378 redefining integer part: | |
379 | |
380 \begin{indentingcode}{0.4in} | |
381 (double) integer part | |
382 -> confusion | |
383 -> octal integer:n = octal2decimal(n); | |
384 -> decimal integer:n = n; | |
385 | |
386 (double) confusion | |
387 -> octal integer:n, '8-9':d = 10*octal2decimal(n) + d-'0'; | |
388 -> confusion:x, '0-9':d = 10*x + d-'0'; | |
389 \end{indentingcode} | |
390 | |
391 where octal2decimal is defined thus: | |
392 | |
393 \begin{indentingcode}{0.4in} | |
394 double octal2decimal(int n) \bra | |
395 if (n) return 10*octal2decimal(n/8) + n\%8; | |
396 else return 0; | |
397 \ket | |
398 \end{indentingcode} | |
399 | |
400 Here confusion represents a number that starts off looking like an | |
401 octal integer, but then turns into a decimal number, because an eight, | |
402 a nine, a decimal pointer or an exponent field is encountered. When | |
403 this occurs, the function octal2decimal undoes the octal conversion | |
404 that had already been done, and redoes the conversion as decimal | |
405 conversion. | |
406 | |
407 If you have included a disregard statement in your grammar to skip | |
408 over irrelevant white space in your input, you might add the following | |
409 to avoid skipping white space inside a real: | |
410 | |
411 \begin{indentingcode}{0.4in} | |
412 {}[ lexeme \bra real \ket ] | |
413 \end{indentingcode} | |
414 | |
415 \index{Names}\index{Parsing} | |
416 \paragraph{Names.} | |
417 In almost all grammars, it is necessary to identify names. To | |
418 accumulate the characters that make up the name it is convenient to | |
419 use a stack. The reduction procedures in this example use the | |
420 functions \agcode{ipn} and \agcode{pcn} to accumulate the characters. | |
421 \agcode{ipn} initializes storage for the name and stores the first | |
422 character. \agcode{pcn} adds a single character to the name. This | |
423 grammar presumes the existence of a function called | |
424 \agcode{identify{\us}name} which would look up the name in a symbol table | |
425 and return an identifying index. | |
426 | |
427 \begin{indentingcode}{0.4in} | |
428 letter = 'a-z' + 'A-Z' + '{\us}' | |
429 digit = '0-9' | |
430 | |
431 (int) name | |
432 -> name string = identify{\us}name(); | |
433 | |
434 name string | |
435 -> letter:c = ipn(c); | |
436 -> name string, letter+digit:c = pcn(c); | |
437 | |
438 \bra // Embedded C to accumulate name | |
439 char name[82]; | |
440 int name{\us}length; | |
441 | |
442 /* Init and Put char to Name */ | |
443 void ipn(int c) \bra | |
444 name[0] = c; | |
445 name{\us}length = 1; | |
446 \ket | |
447 | |
448 /* Put Char to Name */ | |
449 void pcn(int c) \bra | |
450 assert(name{\us}length < 82); | |
451 name[name{\us}length++] = c; | |
452 \ket | |
453 \ket // End of embedded C | |
454 \end{indentingcode} | |
455 | |
456 % XXX: can I please change the names of ipn/pcn, make the global data | |
457 % static, and so on? | |
458 | |
459 If you have included a disregard statement in your grammar to skip | |
460 over irrelevant white space in your input, you might add the following | |
461 to avoid skipping white space inside a name: | |
462 | |
463 \begin{indentingcode}{0.4in} | |
464 {}[ lexeme \bra name \ket ] | |
465 \end{indentingcode} | |
466 | |
467 \index{Names with embedded white space}\index{Parsing} | |
468 \paragraph{Names with Embedded White Space.} | |
469 It is sometimes convenient to allow symbol names to contain embedded | |
470 white space. This is easily done, although it requires a bit of a | |
471 trick: | |
472 | |
473 \begin{indentingcode}{0.4in} | |
474 name | |
475 -> name string, ws?... = identify{\us}name(); | |
476 | |
477 name string | |
478 -> letter:c = ipn(c); | |
479 -> name string, letter+digit:c = pcn(c); | |
480 -> name string, ws..., letter+digit:c = pcn(' '), pcn(c); | |
481 \end{indentingcode} | |
482 | |
483 Note that the last production reduces all contiguous white space | |
484 within a name to a single blank. | |
485 | |
486 Allowing optional blanks following \agcode{name string} is important. | |
487 If you don't allow them there, any \agcode{ws} following a name will | |
488 be presumed to be embedded \agcode{ws}, requiring another | |
489 \agcode{letter} or \agcode{digit}, which is not what you intend. | |
490 There are two ways to accomplish this. The first, shown above, | |
491 explicitly allows for optional white space following \agcode{name | |
492 string}. The second method is to use the disregard and lexeme | |
493 statements: | |
494 | |
495 \begin{indentingcode}{0.4in} | |
496 {}[ | |
497 disregard ws | |
498 lexeme \bra name \ket | |
499 ] | |
500 \end{indentingcode} | |
501 | |
502 If you use the \agparam{disregard} statement you should not include a | |
503 provision for white space in the production for \agcode{name}. Just | |
504 leave it as it was in the previous example. | |
505 % XXX: s/provision for/provision for trailing/ | |
506 | |
507 \index{Character Strings}\index{Parsing} | |
508 \paragraph{Character Strings.} | |
509 Character strings are often required. The simplest way to implement | |
510 character strings is as follows: | |
511 | |
512 \begin{indentingcode}{0.4in} | |
513 character string | |
514 -> '"', \~{}(eof + '"')?..., '"' | |
515 \end{indentingcode} | |
516 | |
517 This approach has the disadvantage that it makes no provision for | |
518 nonprinting characters. | |
519 | |
520 There are numerous ways to provide for nonprinting characters in your | |
521 character strings. However, you can avoid tedious documentation by | |
522 using the same rules for nonprinting characters that C uses. | |
523 Unfortunately, the C rules for octal and hexadecimal escape sequences | |
524 complicate the writing of the grammar quite substantially. For | |
525 example, if you wish to write a string that consists of ASCII code 11 | |
526 followed by the ascii digit ``2'', you must pad with a leading zero, | |
527 writing \agcode{{\bs}0132}, since \agcode{{\bs}132} | |
528 according to the rules is a single three digit octal escape sequence | |
529 designating ASCII code 90. The problem is that the rules allow for | |
530 one, two or three digit octal escape sequences, but sequences shorter | |
531 than three digits have to be followed by the end of the string or a | |
532 character that is not an ASCII digit. There is a similar, if not | |
533 worse, problem with hex escape sequences. There is no limit on the | |
534 length of a hex escape sequence, so there is no possible way to make | |
535 the character ``2'' follow a hex escape sequence without using another | |
536 escape sequence. | |
537 | |
538 A straightforward approach to writing a grammar for strings consistent | |
539 with the C conventions yields a number of conflicts which correspond | |
540 exactly to the problems discussed above. While it is certainly | |
541 possible to write a grammar for strings that has no conflicts, it is | |
542 easier to proceed in a straightforward manner and use a sticky | |
543 declaration to resolve the ambiguities. | |
544 | |
545 Here is the complete grammar for a character string in accordance with | |
546 the C rules. In order to accumulate the contents of the string, this | |
547 grammar uses the functions \agcode{ics} and \agcode{acs}, to | |
548 initialize storage for a character string and to append a character to | |
549 that string respectively. | |
550 % XXX can I please change the names of ics and acs? | |
551 | |
552 \begin{indentingcode}{0.4in} | |
553 character string | |
554 -> string chars, '"' | |
555 | |
556 string chars | |
557 -> '"' = ics(); | |
558 -> string chars, string char:c = acs(c); | |
559 | |
560 string char | |
561 -> simple string char | |
562 -> escape sequence | |
563 | |
564 simple string char = \~{}eof - ('"' + '{\bs\bs}' + '{\bs}n') | |
565 | |
566 (int) escape sequence | |
567 -> "{\bs\bs}a" ='{\bs}a'; | |
568 -> "{\bs\bs}b" ='{\bs}b'; | |
569 -> "{\bs\bs}f" ='{\bs}f'; | |
570 -> "{\bs\bs}n" ='{\bs}n'; | |
571 -> "{\bs\bs}r" ='{\bs}r'; | |
572 -> "{\bs\bs}t" ='{\bs}t'; | |
573 -> "{\bs\bs}v" ='{\bs}v'; | |
574 -> "{\bs\bs\bs\bs}" ='{\bs\bs}'; | |
575 -> "{\bs\bs}?" = '{\bs}?'; | |
576 -> "{\bs\bs}'" ='{\bs}''; | |
577 -> "{\bs\bs\bs}"" = '"'; | |
578 -> octal escape | |
579 -> hex escape | |
580 | |
581 (int) octal escape | |
582 -> one octal | two octal | three octal | |
583 | |
584 (int) one octal | |
585 -> '{\bs\bs}', '0-7':d = d-'0'; | |
586 | |
587 (int) two octal | |
588 -> one octal:n, '0-7':d = 8*n + d-'0'; | |
589 | |
590 (int) three octal | |
591 -> two octal:n, '0-7':d = 8*n + d-'0'; | |
592 | |
593 (int) hex escape | |
594 -> "{\bs\bs}x", hex number | |
595 | |
596 (int) hex number | |
597 -> hex digit | |
598 -> hex number:n, hex digit:d = 16*n + d; | |
599 | |
600 {}[ | |
601 sticky one octal, two octal, hex number | |
602 ] | |
603 \bra // Embedded C to define ics and acs | |
604 char string{\us}store[82]; | |
605 int length; | |
606 | |
607 void ics(void) \bra | |
608 length = 0; | |
609 \ket | |
610 void acs(int c) \bra | |
611 assert(length < 82); | |
612 string{\us}store[length++] = c; | |
613 \ket | |
614 \ket // End of embedded C | |
615 \end{indentingcode} | |
616 | |
617 | |
618 If you have included a disregard statement in your grammar to skip | |
619 over irrelevant white space in your input, you might add the following | |
620 to avoid skipping white space inside a name: | |
621 % XXX inside a what? | |
622 | |
623 \begin{indentingcode}{0.4in} | |
624 {}[ lexeme \bra character string \ket ] | |
625 \end{indentingcode} | |
626 | |
627 \index{Character constants}\index{Parsing} | |
628 \paragraph{Character Constants.} | |
629 It is almost trivial to extend the above syntax for character strings | |
630 to allow simple character constants: | |
631 | |
632 \begin{indentingcode}{0.4in} | |
633 (int) character constant | |
634 -> '{\bs}'', simple char:c, '{\bs}'' = c; | |
635 | |
636 (int) simple char | |
637 -> \~{}eof - ('{\bs}'' + '{\bs\bs}' + '{\bs}n') | |
638 -> escape sequence | |
639 \end{indentingcode} | |
640 | |
641 The value of the character constant token is the character code found | |
642 in the input stream. | |
643 | |
644 If you have included a disregard statement in your grammar to skip | |
645 over irrelevant white space in your input, you might add the following | |
646 to avoid skipping white space inside a character constant: | |
647 | |
648 \begin{indentingcode}{0.4in} | |
649 {}[ lexeme \bra character constant \ket ] | |
650 \end{indentingcode} | |
651 | |
652 \index{Expressions}\index{Parsing} | |
653 \paragraph{Simple Expressions.} | |
654 It is often convenient to allow for simple expressions in your input. | |
655 The syntax for expression logic is written in a conventional way, | |
656 using a separate token for each precedence level desired. The | |
657 following grammar will accept simple addition, subtraction, | |
658 multiplication and division of floating point numbers: | |
659 | |
660 \begin{indentingcode}{0.4in} | |
661 (double) expression | |
662 -> term | |
663 -> expression:x, '+', term:t = x+t; | |
664 -> expression:x, '-', term:t = x-t; | |
665 | |
666 (double) term | |
667 -> factor | |
668 -> term:t, '*', factor:f = t*f; | |
669 -> term:t, '/', factor:f = t/f; | |
670 | |
671 (double) factor | |
672 -> real | |
673 -> '(', expression:x, ')' = x; | |
674 -> '-', factor:f = -f; | |
675 \end{indentingcode} | |
676 | |
677 An alternative way to write expression syntax is to write an ambiguous | |
678 grammar and use precedence declarations to resolve the conflicts: | |
679 | |
680 \begin{indentingcode}{0.4in} | |
681 (double) expression | |
682 -> expression:x, '+', expression:y = x+y; | |
683 -> expression:x, '-', expression:y = x-y; | |
684 -> expression:x, '*', expression:y = x*y; | |
685 -> expression:x, '/', expression:y = x/y; | |
686 -> unary minus, expression:x = -x; | |
687 -> '(', expression:x, ')' = x; | |
688 -> real | |
689 | |
690 unary minus = '-' | |
691 {}[ | |
692 left '+', '-' | |
693 left '*', '/' | |
694 right unary minus | |
695 ] | |
696 \end{indentingcode} | |
697 | |
698 Note that we deal with the different precedence levels of unary and | |
699 binary minus by defining \agcode{unary minus}, which AnaGram treats as | |
700 distinct from the simple \agcode{'-'}, and giving it different | |
701 precedence. |