comparison doc/manual/sbb.tex @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.
author David A. Holland
date Sat, 22 Dec 2007 17:52:45 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:13d2b8934445
1 \chapter{Syntactic Building Blocks}
2
3 This appendix contains examples of the productions necessary to parse
4 the more common syntactic elements in your grammar. Each is provided
5 with sample reduction procedures, which are adequate in the vast
6 majority of cases. You can use these productions like building blocks
7 to quickly assemble quite powerful grammars. These productions can be
8 also be found in the file \agfile{sbb-doc.txt} in the
9 \agfile{examples} directory of your AnaGram distribution. An
10 HTML version may be found in the \agfile{html} directory.
11 % XXX: s/disk//
12
13 \index{End of file}
14 \paragraph{End of File.}
15 If your parser is going to use conventional stream I/O to read its
16 input, you can define end of file as minus one:
17 \begin{indentingcode}{0.4in}
18 eof = $-$1
19 \end{indentingcode}
20
21 If your parser is going to run under Windows/DOS and will use low level I/O
22 instead of stream I/O you might define eof as Control Z:
23
24 \begin{indentingcode}{0.4in}
25 eof = \^{}Z
26 \end{indentingcode}
27
28 % XXX the following is crap
29 If your parser is going to run under UNIX and will use low level I/O instead of
30 stream I/O you might define eof as Control D:
31 \begin{indentingcode}{0.4in}
32 eof = \^{}D
33 \end{indentingcode}
34
35 If your parser is simply going to parse a string in memory, the
36 definition of end of file should be a null character:
37
38 \begin{indentingcode}{0.4in}
39 eof = 0
40 \end{indentingcode}
41
42 It is often convenient, however, to simply define end of file so it
43 will work in all of these contexts:
44
45 \begin{indentingcode}{0.4in}
46 eof = $-$1 + 0 + \^{}D + \^{}Z
47 \end{indentingcode}
48
49 \index{White space}
50 \paragraph{White Space.}
51 It is convenient to have a representation for white space in your
52 grammar. Usually you do not wish to distinguish between space
53 characters and tab characters, so you can write:
54
55 \begin{indentingcode}{0.4in}
56 blank = ' ' + '{\bs}t'
57 \end{indentingcode}
58
59 Using this definition, you can represent required white space of
60 indeterminate length with \agcode{blank...} and optional white space
61 with \agcode{blank?...}
62
63 It is common, however, to include comments (see below) as white space.
64 In this case, you can define the following productions:
65
66 \begin{indentingcode}{0.4in}
67 ws
68 -> blank
69 -> comment
70 \end{indentingcode}
71
72 \index{End of line}
73 \paragraph{End of Line.}
74
75 Because different systems use different representations for end of
76 line, it is wise to use an abstract end of line token in your grammar,
77 and define the token separately. If your parser is going to use files
78 that are known to use carriage return alone or line feed alone as the
79 end of line delimiter you can use one of the following definitions:
80
81 \begin{indentingcode}{0.4in}
82 eol = '{\bs}r' // carriage return only
83 eol = '{\bs}n' // line feed only
84 \end{indentingcode}
85
86 If your input files use the newline character as a line terminator,
87 but you wish to allow for optional carriage returns, you might write:
88
89 \begin{indentingcode}{0.4in}
90 eol
91 -> '{\bs}r'?, '{\bs}n'
92 \end{indentingcode}
93
94 or even
95
96 \begin{indentingcode}{0.4in}
97 eol
98 -> '{\bs}r'?..., '{\bs}n'
99 \end{indentingcode}
100
101 If you wish to make a grammar that can deal with a file of more or
102 less arbitrary provenance, some caution is in order. If you allow
103 carriage return and line feed both to function as end of line
104 characters and you allow blank lines, you may wind up with a very
105 ambiguous grammar. For example, suppose you use \agcode{eol...}
106 somewhere in your grammar and you have defined eol thus:
107
108 \begin{indentingcode}{0.4in}
109 eol
110 -> '{\bs}r'
111 -> '{\bs}n'
112 -> '{\bs}r', '{\bs}n' // trouble!
113 \end{indentingcode}
114
115 Your grammar is ambiguous since a carriage return, line feed pair can
116 be considered two line endings, according to the first two
117 productions, or just one, according to the third.
118
119 If you really need to allow isolated carriage returns and linefeeds to
120 mark ends of line, but need to deal with the pair as well, you can
121 make good use of the idiosyncracies of AnaGram's keyword logic by
122 writing the following:
123
124 \begin{indentingcode}{0.4in}
125 eol
126 -> '{\bs}r'
127 -> '{\bs}n'
128 -> "{\bs}r{\bs}n"
129 \end{indentingcode}
130
131 This will treat a carriage return followed by a line feed as a single
132 end of line.
133
134 \index{Comments}\index{Parsing}
135 \paragraph{Comments.}
136 There are two different categories of comments in common use: those
137 that run to the end of line, and those that run to a specific
138 terminator. The first can be conveniently incorporated into your end
139 of line token as a virtual production:
140
141 \begin{indentingcode}{0.4in}
142 eol
143 -> ["//", \~{}(eof + '{\bs}n')?...], '{\bs}n'
144 \end{indentingcode}
145
146 Thus, \agcode{eol} allows for an optional comment at the end of every
147 line. Note that you never want to allow end of file characters in
148 your comments.
149
150 % XXX: simply/simple
151 Conventional C comments are slightly more complicated, depending on
152 your treatment of nesting. If you are not interested in nesting
153 comments, you can simply use the following simple syntax:
154
155 \begin{indentingcode}{0.4in}
156 comment
157 -> "/*", \~{}eof?..., "*/"
158 \end{indentingcode}
159
160 Note that this production works because keywords take precedence over
161 ordinary character input.
162 % XXX add: Otherwise, it would probably conflict with other uses of
163 % ``*'' and ``/''.
164
165 A somewhat more complex way to write this is useful:
166
167 \begin{indentingcode}{0.4in}
168 comment
169 -> comment head, "*/"
170 -> comment head, comment
171 comment head
172 -> "/*"
173 -> comment head, \~{}eof
174 \end{indentingcode}
175
176 Note that where the previous grammar simply ignored the beginning of
177 any nested comment, this grammar recognizes a nested comment, but then
178 deliberately chooses to ignore nesting.
179 % XXX add:
180
181 If you want your comments to nest, you can easily rewrite this grammar
182 to allow nesting:
183
184 \begin{indentingcode}{0.4in}
185 comment
186 -> comment head, "*/"
187 comment head
188 -> "/*"
189 -> comment head, \~{}eof
190 -> comment head, comment
191 \end{indentingcode}
192
193 Note that the only difference between nested and non-nested comments
194 is whether the grammar rule \agcode{comment head, comment} reduces to
195 \agcode{comment head} or to \agcode{comment}.
196
197 If you wish to defer the question of nesting to run-time, you can
198 use a semantically determined production to make the decision:
199
200 \begin{indentingcode}{0.4in}
201 comment
202 -> comment head, "*/"
203 comment head
204 -> "/*"
205 -> comment head, \~{}eof
206 comment, comment head
207 -> comment head, comment = check{\us}nesting();
208 \end{indentingcode}
209
210 Although the final production in this grammar has a somewhat
211 inscrutable air about it, a little study will show that if
212 \agcode{check{\us}nesting} sets the reduction token to \agcode{comment},
213 the effective grammar is the same as the grammar above for non-nested
214 comments. On the other hand, if \agcode{check{\us}nesting} sets the
215 reduction token to comment head, the effective grammar is the same as
216 the grammar for nested comments.
217
218 Assuming you have a switch called \agcode{nest{\us}comments}, you could
219 write \agcode{check{\us}nesting} as follows:
220
221 \begin{indentingcode}{0.4in}
222 check{\us}nesting() \bra
223 if (nest{\us}comments)
224 CHANGE{\us}REDUCTION(comment{\us}head);
225 else
226 CHANGE{\us}REDUCTION(comment);
227 \ket
228 \end{indentingcode}
229 % XXX add some void to that
230
231 The \agcode{else} clause in this procedure is technically unnecessary
232 since \agcode{comment} is the default value of the reduction token.
233
234 If you wish to skip white space in your input, and wish to consider
235 comments as simple white space, you might want to add the following
236 production to your grammar:
237
238 \begin{indentingcode}{0.4in}
239 ws
240 -> blank
241 -> comment
242 \end{indentingcode}
243
244 You can then use the \agparam{disregard} statement to ignore
245 \agcode{ws} appropriately.
246
247 \index{Integers}\index{Parsing}
248 \paragraph{Integers.}
249 It is quite easy to convert ordinary integers to binary form. For
250 instance:
251
252 \begin{indentingcode}{0.4in}
253 (int) decimal integer
254 -> '0-9':d = d-'0';
255 -> decimal integer:n, '0-9':d = 10*n + d-'0';
256 \end{indentingcode}
257
258 If necessary, you can specify that the value of decimal integer be
259 maintained as a long:
260
261 \begin{indentingcode}{0.4in}
262 (long) decimal integer
263 -> '0-9':d = d-'0';
264 -> decimal integer:n, '0-9':d = 10*n + d-'0';
265 \end{indentingcode}
266
267 Other representations are equally simple:
268
269 \begin{indentingcode}{0.4in}
270 (long) octal integer
271 -> '0' = 0;
272 -> octal integer:n, '0-7':d = 8*n + d-'0';
273
274 (long) hex digit
275 -> '0-9':d = d-'0';
276 -> 'A-F' + 'a-f':d = 9 + (d \& 7);
277
278 (long) hex integer
279 -> {\bra}"0x" + "0X"{\ket}, hex digit:d = d;
280 -> hex integer:n, hex digit:d = 16*n + d;
281 \end{indentingcode}
282
283 % XXX: s/if/if as shown/
284 Note that if you use the C convention for octal integers, you have to
285 redefine decimal integer to avoid ambiguity:
286
287 \begin{indentingcode}{0.4in}
288 (long) decimal integer
289 -> '1-9':d = d-'0';
290 -> decimal integer:n, '0-9':d = 10*n + d-'0';
291 \end{indentingcode}
292
293 % XXX add: This is because 0 standing alone could be either octal or
294 % decimal.
295
296 You can then represent the general case as follows:
297
298 \begin{indentingcode}{0.4in}
299 (long) integer
300 -> decimal integer | octal integer | hex integer
301 \end{indentingcode}
302
303 You can allow for a signed integer with the following productions:
304
305 \begin{indentingcode}{0.4in}
306 (long) signed integer
307 -> '+'?, integer:n = n;
308 -> '-', integer:n = -n;
309 \end{indentingcode}
310
311 If you have included a disregard statement in your grammar to skip
312 over irrelevant white space in your input, you might add the following
313 to avoid skipping white space inside a number:
314
315 \begin{indentingcode}{0.4in}
316 {}[ lexeme \bra integer \ket ]
317 \end{indentingcode}
318
319 Note that if you were to declare signed integer as a lexeme, your
320 parser would not allow space between a leading sign and the integer.
321
322 % XXX. This should really at least mention dealing with integer
323 % overflow.
324
325 \index{Floating point numbers}
326 \index{Parsing}
327 \paragraph{Floating Point Numbers.}
328 Floating point numbers are somewhat more complex than integers, but
329 not significantly so:
330
331 \begin{indentingcode}{0.4in}
332 (double) real
333 -> simple real
334 -> integer part:r, exponent field:x = r*pow10(x);
335 -> simple real:r, exponent field:x = r*pow10(x);
336
337 (double) simple real
338 -> integer part:i, '.', fraction part:f = i+f;
339 -> integer part, '.'
340 -> '.', fraction part:f = f;
341
342 (double) integer part
343 -> '0-9':d = d-'0';
344 -> integer part:n, '0-9':d = 10*n + d-'0';
345
346 (double) fraction part
347 -> '0-9':d = (d-'0')/10.;
348 -> '0-9':d, fraction part:f = (d-'0'+f)/10.;
349
350 exponent field
351 -> 'e' + 'E', signed exponent:x = x;
352
353 signed exponent
354 -> '+'?, exponent:n = n;
355 -> '-', exponent:n = -n;
356
357 exponent
358 -> '0-9':d = d-'0';
359 -> exponent:n, '0-9':d = 10*n + d-'0';
360 \end{indentingcode}
361
362 % XXX this is all wet - you really need to collect it as a string and
363 % then use strtod on it, or you lose precision.
364
365 Note that fraction part uses right recursion rather than left
366 recursion. \agcode{exponent} is effectively the same as
367 \agcode{decimal integer}, above, but allows for initial zeroes.
368
369 The situation becomes somewhat more complex if you wish to allow both
370 integer and floating point forms, particularly if you wish to follow
371 the C convention for octal integers. First, you cannot have distinct
372 productions for integer part and decimal integer, since there is no
373 way to distinguish them until a decimal point or an exponent field is
374 encountered. Second, since 0129.3 looks for all the world like an
375 octal number until the ``9'' is encountered, one either has to
376 postpone all conversion calculations until the issue is resolved or
377 resort to trickery. Here is a way to resolve the problem by
378 redefining integer part:
379
380 \begin{indentingcode}{0.4in}
381 (double) integer part
382 -> confusion
383 -> octal integer:n = octal2decimal(n);
384 -> decimal integer:n = n;
385
386 (double) confusion
387 -> octal integer:n, '8-9':d = 10*octal2decimal(n) + d-'0';
388 -> confusion:x, '0-9':d = 10*x + d-'0';
389 \end{indentingcode}
390
391 where octal2decimal is defined thus:
392
393 \begin{indentingcode}{0.4in}
394 double octal2decimal(int n) \bra
395 if (n) return 10*octal2decimal(n/8) + n\%8;
396 else return 0;
397 \ket
398 \end{indentingcode}
399
400 Here confusion represents a number that starts off looking like an
401 octal integer, but then turns into a decimal number, because an eight,
402 a nine, a decimal pointer or an exponent field is encountered. When
403 this occurs, the function octal2decimal undoes the octal conversion
404 that had already been done, and redoes the conversion as decimal
405 conversion.
406
407 If you have included a disregard statement in your grammar to skip
408 over irrelevant white space in your input, you might add the following
409 to avoid skipping white space inside a real:
410
411 \begin{indentingcode}{0.4in}
412 {}[ lexeme \bra real \ket ]
413 \end{indentingcode}
414
415 \index{Names}\index{Parsing}
416 \paragraph{Names.}
417 In almost all grammars, it is necessary to identify names. To
418 accumulate the characters that make up the name it is convenient to
419 use a stack. The reduction procedures in this example use the
420 functions \agcode{ipn} and \agcode{pcn} to accumulate the characters.
421 \agcode{ipn} initializes storage for the name and stores the first
422 character. \agcode{pcn} adds a single character to the name. This
423 grammar presumes the existence of a function called
424 \agcode{identify{\us}name} which would look up the name in a symbol table
425 and return an identifying index.
426
427 \begin{indentingcode}{0.4in}
428 letter = 'a-z' + 'A-Z' + '{\us}'
429 digit = '0-9'
430
431 (int) name
432 -> name string = identify{\us}name();
433
434 name string
435 -> letter:c = ipn(c);
436 -> name string, letter+digit:c = pcn(c);
437
438 \bra // Embedded C to accumulate name
439 char name[82];
440 int name{\us}length;
441
442 /* Init and Put char to Name */
443 void ipn(int c) \bra
444 name[0] = c;
445 name{\us}length = 1;
446 \ket
447
448 /* Put Char to Name */
449 void pcn(int c) \bra
450 assert(name{\us}length < 82);
451 name[name{\us}length++] = c;
452 \ket
453 \ket // End of embedded C
454 \end{indentingcode}
455
456 % XXX: can I please change the names of ipn/pcn, make the global data
457 % static, and so on?
458
459 If you have included a disregard statement in your grammar to skip
460 over irrelevant white space in your input, you might add the following
461 to avoid skipping white space inside a name:
462
463 \begin{indentingcode}{0.4in}
464 {}[ lexeme \bra name \ket ]
465 \end{indentingcode}
466
467 \index{Names with embedded white space}\index{Parsing}
468 \paragraph{Names with Embedded White Space.}
469 It is sometimes convenient to allow symbol names to contain embedded
470 white space. This is easily done, although it requires a bit of a
471 trick:
472
473 \begin{indentingcode}{0.4in}
474 name
475 -> name string, ws?... = identify{\us}name();
476
477 name string
478 -> letter:c = ipn(c);
479 -> name string, letter+digit:c = pcn(c);
480 -> name string, ws..., letter+digit:c = pcn(' '), pcn(c);
481 \end{indentingcode}
482
483 Note that the last production reduces all contiguous white space
484 within a name to a single blank.
485
486 Allowing optional blanks following \agcode{name string} is important.
487 If you don't allow them there, any \agcode{ws} following a name will
488 be presumed to be embedded \agcode{ws}, requiring another
489 \agcode{letter} or \agcode{digit}, which is not what you intend.
490 There are two ways to accomplish this. The first, shown above,
491 explicitly allows for optional white space following \agcode{name
492 string}. The second method is to use the disregard and lexeme
493 statements:
494
495 \begin{indentingcode}{0.4in}
496 {}[
497 disregard ws
498 lexeme \bra name \ket
499 ]
500 \end{indentingcode}
501
502 If you use the \agparam{disregard} statement you should not include a
503 provision for white space in the production for \agcode{name}. Just
504 leave it as it was in the previous example.
505 % XXX: s/provision for/provision for trailing/
506
507 \index{Character Strings}\index{Parsing}
508 \paragraph{Character Strings.}
509 Character strings are often required. The simplest way to implement
510 character strings is as follows:
511
512 \begin{indentingcode}{0.4in}
513 character string
514 -> '"', \~{}(eof + '"')?..., '"'
515 \end{indentingcode}
516
517 This approach has the disadvantage that it makes no provision for
518 nonprinting characters.
519
520 There are numerous ways to provide for nonprinting characters in your
521 character strings. However, you can avoid tedious documentation by
522 using the same rules for nonprinting characters that C uses.
523 Unfortunately, the C rules for octal and hexadecimal escape sequences
524 complicate the writing of the grammar quite substantially. For
525 example, if you wish to write a string that consists of ASCII code 11
526 followed by the ascii digit ``2'', you must pad with a leading zero,
527 writing \agcode{{\bs}0132}, since \agcode{{\bs}132}
528 according to the rules is a single three digit octal escape sequence
529 designating ASCII code 90. The problem is that the rules allow for
530 one, two or three digit octal escape sequences, but sequences shorter
531 than three digits have to be followed by the end of the string or a
532 character that is not an ASCII digit. There is a similar, if not
533 worse, problem with hex escape sequences. There is no limit on the
534 length of a hex escape sequence, so there is no possible way to make
535 the character ``2'' follow a hex escape sequence without using another
536 escape sequence.
537
538 A straightforward approach to writing a grammar for strings consistent
539 with the C conventions yields a number of conflicts which correspond
540 exactly to the problems discussed above. While it is certainly
541 possible to write a grammar for strings that has no conflicts, it is
542 easier to proceed in a straightforward manner and use a sticky
543 declaration to resolve the ambiguities.
544
545 Here is the complete grammar for a character string in accordance with
546 the C rules. In order to accumulate the contents of the string, this
547 grammar uses the functions \agcode{ics} and \agcode{acs}, to
548 initialize storage for a character string and to append a character to
549 that string respectively.
550 % XXX can I please change the names of ics and acs?
551
552 \begin{indentingcode}{0.4in}
553 character string
554 -> string chars, '"'
555
556 string chars
557 -> '"' = ics();
558 -> string chars, string char:c = acs(c);
559
560 string char
561 -> simple string char
562 -> escape sequence
563
564 simple string char = \~{}eof - ('"' + '{\bs\bs}' + '{\bs}n')
565
566 (int) escape sequence
567 -> "{\bs\bs}a" ='{\bs}a';
568 -> "{\bs\bs}b" ='{\bs}b';
569 -> "{\bs\bs}f" ='{\bs}f';
570 -> "{\bs\bs}n" ='{\bs}n';
571 -> "{\bs\bs}r" ='{\bs}r';
572 -> "{\bs\bs}t" ='{\bs}t';
573 -> "{\bs\bs}v" ='{\bs}v';
574 -> "{\bs\bs\bs\bs}" ='{\bs\bs}';
575 -> "{\bs\bs}?" = '{\bs}?';
576 -> "{\bs\bs}'" ='{\bs}'';
577 -> "{\bs\bs\bs}"" = '"';
578 -> octal escape
579 -> hex escape
580
581 (int) octal escape
582 -> one octal | two octal | three octal
583
584 (int) one octal
585 -> '{\bs\bs}', '0-7':d = d-'0';
586
587 (int) two octal
588 -> one octal:n, '0-7':d = 8*n + d-'0';
589
590 (int) three octal
591 -> two octal:n, '0-7':d = 8*n + d-'0';
592
593 (int) hex escape
594 -> "{\bs\bs}x", hex number
595
596 (int) hex number
597 -> hex digit
598 -> hex number:n, hex digit:d = 16*n + d;
599
600 {}[
601 sticky one octal, two octal, hex number
602 ]
603 \bra // Embedded C to define ics and acs
604 char string{\us}store[82];
605 int length;
606
607 void ics(void) \bra
608 length = 0;
609 \ket
610 void acs(int c) \bra
611 assert(length < 82);
612 string{\us}store[length++] = c;
613 \ket
614 \ket // End of embedded C
615 \end{indentingcode}
616
617
618 If you have included a disregard statement in your grammar to skip
619 over irrelevant white space in your input, you might add the following
620 to avoid skipping white space inside a name:
621 % XXX inside a what?
622
623 \begin{indentingcode}{0.4in}
624 {}[ lexeme \bra character string \ket ]
625 \end{indentingcode}
626
627 \index{Character constants}\index{Parsing}
628 \paragraph{Character Constants.}
629 It is almost trivial to extend the above syntax for character strings
630 to allow simple character constants:
631
632 \begin{indentingcode}{0.4in}
633 (int) character constant
634 -> '{\bs}'', simple char:c, '{\bs}'' = c;
635
636 (int) simple char
637 -> \~{}eof - ('{\bs}'' + '{\bs\bs}' + '{\bs}n')
638 -> escape sequence
639 \end{indentingcode}
640
641 The value of the character constant token is the character code found
642 in the input stream.
643
644 If you have included a disregard statement in your grammar to skip
645 over irrelevant white space in your input, you might add the following
646 to avoid skipping white space inside a character constant:
647
648 \begin{indentingcode}{0.4in}
649 {}[ lexeme \bra character constant \ket ]
650 \end{indentingcode}
651
652 \index{Expressions}\index{Parsing}
653 \paragraph{Simple Expressions.}
654 It is often convenient to allow for simple expressions in your input.
655 The syntax for expression logic is written in a conventional way,
656 using a separate token for each precedence level desired. The
657 following grammar will accept simple addition, subtraction,
658 multiplication and division of floating point numbers:
659
660 \begin{indentingcode}{0.4in}
661 (double) expression
662 -> term
663 -> expression:x, '+', term:t = x+t;
664 -> expression:x, '-', term:t = x-t;
665
666 (double) term
667 -> factor
668 -> term:t, '*', factor:f = t*f;
669 -> term:t, '/', factor:f = t/f;
670
671 (double) factor
672 -> real
673 -> '(', expression:x, ')' = x;
674 -> '-', factor:f = -f;
675 \end{indentingcode}
676
677 An alternative way to write expression syntax is to write an ambiguous
678 grammar and use precedence declarations to resolve the conflicts:
679
680 \begin{indentingcode}{0.4in}
681 (double) expression
682 -> expression:x, '+', expression:y = x+y;
683 -> expression:x, '-', expression:y = x-y;
684 -> expression:x, '*', expression:y = x*y;
685 -> expression:x, '/', expression:y = x/y;
686 -> unary minus, expression:x = -x;
687 -> '(', expression:x, ')' = x;
688 -> real
689
690 unary minus = '-'
691 {}[
692 left '+', '-'
693 left '*', '/'
694 right unary minus
695 ]
696 \end{indentingcode}
697
698 Note that we deal with the different precedence levels of unary and
699 binary minus by defining \agcode{unary minus}, which AnaGram treats as
700 distinct from the simple \agcode{'-'}, and giving it different
701 precedence.