comparison examples/sbb-doc.txt @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.
author David A. Holland
date Sat, 22 Dec 2007 17:52:45 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:13d2b8934445
1 ------------------------------------------------------
2 Syntactic Building Blocks
3 Copyright 1993-1998, Parsifal Software.
4 All Rights Reserved.
5
6 This software is provided 'as-is', without any express or
7 implied warranty. In no event will the authors be held
8 liable for any damages arising from the use of this software.
9
10 Permission is granted to anyone to use this software for any
11 purpose, including commercial applications, and to alter it
12 and redistribute it freely, subject to the restriction that
13 this notice may not be removed or altered from any source
14 distribution.
15 ------------------------------------------------------
16
17
18 Introduction
19 ---------------
20
21 This file contains examples of the productions necessary to
22 parse the more common syntactic elements in your grammar. It
23 should be of interest once you have begun to write your own
24 grammars. Each example is provided with sample reduction
25 procedures, which are adequate in the vast majority of cases.
26 You can use these productions like building blocks to quickly
27 assemble quite powerful grammars. The text of this file is
28 also to be found in Appendix D of the AnaGram User's Guide.
29
30
31 End of File
32 --------------
33 If your parser is going to run under Unix or DOS and use
34 stream I/O to read its input, you can define end of file as
35 minus one:
36 eof = -1
37
38 If your parser is going to run under DOS and will use low
39 level I/O instead of stream I/O you might define eof as
40 Control Z:
41 eof = ^Z
42
43 If your parser is going to run under UNIX and will use low
44 level I/O instead of stream I/O you might define eof as
45 Control D:
46 eof = ^D
47
48 If your parser is simply going to parse a string in memory,
49 the definition of end of file should be a null character:
50 eof = 0
51
52 It is often convenient, however, to simply define end of
53 file so it will work in all of these contexts:
54 eof = -1 + 0 + ^D + ^Z
55
56
57 White Space
58 ----------------
59 It is convenient to have a representation for white space in
60 your grammar. Usually you do not wish to distinguish between
61 space characters and tab characters, so you can write:
62 blank = ' ' + '\t'
63
64 Using this definition, you can represent required white
65 space of indeterminate length with blank... and optional
66 white space with blank?...
67
68 It is common, however, to include comments (see below) as
69 white space. In this case, you can define the following
70 productions:
71 ws
72 -> blank
73 -> comment
74
75
76 End of Line
77 --------------
78 Because different systems use different representations for
79 end of line, it is wise to use an abstract end of line token
80 in your grammar, and define the token separately. If your
81 parser is going to use files that are known to use carriage
82 return alone or line feed alone as the end of line delimiter
83 you can use one of the following definitions:
84
85 eol = '\r' //carriage return only
86 eol = '\n' //line feed only
87
88 If your input files use the newline character as a line
89 terminator, but you wish to allow for optional carriage
90 returns, you might write:
91
92 eol
93 -> '\r'?, '\n'
94 or even
95 eol
96 -> '\r'?..., '\n'
97
98 If you wish to make a grammar that can deal with a file of
99 more or less arbitrary provenance, some caution is in order.
100 If you allow carriage return and line feed both to function
101 as end of line characters and you allow blank lines, you may
102 wind up with a very ambiguous grammar. For example, suppose
103 you use eol... somewhere in your grammar and you have
104 defined eol thus:
105 eol
106 -> '\r'
107 -> '\n'
108 -> '\r', '\n' // trouble!
109
110 Your grammar is ambiguous since a carriage return, line feed
111 pair can be considered two line endings, according to the
112 first two production, or just one, according to the third.
113
114 If you really need to allow isolated carriage returns and
115 linefeeds to mark ends of line, but need to deal with the
116 pair as well, you can make good use of the idiosyncracies of
117 AnaGram's keyword logic by writing the following:
118 eol
119 -> '\r'
120 -> '\n'
121 -> "\r\n"
122
123 This will treat a carriage return followed by a line feed as
124 a single end of line.
125
126
127 Comments
128 -------------
129
130 There are two different categories of comments in common
131 use: Those that run to the end of line, and those that run
132 to a specific terminator. The first can be conveniently
133 incorporated into your end of line token as a virtual
134 production:
135 eol
136 -> ["//", ~(eof + '\n')?...], '\n'
137
138 Thus, eol allows for an optional comment at the end of every
139 line. Note that you never want to allow end of file
140 characters in your comments.
141
142 Conventional C comments are slightly more complicated,
143 depending on your treatment of nesting. If you are not
144 interested in nesting comments, you can simply use the
145 following simple syntax:
146 comment
147 -> "/*", ~eof?..., "*/"
148
149 Note that this production works because keywords take
150 precedence over ordinary character input.
151
152 A somewhat more complex way to write this is useful:
153 comment
154 -> comment head, "*/"
155 -> comment head, comment
156 comment head
157 -> "/*"
158 -> comment head, ~eof
159
160 Note that where the previous grammar simply ignored the
161 beginning of any nested comment, this grammar recognizes a
162 nested comment, but then deliberately chooses to ignore
163 nesting.
164
165 If you want your comments to nest, you can easily rewrite
166 this grammar to allow nesting:
167
168 comment
169 -> comment head, "*/"
170 comment head
171 -> "/*"
172 -> comment head, ~eof
173 -> comment head, comment
174
175 Note that the only difference between nested and non-nested
176 comments is whether the grammar rule comment head, comment
177 reduces to comment head or to comment.
178
179 If you wish to defer the question of nesting to run-time,
180 you can use a semantically determined production to make the
181 decision:
182 comment
183 -> comment head, "*/"
184 comment head
185 -> "/*"
186 -> comment head, ~eof
187 comment, comment head
188 -> comment head, comment =check_nesting();
189
190 Although the final production in this grammar has a somewhat
191 inscrutable air about it, a little study will show that if
192 check_nesting sets the reduction token to comment, the
193 effective grammar is the same as the grammar above for
194 non-nested comments. On the other hand, if check_nesting
195 sets the reduction token to comment head, the effective
196 grammar is the same as the grammar for nested comments.
197
198 Assuming you have a switch called nest_comments, you could
199 write check_nesting as follows:
200
201 check_nesting() {
202 if (nest_comments)
203 CHANGE_REDUCTION(comment_head);
204 else
205 CHANGE_REDUCTION(comment);
206 }
207
208 The else clause in this procedure is technically unnecessary
209 since comment is the default value of the reduction token.
210
211 If you wish to skip white space in your input, and wish to
212 consider comments as simple white space, you might want to
213 add the following production to your grammar:
214 ws
215 -> blank
216 -> comment
217 You can then use the disregard statement to ignore ws
218 appropriately.
219
220
221 Integers
222 ----------
223
224 It is quite easy to convert ordinary integers to binary
225 form. For instance:
226 (int) decimal integer
227 -> '0-9':d =d-'0';
228 -> decimal integer:n, '0-9':d =10*n + d - '0';
229
230 If necessary, you can specify that the value of decimal
231 integer be maintained as a long:
232 (long) decimal integer
233 -> '0-9':d =d-'0';
234 -> decimal integer:n, '0-9':d =10*n + d - '0';
235
236 Other representations are equally simple:
237 (long) octal integer
238 -> '0' =0;
239 -> octal integer:n, '0-7':d =8*n + d - '0';
240
241 (long) hex digit
242 -> '0-9':d =d-'0';
243 -> 'A-F' + 'a-f':d =9 + (d & 7);
244
245 (long) hex integer
246 -> {"0x" + "0X"}, hex digit:d =d;
247 -> hex integer:n, hex digit:d =16*n + d;
248
249 Note that if you use the C convention for octal integers,
250 you have to redefine decimal integer to avoid ambiguity:
251
252 (long) decimal integer
253 -> '1-9':d =d-'0';
254 -> decimal integer:n, '0-9':d =10*n + d - '0';
255
256 You can then represent the general case as follows:
257
258 (long) integer
259 -> decimal integer | octal integer | hex integer
260
261 You can allow for a signed integer with the following
262 productions:
263
264 (long) signed integer
265 -> '+'?, integer:n =n;
266 -> '-', integer:n =-n;
267
268 If you have included a disregard statement in your grammar
269 to skip over irrelevant white space in your input, you might
270 add the following to avoid skipping white space inside a
271 number:
272 [ lexeme {integer}]
273
274 Note that if you were to declare signed integer as a lexeme,
275 your parser would not allow space between a leading sign and
276 the integer.
277
278
279 Floating Point Numbers
280 ----------------------------
281
282 Floating point numbers are somewhat more complex than
283 integers, but not significantly so:
284
285 (double) real
286 -> simple real
287 -> integer part:r, exponent field:x =r*pow10(x);
288 -> simple real:r, exponent field:x =r*pow10(x);
289
290
291 (double) simple real
292 -> integer part:i, '.', fraction part:f =i+f;
293 -> integer part, '.'
294 -> '.', fraction part:f =f;
295
296 (double) integer part
297 -> '0-9':d =d-'0';
298 -> integer part:n, '0-9':d =10*n + d-'0';
299
300 (double) fraction part
301 -> '0-9':d =(d-'0')/10.;
302 -> '0-9':d, fraction part:f =(d-'0'+f)/10.;
303
304 exponent field
305 -> 'e' + 'E', signed exponent:x =x;
306
307 signed exponent
308 -> '+'?, exponent:n =n;
309 -> '-', exponent:n =-n;
310
311 exponent
312 -> '0-9':d =d-'0';
313 -> exponent:n, '0-9':d =10*n + d - '0';
314
315 Note that fraction part uses right recursion rather than
316 left recursion. exponent is effectively the same as decimal
317 integer, above, but allows for initial zeroes.
318
319 The situation becomes somewhat more complex if you wish to
320 allow both integer and floating point forms, particularly if
321 you wish to follow the C convention for octal integers.
322 First, you cannot have distinct productions for integer part
323 and decimal integer, since there is no way to distinguish
324 them until a decimal point or an exponent field is
325 encountered. Second, since 0129.3 looks for all the world
326 like an octal number until the '9' is encountered, one
327 either has to postpone all conversion calculations until the
328 issue is resolved or resort to trickery. Here is a way to
329 resolve the problem by redefining integer part:
330 (double) integer part
331 -> confusion
332 -> octal integer:n =octal2decimal(n);
333 -> decimal integer:n =n;
334
335 (double) confusion
336 -> octal integer:n, '8-9':d =10*octal2decimal(n)+d-'0';
337 -> confusion:x, '0-9':d =10*x + d-'0';
338
339 where octal2decimal is defined thus:
340 double octal2decimal(int n) {
341 if (n) return 10*octal2decimal(n/8) + n%8;
342 else return 0;
343 }
344
345 Here confusion represents a number that starts off looking
346 like an octal integer, but then turns into a decimal number,
347 because an eight, a nine, a decimal pointer or an exponent
348 field is encountered. When this occurs, the function octal2-
349 decimal undoes the octal conversion that had already been
350 done, and redoes the conversion as decimal conversion.
351
352 If you have included a disregard statement in your grammar
353 to skip over irrelevant white space in your input, you might
354 add the following to avoid skipping white space inside a
355 real:
356 [ lexeme {real}]
357
358
359 Names
360 ---------
361
362 In almost all grammars, it is necessary to identify names.
363 To accumulate the characters that make up the name it is
364 convenient to use a stack. The reduction procedures in this
365 example use the functions ipn and pcn to accumulate the
366 characters. ipn initializes storage for the name and stores
367 the first character. pcn adds a single character to the
368 name. This grammar presumes the existence of a function
369 called identify_name which would look up the name in a
370 symbol table and return an identifying index.
371
372 letter = 'a-z' + 'A-Z' + '_'
373 digit = '0-9'
374 (int) name
375 -> name string =identify_name();
376
377 name string
378 -> letter:c =ins(c);
379 -> name string, letter+digit:c =pcn(c);
380
381 {/* embedded C to accumulate name */
382 char name[82];
383 int name_length;
384 void ipn(int c) { /* Init and Put char to Name */
385 name[0] = c;
386 name_length = 1;
387 }
388 void pcn(int c) { /* Put Char to Name */
389 assert(name_length < 82);
390 name[name_length++] = c;
391 }
392 } // End of embedded C
393
394 If you have included a disregard statement in your grammar
395 to skip over irrelevant white space in your input, you might
396 add the following to avoid skipping white space inside a
397 name:
398 [ lexeme {name}]
399
400
401 Names with Embedded White Space
402 ---------------------------------------------
403
404 It is sometimes convenient to allow symbol names to contain
405 embedded white space. This is easily done, although it
406 requires a bit of a trick:
407
408 name
409 -> name string, ws?... =identify_name();
410
411 name string
412 -> letter:c =ipn(c);
413 -> name string, letter+digit:c =pcn(c);
414 -> name string, ws..., letter+digit:c=pcn(' '), pcn(c);
415
416 Note that the last production reduces all contiguous white
417 space within a name to a single blank.
418
419 Allowing optional blanks following name string is important.
420 If you don't allow them there, any ws following a name will
421 be presumed to be embedded ws, requiring another letter or
422 digit, which is not what you intend. There are two ways to
423 accomplish this. The first, shown above, explicitly allows
424 for optional white space following name string. The second
425 method is to use the disregard and lexeme statements:
426
427 [
428 disregard ws
429 lexeme {name}
430 ]
431 If you use the disregard statement you should not include a
432 provision for white space in the production for name. Just
433 leave it as it was in the previous example.
434
435
436 Character Strings
437 ----------------------
438
439 Character strings are often required. The simplest way to
440 implement character strings is as follows:
441
442 character string
443 -> '"', ~(eof + '"')?..., '"'
444
445 This approach has the disadvantage that it makes no
446 provision for nonprinting characters.
447
448 There are numerous ways to provide for nonprinting
449 characters in your character strings. However, you can avoid
450 tedious documentation by using the same rules for
451 nonprinting characters that C uses. Unfortunately, the C
452 rules for octal and hexadecimal escape sequences complicate
453 the writing of the grammar quite substantially. For example,
454 if you wish to write a string that consists of ascii code 11
455 followed by the ascii digit '2', you must pad with a leading
456 zero, writing "\0132", since "\132" according to the rules
457 is a single three digit octal escape sequence designating
458 ascii code 90. The problem is that the rules allow for one,
459 two or three digit octal escape sequences, but sequences
460 shorter than three digits have to be followed by the end of
461 the string or a character that is not an ascii digit. There
462 is a similar, if not worse, problem with hex escape
463 sequences. There is no limit on the length of a hex escape
464 sequence, so there is no possible way to make the character
465 '2' follow a hex escape sequence without using another
466 escape sequence.
467
468 A straightforward approach to writing a grammar for strings
469 consistent with the C conventions yields a number of
470 conflicts which correspond exactly to the problems discussed
471 above. While it is certainly possible to write a grammar for
472 strings that has no conflicts, it is easier to proceed in a
473 straightforward manner and use a sticky declaration to
474 resolve the ambiguities.
475
476 Here is the complete grammar for a character string in
477 accordance with the C rules. In order to accumulate the
478 contents of the string, this grammar uses the functions ics
479 and acs, to initialize storage for a character string and to
480 append a character to that string respectively.
481
482 character string
483 -> string chars, '"'
484
485 string chars
486 -> '"' =ics();
487 -> string chars, string char:c =acs(c);
488
489 string char
490 -> simple string char
491 -> escape sequence
492
493 simple string char = ~eof - ('"' + '\\' + '\n')
494
495 (int) escape sequence
496 -> "\\a" ='\a';
497 -> "\\b" ='\b';
498 -> "\\f" ='\f';
499 -> "\\n" ='\n';
500 -> "\\r" ='\r';
501 -> "\\t" ='\t';
502 -> "\\v" ='\v';
503 -> "\\\\" ='\\';
504 -> "\\?" = '\?';
505 -> "\\'" ='\'';
506 -> "\\\"" ='"';
507 -> octal escape
508 -> hex escape
509
510 (int) octal escape
511 -> one octal | two octal | three octal
512
513 (int) one octal
514 -> '\\', '0-7':d =d-'0';
515
516 (int) two octal
517 -> one octal:n, '0-7':d =8*n + d-'0';
518
519 (int) three octal
520 -> two octal:n, '0-7':d =8*n + d-'0';
521
522 (int) hex escape
523 -> "\\x", hex number
524
525 (int) hex number
526 -> hex digit
527 -> hex number:n, hex digit:d =16*n + d;
528
529 [
530 sticky one octal, two octal, hex number
531 ]
532 {/* embedded C to define ics and acs */
533 char string_store[82];
534 int length;
535 void ics(void) {
536 length = 0;
537 }
538 void acs(int c) {
539 assert(length < 82);
540 string_store[length++] = c;
541 }
542
543 If you have included a disregard statement in your grammar
544 to skip over irrelevant white space in your input, you might
545 add the following to avoid skipping white space inside a
546 name:
547
548 [ lexeme {character string}]
549
550
551 Character Constants
552 -------------------------
553
554 It is almost trivial to extend the above syntax for
555 character strings to allow simple character constants:
556
557 (int) character constant
558 -> '\'', simple char:c, '\'' =c;
559
560 (int) simple char
561 -> ~eof - ('\'' + '\\' + '\n')
562 -> escape sequence
563
564 The value of the character constant token is the character
565 code found in the input stream.
566
567 If you have included a disregard statement in your grammar
568 to skip over irrelevant white space in your input, you might
569 add the following to avoid skipping white space inside a
570 character constant:
571 [ lexeme {character constant}]
572
573
574 Simple Expressions
575 ------------------------
576
577 It is often convenient to allow for simple expressions in
578 your input. The syntax for expression logic is written in a
579 conventional way, using a separate token for each precedence
580 level desired. The following grammar will accept simple
581 addition, subtraction, multiplication and division of
582 floating point numbers:
583
584 (double) expression
585 -> term
586 -> expression:x, '+', term:t =x+t;
587 -> expression:x, '-', term:t =x-t;
588
589 (double) term
590 -> factor
591 -> term:t, '*', factor:f =t*f;
592 -> term:t, '/', factor:f =t/f;
593
594 (double) factor
595 -> real
596 -> '(', expression:x, ')' =x;
597 -> '-', factor:f =-f;
598
599 An alternative way to write expression syntax is to write an
600 ambiguous grammar and use precedence declarations to resolve
601 the conflicts. This results in a slightly more compact and
602 faster parser:
603
604 (double) expression
605 -> expression:x, '+', expression:y =x+y;
606 -> expression:x, '-', expression:y =x-y;
607 -> expression:x, '*', expression:y =x*y;
608 -> expression:x, '/', expression:y =x/y;
609 -> unary minus, expression:x =-x;
610 -> '(', expression:x, ')' =x;
611 -> real
612
613 unary minus = '-'
614
615 [
616 left '+', '-'
617 left '*', '/'
618 right unary minus
619 ]
620
621 Note that we deal with the different precedence levels of
622 unary and binary minus by defining unary minus, which
623 AnaGram treats as distinct from the simple '-', and giving
624 it different precedence.
625