Mercurial > ~dholland > hg > ag > index.cgi
comparison examples/sbb-doc.txt @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:13d2b8934445 |
---|---|
1 ------------------------------------------------------ | |
2 Syntactic Building Blocks | |
3 Copyright 1993-1998, Parsifal Software. | |
4 All Rights Reserved. | |
5 | |
6 This software is provided 'as-is', without any express or | |
7 implied warranty. In no event will the authors be held | |
8 liable for any damages arising from the use of this software. | |
9 | |
10 Permission is granted to anyone to use this software for any | |
11 purpose, including commercial applications, and to alter it | |
12 and redistribute it freely, subject to the restriction that | |
13 this notice may not be removed or altered from any source | |
14 distribution. | |
15 ------------------------------------------------------ | |
16 | |
17 | |
18 Introduction | |
19 --------------- | |
20 | |
21 This file contains examples of the productions necessary to | |
22 parse the more common syntactic elements in your grammar. It | |
23 should be of interest once you have begun to write your own | |
24 grammars. Each example is provided with sample reduction | |
25 procedures, which are adequate in the vast majority of cases. | |
26 You can use these productions like building blocks to quickly | |
27 assemble quite powerful grammars. The text of this file is | |
28 also to be found in Appendix D of the AnaGram User's Guide. | |
29 | |
30 | |
31 End of File | |
32 -------------- | |
33 If your parser is going to run under Unix or DOS and use | |
34 stream I/O to read its input, you can define end of file as | |
35 minus one: | |
36 eof = -1 | |
37 | |
38 If your parser is going to run under DOS and will use low | |
39 level I/O instead of stream I/O you might define eof as | |
40 Control Z: | |
41 eof = ^Z | |
42 | |
43 If your parser is going to run under UNIX and will use low | |
44 level I/O instead of stream I/O you might define eof as | |
45 Control D: | |
46 eof = ^D | |
47 | |
48 If your parser is simply going to parse a string in memory, | |
49 the definition of end of file should be a null character: | |
50 eof = 0 | |
51 | |
52 It is often convenient, however, to simply define end of | |
53 file so it will work in all of these contexts: | |
54 eof = -1 + 0 + ^D + ^Z | |
55 | |
56 | |
57 White Space | |
58 ---------------- | |
59 It is convenient to have a representation for white space in | |
60 your grammar. Usually you do not wish to distinguish between | |
61 space characters and tab characters, so you can write: | |
62 blank = ' ' + '\t' | |
63 | |
64 Using this definition, you can represent required white | |
65 space of indeterminate length with blank... and optional | |
66 white space with blank?... | |
67 | |
68 It is common, however, to include comments (see below) as | |
69 white space. In this case, you can define the following | |
70 productions: | |
71 ws | |
72 -> blank | |
73 -> comment | |
74 | |
75 | |
76 End of Line | |
77 -------------- | |
78 Because different systems use different representations for | |
79 end of line, it is wise to use an abstract end of line token | |
80 in your grammar, and define the token separately. If your | |
81 parser is going to use files that are known to use carriage | |
82 return alone or line feed alone as the end of line delimiter | |
83 you can use one of the following definitions: | |
84 | |
85 eol = '\r' //carriage return only | |
86 eol = '\n' //line feed only | |
87 | |
88 If your input files use the newline character as a line | |
89 terminator, but you wish to allow for optional carriage | |
90 returns, you might write: | |
91 | |
92 eol | |
93 -> '\r'?, '\n' | |
94 or even | |
95 eol | |
96 -> '\r'?..., '\n' | |
97 | |
98 If you wish to make a grammar that can deal with a file of | |
99 more or less arbitrary provenance, some caution is in order. | |
100 If you allow carriage return and line feed both to function | |
101 as end of line characters and you allow blank lines, you may | |
102 wind up with a very ambiguous grammar. For example, suppose | |
103 you use eol... somewhere in your grammar and you have | |
104 defined eol thus: | |
105 eol | |
106 -> '\r' | |
107 -> '\n' | |
108 -> '\r', '\n' // trouble! | |
109 | |
110 Your grammar is ambiguous since a carriage return, line feed | |
111 pair can be considered two line endings, according to the | |
112 first two production, or just one, according to the third. | |
113 | |
114 If you really need to allow isolated carriage returns and | |
115 linefeeds to mark ends of line, but need to deal with the | |
116 pair as well, you can make good use of the idiosyncracies of | |
117 AnaGram's keyword logic by writing the following: | |
118 eol | |
119 -> '\r' | |
120 -> '\n' | |
121 -> "\r\n" | |
122 | |
123 This will treat a carriage return followed by a line feed as | |
124 a single end of line. | |
125 | |
126 | |
127 Comments | |
128 ------------- | |
129 | |
130 There are two different categories of comments in common | |
131 use: Those that run to the end of line, and those that run | |
132 to a specific terminator. The first can be conveniently | |
133 incorporated into your end of line token as a virtual | |
134 production: | |
135 eol | |
136 -> ["//", ~(eof + '\n')?...], '\n' | |
137 | |
138 Thus, eol allows for an optional comment at the end of every | |
139 line. Note that you never want to allow end of file | |
140 characters in your comments. | |
141 | |
142 Conventional C comments are slightly more complicated, | |
143 depending on your treatment of nesting. If you are not | |
144 interested in nesting comments, you can simply use the | |
145 following simple syntax: | |
146 comment | |
147 -> "/*", ~eof?..., "*/" | |
148 | |
149 Note that this production works because keywords take | |
150 precedence over ordinary character input. | |
151 | |
152 A somewhat more complex way to write this is useful: | |
153 comment | |
154 -> comment head, "*/" | |
155 -> comment head, comment | |
156 comment head | |
157 -> "/*" | |
158 -> comment head, ~eof | |
159 | |
160 Note that where the previous grammar simply ignored the | |
161 beginning of any nested comment, this grammar recognizes a | |
162 nested comment, but then deliberately chooses to ignore | |
163 nesting. | |
164 | |
165 If you want your comments to nest, you can easily rewrite | |
166 this grammar to allow nesting: | |
167 | |
168 comment | |
169 -> comment head, "*/" | |
170 comment head | |
171 -> "/*" | |
172 -> comment head, ~eof | |
173 -> comment head, comment | |
174 | |
175 Note that the only difference between nested and non-nested | |
176 comments is whether the grammar rule comment head, comment | |
177 reduces to comment head or to comment. | |
178 | |
179 If you wish to defer the question of nesting to run-time, | |
180 you can use a semantically determined production to make the | |
181 decision: | |
182 comment | |
183 -> comment head, "*/" | |
184 comment head | |
185 -> "/*" | |
186 -> comment head, ~eof | |
187 comment, comment head | |
188 -> comment head, comment =check_nesting(); | |
189 | |
190 Although the final production in this grammar has a somewhat | |
191 inscrutable air about it, a little study will show that if | |
192 check_nesting sets the reduction token to comment, the | |
193 effective grammar is the same as the grammar above for | |
194 non-nested comments. On the other hand, if check_nesting | |
195 sets the reduction token to comment head, the effective | |
196 grammar is the same as the grammar for nested comments. | |
197 | |
198 Assuming you have a switch called nest_comments, you could | |
199 write check_nesting as follows: | |
200 | |
201 check_nesting() { | |
202 if (nest_comments) | |
203 CHANGE_REDUCTION(comment_head); | |
204 else | |
205 CHANGE_REDUCTION(comment); | |
206 } | |
207 | |
208 The else clause in this procedure is technically unnecessary | |
209 since comment is the default value of the reduction token. | |
210 | |
211 If you wish to skip white space in your input, and wish to | |
212 consider comments as simple white space, you might want to | |
213 add the following production to your grammar: | |
214 ws | |
215 -> blank | |
216 -> comment | |
217 You can then use the disregard statement to ignore ws | |
218 appropriately. | |
219 | |
220 | |
221 Integers | |
222 ---------- | |
223 | |
224 It is quite easy to convert ordinary integers to binary | |
225 form. For instance: | |
226 (int) decimal integer | |
227 -> '0-9':d =d-'0'; | |
228 -> decimal integer:n, '0-9':d =10*n + d - '0'; | |
229 | |
230 If necessary, you can specify that the value of decimal | |
231 integer be maintained as a long: | |
232 (long) decimal integer | |
233 -> '0-9':d =d-'0'; | |
234 -> decimal integer:n, '0-9':d =10*n + d - '0'; | |
235 | |
236 Other representations are equally simple: | |
237 (long) octal integer | |
238 -> '0' =0; | |
239 -> octal integer:n, '0-7':d =8*n + d - '0'; | |
240 | |
241 (long) hex digit | |
242 -> '0-9':d =d-'0'; | |
243 -> 'A-F' + 'a-f':d =9 + (d & 7); | |
244 | |
245 (long) hex integer | |
246 -> {"0x" + "0X"}, hex digit:d =d; | |
247 -> hex integer:n, hex digit:d =16*n + d; | |
248 | |
249 Note that if you use the C convention for octal integers, | |
250 you have to redefine decimal integer to avoid ambiguity: | |
251 | |
252 (long) decimal integer | |
253 -> '1-9':d =d-'0'; | |
254 -> decimal integer:n, '0-9':d =10*n + d - '0'; | |
255 | |
256 You can then represent the general case as follows: | |
257 | |
258 (long) integer | |
259 -> decimal integer | octal integer | hex integer | |
260 | |
261 You can allow for a signed integer with the following | |
262 productions: | |
263 | |
264 (long) signed integer | |
265 -> '+'?, integer:n =n; | |
266 -> '-', integer:n =-n; | |
267 | |
268 If you have included a disregard statement in your grammar | |
269 to skip over irrelevant white space in your input, you might | |
270 add the following to avoid skipping white space inside a | |
271 number: | |
272 [ lexeme {integer}] | |
273 | |
274 Note that if you were to declare signed integer as a lexeme, | |
275 your parser would not allow space between a leading sign and | |
276 the integer. | |
277 | |
278 | |
279 Floating Point Numbers | |
280 ---------------------------- | |
281 | |
282 Floating point numbers are somewhat more complex than | |
283 integers, but not significantly so: | |
284 | |
285 (double) real | |
286 -> simple real | |
287 -> integer part:r, exponent field:x =r*pow10(x); | |
288 -> simple real:r, exponent field:x =r*pow10(x); | |
289 | |
290 | |
291 (double) simple real | |
292 -> integer part:i, '.', fraction part:f =i+f; | |
293 -> integer part, '.' | |
294 -> '.', fraction part:f =f; | |
295 | |
296 (double) integer part | |
297 -> '0-9':d =d-'0'; | |
298 -> integer part:n, '0-9':d =10*n + d-'0'; | |
299 | |
300 (double) fraction part | |
301 -> '0-9':d =(d-'0')/10.; | |
302 -> '0-9':d, fraction part:f =(d-'0'+f)/10.; | |
303 | |
304 exponent field | |
305 -> 'e' + 'E', signed exponent:x =x; | |
306 | |
307 signed exponent | |
308 -> '+'?, exponent:n =n; | |
309 -> '-', exponent:n =-n; | |
310 | |
311 exponent | |
312 -> '0-9':d =d-'0'; | |
313 -> exponent:n, '0-9':d =10*n + d - '0'; | |
314 | |
315 Note that fraction part uses right recursion rather than | |
316 left recursion. exponent is effectively the same as decimal | |
317 integer, above, but allows for initial zeroes. | |
318 | |
319 The situation becomes somewhat more complex if you wish to | |
320 allow both integer and floating point forms, particularly if | |
321 you wish to follow the C convention for octal integers. | |
322 First, you cannot have distinct productions for integer part | |
323 and decimal integer, since there is no way to distinguish | |
324 them until a decimal point or an exponent field is | |
325 encountered. Second, since 0129.3 looks for all the world | |
326 like an octal number until the '9' is encountered, one | |
327 either has to postpone all conversion calculations until the | |
328 issue is resolved or resort to trickery. Here is a way to | |
329 resolve the problem by redefining integer part: | |
330 (double) integer part | |
331 -> confusion | |
332 -> octal integer:n =octal2decimal(n); | |
333 -> decimal integer:n =n; | |
334 | |
335 (double) confusion | |
336 -> octal integer:n, '8-9':d =10*octal2decimal(n)+d-'0'; | |
337 -> confusion:x, '0-9':d =10*x + d-'0'; | |
338 | |
339 where octal2decimal is defined thus: | |
340 double octal2decimal(int n) { | |
341 if (n) return 10*octal2decimal(n/8) + n%8; | |
342 else return 0; | |
343 } | |
344 | |
345 Here confusion represents a number that starts off looking | |
346 like an octal integer, but then turns into a decimal number, | |
347 because an eight, a nine, a decimal pointer or an exponent | |
348 field is encountered. When this occurs, the function octal2- | |
349 decimal undoes the octal conversion that had already been | |
350 done, and redoes the conversion as decimal conversion. | |
351 | |
352 If you have included a disregard statement in your grammar | |
353 to skip over irrelevant white space in your input, you might | |
354 add the following to avoid skipping white space inside a | |
355 real: | |
356 [ lexeme {real}] | |
357 | |
358 | |
359 Names | |
360 --------- | |
361 | |
362 In almost all grammars, it is necessary to identify names. | |
363 To accumulate the characters that make up the name it is | |
364 convenient to use a stack. The reduction procedures in this | |
365 example use the functions ipn and pcn to accumulate the | |
366 characters. ipn initializes storage for the name and stores | |
367 the first character. pcn adds a single character to the | |
368 name. This grammar presumes the existence of a function | |
369 called identify_name which would look up the name in a | |
370 symbol table and return an identifying index. | |
371 | |
372 letter = 'a-z' + 'A-Z' + '_' | |
373 digit = '0-9' | |
374 (int) name | |
375 -> name string =identify_name(); | |
376 | |
377 name string | |
378 -> letter:c =ins(c); | |
379 -> name string, letter+digit:c =pcn(c); | |
380 | |
381 {/* embedded C to accumulate name */ | |
382 char name[82]; | |
383 int name_length; | |
384 void ipn(int c) { /* Init and Put char to Name */ | |
385 name[0] = c; | |
386 name_length = 1; | |
387 } | |
388 void pcn(int c) { /* Put Char to Name */ | |
389 assert(name_length < 82); | |
390 name[name_length++] = c; | |
391 } | |
392 } // End of embedded C | |
393 | |
394 If you have included a disregard statement in your grammar | |
395 to skip over irrelevant white space in your input, you might | |
396 add the following to avoid skipping white space inside a | |
397 name: | |
398 [ lexeme {name}] | |
399 | |
400 | |
401 Names with Embedded White Space | |
402 --------------------------------------------- | |
403 | |
404 It is sometimes convenient to allow symbol names to contain | |
405 embedded white space. This is easily done, although it | |
406 requires a bit of a trick: | |
407 | |
408 name | |
409 -> name string, ws?... =identify_name(); | |
410 | |
411 name string | |
412 -> letter:c =ipn(c); | |
413 -> name string, letter+digit:c =pcn(c); | |
414 -> name string, ws..., letter+digit:c=pcn(' '), pcn(c); | |
415 | |
416 Note that the last production reduces all contiguous white | |
417 space within a name to a single blank. | |
418 | |
419 Allowing optional blanks following name string is important. | |
420 If you don't allow them there, any ws following a name will | |
421 be presumed to be embedded ws, requiring another letter or | |
422 digit, which is not what you intend. There are two ways to | |
423 accomplish this. The first, shown above, explicitly allows | |
424 for optional white space following name string. The second | |
425 method is to use the disregard and lexeme statements: | |
426 | |
427 [ | |
428 disregard ws | |
429 lexeme {name} | |
430 ] | |
431 If you use the disregard statement you should not include a | |
432 provision for white space in the production for name. Just | |
433 leave it as it was in the previous example. | |
434 | |
435 | |
436 Character Strings | |
437 ---------------------- | |
438 | |
439 Character strings are often required. The simplest way to | |
440 implement character strings is as follows: | |
441 | |
442 character string | |
443 -> '"', ~(eof + '"')?..., '"' | |
444 | |
445 This approach has the disadvantage that it makes no | |
446 provision for nonprinting characters. | |
447 | |
448 There are numerous ways to provide for nonprinting | |
449 characters in your character strings. However, you can avoid | |
450 tedious documentation by using the same rules for | |
451 nonprinting characters that C uses. Unfortunately, the C | |
452 rules for octal and hexadecimal escape sequences complicate | |
453 the writing of the grammar quite substantially. For example, | |
454 if you wish to write a string that consists of ascii code 11 | |
455 followed by the ascii digit '2', you must pad with a leading | |
456 zero, writing "\0132", since "\132" according to the rules | |
457 is a single three digit octal escape sequence designating | |
458 ascii code 90. The problem is that the rules allow for one, | |
459 two or three digit octal escape sequences, but sequences | |
460 shorter than three digits have to be followed by the end of | |
461 the string or a character that is not an ascii digit. There | |
462 is a similar, if not worse, problem with hex escape | |
463 sequences. There is no limit on the length of a hex escape | |
464 sequence, so there is no possible way to make the character | |
465 '2' follow a hex escape sequence without using another | |
466 escape sequence. | |
467 | |
468 A straightforward approach to writing a grammar for strings | |
469 consistent with the C conventions yields a number of | |
470 conflicts which correspond exactly to the problems discussed | |
471 above. While it is certainly possible to write a grammar for | |
472 strings that has no conflicts, it is easier to proceed in a | |
473 straightforward manner and use a sticky declaration to | |
474 resolve the ambiguities. | |
475 | |
476 Here is the complete grammar for a character string in | |
477 accordance with the C rules. In order to accumulate the | |
478 contents of the string, this grammar uses the functions ics | |
479 and acs, to initialize storage for a character string and to | |
480 append a character to that string respectively. | |
481 | |
482 character string | |
483 -> string chars, '"' | |
484 | |
485 string chars | |
486 -> '"' =ics(); | |
487 -> string chars, string char:c =acs(c); | |
488 | |
489 string char | |
490 -> simple string char | |
491 -> escape sequence | |
492 | |
493 simple string char = ~eof - ('"' + '\\' + '\n') | |
494 | |
495 (int) escape sequence | |
496 -> "\\a" ='\a'; | |
497 -> "\\b" ='\b'; | |
498 -> "\\f" ='\f'; | |
499 -> "\\n" ='\n'; | |
500 -> "\\r" ='\r'; | |
501 -> "\\t" ='\t'; | |
502 -> "\\v" ='\v'; | |
503 -> "\\\\" ='\\'; | |
504 -> "\\?" = '\?'; | |
505 -> "\\'" ='\''; | |
506 -> "\\\"" ='"'; | |
507 -> octal escape | |
508 -> hex escape | |
509 | |
510 (int) octal escape | |
511 -> one octal | two octal | three octal | |
512 | |
513 (int) one octal | |
514 -> '\\', '0-7':d =d-'0'; | |
515 | |
516 (int) two octal | |
517 -> one octal:n, '0-7':d =8*n + d-'0'; | |
518 | |
519 (int) three octal | |
520 -> two octal:n, '0-7':d =8*n + d-'0'; | |
521 | |
522 (int) hex escape | |
523 -> "\\x", hex number | |
524 | |
525 (int) hex number | |
526 -> hex digit | |
527 -> hex number:n, hex digit:d =16*n + d; | |
528 | |
529 [ | |
530 sticky one octal, two octal, hex number | |
531 ] | |
532 {/* embedded C to define ics and acs */ | |
533 char string_store[82]; | |
534 int length; | |
535 void ics(void) { | |
536 length = 0; | |
537 } | |
538 void acs(int c) { | |
539 assert(length < 82); | |
540 string_store[length++] = c; | |
541 } | |
542 | |
543 If you have included a disregard statement in your grammar | |
544 to skip over irrelevant white space in your input, you might | |
545 add the following to avoid skipping white space inside a | |
546 name: | |
547 | |
548 [ lexeme {character string}] | |
549 | |
550 | |
551 Character Constants | |
552 ------------------------- | |
553 | |
554 It is almost trivial to extend the above syntax for | |
555 character strings to allow simple character constants: | |
556 | |
557 (int) character constant | |
558 -> '\'', simple char:c, '\'' =c; | |
559 | |
560 (int) simple char | |
561 -> ~eof - ('\'' + '\\' + '\n') | |
562 -> escape sequence | |
563 | |
564 The value of the character constant token is the character | |
565 code found in the input stream. | |
566 | |
567 If you have included a disregard statement in your grammar | |
568 to skip over irrelevant white space in your input, you might | |
569 add the following to avoid skipping white space inside a | |
570 character constant: | |
571 [ lexeme {character constant}] | |
572 | |
573 | |
574 Simple Expressions | |
575 ------------------------ | |
576 | |
577 It is often convenient to allow for simple expressions in | |
578 your input. The syntax for expression logic is written in a | |
579 conventional way, using a separate token for each precedence | |
580 level desired. The following grammar will accept simple | |
581 addition, subtraction, multiplication and division of | |
582 floating point numbers: | |
583 | |
584 (double) expression | |
585 -> term | |
586 -> expression:x, '+', term:t =x+t; | |
587 -> expression:x, '-', term:t =x-t; | |
588 | |
589 (double) term | |
590 -> factor | |
591 -> term:t, '*', factor:f =t*f; | |
592 -> term:t, '/', factor:f =t/f; | |
593 | |
594 (double) factor | |
595 -> real | |
596 -> '(', expression:x, ')' =x; | |
597 -> '-', factor:f =-f; | |
598 | |
599 An alternative way to write expression syntax is to write an | |
600 ambiguous grammar and use precedence declarations to resolve | |
601 the conflicts. This results in a slightly more compact and | |
602 faster parser: | |
603 | |
604 (double) expression | |
605 -> expression:x, '+', expression:y =x+y; | |
606 -> expression:x, '-', expression:y =x-y; | |
607 -> expression:x, '*', expression:y =x*y; | |
608 -> expression:x, '/', expression:y =x/y; | |
609 -> unary minus, expression:x =-x; | |
610 -> '(', expression:x, ')' =x; | |
611 -> real | |
612 | |
613 unary minus = '-' | |
614 | |
615 [ | |
616 left '+', '-' | |
617 left '*', '/' | |
618 right unary minus | |
619 ] | |
620 | |
621 Note that we deal with the different precedence levels of | |
622 unary and binary minus by defining unary minus, which | |
623 AnaGram treats as distinct from the simple '-', and giving | |
624 it different precedence. | |
625 |