Mercurial > ~dholland > hg > ag > index.cgi
view examples/sbb-doc.txt @ 16:f9e4689b837d
Some minor updates for 15 years later.
author | David A. Holland |
---|---|
date | Tue, 31 May 2022 01:45:26 -0400 |
parents | 13d2b8934445 |
children |
line wrap: on
line source
------------------------------------------------------ Syntactic Building Blocks Copyright 1993-1998, Parsifal Software. All Rights Reserved. This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the restriction that this notice may not be removed or altered from any source distribution. ------------------------------------------------------ Introduction --------------- This file contains examples of the productions necessary to parse the more common syntactic elements in your grammar. It should be of interest once you have begun to write your own grammars. Each example is provided with sample reduction procedures, which are adequate in the vast majority of cases. You can use these productions like building blocks to quickly assemble quite powerful grammars. The text of this file is also to be found in Appendix D of the AnaGram User's Guide. End of File -------------- If your parser is going to run under Unix or DOS and use stream I/O to read its input, you can define end of file as minus one: eof = -1 If your parser is going to run under DOS and will use low level I/O instead of stream I/O you might define eof as Control Z: eof = ^Z If your parser is going to run under UNIX and will use low level I/O instead of stream I/O you might define eof as Control D: eof = ^D If your parser is simply going to parse a string in memory, the definition of end of file should be a null character: eof = 0 It is often convenient, however, to simply define end of file so it will work in all of these contexts: eof = -1 + 0 + ^D + ^Z White Space ---------------- It is convenient to have a representation for white space in your grammar. Usually you do not wish to distinguish between space characters and tab characters, so you can write: blank = ' ' + '\t' Using this definition, you can represent required white space of indeterminate length with blank... and optional white space with blank?... It is common, however, to include comments (see below) as white space. In this case, you can define the following productions: ws -> blank -> comment End of Line -------------- Because different systems use different representations for end of line, it is wise to use an abstract end of line token in your grammar, and define the token separately. If your parser is going to use files that are known to use carriage return alone or line feed alone as the end of line delimiter you can use one of the following definitions: eol = '\r' //carriage return only eol = '\n' //line feed only If your input files use the newline character as a line terminator, but you wish to allow for optional carriage returns, you might write: eol -> '\r'?, '\n' or even eol -> '\r'?..., '\n' If you wish to make a grammar that can deal with a file of more or less arbitrary provenance, some caution is in order. If you allow carriage return and line feed both to function as end of line characters and you allow blank lines, you may wind up with a very ambiguous grammar. For example, suppose you use eol... somewhere in your grammar and you have defined eol thus: eol -> '\r' -> '\n' -> '\r', '\n' // trouble! Your grammar is ambiguous since a carriage return, line feed pair can be considered two line endings, according to the first two production, or just one, according to the third. If you really need to allow isolated carriage returns and linefeeds to mark ends of line, but need to deal with the pair as well, you can make good use of the idiosyncracies of AnaGram's keyword logic by writing the following: eol -> '\r' -> '\n' -> "\r\n" This will treat a carriage return followed by a line feed as a single end of line. Comments ------------- There are two different categories of comments in common use: Those that run to the end of line, and those that run to a specific terminator. The first can be conveniently incorporated into your end of line token as a virtual production: eol -> ["//", ~(eof + '\n')?...], '\n' Thus, eol allows for an optional comment at the end of every line. Note that you never want to allow end of file characters in your comments. Conventional C comments are slightly more complicated, depending on your treatment of nesting. If you are not interested in nesting comments, you can simply use the following simple syntax: comment -> "/*", ~eof?..., "*/" Note that this production works because keywords take precedence over ordinary character input. A somewhat more complex way to write this is useful: comment -> comment head, "*/" -> comment head, comment comment head -> "/*" -> comment head, ~eof Note that where the previous grammar simply ignored the beginning of any nested comment, this grammar recognizes a nested comment, but then deliberately chooses to ignore nesting. If you want your comments to nest, you can easily rewrite this grammar to allow nesting: comment -> comment head, "*/" comment head -> "/*" -> comment head, ~eof -> comment head, comment Note that the only difference between nested and non-nested comments is whether the grammar rule comment head, comment reduces to comment head or to comment. If you wish to defer the question of nesting to run-time, you can use a semantically determined production to make the decision: comment -> comment head, "*/" comment head -> "/*" -> comment head, ~eof comment, comment head -> comment head, comment =check_nesting(); Although the final production in this grammar has a somewhat inscrutable air about it, a little study will show that if check_nesting sets the reduction token to comment, the effective grammar is the same as the grammar above for non-nested comments. On the other hand, if check_nesting sets the reduction token to comment head, the effective grammar is the same as the grammar for nested comments. Assuming you have a switch called nest_comments, you could write check_nesting as follows: check_nesting() { if (nest_comments) CHANGE_REDUCTION(comment_head); else CHANGE_REDUCTION(comment); } The else clause in this procedure is technically unnecessary since comment is the default value of the reduction token. If you wish to skip white space in your input, and wish to consider comments as simple white space, you might want to add the following production to your grammar: ws -> blank -> comment You can then use the disregard statement to ignore ws appropriately. Integers ---------- It is quite easy to convert ordinary integers to binary form. For instance: (int) decimal integer -> '0-9':d =d-'0'; -> decimal integer:n, '0-9':d =10*n + d - '0'; If necessary, you can specify that the value of decimal integer be maintained as a long: (long) decimal integer -> '0-9':d =d-'0'; -> decimal integer:n, '0-9':d =10*n + d - '0'; Other representations are equally simple: (long) octal integer -> '0' =0; -> octal integer:n, '0-7':d =8*n + d - '0'; (long) hex digit -> '0-9':d =d-'0'; -> 'A-F' + 'a-f':d =9 + (d & 7); (long) hex integer -> {"0x" + "0X"}, hex digit:d =d; -> hex integer:n, hex digit:d =16*n + d; Note that if you use the C convention for octal integers, you have to redefine decimal integer to avoid ambiguity: (long) decimal integer -> '1-9':d =d-'0'; -> decimal integer:n, '0-9':d =10*n + d - '0'; You can then represent the general case as follows: (long) integer -> decimal integer | octal integer | hex integer You can allow for a signed integer with the following productions: (long) signed integer -> '+'?, integer:n =n; -> '-', integer:n =-n; If you have included a disregard statement in your grammar to skip over irrelevant white space in your input, you might add the following to avoid skipping white space inside a number: [ lexeme {integer}] Note that if you were to declare signed integer as a lexeme, your parser would not allow space between a leading sign and the integer. Floating Point Numbers ---------------------------- Floating point numbers are somewhat more complex than integers, but not significantly so: (double) real -> simple real -> integer part:r, exponent field:x =r*pow10(x); -> simple real:r, exponent field:x =r*pow10(x); (double) simple real -> integer part:i, '.', fraction part:f =i+f; -> integer part, '.' -> '.', fraction part:f =f; (double) integer part -> '0-9':d =d-'0'; -> integer part:n, '0-9':d =10*n + d-'0'; (double) fraction part -> '0-9':d =(d-'0')/10.; -> '0-9':d, fraction part:f =(d-'0'+f)/10.; exponent field -> 'e' + 'E', signed exponent:x =x; signed exponent -> '+'?, exponent:n =n; -> '-', exponent:n =-n; exponent -> '0-9':d =d-'0'; -> exponent:n, '0-9':d =10*n + d - '0'; Note that fraction part uses right recursion rather than left recursion. exponent is effectively the same as decimal integer, above, but allows for initial zeroes. The situation becomes somewhat more complex if you wish to allow both integer and floating point forms, particularly if you wish to follow the C convention for octal integers. First, you cannot have distinct productions for integer part and decimal integer, since there is no way to distinguish them until a decimal point or an exponent field is encountered. Second, since 0129.3 looks for all the world like an octal number until the '9' is encountered, one either has to postpone all conversion calculations until the issue is resolved or resort to trickery. Here is a way to resolve the problem by redefining integer part: (double) integer part -> confusion -> octal integer:n =octal2decimal(n); -> decimal integer:n =n; (double) confusion -> octal integer:n, '8-9':d =10*octal2decimal(n)+d-'0'; -> confusion:x, '0-9':d =10*x + d-'0'; where octal2decimal is defined thus: double octal2decimal(int n) { if (n) return 10*octal2decimal(n/8) + n%8; else return 0; } Here confusion represents a number that starts off looking like an octal integer, but then turns into a decimal number, because an eight, a nine, a decimal pointer or an exponent field is encountered. When this occurs, the function octal2- decimal undoes the octal conversion that had already been done, and redoes the conversion as decimal conversion. If you have included a disregard statement in your grammar to skip over irrelevant white space in your input, you might add the following to avoid skipping white space inside a real: [ lexeme {real}] Names --------- In almost all grammars, it is necessary to identify names. To accumulate the characters that make up the name it is convenient to use a stack. The reduction procedures in this example use the functions ipn and pcn to accumulate the characters. ipn initializes storage for the name and stores the first character. pcn adds a single character to the name. This grammar presumes the existence of a function called identify_name which would look up the name in a symbol table and return an identifying index. letter = 'a-z' + 'A-Z' + '_' digit = '0-9' (int) name -> name string =identify_name(); name string -> letter:c =ins(c); -> name string, letter+digit:c =pcn(c); {/* embedded C to accumulate name */ char name[82]; int name_length; void ipn(int c) { /* Init and Put char to Name */ name[0] = c; name_length = 1; } void pcn(int c) { /* Put Char to Name */ assert(name_length < 82); name[name_length++] = c; } } // End of embedded C If you have included a disregard statement in your grammar to skip over irrelevant white space in your input, you might add the following to avoid skipping white space inside a name: [ lexeme {name}] Names with Embedded White Space --------------------------------------------- It is sometimes convenient to allow symbol names to contain embedded white space. This is easily done, although it requires a bit of a trick: name -> name string, ws?... =identify_name(); name string -> letter:c =ipn(c); -> name string, letter+digit:c =pcn(c); -> name string, ws..., letter+digit:c=pcn(' '), pcn(c); Note that the last production reduces all contiguous white space within a name to a single blank. Allowing optional blanks following name string is important. If you don't allow them there, any ws following a name will be presumed to be embedded ws, requiring another letter or digit, which is not what you intend. There are two ways to accomplish this. The first, shown above, explicitly allows for optional white space following name string. The second method is to use the disregard and lexeme statements: [ disregard ws lexeme {name} ] If you use the disregard statement you should not include a provision for white space in the production for name. Just leave it as it was in the previous example. Character Strings ---------------------- Character strings are often required. The simplest way to implement character strings is as follows: character string -> '"', ~(eof + '"')?..., '"' This approach has the disadvantage that it makes no provision for nonprinting characters. There are numerous ways to provide for nonprinting characters in your character strings. However, you can avoid tedious documentation by using the same rules for nonprinting characters that C uses. Unfortunately, the C rules for octal and hexadecimal escape sequences complicate the writing of the grammar quite substantially. For example, if you wish to write a string that consists of ascii code 11 followed by the ascii digit '2', you must pad with a leading zero, writing "\0132", since "\132" according to the rules is a single three digit octal escape sequence designating ascii code 90. The problem is that the rules allow for one, two or three digit octal escape sequences, but sequences shorter than three digits have to be followed by the end of the string or a character that is not an ascii digit. There is a similar, if not worse, problem with hex escape sequences. There is no limit on the length of a hex escape sequence, so there is no possible way to make the character '2' follow a hex escape sequence without using another escape sequence. A straightforward approach to writing a grammar for strings consistent with the C conventions yields a number of conflicts which correspond exactly to the problems discussed above. While it is certainly possible to write a grammar for strings that has no conflicts, it is easier to proceed in a straightforward manner and use a sticky declaration to resolve the ambiguities. Here is the complete grammar for a character string in accordance with the C rules. In order to accumulate the contents of the string, this grammar uses the functions ics and acs, to initialize storage for a character string and to append a character to that string respectively. character string -> string chars, '"' string chars -> '"' =ics(); -> string chars, string char:c =acs(c); string char -> simple string char -> escape sequence simple string char = ~eof - ('"' + '\\' + '\n') (int) escape sequence -> "\\a" ='\a'; -> "\\b" ='\b'; -> "\\f" ='\f'; -> "\\n" ='\n'; -> "\\r" ='\r'; -> "\\t" ='\t'; -> "\\v" ='\v'; -> "\\\\" ='\\'; -> "\\?" = '\?'; -> "\\'" ='\''; -> "\\\"" ='"'; -> octal escape -> hex escape (int) octal escape -> one octal | two octal | three octal (int) one octal -> '\\', '0-7':d =d-'0'; (int) two octal -> one octal:n, '0-7':d =8*n + d-'0'; (int) three octal -> two octal:n, '0-7':d =8*n + d-'0'; (int) hex escape -> "\\x", hex number (int) hex number -> hex digit -> hex number:n, hex digit:d =16*n + d; [ sticky one octal, two octal, hex number ] {/* embedded C to define ics and acs */ char string_store[82]; int length; void ics(void) { length = 0; } void acs(int c) { assert(length < 82); string_store[length++] = c; } If you have included a disregard statement in your grammar to skip over irrelevant white space in your input, you might add the following to avoid skipping white space inside a name: [ lexeme {character string}] Character Constants ------------------------- It is almost trivial to extend the above syntax for character strings to allow simple character constants: (int) character constant -> '\'', simple char:c, '\'' =c; (int) simple char -> ~eof - ('\'' + '\\' + '\n') -> escape sequence The value of the character constant token is the character code found in the input stream. If you have included a disregard statement in your grammar to skip over irrelevant white space in your input, you might add the following to avoid skipping white space inside a character constant: [ lexeme {character constant}] Simple Expressions ------------------------ It is often convenient to allow for simple expressions in your input. The syntax for expression logic is written in a conventional way, using a separate token for each precedence level desired. The following grammar will accept simple addition, subtraction, multiplication and division of floating point numbers: (double) expression -> term -> expression:x, '+', term:t =x+t; -> expression:x, '-', term:t =x-t; (double) term -> factor -> term:t, '*', factor:f =t*f; -> term:t, '/', factor:f =t/f; (double) factor -> real -> '(', expression:x, ')' =x; -> '-', factor:f =-f; An alternative way to write expression syntax is to write an ambiguous grammar and use precedence declarations to resolve the conflicts. This results in a slightly more compact and faster parser: (double) expression -> expression:x, '+', expression:y =x+y; -> expression:x, '-', expression:y =x-y; -> expression:x, '*', expression:y =x*y; -> expression:x, '/', expression:y =x/y; -> unary minus, expression:x =-x; -> '(', expression:x, ')' =x; -> real unary minus = '-' [ left '+', '-' left '*', '/' right unary minus ] Note that we deal with the different precedence levels of unary and binary minus by defining unary minus, which AnaGram treats as distinct from the simple '-', and giving it different precedence.