Mercurial > ~dholland > hg > ag > index.cgi
view doc/manual/sbb.tex @ 10:5b21f127e957
silence random gcc warning
author | David A. Holland |
---|---|
date | Mon, 30 May 2022 23:58:14 -0400 |
parents | 13d2b8934445 |
children |
line wrap: on
line source
\chapter{Syntactic Building Blocks} This appendix contains examples of the productions necessary to parse the more common syntactic elements in your grammar. Each is provided with sample reduction procedures, which are adequate in the vast majority of cases. You can use these productions like building blocks to quickly assemble quite powerful grammars. These productions can be also be found in the file \agfile{sbb-doc.txt} in the \agfile{examples} directory of your AnaGram distribution. An HTML version may be found in the \agfile{html} directory. % XXX: s/disk// \index{End of file} \paragraph{End of File.} If your parser is going to use conventional stream I/O to read its input, you can define end of file as minus one: \begin{indentingcode}{0.4in} eof = $-$1 \end{indentingcode} If your parser is going to run under Windows/DOS and will use low level I/O instead of stream I/O you might define eof as Control Z: \begin{indentingcode}{0.4in} eof = \^{}Z \end{indentingcode} % XXX the following is crap If your parser is going to run under UNIX and will use low level I/O instead of stream I/O you might define eof as Control D: \begin{indentingcode}{0.4in} eof = \^{}D \end{indentingcode} If your parser is simply going to parse a string in memory, the definition of end of file should be a null character: \begin{indentingcode}{0.4in} eof = 0 \end{indentingcode} It is often convenient, however, to simply define end of file so it will work in all of these contexts: \begin{indentingcode}{0.4in} eof = $-$1 + 0 + \^{}D + \^{}Z \end{indentingcode} \index{White space} \paragraph{White Space.} It is convenient to have a representation for white space in your grammar. Usually you do not wish to distinguish between space characters and tab characters, so you can write: \begin{indentingcode}{0.4in} blank = ' ' + '{\bs}t' \end{indentingcode} Using this definition, you can represent required white space of indeterminate length with \agcode{blank...} and optional white space with \agcode{blank?...} It is common, however, to include comments (see below) as white space. In this case, you can define the following productions: \begin{indentingcode}{0.4in} ws -> blank -> comment \end{indentingcode} \index{End of line} \paragraph{End of Line.} Because different systems use different representations for end of line, it is wise to use an abstract end of line token in your grammar, and define the token separately. If your parser is going to use files that are known to use carriage return alone or line feed alone as the end of line delimiter you can use one of the following definitions: \begin{indentingcode}{0.4in} eol = '{\bs}r' // carriage return only eol = '{\bs}n' // line feed only \end{indentingcode} If your input files use the newline character as a line terminator, but you wish to allow for optional carriage returns, you might write: \begin{indentingcode}{0.4in} eol -> '{\bs}r'?, '{\bs}n' \end{indentingcode} or even \begin{indentingcode}{0.4in} eol -> '{\bs}r'?..., '{\bs}n' \end{indentingcode} If you wish to make a grammar that can deal with a file of more or less arbitrary provenance, some caution is in order. If you allow carriage return and line feed both to function as end of line characters and you allow blank lines, you may wind up with a very ambiguous grammar. For example, suppose you use \agcode{eol...} somewhere in your grammar and you have defined eol thus: \begin{indentingcode}{0.4in} eol -> '{\bs}r' -> '{\bs}n' -> '{\bs}r', '{\bs}n' // trouble! \end{indentingcode} Your grammar is ambiguous since a carriage return, line feed pair can be considered two line endings, according to the first two productions, or just one, according to the third. If you really need to allow isolated carriage returns and linefeeds to mark ends of line, but need to deal with the pair as well, you can make good use of the idiosyncracies of AnaGram's keyword logic by writing the following: \begin{indentingcode}{0.4in} eol -> '{\bs}r' -> '{\bs}n' -> "{\bs}r{\bs}n" \end{indentingcode} This will treat a carriage return followed by a line feed as a single end of line. \index{Comments}\index{Parsing} \paragraph{Comments.} There are two different categories of comments in common use: those that run to the end of line, and those that run to a specific terminator. The first can be conveniently incorporated into your end of line token as a virtual production: \begin{indentingcode}{0.4in} eol -> ["//", \~{}(eof + '{\bs}n')?...], '{\bs}n' \end{indentingcode} Thus, \agcode{eol} allows for an optional comment at the end of every line. Note that you never want to allow end of file characters in your comments. % XXX: simply/simple Conventional C comments are slightly more complicated, depending on your treatment of nesting. If you are not interested in nesting comments, you can simply use the following simple syntax: \begin{indentingcode}{0.4in} comment -> "/*", \~{}eof?..., "*/" \end{indentingcode} Note that this production works because keywords take precedence over ordinary character input. % XXX add: Otherwise, it would probably conflict with other uses of % ``*'' and ``/''. A somewhat more complex way to write this is useful: \begin{indentingcode}{0.4in} comment -> comment head, "*/" -> comment head, comment comment head -> "/*" -> comment head, \~{}eof \end{indentingcode} Note that where the previous grammar simply ignored the beginning of any nested comment, this grammar recognizes a nested comment, but then deliberately chooses to ignore nesting. % XXX add: If you want your comments to nest, you can easily rewrite this grammar to allow nesting: \begin{indentingcode}{0.4in} comment -> comment head, "*/" comment head -> "/*" -> comment head, \~{}eof -> comment head, comment \end{indentingcode} Note that the only difference between nested and non-nested comments is whether the grammar rule \agcode{comment head, comment} reduces to \agcode{comment head} or to \agcode{comment}. If you wish to defer the question of nesting to run-time, you can use a semantically determined production to make the decision: \begin{indentingcode}{0.4in} comment -> comment head, "*/" comment head -> "/*" -> comment head, \~{}eof comment, comment head -> comment head, comment = check{\us}nesting(); \end{indentingcode} Although the final production in this grammar has a somewhat inscrutable air about it, a little study will show that if \agcode{check{\us}nesting} sets the reduction token to \agcode{comment}, the effective grammar is the same as the grammar above for non-nested comments. On the other hand, if \agcode{check{\us}nesting} sets the reduction token to comment head, the effective grammar is the same as the grammar for nested comments. Assuming you have a switch called \agcode{nest{\us}comments}, you could write \agcode{check{\us}nesting} as follows: \begin{indentingcode}{0.4in} check{\us}nesting() \bra if (nest{\us}comments) CHANGE{\us}REDUCTION(comment{\us}head); else CHANGE{\us}REDUCTION(comment); \ket \end{indentingcode} % XXX add some void to that The \agcode{else} clause in this procedure is technically unnecessary since \agcode{comment} is the default value of the reduction token. If you wish to skip white space in your input, and wish to consider comments as simple white space, you might want to add the following production to your grammar: \begin{indentingcode}{0.4in} ws -> blank -> comment \end{indentingcode} You can then use the \agparam{disregard} statement to ignore \agcode{ws} appropriately. \index{Integers}\index{Parsing} \paragraph{Integers.} It is quite easy to convert ordinary integers to binary form. For instance: \begin{indentingcode}{0.4in} (int) decimal integer -> '0-9':d = d-'0'; -> decimal integer:n, '0-9':d = 10*n + d-'0'; \end{indentingcode} If necessary, you can specify that the value of decimal integer be maintained as a long: \begin{indentingcode}{0.4in} (long) decimal integer -> '0-9':d = d-'0'; -> decimal integer:n, '0-9':d = 10*n + d-'0'; \end{indentingcode} Other representations are equally simple: \begin{indentingcode}{0.4in} (long) octal integer -> '0' = 0; -> octal integer:n, '0-7':d = 8*n + d-'0'; (long) hex digit -> '0-9':d = d-'0'; -> 'A-F' + 'a-f':d = 9 + (d \& 7); (long) hex integer -> {\bra}"0x" + "0X"{\ket}, hex digit:d = d; -> hex integer:n, hex digit:d = 16*n + d; \end{indentingcode} % XXX: s/if/if as shown/ Note that if you use the C convention for octal integers, you have to redefine decimal integer to avoid ambiguity: \begin{indentingcode}{0.4in} (long) decimal integer -> '1-9':d = d-'0'; -> decimal integer:n, '0-9':d = 10*n + d-'0'; \end{indentingcode} % XXX add: This is because 0 standing alone could be either octal or % decimal. You can then represent the general case as follows: \begin{indentingcode}{0.4in} (long) integer -> decimal integer | octal integer | hex integer \end{indentingcode} You can allow for a signed integer with the following productions: \begin{indentingcode}{0.4in} (long) signed integer -> '+'?, integer:n = n; -> '-', integer:n = -n; \end{indentingcode} If you have included a disregard statement in your grammar to skip over irrelevant white space in your input, you might add the following to avoid skipping white space inside a number: \begin{indentingcode}{0.4in} {}[ lexeme \bra integer \ket ] \end{indentingcode} Note that if you were to declare signed integer as a lexeme, your parser would not allow space between a leading sign and the integer. % XXX. This should really at least mention dealing with integer % overflow. \index{Floating point numbers} \index{Parsing} \paragraph{Floating Point Numbers.} Floating point numbers are somewhat more complex than integers, but not significantly so: \begin{indentingcode}{0.4in} (double) real -> simple real -> integer part:r, exponent field:x = r*pow10(x); -> simple real:r, exponent field:x = r*pow10(x); (double) simple real -> integer part:i, '.', fraction part:f = i+f; -> integer part, '.' -> '.', fraction part:f = f; (double) integer part -> '0-9':d = d-'0'; -> integer part:n, '0-9':d = 10*n + d-'0'; (double) fraction part -> '0-9':d = (d-'0')/10.; -> '0-9':d, fraction part:f = (d-'0'+f)/10.; exponent field -> 'e' + 'E', signed exponent:x = x; signed exponent -> '+'?, exponent:n = n; -> '-', exponent:n = -n; exponent -> '0-9':d = d-'0'; -> exponent:n, '0-9':d = 10*n + d-'0'; \end{indentingcode} % XXX this is all wet - you really need to collect it as a string and % then use strtod on it, or you lose precision. Note that fraction part uses right recursion rather than left recursion. \agcode{exponent} is effectively the same as \agcode{decimal integer}, above, but allows for initial zeroes. The situation becomes somewhat more complex if you wish to allow both integer and floating point forms, particularly if you wish to follow the C convention for octal integers. First, you cannot have distinct productions for integer part and decimal integer, since there is no way to distinguish them until a decimal point or an exponent field is encountered. Second, since 0129.3 looks for all the world like an octal number until the ``9'' is encountered, one either has to postpone all conversion calculations until the issue is resolved or resort to trickery. Here is a way to resolve the problem by redefining integer part: \begin{indentingcode}{0.4in} (double) integer part -> confusion -> octal integer:n = octal2decimal(n); -> decimal integer:n = n; (double) confusion -> octal integer:n, '8-9':d = 10*octal2decimal(n) + d-'0'; -> confusion:x, '0-9':d = 10*x + d-'0'; \end{indentingcode} where octal2decimal is defined thus: \begin{indentingcode}{0.4in} double octal2decimal(int n) \bra if (n) return 10*octal2decimal(n/8) + n\%8; else return 0; \ket \end{indentingcode} Here confusion represents a number that starts off looking like an octal integer, but then turns into a decimal number, because an eight, a nine, a decimal pointer or an exponent field is encountered. When this occurs, the function octal2decimal undoes the octal conversion that had already been done, and redoes the conversion as decimal conversion. If you have included a disregard statement in your grammar to skip over irrelevant white space in your input, you might add the following to avoid skipping white space inside a real: \begin{indentingcode}{0.4in} {}[ lexeme \bra real \ket ] \end{indentingcode} \index{Names}\index{Parsing} \paragraph{Names.} In almost all grammars, it is necessary to identify names. To accumulate the characters that make up the name it is convenient to use a stack. The reduction procedures in this example use the functions \agcode{ipn} and \agcode{pcn} to accumulate the characters. \agcode{ipn} initializes storage for the name and stores the first character. \agcode{pcn} adds a single character to the name. This grammar presumes the existence of a function called \agcode{identify{\us}name} which would look up the name in a symbol table and return an identifying index. \begin{indentingcode}{0.4in} letter = 'a-z' + 'A-Z' + '{\us}' digit = '0-9' (int) name -> name string = identify{\us}name(); name string -> letter:c = ipn(c); -> name string, letter+digit:c = pcn(c); \bra // Embedded C to accumulate name char name[82]; int name{\us}length; /* Init and Put char to Name */ void ipn(int c) \bra name[0] = c; name{\us}length = 1; \ket /* Put Char to Name */ void pcn(int c) \bra assert(name{\us}length < 82); name[name{\us}length++] = c; \ket \ket // End of embedded C \end{indentingcode} % XXX: can I please change the names of ipn/pcn, make the global data % static, and so on? If you have included a disregard statement in your grammar to skip over irrelevant white space in your input, you might add the following to avoid skipping white space inside a name: \begin{indentingcode}{0.4in} {}[ lexeme \bra name \ket ] \end{indentingcode} \index{Names with embedded white space}\index{Parsing} \paragraph{Names with Embedded White Space.} It is sometimes convenient to allow symbol names to contain embedded white space. This is easily done, although it requires a bit of a trick: \begin{indentingcode}{0.4in} name -> name string, ws?... = identify{\us}name(); name string -> letter:c = ipn(c); -> name string, letter+digit:c = pcn(c); -> name string, ws..., letter+digit:c = pcn(' '), pcn(c); \end{indentingcode} Note that the last production reduces all contiguous white space within a name to a single blank. Allowing optional blanks following \agcode{name string} is important. If you don't allow them there, any \agcode{ws} following a name will be presumed to be embedded \agcode{ws}, requiring another \agcode{letter} or \agcode{digit}, which is not what you intend. There are two ways to accomplish this. The first, shown above, explicitly allows for optional white space following \agcode{name string}. The second method is to use the disregard and lexeme statements: \begin{indentingcode}{0.4in} {}[ disregard ws lexeme \bra name \ket ] \end{indentingcode} If you use the \agparam{disregard} statement you should not include a provision for white space in the production for \agcode{name}. Just leave it as it was in the previous example. % XXX: s/provision for/provision for trailing/ \index{Character Strings}\index{Parsing} \paragraph{Character Strings.} Character strings are often required. The simplest way to implement character strings is as follows: \begin{indentingcode}{0.4in} character string -> '"', \~{}(eof + '"')?..., '"' \end{indentingcode} This approach has the disadvantage that it makes no provision for nonprinting characters. There are numerous ways to provide for nonprinting characters in your character strings. However, you can avoid tedious documentation by using the same rules for nonprinting characters that C uses. Unfortunately, the C rules for octal and hexadecimal escape sequences complicate the writing of the grammar quite substantially. For example, if you wish to write a string that consists of ASCII code 11 followed by the ascii digit ``2'', you must pad with a leading zero, writing \agcode{{\bs}0132}, since \agcode{{\bs}132} according to the rules is a single three digit octal escape sequence designating ASCII code 90. The problem is that the rules allow for one, two or three digit octal escape sequences, but sequences shorter than three digits have to be followed by the end of the string or a character that is not an ASCII digit. There is a similar, if not worse, problem with hex escape sequences. There is no limit on the length of a hex escape sequence, so there is no possible way to make the character ``2'' follow a hex escape sequence without using another escape sequence. A straightforward approach to writing a grammar for strings consistent with the C conventions yields a number of conflicts which correspond exactly to the problems discussed above. While it is certainly possible to write a grammar for strings that has no conflicts, it is easier to proceed in a straightforward manner and use a sticky declaration to resolve the ambiguities. Here is the complete grammar for a character string in accordance with the C rules. In order to accumulate the contents of the string, this grammar uses the functions \agcode{ics} and \agcode{acs}, to initialize storage for a character string and to append a character to that string respectively. % XXX can I please change the names of ics and acs? \begin{indentingcode}{0.4in} character string -> string chars, '"' string chars -> '"' = ics(); -> string chars, string char:c = acs(c); string char -> simple string char -> escape sequence simple string char = \~{}eof - ('"' + '{\bs\bs}' + '{\bs}n') (int) escape sequence -> "{\bs\bs}a" ='{\bs}a'; -> "{\bs\bs}b" ='{\bs}b'; -> "{\bs\bs}f" ='{\bs}f'; -> "{\bs\bs}n" ='{\bs}n'; -> "{\bs\bs}r" ='{\bs}r'; -> "{\bs\bs}t" ='{\bs}t'; -> "{\bs\bs}v" ='{\bs}v'; -> "{\bs\bs\bs\bs}" ='{\bs\bs}'; -> "{\bs\bs}?" = '{\bs}?'; -> "{\bs\bs}'" ='{\bs}''; -> "{\bs\bs\bs}"" = '"'; -> octal escape -> hex escape (int) octal escape -> one octal | two octal | three octal (int) one octal -> '{\bs\bs}', '0-7':d = d-'0'; (int) two octal -> one octal:n, '0-7':d = 8*n + d-'0'; (int) three octal -> two octal:n, '0-7':d = 8*n + d-'0'; (int) hex escape -> "{\bs\bs}x", hex number (int) hex number -> hex digit -> hex number:n, hex digit:d = 16*n + d; {}[ sticky one octal, two octal, hex number ] \bra // Embedded C to define ics and acs char string{\us}store[82]; int length; void ics(void) \bra length = 0; \ket void acs(int c) \bra assert(length < 82); string{\us}store[length++] = c; \ket \ket // End of embedded C \end{indentingcode} If you have included a disregard statement in your grammar to skip over irrelevant white space in your input, you might add the following to avoid skipping white space inside a name: % XXX inside a what? \begin{indentingcode}{0.4in} {}[ lexeme \bra character string \ket ] \end{indentingcode} \index{Character constants}\index{Parsing} \paragraph{Character Constants.} It is almost trivial to extend the above syntax for character strings to allow simple character constants: \begin{indentingcode}{0.4in} (int) character constant -> '{\bs}'', simple char:c, '{\bs}'' = c; (int) simple char -> \~{}eof - ('{\bs}'' + '{\bs\bs}' + '{\bs}n') -> escape sequence \end{indentingcode} The value of the character constant token is the character code found in the input stream. If you have included a disregard statement in your grammar to skip over irrelevant white space in your input, you might add the following to avoid skipping white space inside a character constant: \begin{indentingcode}{0.4in} {}[ lexeme \bra character constant \ket ] \end{indentingcode} \index{Expressions}\index{Parsing} \paragraph{Simple Expressions.} It is often convenient to allow for simple expressions in your input. The syntax for expression logic is written in a conventional way, using a separate token for each precedence level desired. The following grammar will accept simple addition, subtraction, multiplication and division of floating point numbers: \begin{indentingcode}{0.4in} (double) expression -> term -> expression:x, '+', term:t = x+t; -> expression:x, '-', term:t = x-t; (double) term -> factor -> term:t, '*', factor:f = t*f; -> term:t, '/', factor:f = t/f; (double) factor -> real -> '(', expression:x, ')' = x; -> '-', factor:f = -f; \end{indentingcode} An alternative way to write expression syntax is to write an ambiguous grammar and use precedence declarations to resolve the conflicts: \begin{indentingcode}{0.4in} (double) expression -> expression:x, '+', expression:y = x+y; -> expression:x, '-', expression:y = x-y; -> expression:x, '*', expression:y = x*y; -> expression:x, '/', expression:y = x/y; -> unary minus, expression:x = -x; -> '(', expression:x, ')' = x; -> real unary minus = '-' {}[ left '+', '-' left '*', '/' right unary minus ] \end{indentingcode} Note that we deal with the different precedence levels of unary and binary minus by defining \agcode{unary minus}, which AnaGram treats as distinct from the simple \agcode{'-'}, and giving it different precedence.