Mercurial > ~dholland > hg > ag > index.cgi
diff doc/manual/sbb.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/manual/sbb.tex Sat Dec 22 17:52:45 2007 -0500 @@ -0,0 +1,701 @@ +\chapter{Syntactic Building Blocks} + +This appendix contains examples of the productions necessary to parse +the more common syntactic elements in your grammar. Each is provided +with sample reduction procedures, which are adequate in the vast +majority of cases. You can use these productions like building blocks +to quickly assemble quite powerful grammars. These productions can be +also be found in the file \agfile{sbb-doc.txt} in the +\agfile{examples} directory of your AnaGram distribution. An +HTML version may be found in the \agfile{html} directory. +% XXX: s/disk// + +\index{End of file} +\paragraph{End of File.} +If your parser is going to use conventional stream I/O to read its +input, you can define end of file as minus one: +\begin{indentingcode}{0.4in} + eof = $-$1 +\end{indentingcode} + +If your parser is going to run under Windows/DOS and will use low level I/O +instead of stream I/O you might define eof as Control Z: + +\begin{indentingcode}{0.4in} +eof = \^{}Z +\end{indentingcode} + +% XXX the following is crap +If your parser is going to run under UNIX and will use low level I/O instead of +stream I/O you might define eof as Control D: +\begin{indentingcode}{0.4in} +eof = \^{}D +\end{indentingcode} + +If your parser is simply going to parse a string in memory, the +definition of end of file should be a null character: + +\begin{indentingcode}{0.4in} +eof = 0 +\end{indentingcode} + +It is often convenient, however, to simply define end of file so it +will work in all of these contexts: + +\begin{indentingcode}{0.4in} +eof = $-$1 + 0 + \^{}D + \^{}Z +\end{indentingcode} + +\index{White space} +\paragraph{White Space.} +It is convenient to have a representation for white space in your +grammar. Usually you do not wish to distinguish between space +characters and tab characters, so you can write: + +\begin{indentingcode}{0.4in} +blank = ' ' + '{\bs}t' +\end{indentingcode} + +Using this definition, you can represent required white space of +indeterminate length with \agcode{blank...} and optional white space +with \agcode{blank?...} + +It is common, however, to include comments (see below) as white space. +In this case, you can define the following productions: + +\begin{indentingcode}{0.4in} +ws + -> blank + -> comment +\end{indentingcode} + +\index{End of line} +\paragraph{End of Line.} + +Because different systems use different representations for end of +line, it is wise to use an abstract end of line token in your grammar, +and define the token separately. If your parser is going to use files +that are known to use carriage return alone or line feed alone as the +end of line delimiter you can use one of the following definitions: + +\begin{indentingcode}{0.4in} +eol = '{\bs}r' // carriage return only +eol = '{\bs}n' // line feed only +\end{indentingcode} + +If your input files use the newline character as a line terminator, +but you wish to allow for optional carriage returns, you might write: + +\begin{indentingcode}{0.4in} +eol + -> '{\bs}r'?, '{\bs}n' +\end{indentingcode} + +or even + +\begin{indentingcode}{0.4in} +eol + -> '{\bs}r'?..., '{\bs}n' +\end{indentingcode} + +If you wish to make a grammar that can deal with a file of more or +less arbitrary provenance, some caution is in order. If you allow +carriage return and line feed both to function as end of line +characters and you allow blank lines, you may wind up with a very +ambiguous grammar. For example, suppose you use \agcode{eol...} +somewhere in your grammar and you have defined eol thus: + +\begin{indentingcode}{0.4in} +eol + -> '{\bs}r' + -> '{\bs}n' + -> '{\bs}r', '{\bs}n' // trouble! +\end{indentingcode} + +Your grammar is ambiguous since a carriage return, line feed pair can +be considered two line endings, according to the first two +productions, or just one, according to the third. + +If you really need to allow isolated carriage returns and linefeeds to +mark ends of line, but need to deal with the pair as well, you can +make good use of the idiosyncracies of AnaGram's keyword logic by +writing the following: + +\begin{indentingcode}{0.4in} +eol + -> '{\bs}r' + -> '{\bs}n' + -> "{\bs}r{\bs}n" +\end{indentingcode} + +This will treat a carriage return followed by a line feed as a single +end of line. + +\index{Comments}\index{Parsing} +\paragraph{Comments.} +There are two different categories of comments in common use: those +that run to the end of line, and those that run to a specific +terminator. The first can be conveniently incorporated into your end +of line token as a virtual production: + +\begin{indentingcode}{0.4in} +eol + -> ["//", \~{}(eof + '{\bs}n')?...], '{\bs}n' +\end{indentingcode} + +Thus, \agcode{eol} allows for an optional comment at the end of every +line. Note that you never want to allow end of file characters in +your comments. + +% XXX: simply/simple +Conventional C comments are slightly more complicated, depending on +your treatment of nesting. If you are not interested in nesting +comments, you can simply use the following simple syntax: + +\begin{indentingcode}{0.4in} +comment + -> "/*", \~{}eof?..., "*/" +\end{indentingcode} + +Note that this production works because keywords take precedence over +ordinary character input. +% XXX add: Otherwise, it would probably conflict with other uses of +% ``*'' and ``/''. + +A somewhat more complex way to write this is useful: + +\begin{indentingcode}{0.4in} +comment + -> comment head, "*/" + -> comment head, comment +comment head + -> "/*" + -> comment head, \~{}eof +\end{indentingcode} + +Note that where the previous grammar simply ignored the beginning of +any nested comment, this grammar recognizes a nested comment, but then +deliberately chooses to ignore nesting. +% XXX add: + +If you want your comments to nest, you can easily rewrite this grammar +to allow nesting: + +\begin{indentingcode}{0.4in} +comment + -> comment head, "*/" +comment head + -> "/*" + -> comment head, \~{}eof + -> comment head, comment +\end{indentingcode} + +Note that the only difference between nested and non-nested comments +is whether the grammar rule \agcode{comment head, comment} reduces to +\agcode{comment head} or to \agcode{comment}. + +If you wish to defer the question of nesting to run-time, you can +use a semantically determined production to make the decision: + +\begin{indentingcode}{0.4in} +comment + -> comment head, "*/" +comment head + -> "/*" + -> comment head, \~{}eof +comment, comment head + -> comment head, comment = check{\us}nesting(); +\end{indentingcode} + +Although the final production in this grammar has a somewhat +inscrutable air about it, a little study will show that if +\agcode{check{\us}nesting} sets the reduction token to \agcode{comment}, +the effective grammar is the same as the grammar above for non-nested +comments. On the other hand, if \agcode{check{\us}nesting} sets the +reduction token to comment head, the effective grammar is the same as +the grammar for nested comments. + +Assuming you have a switch called \agcode{nest{\us}comments}, you could +write \agcode{check{\us}nesting} as follows: + +\begin{indentingcode}{0.4in} +check{\us}nesting() \bra + if (nest{\us}comments) + CHANGE{\us}REDUCTION(comment{\us}head); + else + CHANGE{\us}REDUCTION(comment); +\ket +\end{indentingcode} +% XXX add some void to that + +The \agcode{else} clause in this procedure is technically unnecessary +since \agcode{comment} is the default value of the reduction token. + +If you wish to skip white space in your input, and wish to consider +comments as simple white space, you might want to add the following +production to your grammar: + +\begin{indentingcode}{0.4in} +ws + -> blank + -> comment +\end{indentingcode} + +You can then use the \agparam{disregard} statement to ignore +\agcode{ws} appropriately. + +\index{Integers}\index{Parsing} +\paragraph{Integers.} +It is quite easy to convert ordinary integers to binary form. For +instance: + +\begin{indentingcode}{0.4in} +(int) decimal integer + -> '0-9':d = d-'0'; + -> decimal integer:n, '0-9':d = 10*n + d-'0'; +\end{indentingcode} + +If necessary, you can specify that the value of decimal integer be +maintained as a long: + +\begin{indentingcode}{0.4in} +(long) decimal integer + -> '0-9':d = d-'0'; + -> decimal integer:n, '0-9':d = 10*n + d-'0'; +\end{indentingcode} + +Other representations are equally simple: + +\begin{indentingcode}{0.4in} +(long) octal integer + -> '0' = 0; + -> octal integer:n, '0-7':d = 8*n + d-'0'; + +(long) hex digit + -> '0-9':d = d-'0'; + -> 'A-F' + 'a-f':d = 9 + (d \& 7); + +(long) hex integer + -> {\bra}"0x" + "0X"{\ket}, hex digit:d = d; + -> hex integer:n, hex digit:d = 16*n + d; +\end{indentingcode} + +% XXX: s/if/if as shown/ +Note that if you use the C convention for octal integers, you have to +redefine decimal integer to avoid ambiguity: + +\begin{indentingcode}{0.4in} +(long) decimal integer + -> '1-9':d = d-'0'; + -> decimal integer:n, '0-9':d = 10*n + d-'0'; +\end{indentingcode} + +% XXX add: This is because 0 standing alone could be either octal or +% decimal. + +You can then represent the general case as follows: + +\begin{indentingcode}{0.4in} +(long) integer + -> decimal integer | octal integer | hex integer +\end{indentingcode} + +You can allow for a signed integer with the following productions: + +\begin{indentingcode}{0.4in} +(long) signed integer + -> '+'?, integer:n = n; + -> '-', integer:n = -n; +\end{indentingcode} + +If you have included a disregard statement in your grammar to skip +over irrelevant white space in your input, you might add the following +to avoid skipping white space inside a number: + +\begin{indentingcode}{0.4in} +{}[ lexeme \bra integer \ket ] +\end{indentingcode} + +Note that if you were to declare signed integer as a lexeme, your +parser would not allow space between a leading sign and the integer. + +% XXX. This should really at least mention dealing with integer +% overflow. + +\index{Floating point numbers} +\index{Parsing} +\paragraph{Floating Point Numbers.} +Floating point numbers are somewhat more complex than integers, but +not significantly so: + +\begin{indentingcode}{0.4in} +(double) real + -> simple real + -> integer part:r, exponent field:x = r*pow10(x); + -> simple real:r, exponent field:x = r*pow10(x); + +(double) simple real + -> integer part:i, '.', fraction part:f = i+f; + -> integer part, '.' + -> '.', fraction part:f = f; + +(double) integer part + -> '0-9':d = d-'0'; + -> integer part:n, '0-9':d = 10*n + d-'0'; + +(double) fraction part + -> '0-9':d = (d-'0')/10.; + -> '0-9':d, fraction part:f = (d-'0'+f)/10.; + +exponent field + -> 'e' + 'E', signed exponent:x = x; + +signed exponent + -> '+'?, exponent:n = n; + -> '-', exponent:n = -n; + +exponent + -> '0-9':d = d-'0'; + -> exponent:n, '0-9':d = 10*n + d-'0'; +\end{indentingcode} + +% XXX this is all wet - you really need to collect it as a string and +% then use strtod on it, or you lose precision. + +Note that fraction part uses right recursion rather than left +recursion. \agcode{exponent} is effectively the same as +\agcode{decimal integer}, above, but allows for initial zeroes. + +The situation becomes somewhat more complex if you wish to allow both +integer and floating point forms, particularly if you wish to follow +the C convention for octal integers. First, you cannot have distinct +productions for integer part and decimal integer, since there is no +way to distinguish them until a decimal point or an exponent field is +encountered. Second, since 0129.3 looks for all the world like an +octal number until the ``9'' is encountered, one either has to +postpone all conversion calculations until the issue is resolved or +resort to trickery. Here is a way to resolve the problem by +redefining integer part: + +\begin{indentingcode}{0.4in} +(double) integer part + -> confusion + -> octal integer:n = octal2decimal(n); + -> decimal integer:n = n; + +(double) confusion + -> octal integer:n, '8-9':d = 10*octal2decimal(n) + d-'0'; + -> confusion:x, '0-9':d = 10*x + d-'0'; +\end{indentingcode} + +where octal2decimal is defined thus: + +\begin{indentingcode}{0.4in} + double octal2decimal(int n) \bra + if (n) return 10*octal2decimal(n/8) + n\%8; + else return 0; + \ket +\end{indentingcode} + +Here confusion represents a number that starts off looking like an +octal integer, but then turns into a decimal number, because an eight, +a nine, a decimal pointer or an exponent field is encountered. When +this occurs, the function octal2decimal undoes the octal conversion +that had already been done, and redoes the conversion as decimal +conversion. + +If you have included a disregard statement in your grammar to skip +over irrelevant white space in your input, you might add the following +to avoid skipping white space inside a real: + +\begin{indentingcode}{0.4in} +{}[ lexeme \bra real \ket ] +\end{indentingcode} + +\index{Names}\index{Parsing} +\paragraph{Names.} +In almost all grammars, it is necessary to identify names. To +accumulate the characters that make up the name it is convenient to +use a stack. The reduction procedures in this example use the +functions \agcode{ipn} and \agcode{pcn} to accumulate the characters. +\agcode{ipn} initializes storage for the name and stores the first +character. \agcode{pcn} adds a single character to the name. This +grammar presumes the existence of a function called +\agcode{identify{\us}name} which would look up the name in a symbol table +and return an identifying index. + +\begin{indentingcode}{0.4in} +letter = 'a-z' + 'A-Z' + '{\us}' +digit = '0-9' + +(int) name + -> name string = identify{\us}name(); + +name string + -> letter:c = ipn(c); + -> name string, letter+digit:c = pcn(c); + +\bra // Embedded C to accumulate name + char name[82]; + int name{\us}length; + + /* Init and Put char to Name */ + void ipn(int c) \bra + name[0] = c; + name{\us}length = 1; + \ket + + /* Put Char to Name */ + void pcn(int c) \bra + assert(name{\us}length < 82); + name[name{\us}length++] = c; + \ket +\ket // End of embedded C +\end{indentingcode} + +% XXX: can I please change the names of ipn/pcn, make the global data +% static, and so on? + +If you have included a disregard statement in your grammar to skip +over irrelevant white space in your input, you might add the following +to avoid skipping white space inside a name: + +\begin{indentingcode}{0.4in} +{}[ lexeme \bra name \ket ] +\end{indentingcode} + +\index{Names with embedded white space}\index{Parsing} +\paragraph{Names with Embedded White Space.} +It is sometimes convenient to allow symbol names to contain embedded +white space. This is easily done, although it requires a bit of a +trick: + +\begin{indentingcode}{0.4in} +name + -> name string, ws?... = identify{\us}name(); + +name string + -> letter:c = ipn(c); + -> name string, letter+digit:c = pcn(c); + -> name string, ws..., letter+digit:c = pcn(' '), pcn(c); +\end{indentingcode} + +Note that the last production reduces all contiguous white space +within a name to a single blank. + +Allowing optional blanks following \agcode{name string} is important. +If you don't allow them there, any \agcode{ws} following a name will +be presumed to be embedded \agcode{ws}, requiring another +\agcode{letter} or \agcode{digit}, which is not what you intend. +There are two ways to accomplish this. The first, shown above, +explicitly allows for optional white space following \agcode{name +string}. The second method is to use the disregard and lexeme +statements: + +\begin{indentingcode}{0.4in} +{}[ + disregard ws + lexeme \bra name \ket +] +\end{indentingcode} + +If you use the \agparam{disregard} statement you should not include a +provision for white space in the production for \agcode{name}. Just +leave it as it was in the previous example. +% XXX: s/provision for/provision for trailing/ + +\index{Character Strings}\index{Parsing} +\paragraph{Character Strings.} +Character strings are often required. The simplest way to implement +character strings is as follows: + +\begin{indentingcode}{0.4in} +character string + -> '"', \~{}(eof + '"')?..., '"' +\end{indentingcode} + +This approach has the disadvantage that it makes no provision for +nonprinting characters. + +There are numerous ways to provide for nonprinting characters in your +character strings. However, you can avoid tedious documentation by +using the same rules for nonprinting characters that C uses. +Unfortunately, the C rules for octal and hexadecimal escape sequences +complicate the writing of the grammar quite substantially. For +example, if you wish to write a string that consists of ASCII code 11 +followed by the ascii digit ``2'', you must pad with a leading zero, +writing \agcode{{\bs}0132}, since \agcode{{\bs}132} +according to the rules is a single three digit octal escape sequence +designating ASCII code 90. The problem is that the rules allow for +one, two or three digit octal escape sequences, but sequences shorter +than three digits have to be followed by the end of the string or a +character that is not an ASCII digit. There is a similar, if not +worse, problem with hex escape sequences. There is no limit on the +length of a hex escape sequence, so there is no possible way to make +the character ``2'' follow a hex escape sequence without using another +escape sequence. + +A straightforward approach to writing a grammar for strings consistent +with the C conventions yields a number of conflicts which correspond +exactly to the problems discussed above. While it is certainly +possible to write a grammar for strings that has no conflicts, it is +easier to proceed in a straightforward manner and use a sticky +declaration to resolve the ambiguities. + +Here is the complete grammar for a character string in accordance with +the C rules. In order to accumulate the contents of the string, this +grammar uses the functions \agcode{ics} and \agcode{acs}, to +initialize storage for a character string and to append a character to +that string respectively. +% XXX can I please change the names of ics and acs? + +\begin{indentingcode}{0.4in} +character string + -> string chars, '"' + +string chars + -> '"' = ics(); + -> string chars, string char:c = acs(c); + +string char + -> simple string char + -> escape sequence + +simple string char = \~{}eof - ('"' + '{\bs\bs}' + '{\bs}n') + +(int) escape sequence + -> "{\bs\bs}a" ='{\bs}a'; + -> "{\bs\bs}b" ='{\bs}b'; + -> "{\bs\bs}f" ='{\bs}f'; + -> "{\bs\bs}n" ='{\bs}n'; + -> "{\bs\bs}r" ='{\bs}r'; + -> "{\bs\bs}t" ='{\bs}t'; + -> "{\bs\bs}v" ='{\bs}v'; + -> "{\bs\bs\bs\bs}" ='{\bs\bs}'; + -> "{\bs\bs}?" = '{\bs}?'; + -> "{\bs\bs}'" ='{\bs}''; + -> "{\bs\bs\bs}"" = '"'; + -> octal escape + -> hex escape + +(int) octal escape + -> one octal | two octal | three octal + +(int) one octal + -> '{\bs\bs}', '0-7':d = d-'0'; + +(int) two octal + -> one octal:n, '0-7':d = 8*n + d-'0'; + +(int) three octal + -> two octal:n, '0-7':d = 8*n + d-'0'; + +(int) hex escape + -> "{\bs\bs}x", hex number + +(int) hex number + -> hex digit + -> hex number:n, hex digit:d = 16*n + d; + +{}[ + sticky one octal, two octal, hex number +] +\bra // Embedded C to define ics and acs + char string{\us}store[82]; + int length; + + void ics(void) \bra + length = 0; + \ket + void acs(int c) \bra + assert(length < 82); + string{\us}store[length++] = c; + \ket +\ket // End of embedded C +\end{indentingcode} + + +If you have included a disregard statement in your grammar to skip +over irrelevant white space in your input, you might add the following +to avoid skipping white space inside a name: +% XXX inside a what? + +\begin{indentingcode}{0.4in} +{}[ lexeme \bra character string \ket ] +\end{indentingcode} + +\index{Character constants}\index{Parsing} +\paragraph{Character Constants.} +It is almost trivial to extend the above syntax for character strings +to allow simple character constants: + +\begin{indentingcode}{0.4in} +(int) character constant + -> '{\bs}'', simple char:c, '{\bs}'' = c; + +(int) simple char + -> \~{}eof - ('{\bs}'' + '{\bs\bs}' + '{\bs}n') + -> escape sequence +\end{indentingcode} + +The value of the character constant token is the character code found +in the input stream. + +If you have included a disregard statement in your grammar to skip +over irrelevant white space in your input, you might add the following +to avoid skipping white space inside a character constant: + +\begin{indentingcode}{0.4in} +{}[ lexeme \bra character constant \ket ] +\end{indentingcode} + +\index{Expressions}\index{Parsing} +\paragraph{Simple Expressions.} +It is often convenient to allow for simple expressions in your input. +The syntax for expression logic is written in a conventional way, +using a separate token for each precedence level desired. The +following grammar will accept simple addition, subtraction, +multiplication and division of floating point numbers: + +\begin{indentingcode}{0.4in} +(double) expression + -> term + -> expression:x, '+', term:t = x+t; + -> expression:x, '-', term:t = x-t; + +(double) term + -> factor + -> term:t, '*', factor:f = t*f; + -> term:t, '/', factor:f = t/f; + +(double) factor + -> real + -> '(', expression:x, ')' = x; + -> '-', factor:f = -f; +\end{indentingcode} + +An alternative way to write expression syntax is to write an ambiguous +grammar and use precedence declarations to resolve the conflicts: + +\begin{indentingcode}{0.4in} +(double) expression + -> expression:x, '+', expression:y = x+y; + -> expression:x, '-', expression:y = x-y; + -> expression:x, '*', expression:y = x*y; + -> expression:x, '/', expression:y = x/y; + -> unary minus, expression:x = -x; + -> '(', expression:x, ')' = x; + -> real + +unary minus = '-' +{}[ + left '+', '-' + left '*', '/' + right unary minus +] +\end{indentingcode} + +Note that we deal with the different precedence levels of unary and +binary minus by defining \agcode{unary minus}, which AnaGram treats as +distinct from the simple \agcode{'-'}, and giving it different +precedence.