Mercurial > ~dholland > hg > ag > index.cgi
diff doc/misc/html/examples/sbb-doc.html @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/misc/html/examples/sbb-doc.html Sat Dec 22 17:52:45 2007 -0500 @@ -0,0 +1,727 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> +<HTML> +<HEAD> +<TITLE>Syntactic Building Blocks</TITLE> +</HEAD> + + +<BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif" + TEXT="#000000" LINK="#0033CC" + VLINK="#CC0033" ALINK="#CC0099"> + +<P> + +<IMG ALIGN="right" SRC="../images/agrsl6c.gif" ALT="AnaGram" + WIDTH=124 HEIGHT=30 > +<BR CLEAR="all"> +Back to <A HREF="../index.html">Index</A> +<P> +<IMG ALIGN="bottom" SRC="../images/rbline6j.gif" ALT="----------------------" + WIDTH=1010 HEIGHT=2 > + + +<BR CLEAR="all"> + +<H1>Syntactic Building Blocks</H1> + +<IMG ALIGN="bottom" SRC="../images/rbline6j.gif" ALT="----------------------" + WIDTH=1010 HEIGHT=2 > +<P> +<BR> + +<H2>Introduction</H2> + + <tt>examples/sbb-doc.txt</tt> contains examples of the + productions necessary to + parse the more common syntactic elements in your grammar. It + should be of interest once you have begun to write your own + grammars. Each example is provided with sample reduction + procedures, which are adequate in the vast majority of cases. + You can use these productions like building blocks to quickly + assemble quite powerful grammars. +<P> + You are reading an HTML version of <tt>sbb-doc.txt</tt>. The + actual text file + should be easier to copy sections from. The text of this file is + also to be found in Appendix D of the AnaGram User's Guide. +<P> +<BR> + +<H2>End of File</H2> + + If your parser is going to use + stream I/O to read its input, you can define end of file as + minus one: + <PRE> eof = -1 </PRE> + + If your parser is going to run under Windows/DOS and will use low + level I/O instead of stream I/O you might + define <STRONG>eof</STRONG> as + Control Z: + <PRE> eof = ^Z </PRE> +<P> + If your parser is going to run under UNIX and will use low + level I/O instead of stream I/O you might define + <STRONG>eof</STRONG> as + Control D: + <PRE> eof = ^D </PRE> +<P> + If your parser is simply going to parse a string in memory, + the definition of end of file should be a null character: + <PRE> eof = 0 </PRE> +<P> + It is often convenient, however, to simply define end of + file so it will work in all of these contexts: + <PRE> eof = -1 + 0 + ^D + ^Z </PRE> +<P> +<BR> + + <H2>White Space</H2> + + It is convenient to have a representation for white space in + your grammar. Usually you do not wish to distinguish between + space characters and tab characters, so you can write: + <PRE> blank = ' ' + '\t' </PRE> +<P> + Using this definition, you can represent required white + space of indeterminate length with +<PRE> blank... </PRE> and optional + white space with +<PRE> blank?... </PRE> +<P> + It is common, however, to include comments (see below) as + white space. In this case, you can define the following + productions: +<PRE> ws + ->: blank + -> comment +</PRE> +<P> +<BR> + + <H2>End of Line</H2> + + Because different systems use different representations for + end of line, it is wise to use an abstract end of line token + in your grammar, and define the token separately. If your + parser is going to use files that are known to use carriage + return alone or line feed alone as the end of line delimiter + you can use one of the following definitions: +<PRE> eol = '\r' //carriage return only + eol = '\n' //line feed only +</PRE> + If your input files use the newline character as a line + terminator, but you wish to allow for optional carriage + returns, you might write: +<PRE> + eol + -> '\r'?, '\n' +</PRE> + or even + <PRE> + eol + -> '\r'?..., '\n' +</PRE> +<P> + If you wish to make a grammar that can deal with a file of + more or less arbitrary provenance, some caution is in order. + If you allow carriage return and line feed both to function + as end of line characters and you allow blank lines, you may + wind up with a very ambiguous grammar. For example, suppose + you use <STRONG>eol...</STRONG> somewhere in your grammar and you have + defined <STRONG>eol</STRONG> thus: +<PRE> + eol + -> '\r' + -> '\n' + -> '\r', '\n' // trouble! +</PRE> + Your grammar is ambiguous since a carriage return, line feed + pair can be considered two line endings, according to the + first two production, or just one, according to the third. +<P> + If you really need to allow isolated carriage returns and + linefeeds to mark ends of line, but need to deal with the + pair as well, you can make good use of the idiosyncracies of + AnaGram's keyword logic by writing the following: +<PRE> + eol + -> '\r' + -> '\n' + -> "\r\n" +</PRE> + This will treat a carriage return followed by a line feed as + a single end of line. +<P> +<BR> + +<H2> Comments</H2> + + There are two different categories of comments in common + use: Those that run to the end of line, and those that run + to a specific terminator. The first can be conveniently + incorporated into your end of line token as a virtual + production: +<PRE> + eol + -> ["//", ~(eof + '\n')?...], '\n' +</PRE> + Thus, <STRONG>eol</STRONG> allows for an optional comment at the end of every + line. Note that you never want to allow end of file + characters in your comments. +<P> + Conventional C comments are slightly more complicated, + depending on your treatment of nesting. If you are not + interested in nesting comments, you can simply use the + following simple syntax: +<PRE> + comment + -> "/*", ~eof?..., "*/" +</PRE> + Note that this production works because keywords take + precedence over ordinary character input. +<P> + A somewhat more complex way to write this is useful: +<PRE> + comment + -> comment head, "*/" + -> comment head, comment + comment head + -> "/*" + -> comment head, ~eof +</PRE> + Note that where the previous grammar simply ignored the + beginning of any nested comment, this grammar recognizes a + nested comment, but then deliberately chooses to ignore + nesting. +<P> + If you want your comments to nest, you can easily rewrite + this grammar to allow nesting: +<PRE> + comment + -> comment head, "*/" + comment head + -> "/*" + -> comment head, ~eof + -> comment head, comment +</PRE> + Note that the only difference between nested and non-nested + comments is whether the grammar rule + <STRONG>comment head, comment</STRONG> + reduces to <STRONG>comment head</STRONG> or + to <STRONG>comment</STRONG>. +<P> + If you wish to defer the question of nesting to run-time, + you can use a semantically determined production to make the + decision: +<PRE> + comment + -> comment head, "*/" + comment head + -> "/*" + -> comment head, ~eof + comment, comment head + -> comment head, comment =check_nesting(); +</PRE> +<P> + Although the final production in this grammar has a somewhat + inscrutable air about it, a little study will show that if + <STRONG>check_nesting</STRONG> sets the reduction + token to <STRONG>comment</STRONG>, the + effective grammar is the same as the grammar above for + non-nested comments. On the other hand, + if <STRONG>check_nesting</STRONG> + sets the reduction token + to <STRONG>comment head</STRONG>, the effective + grammar is the same as the grammar for nested comments. +<P> + Assuming you have a switch called + <STRONG>nest_comments</STRONG>, you could + write <STRONG>check_nesting</STRONG> as follows: +<PRE> + check_nesting() { + if (nest_comments) + CHANGE_REDUCTION(comment_head); + else + CHANGE_REDUCTION(comment); + } +</PRE> + The else clause in this procedure is technically unnecessary + since comment is the default value of the reduction token. +<P> + If you wish to skip white space in your input, and wish to + consider comments as simple white space, you might want to + add the following production to your grammar: +<PRE> + ws + -> blank + -> comment +</PRE> + You can then use the <STRONG>disregard</STRONG> statement + to ignore <STRONG>ws</STRONG> + appropriately. +<P> +<BR> + +<H2>Integers</H2> + + It is quite easy to convert ordinary integers to binary + form. For instance: +<PRE> + (int) decimal integer + -> '0-9':d =d-'0'; + -> decimal integer:n, '0-9':d =10*n + d - '0'; +</PRE> + If necessary, you can specify that the value of <STRONG>decimal + integer</STRONG> be maintained as a long: +<PRE> + (long) decimal integer + -> '0-9':d =d-'0'; + -> decimal integer:n, '0-9':d =10*n + d - '0'; +</PRE> + Other representations are equally simple: +<PRE> + (long) octal integer + -> '0' =0; + -> octal integer:n, '0-7':d =8*n + d - '0'; + + (long) hex digit + -> '0-9':d =d-'0'; + -> 'A-F' + 'a-f':d =9 + (d & 7); + + (long) hex integer + -> {"0x" | "0X"}, hex digit:d =d; + -> hex integer:n, hex digit:d =16*n + d; +</PRE> + Note that if you use the C convention for octal integers, + you have to redefine <STRONG>decimal integer</STRONG> + to avoid ambiguity: +<PRE> + (long) decimal integer + -> '1-9':d =d-'0'; + -> decimal integer:n, '0-9':d =10*n + d - '0'; +</PRE> + You can then represent the general case as follows: +<PRE> + (long) integer + -> decimal integer | octal integer | hex integer +</PRE> + You can allow for a signed integer with the following + productions: +<PRE> + (long) signed integer + -> '+'?, integer:n =n; + -> '-', integer:n =-n; +</PRE> + If you have included a <STRONG>disregard</STRONG> + statement in your grammar + to skip over irrelevant white space in your input, you might + add the following to avoid skipping white space inside a + number: +<PRE> + [ lexeme {integer}] +</PRE> + Note that if you were to declare <STRONG>signed + integer</STRONG> as a lexeme, + your parser would not allow space between a leading sign and + the integer. +<P> +<BR> + + <H2>Floating Point Numbers</H2> + + Floating point numbers are somewhat more complex than + integers, but not significantly so: +<PRE> + (double) real + -> simple real + -> integer part:r, exponent field:x =r*pow10(x); + -> simple real:r, exponent field:x =r*pow10(x); + + + (double) simple real + -> integer part:i, '.', fraction part:f =i+f; + -> integer part, '.' + -> '.', fraction part:f =f; + + (double) integer part + -> '0-9':d =d-'0'; + -> integer part:n, '0-9':d =10*n + d-'0'; + + (double) fraction part + -> '0-9':d =(d-'0')/10.; + -> '0-9':d, fraction part:f =(d-'0'+f)/10.; + + exponent field + -> 'e' + 'E', signed exponent:x =x; + + signed exponent + -> '+'?, exponent:n =n; + -> '-', exponent:n =-n; + + exponent + -> '0-9':d =d-'0'; + -> exponent:n, '0-9':d =10*n + d - '0'; +</PRE> + Note that <STRONG>fraction part</STRONG> uses + right recursion rather than + left recursion. <STRONG>exponent</STRONG> is + effectively the same as <STRONG>decimal + integer</STRONG>, above, but allows for initial zeroes. +<P> + The situation becomes somewhat more complex if you wish to + allow both integer and floating point forms, particularly if + you wish to follow the C convention for octal integers. + First, you cannot have distinct productions + for <STRONG>integer part</STRONG> + and <STRONG>decimal integer</STRONG>, since + there is no way to distinguish + them until a decimal point or an exponent field is + encountered. Second, since 0129.3 looks for all the world + like an octal number until the '9' is encountered, one + either has to postpone all conversion calculations until the + issue is resolved or resort to trickery. Here is a way to + resolve the problem by redefining <STRONG>integer part</STRONG>: +<PRE> + (double) integer part + -> confusion + -> octal integer:n =octal2decimal(n); + -> decimal integer:n =n; + + (double) confusion + -> octal integer:n, '8-9':d =10*octal2decimal(n)+d-'0'; + -> confusion:x, '0-9':d =10*x + d-'0'; +</PRE> + where <STRONG>octal2decimal</STRONG> is defined thus: +<PRE> + double octal2decimal(int n) { + if (n) return 10*octal2decimal(n/8) + n%8; + else return 0; + } +</PRE> + Here <STRONG>confusion</STRONG> represents + a number that starts off looking + like an octal integer, but then turns into a decimal number, + because an eight, a nine, a decimal pointer or an exponent + field is encountered. When this occurs, the function + <STRONG>octal2decimal</STRONG> undoes the octal + conversion that had already been + done, and redoes the conversion as decimal conversion. +<P> + If you have included a <STRONG>disregard</STRONG> + statement in your grammar + to skip over irrelevant white space in your input, you might + add the following to avoid skipping white space inside a + real: +<PRE> + [ lexeme {real}] +</PRE> +<BR> + + <H2>Names</H2> + + In almost all grammars, it is necessary to identify names. + To accumulate the characters that make up the name it is + convenient to use a stack. The reduction procedures in this + example use the functions <STRONG>ipn</STRONG> and + <STRONG>pcn</STRONG> to accumulate the + characters. <STRONG>ipn</STRONG> initializes storage + for the name and stores + the first character. <STRONG>pcn</STRONG> adds a single + character to the + name. This grammar presumes the existence of a function + called <STRONG>identify_name</STRONG> which would + look up the name in a + symbol table and return an identifying index. +<PRE> + letter = 'a-z' + 'A-Z' + '_' + digit = '0-9' + (int) name + -> name string =identify_name(); + + name string + -> letter:c =ins(c); + -> name string, letter+digit:c =pcn(c); + + {/* embedded C to accumulate name */ + char name[82]; + int name_length; + void ipn(int c) { /* Init and Put char to Name */ + name[0] = c; + name_length = 1; + } + void pcn(int c) { /* Put Char to Name */ + assert(name_length < 82); + name[name_length++] = c; + } + } // End of embedded C +</PRE> +<P> + If you have included a + <STRONG>disregard</STRONG> statement in your grammar + to skip over irrelevant white space in your input, you might + add the following to avoid skipping white space inside a + name: +<PRE> + [ lexeme {name}] +</PRE> + +<BR> + + <H2>Names with Embedded White Space</H2> + + It is sometimes convenient to allow symbol names to contain + embedded white space. This is easily done, although it + requires a bit of a trick: +<PRE> + name + -> name string, ws?... =identify_name(); + + name string + -> letter:c =ipn(c); + -> name string, letter+digit:c =pcn(c); + -> name string, ws..., letter+digit:c =pcn(' '), pcn(c); +</PRE> + Note that the last production reduces all contiguous white + space within a name to a single blank. +<P> + Allowing optional blanks following name string is important. + If you don't allow them there, any <STRONG>ws</STRONG> + following a name will + be presumed to be embedded <STRONG>ws</STRONG>, + requiring another letter or + digit, which is not what you intend. There are two ways to + accomplish this. The first, shown above, explicitly allows + for optional white space following name string. The second + method is to use the <STRONG>disregard</STRONG> + and <STRONG>lexeme</STRONG> statements: +<PRE> + [ + disregard ws + lexeme {name} + ] +</PRE> + If you use the <STRONG>disregard</STRONG> statement you should not include a + provision for white space in the production for name. Just + leave it as it was in the previous example. +<P> +<BR> + + <H2>Character Strings</H2> + + Character strings are often required. The simplest way to + implement character strings is as follows: +<PRE> + character string + -> '"', ~(eof + '"')?..., '"' +</PRE> + This approach has the disadvantage that it makes no + provision for nonprinting characters. +<P> + There are numerous ways to provide for nonprinting + characters in your character strings. However, you can avoid + tedious documentation by using the same rules for + nonprinting characters that C uses. Unfortunately, the C + rules for octal and hexadecimal escape sequences complicate + the writing of the grammar quite substantially. For example, + if you wish to write a string that consists of ascii code 11 + followed by the ascii digit '2', you must pad with a leading + zero, writing "\0132", since "\132" according to the rules + is a single three digit octal escape sequence designating + ascii code 90. The problem is that the rules allow for one, + two or three digit octal escape sequences, but sequences + shorter than three digits have to be followed by the end of + the string or a character that is not an ascii digit. There + is a similar, if not worse, problem with hex escape + sequences. There is no limit on the length of a hex escape + sequence, so there is no possible way to make the character + '2' follow a hex escape sequence without using another + escape sequence. +<P> + A straightforward approach to writing a grammar for strings + consistent with the C conventions yields a number of + conflicts which correspond exactly to the problems discussed + above. While it is certainly possible to write a grammar for + strings that has no conflicts, it is easier to proceed in a + straightforward manner and use a <STRONG>sticky</STRONG> + declaration to + resolve the ambiguities. +<P> + Here is the complete grammar for a character string in + accordance with the C rules. In order to accumulate the + contents of the string, this grammar uses the functions + <STRONG>ics</STRONG> and <STRONG>acs</STRONG>, + to initialize storage for a character string and to + append a character to that string respectively. +<PRE> + character string + -> string chars, '"' + + string chars + -> '"' =ics(); + -> string chars, string char:c =acs(c); + + string char + -> simple string char + -> escape sequence + + simple string char = ~eof - ('"' + '\\' + '\n') + + (int) escape sequence + -> "\\a" ='\a'; + -> "\\b" ='\b'; + -> "\\f" ='\f'; + -> "\\n" ='\n'; + -> "\\r" ='\r'; + -> "\\t" ='\t'; + -> "\\v" ='\v'; + -> "\\\\" ='\\'; + -> "\\?" = '\?'; + -> "\\'" ='\''; + -> "\\\"" ='"'; + -> octal escape + -> hex escape + + (int) octal escape + -> one octal | two octal | three octal + + (int) one octal + -> '\\', '0-7':d =d-'0'; + + (int) two octal + -> one octal:n, '0-7':d =8*n + d-'0'; + + (int) three octal + -> two octal:n, '0-7':d =8*n + d-'0'; + + (int) hex escape + -> "\\x", hex number + + (int) hex number + -> hex digit + -> hex number:n, hex digit:d =16*n + d; + + [ + sticky one octal, two octal, hex number + ] + {/* embedded C to define ics and acs */ + char string_store[82]; + int length; + void ics(void) { + length = 0; + } + void acs(int c) { + assert(length < 82); + string_store[length++] = c; + } +</PRE> +<P> + If you have included a <STRONG>disregard</STRONG> + statement in your grammar + to skip over irrelevant white space in your input, you might + add the following to avoid skipping white space inside a + name: +<PRE> + [ lexeme {character string}] +</PRE> + +<BR> + + <H2>Character Constants</H2> + + It is almost trivial to extend the above syntax for + character strings to allow simple character constants: +<PRE> + (int) character constant + -> '\'', simple char:c, '\'' =c; + + (int) simple char + -> ~eof - ('\'' + '\\' + '\n') + -> escape sequence +</PRE> + The value of the character constant token is the character + code found in the input stream. +<P> + If you have included a <STRONG>disregard</STRONG> + statement in your grammar + to skip over irrelevant white space in your input, you might + add the following to avoid skipping white space inside a + character constant: +<PRE> + [ lexeme {character constant}] +</PRE> + +<BR> + + <H2>Simple Expressions</H2> + + It is often convenient to allow for simple expressions in + your input. The syntax for expression logic is written in a + conventional way, using a separate token for each precedence + level desired. The following grammar will accept simple + addition, subtraction, multiplication and division of + floating point numbers: +<PRE> + (double) expression + -> term + -> expression:x, '+', term:t =x+t; + -> expression:x, '-', term:t =x-t; + + (double) term + -> factor + -> term:t, '*', factor:f =t*f; + -> term:t, '/', factor:f =t/f; + + (double) factor + -> real + -> '(', expression:x, ')' =x; + -> '-', factor:f =-f; +</PRE> +<P> + An alternative way to write expression syntax is to write an + ambiguous grammar and use precedence declarations to resolve + the conflicts. This results in a slightly more compact and + faster parser: +<PRE> + (double) expression + -> expression:x, '+', expression:y =x+y; + -> expression:x, '-', expression:y =x-y; + -> expression:x, '*', expression:y =x*y; + -> expression:x, '/', expression:y =x/y; + -> unary minus, expression:x =-x; + -> '(', expression:x, ')' =x; + -> real + + unary minus = '-' + + [ + left '+', '-' + left '*', '/' + right unary minus + ] +</PRE> + Note that we deal with the different precedence levels of + unary and binary minus by defining unary minus, which + AnaGram treats as distinct from the simple '-', and giving + it different precedence. + +<P> + +<IMG ALIGN="bottom" SRC="../images/rbline6j.gif" ALT="----------------------" + WIDTH=1010 HEIGHT=2 > +<P> +<IMG ALIGN="right" SRC="../images/pslrb6d.gif" ALT="Parsifal Software" + WIDTH=181 HEIGHT=25> +<BR CLEAR="right"> +<P> +Back to <A HREF="../index.html">Index</A> +<P> +<ADDRESS><FONT SIZE="-1"> + AnaGram parser generator - examples<BR> + Syntactic Building Blocks<BR> + Copyright © 1993-1999, Parsifal Software. <BR> + All Rights Reserved.<BR> +</FONT></ADDRESS> + +</BODY> +</HTML>