AnaGram interim repo (temporary): examples/sbb-doc.txt comparison

comparison examples/sbb-doc.txt @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.

author	David A. Holland
date	Sat, 22 Dec 2007 17:52:45 -0500
parents
children

comparison

equal deleted inserted replaced

--1:000000000000
+:13d2b8934445
+------------------------------------------------------
+Syntactic Building Blocks
+Copyright 1993-1998, Parsifal Software.
+All Rights Reserved.
+This software is provided 'as-is', without any express or
+implied warranty.  In no event will the authors be held
+liable for any damages arising from the use of this software.
+Permission is granted to anyone to use this software for any
+purpose, including commercial applications, and to alter it
+and redistribute it freely, subject to the restriction that
+this notice may not be removed or altered from any source
+distribution.
+------------------------------------------------------
+Introduction
+---------------
+This file contains examples of the productions necessary to
+parse the more common syntactic elements in your grammar. It
+should be of interest once you have begun to write your own
+grammars. Each example is provided with sample reduction
+procedures, which are adequate in the vast majority of cases.
+You can use these productions like building blocks to quickly
+assemble quite powerful grammars.  The text of this file is
+also to be found in Appendix D of the AnaGram User's Guide.
+End of File
+--------------
+If your parser is going to run under Unix or DOS and use
+stream I/O to read its input, you can define end of file as
+minus one:
+eof = -1
+If your parser is going to run under DOS and will use low
+level I/O instead of stream I/O you might define eof as
+Control Z:
+eof = ^Z
+If your parser is going to run under UNIX and will use low
+level I/O instead of stream I/O you might define eof as
+Control D:
+eof = ^D
+If your parser is simply going to parse a string in memory,
+the definition of end of file should be a null character:
+eof = 0
+It is often convenient, however, to simply define end of
+file so it will work in all of these contexts:
+eof = -1 + 0 + ^D + ^Z
+White Space
+----------------
+It is convenient to have a representation for white space in
+your grammar. Usually you do not wish to distinguish between
+space characters and tab characters, so you can write:
+blank = ' ' + '\t'
+Using this definition, you can represent required white
+space of indeterminate length with blank... and optional
+white space with blank?...
+It is common, however, to include comments (see below) as
+white space. In this case, you can define the following
+productions:
+ws
+-> blank
+-> comment
+End of Line
+--------------
+Because different systems use different representations for
+end of line, it is wise to use an abstract end of line token
+in your grammar, and define the token separately. If your
+parser is going to use files that are known to use carriage
+return alone or line feed alone as the end of line delimiter
+you can use one of the following definitions:
+eol = '\r'                          //carriage return only
+eol = '\n'                          //line feed only
+If your input files use the newline character as a line
+terminator, but you wish to allow for optional carriage
+returns, you might write:
+eol
+-> '\r'?, '\n'
+or even
+eol
+-> '\r'?..., '\n'
+If you wish to make a grammar that can deal with a file of
+more or less arbitrary provenance, some caution is in order.
+If you allow carriage return and line feed both to function
+as end of line characters and you allow blank lines, you may
+wind up with a very ambiguous grammar. For example, suppose
+you use eol...  somewhere in your grammar and you have
+defined eol thus:
+eol
+-> '\r'
+-> '\n'
+-> '\r', '\n'                        // trouble!
+Your grammar is ambiguous since a carriage return, line feed
+pair can be considered two line endings, according to the
+first two production, or just one, according to the third.
+If you really need to allow isolated carriage returns and
+linefeeds to mark ends of line, but need to deal with the
+pair as well, you can make good use of the idiosyncracies of
+AnaGram's keyword logic by writing the following:
+eol
+-> '\r'
+-> '\n'
+-> "\r\n"
+This will treat a carriage return followed by a line feed as
+a single end of line.
+Comments
+-------------
+There are two different categories of comments in common
+use: Those that run to the end of line, and those that run
+to a specific terminator.  The first can be conveniently
+incorporated into your end of line token as a virtual
+production:
+eol
+-> ["//", ~(eof + '\n')?...], '\n'
+Thus, eol allows for an optional comment at the end of every
+line. Note that you never want to allow end of file
+characters in your comments.
+Conventional C comments are slightly more complicated,
+depending on your treatment of nesting. If you are not
+interested in nesting comments, you can simply use the
+following simple syntax:
+comment
+-> "/*", ~eof?..., "*/"
+Note that this production works because keywords take
+precedence over ordinary character input.
+A somewhat more complex way to write this is useful:
+comment
+-> comment head, "*/"
+-> comment head, comment
+comment head
+-> "/*"
+-> comment head, ~eof
+Note that where the previous grammar simply ignored the
+beginning of any nested comment, this grammar recognizes a
+nested comment, but then deliberately chooses to ignore
+nesting.
+If you want your comments to nest, you can easily rewrite
+this grammar to allow nesting:
+comment
+-> comment head, "*/"
+comment head
+-> "/*"
+-> comment head, ~eof
+-> comment head, comment
+Note that the only difference between nested and non-nested
+comments is whether the grammar rule comment head, comment
+reduces to comment head or to comment.
+If you wish to defer the question of nesting to run-time,
+you can use a semantically determined production to make the
+decision:
+comment
+-> comment head, "*/"
+comment head
+-> "/*"
+-> comment head, ~eof
+comment, comment head
+-> comment head, comment              =check_nesting();
+Although the final production in this grammar has a somewhat
+inscrutable air about it, a little study will show that if
+check_nesting sets the reduction token to comment, the
+effective grammar is the same as the grammar above for
+non-nested comments. On the other hand, if check_nesting
+sets the reduction token to comment head, the effective
+grammar is the same as the grammar for nested comments.
+Assuming you have a switch called nest_comments, you could
+write check_nesting as follows:
+check_nesting() {
+if (nest_comments)
+CHANGE_REDUCTION(comment_head);
+else
+CHANGE_REDUCTION(comment);
+}
+The else clause in this procedure is technically unnecessary
+since comment is the default value of the reduction token.
+If you wish to skip white space in your input, and wish to
+consider comments as simple white space, you might want to
+add the following production to your grammar:
+ws
+-> blank
+-> comment
+You can then use the disregard statement to ignore ws
+appropriately.
+Integers
+----------
+It is quite easy to convert ordinary integers to binary
+form. For instance:
+(int) decimal integer
+-> '0-9':d                                      =d-'0';
+-> decimal integer:n, '0-9':d          =10*n + d - '0';
+If necessary, you can specify that the value of decimal
+integer be maintained as a long:
+(long) decimal integer
+-> '0-9':d                                      =d-'0';
+-> decimal integer:n, '0-9':d          =10*n + d - '0';
+Other representations are equally simple:
+(long) octal integer
+-> '0'                                              =0;
+-> octal integer:n, '0-7':d             =8*n + d - '0';
+(long) hex digit
+-> '0-9':d                                      =d-'0';
+-> 'A-F' + 'a-f':d                        =9 + (d & 7);
+(long) hex integer
+-> {"0x" + "0X"}, hex digit:d                       =d;
+-> hex integer:n, hex digit:d                =16*n + d;
+Note that if you use the C convention for octal integers,
+you have to redefine decimal integer to avoid ambiguity:
+(long) decimal integer
+-> '1-9':d                                      =d-'0';
+-> decimal integer:n, '0-9':d          =10*n + d - '0';
+You can then represent the general case as follows:
+(long) integer
+-> decimal integer | octal integer | hex integer
+You can allow for a signed integer with the following
+productions:
+(long) signed integer
+-> '+'?, integer:n                                  =n;
+-> '-', integer:n                                  =-n;
+If you have included a disregard statement in your grammar
+to skip over irrelevant white space in your input, you might
+add the following to avoid skipping white space inside a
+number:
+[ lexeme {integer}]
+Note that if you were to declare signed integer as a lexeme,
+your parser would not allow space between a leading sign and
+the integer.
+Floating Point Numbers
+----------------------------
+Floating point numbers are somewhat more complex than
+integers, but not significantly so:
+(double) real
+-> simple real
+-> integer part:r, exponent field:x        =r*pow10(x);
+-> simple real:r, exponent field:x         =r*pow10(x);
+(double) simple real
+-> integer part:i, '.', fraction part:f           =i+f;
+-> integer part, '.'
+-> '.', fraction part:f                             =f;
+(double) integer part
+-> '0-9':d                                      =d-'0';
+-> integer part:n, '0-9':d               =10*n + d-'0';
+(double) fraction part
+-> '0-9':d                                =(d-'0')/10.;
+-> '0-9':d, fraction part:f             =(d-'0'+f)/10.;
+exponent field
+-> 'e' + 'E', signed exponent:x                     =x;
+signed exponent
+-> '+'?, exponent:n                                 =n;
+-> '-', exponent:n                                 =-n;
+exponent
+-> '0-9':d                                      =d-'0';
+-> exponent:n, '0-9':d                 =10*n + d - '0';
+Note that fraction part uses right recursion rather than
+left recursion.  exponent is effectively the same as decimal
+integer, above, but allows for initial zeroes.
+The situation becomes somewhat more complex if you wish to
+allow both integer and floating point forms, particularly if
+you wish to follow the C convention for octal integers.
+First, you cannot have distinct productions for integer part
+and decimal integer, since there is no way to distinguish
+them until a decimal point or an exponent field is
+encountered. Second, since 0129.3 looks for all the world
+like an octal number until the '9' is encountered, one
+either has to postpone all conversion calculations until the
+issue is resolved or resort to trickery.  Here is a way to
+resolve the problem by redefining integer part:
+(double) integer part
+-> confusion
+-> octal integer:n                    =octal2decimal(n);
+-> decimal integer:n                                 =n;
+(double) confusion
+-> octal integer:n, '8-9':d  =10*octal2decimal(n)+d-'0';
+-> confusion:x, '0-9':d                   =10*x + d-'0';
+where octal2decimal is defined thus:
+double octal2decimal(int n) {
+if (n) return 10*octal2decimal(n/8) + n%8;
+else return 0;
+}
+Here confusion represents a number that starts off looking
+like an octal integer, but then turns into a decimal number,
+because an eight, a nine, a decimal pointer or an exponent
+field is encountered. When this occurs, the function octal2-
+decimal undoes the octal conversion that had already been
+done, and redoes the conversion as decimal conversion.
+If you have included a disregard statement in your grammar
+to skip over irrelevant white space in your input, you might
+add the following to avoid skipping white space inside a
+real:
+[ lexeme {real}]
+Names
+---------
+In almost all grammars, it is necessary to identify names.
+To accumulate the characters that make up the name it is
+convenient to use a stack. The reduction procedures in this
+example use the functions ipn and pcn to accumulate the
+characters. ipn initializes storage for the name and stores
+the first character. pcn adds a single character to the
+name. This grammar presumes the existence of a function
+called identify_name which would look up the name in a
+symbol table and return an identifying index.
+letter = 'a-z' + 'A-Z' + '_'
+digit  = '0-9'
+(int) name
+-> name string                        =identify_name();
+name string
+-> letter:c                                    =ins(c);
+-> name string, letter+digit:c                 =pcn(c);
+{/* embedded C to accumulate name */
+char name[82];
+int name_length;
+void ipn(int c) {       /* Init and Put char to Name */
+name[0] = c;
+name_length = 1;
+}
+void pcn(int c) {                /* Put Char to Name */
+assert(name_length < 82);
+name[name_length++] = c;
+}
+} // End of embedded C
+If you have included a disregard statement in your grammar
+to skip over irrelevant white space in your input, you might
+add the following to avoid skipping white space inside a
+name:
+[ lexeme {name}]
+Names with Embedded White Space
+---------------------------------------------
+It is sometimes convenient to allow symbol names to contain
+embedded white space. This is easily done, although it
+requires a bit of a trick:
+name
+-> name string, ws?...                =identify_name();
+name string
+-> letter:c                                    =ipn(c);
+-> name string, letter+digit:c                 =pcn(c);
+-> name string, ws..., letter+digit:c=pcn(' '), pcn(c);
+Note that the last production reduces all contiguous white
+space within a name to a single blank.
+Allowing optional blanks following name string is important.
+If you don't allow them there, any ws following a name will
+be presumed to be embedded ws, requiring another letter or
+digit, which is not what you intend. There are two ways to
+accomplish this. The first, shown above, explicitly allows
+for optional white space following name string. The second
+method is to use the disregard and lexeme statements:
+[
+disregard ws
+lexeme {name}
+]
+If you use the disregard statement you should not include a
+provision for white space in the production for name. Just
+leave it as it was in the previous example.
+Character Strings
+----------------------
+Character strings are often required. The simplest way to
+implement character strings is as follows:
+character string
+-> '"', ~(eof + '"')?..., '"'
+This approach has the disadvantage that it makes no
+provision for nonprinting characters.
+There are numerous ways to provide for nonprinting
+characters in your character strings. However, you can avoid
+tedious documentation by using the same rules for
+nonprinting characters that C uses. Unfortunately, the C
+rules for octal and hexadecimal escape sequences complicate
+the writing of the grammar quite substantially. For example,
+if you wish to write a string that consists of ascii code 11
+followed by the ascii digit '2', you must pad with a leading
+zero, writing "\0132", since "\132" according to the rules
+is a single three digit octal escape sequence designating
+ascii code 90. The problem is that the rules allow for one,
+two or three digit octal escape sequences, but sequences
+shorter than three digits have to be followed by the end of
+the string or a character that is not an ascii digit. There
+is a similar, if not worse, problem with hex escape
+sequences. There is no limit on the length of a hex escape
+sequence, so there is no possible way to make the character
+'2' follow a hex escape sequence without using another
+escape sequence.
+A straightforward approach to writing a grammar for strings
+consistent with the C conventions yields a number of
+conflicts which correspond exactly to the problems discussed
+above. While it is certainly possible to write a grammar for
+strings that has no conflicts, it is easier to proceed in a
+straightforward manner and use a sticky declaration to
+resolve the ambiguities.
+Here is the complete grammar for a character string in
+accordance with the C rules. In order to accumulate the
+contents of the string, this grammar uses the functions ics
+and acs, to initialize storage for a character string and to
+append a character to that string respectively.
+character string
+-> string chars, '"'
+string chars
+-> '"'                                          =ics();
+-> string chars, string char:c                 =acs(c);
+string char
+-> simple string char
+-> escape sequence
+simple string char = ~eof - ('"' + '\\' + '\n')
+(int) escape sequence
+-> "\\a"   ='\a';
+-> "\\b"   ='\b';
+-> "\\f"   ='\f';
+-> "\\n"   ='\n';
+-> "\\r"   ='\r';
+-> "\\t"   ='\t';
+-> "\\v"   ='\v';
+-> "\\\\"  ='\\';
+-> "\\?"   = '\?';
+-> "\\'"   ='\'';
+-> "\\\""  ='"';
+-> octal escape
+-> hex escape
+(int) octal escape
+-> one octal | two octal | three octal
+(int) one octal
+-> '\\', '0-7':d                                 =d-'0';
+(int) two octal
+-> one octal:n, '0-7':d                    =8*n + d-'0';
+(int) three octal
+-> two octal:n, '0-7':d                    =8*n + d-'0';
+(int) hex escape
+-> "\\x", hex number
+(int) hex number
+-> hex digit
+-> hex number:n, hex digit:d                  =16*n + d;
+[
+sticky one octal, two octal, hex number
+]
+{/* embedded C to define ics and acs */
+char string_store[82];
+int length;
+void ics(void) {
+length = 0;
+}
+void acs(int c) {
+assert(length < 82);
+string_store[length++] = c;
+}
+If you have included a disregard statement in your grammar
+to skip over irrelevant white space in your input, you might
+add the following to avoid skipping white space inside a
+name:
+[ lexeme {character string}]
+Character Constants
+-------------------------
+It is almost trivial to extend the above syntax for
+character strings to allow simple character constants:
+(int) character constant
+-> '\'', simple char:c, '\''                         =c;
+(int) simple char
+-> ~eof - ('\'' + '\\' + '\n')
+-> escape sequence
+The value of the character constant token is the character
+code found in the input stream.
+If you have included a disregard statement in your grammar
+to skip over irrelevant white space in your input, you might
+add the following to avoid skipping white space inside a
+character constant:
+[ lexeme {character constant}]
+Simple Expressions
+------------------------
+It is often convenient to allow for simple expressions in
+your input. The syntax for expression logic is written in a
+conventional way, using a separate token for each precedence
+level desired. The following grammar will accept simple
+addition, subtraction, multiplication and division of
+floating point numbers:
+(double) expression
+-> term
+-> expression:x, '+', term:t                       =x+t;
+-> expression:x, '-', term:t                       =x-t;
+(double) term
+-> factor
+-> term:t, '*', factor:f                           =t*f;
+-> term:t, '/', factor:f                           =t/f;
+(double) factor
+-> real
+-> '(', expression:x, ')'                            =x;
+-> '-', factor:f                                    =-f;
+An alternative way to write expression syntax is to write an
+ambiguous grammar and use precedence declarations to resolve
+the conflicts. This results in a slightly more compact and
+faster parser:
+(double) expression
+-> expression:x, '+', expression:y                 =x+y;
+-> expression:x, '-', expression:y                 =x-y;
+-> expression:x, '*', expression:y                 =x*y;
+-> expression:x, '/', expression:y                 =x/y;
+-> unary minus, expression:x                        =-x;
+-> '(', expression:x, ')'                            =x;
+-> real
+unary minus = '-'
+[
+left '+', '-'
+left '*', '/'
+right unary minus
+]
+Note that we deal with the different precedence levels of
+unary and binary minus by defining unary minus, which
+AnaGram treats as distinct from the simple '-', and giving
+it different precedence.

Mercurial > ~dholland > hg > ag > index.cgi

comparison examples/sbb-doc.txt @ 0:13d2b8934445