Mercurial > ~dholland > hg > ag > index.cgi
diff doc/misc/html/isdp.html @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/misc/html/isdp.html Sat Dec 22 17:52:45 2007 -0500 @@ -0,0 +1,479 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> +<HTML> +<HEAD> +<TITLE>Introduction to Syntax Directed Parsing</TITLE> +</HEAD> + +<BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif" + TEXT="#000000" LINK="#0033CC" + VLINK="#CC0033" ALINK="#CC0099"> + +<P> + +<IMG ALIGN="right" SRC="images/agrsl6c.gif" ALT="AnaGram" + WIDTH=124 HEIGHT=30> +<BR CLEAR="all"> +Back to <A HREF="index.html">Index</A> +</P> +<IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" + WIDTH=1010 HEIGHT=2 > + +<BR CLEAR="all"> + + +<H1 ALIGN="LEFT">Introduction to Syntax Directed Parsing</H1> +<IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" + WIDTH=1010 HEIGHT=2 > +</P> + +<P> +(Adapted from Chapter 4, AnaGram User's Guide)</P> +</P> +<UL> + <LI SRC="#What"><A HREF="#What">What is Syntax Directed Parsing?</A></LI> + <LI><A HREF="#Describing">Describing an Input Sequence</A></LI> + <LI><A HREF="#How">How a Parser Works</A></LI> + <LI><A HREF="#Note">A Note on Notation</A></LI> + <LI><A HREF="#Reduction">Reduction Procedures</A></LI> + <LI><A HREF="#Building">Building a Parser</A> + <UL> + <LI SRC="#Invoking"><A HREF="#Invoking">Invoking a Parser</A></LI> + <LI><A HREF="#Communicating">Communicating with a Parser</A></LI> + <LI><A HREF="#Parser">Parser Input</A></LI> + <LI><A HREF="#Error">Error Handling</A></LI> + </UL> + </LI> +</UL> +<BR> + +<H2><A NAME="What">What is Syntax Directed Parsing?</A></H2> +<P> +Every programmer has to deal with input data. Usually processing input data +depends on what has preceded, and often even on what follows, the input under +consideration. Keeping track of these dependencies in order to know how to +process the data is called parsing. It is often relatively easy to keep track of +simple dependencies when first writing a program. As the program develops, as +new features are added, as bugs are fixed, the dependencies often stop being +simple. Input processing becomes a headache, since it is hard to keep track of +or even identify all the particular cases. Changes to the program cause +unexpected problems and program maintenance threatens to get out of control.</P> +<P> +Syntax directed parsing is a technique for addressing these difficulties. In +syntax directed parsing, the input part of a program is constructed +automatically, by means of a standard algorithm, from a high level description +of the input data structure. Code to perform any required processing of the data +is attached to the description in a convenient way.</P> +<P> +The description, being non-procedural, is generally easier to write and to +modify than the equivalent program code, and much less likely to harbor bugs. It +is easier to read, and easier to maintain. It is easy to re-use in other +programs requiring the same input, thus encouraging uniform interfaces. The +technique also simplifies the overall program by decoupling the input and +processing components and providing a natural, modular structure.</P> +<P> +To use syntax directed parsing you first write the input data description, +called a <A HREF="gloss.html#Grammar">grammar</A>. A file which contains a grammar is called a syntax +file. A parser generator, such as AnaGram, then can create, from the syntax +file, a function (or program) called a <A HREF="gloss.html#Parser">parser</A>, written in C or C++. The parser +keeps track of all the dependencies in your input, and calls certain functions, +<A HREF="gloss.html#ReductionProcedure">reduction procedures</A>, to deal with specific units or sequences of data as they +are encountered.</P> +<P> +Reduction procedures are the code you write to process your data. In your +grammar they are linked to the structures in your input, so that your parser +will call them at precisely the right times with precisely the right data. Note +that with this technique you need only provide a non-procedural <I>description</I> +of the input. The details of flow of control are handled entirely by the parser. +When you write reduction procedures, you can concentrate entirely on what you +have to do with the data. You don't have to encumber your code with switches and +tests to determine the structure of your input.</P> +<P> +The parsers you build using a parser generator such as AnaGram may be complete +stand-alone programs, or they may serve as input routines for a more extensive +program. Some programs may even use more than one parser.</P> +<BR> + +<H2><A NAME="Describing">Describing an Input Sequence</A></H2> +<P> +Writing a grammar consists of describing the acceptable input sequences for your +program. The vehicle for describing an input sequence is called a <A HREF="gloss.html#Production">production</A>. +Productions show how a logical component of the input can be made up of a +sequence of more elementary components. A production that describes a date might +be written:</P> +<PRE> + date + -> name of month, day, comma, year </PRE> +<P> + The components of the input are called <A HREF="gloss.html#Token">tokens</A>. The sequence of tokens on the +side of the production is called a <A HREF="gloss.html#GrammarRule">grammar rule</A>, or rule for short. The +individual tokens on the right side of the rule are also called <A HREF="gloss.html#RuleElement">rule elements</A>. +</P> +<P> +The token on the left side of the production is called the <A HREF="gloss.html#ReductionToken">reduction token</A> for +the rule. Tokens may have semantic values, as distinguished from syntactic +values, which you can use in your <A HREF="gloss.html#ReductionProcedure">reduction procedures</A>. For instance, the value +of name of month could be an integer in the range zero to eleven, or it could be +a pointer to an ascii string. The value of day could be an integer in the range +one to thirty-one.</P> +<P> +A <A HREF="gloss.html#Grammar">grammar</A> consists of a number of such <A HREF="gloss.html#Production">productions</A>, each of which describes some +component of the input in terms of other components. It does not take very many +productions to describe quite complex input streams. A grammar for the C +language, for instance, requires about two hundred productions.</P> +<P> +Many people find the term "production" quite confusing. It comes from theoretical +linguistics where it is used to describe how one may produce sequences which +correspond to a set of grammatical rules. Ironically, the major usage of the +idea has been in parsing where the interest is not so much in creating sequences +which satisfy the grammatical rules as in decoding and analyzing such sequences. +Nonetheless, it is convenient, in the above example, to say that the token date +produces a sequence of tokens consisting of name of month, day, comma, and year. +We also say that the sequence reduces to date.</P> +<P> +There may be more than one production to describe a given component, if there is +more than one way it may be represented. For instance,</P> +<PRE> + date + -> day, name of month, year </PRE> +<P> + describes another common way of writing a date. In other words, a <A HREF="gloss.html#ReductionToken">reduction +token</A> may produce a number of different <A HREF="gloss.html#GrammarRule">grammar rules</A>.</P> +<P> +Tokens which appear on the left side of one or more productions are called +<A HREF="gloss.html#Nonterminal">nonterminal tokens</A>. Those which appear <I>only</I> on the right sides of +productions are called <A HREF="gloss.html#Terminal">terminal tokens</A>. Terminal tokens are the units which +actually appear physically in the input. Nonterminal tokens are identified when +a sequence of tokens that matches the right side of a production is seen in the +input. When AnaGram analyzes a <A HREF="gloss.html#Grammar">grammar</A>, it assigns a unique token number to each +token it finds in the grammar.</P> +<P> +Nonterminal tokens, such as date in the example above, may appear in any grammar +rule just as though they were input tokens. The token on the left side of a +production can even appear on the right side as well. Such a production is +called a recursive production. When a nonterminal token appears on the right +side of a production, it may be represented in this context by <I>any</I> of +the <A HREF="gloss.html#GrammarRule">grammar rules</A> it produces. Grammars described in this manner are called +<A HREF="gloss.html#ContextFreeGrammar">context free grammars</A> since there is no contextual constraint on which of the +rules that a token produces can appear in any given context.</P> +<P> +Recursive productions may be either left recursive or right recursive. Left +recursive productions are those where the recursively defined <A HREF="gloss.html#Nonterminal">nonterminal</A> +appears as the first <A HREF="gloss.html#RuleElement">element</A> in the recursive rule. Right recursive productions +are those where it is the last element. If it appears anywhere between, the +production is said to be center recursive.</P> +<P> +Any nonterminal token which has a recursive production must also have at least +one simple, non-recursive production. Otherwise, it is not possible to create a +finite sequence of terminal tokens from the nonterminal token.</P> +<P> +Recursion may also occur implicitly in a <A HREF="gloss.html#Grammar">grammar</A> when one of the tokens on the +right side of a production itself has a production involving the token on the +left. Such implicit recursion sometimes may involve numerous levels of +productions. Implicit recursion occurs most commonly in describing constructs +such as arithmetic expressions or the block structure of programming languages.</P> +<P> +Clearly, grammars can accommodate multiple levels of structure in the input +sequences they describe. There must, at the top, be a single token which +encompasses the entire input. This special token is variously called the <A HREF="gloss.html#GrammarToken">grammar +token</A>, the goal token or the start token.</P> +<P> +AnaGram allows you to specify <A HREF="gloss.html#Terminal">terminal tokens</A> explicitly as ascii characters, or +even as <A HREF="gloss.html#CharacterSets">sets of ascii characters</A>, right in the grammar. Thus, you may write +'0-9' to represent the set of ascii digits, or 'A-Z' to represent the set of +upper case letters. The semantic value of such a <A HREF="gloss.html#Token">token</A> is the ascii character +code that actually appears in the input stream. If the various sets you use in +your grammar overlap, they may not properly represent terminal tokens. In this +case, AnaGram automatically extends your <A HREF="gloss.html#Grammar">grammar</A> appropriately. </P> +<BR> + +<H2><A NAME="How">How a Parser Works</A></H2> +<P> +The aim of a <A HREF="gloss.html#Parser">parser</A> is to match its input with the full syntactic structure +specified by the <A HREF="gloss.html#Production">productions</A> which make up the grammar. The primary component of +a parser is an input buffer, sometimes thought of as a stack, into which tokens +are shifted, or stored sequentially, as they are encountered in the input. At +the same time that a token is shifted into the input buffer, its semantic value +is pushed onto the value stack. A token is not shifted into the buffer unless it +"makes sense", that is, unless it is consistent with the rules of the +<A HREF="gloss.html#Grammar">grammar</A> and with the input that has preceded it. If a token does not make sense, +the parser signals a syntax error. In order to determine whether a token makes +sense, the parser has a sort of decision table which provides a list of +acceptable tokens for each of a number of states. The table also specifies what +the parser is to do with each acceptable token. When the table indicates that a +token is to be shifted into the buffer, it also specifies a new state. The +parser stacks the current state on a state stack and jumps to the new state. +Thus every time a token is shifted into the input buffer, a state number is +pushed onto the state stack. For each state of the parser, excepting only the +initial state, there is a unique token which will cause a jump to that state. +This token is called the <A HREF="gloss.html#CharacteristicToken">characteristic token</A> for the state.</P> +<P> +When the rightmost, or most recent, tokens in the input buffer match the right +side of a production precisely, the parser <I>may</I> replace the tokens that +match the rule with a single token, the token on the left side of the +production. This process of replacing a sequence of tokens with a single token +is called reduction. The token that replaces the sequence of tokens is called +the <A HREF="gloss.html#ReductionToken">reduction token</A>.</P> +<P> +The actual mechanism of the reduction is quite important. At the same time that +input tokens are removed from the input buffer, state numbers are popped from +the state stack, so that when all input tokens matching the rule have been +removed, the parser state has been restored to the value it had at the time the +first token in the rule was seen. As state numbers are popped from the state +stack, token values are popped from the value stack. If the rule has a <A HREF="gloss.html#ReductionProcedure">reduction +procedure</A>, temporary variables are loaded with the values popped from the stack +and the reduction procedure is called. The reduction token is now shifted into +the input buffer just as though it were an input token. If the reduction +procedure returned a result it is shifted into the value stack as the value of +the reduction token. The parser stacks the current state again and jumps to a +new state as determined by the parser tables.</P> +<P> +If there are no errors in your input, when the last token has been read from the +input, shifted into the input buffer, and reductions performed, there will be +precisely one token in the input buffer: the <A HREF="gloss.html#GrammarToken">grammar, or goal, token</A> which +describes your entire input. At this point your parser declares itself finished.</P> +<P> +Reductions do not necessarily occur every time a rule matches the tokens in the +input buffer. If the reduction token does not make sense, that is, if it is not +consistent with the rules of the <A HREF="gloss.html#GrammarToken">grammar</A> and with the input that has preceded +it, the parser will not perform the reduction. Suppose there are tokens in the +input which match one of the rules given above for date. A reduction will not +occur unless date is one of the tokens the parser is actually looking for at +that stage in the input.</P> +<P> +Even when the reduction token would make sense, there are still situations where +the reduction would not take place. Suppose a grammar includes the two +<A HREF="gloss.html#Production">productions</A> given above for date as well as the following:</P> +<PRE> + + date + -> name of month, day </PRE> +<P> +This production is the same as the first, but with no year specification. When a +parser directed by this grammar has encountered name of month and day, it can't +tell without looking further whether it has a short form or a long form date. In +such a circumstance, the parser looks at the next following token, which is +called a <A HREF="gloss.html#Lookahead">lookahead token</A>. If the lookahead token is a comma, then, in the +absence of other productions, the input is a long form date. If the lookahead +token is not a comma, then the input is certainly not a long form date, and it +is proper to reduce the short form production.</P> +<P> +Suppose the lookahead token were a comma and the grammar were also to contain +the following production:</P> +<PRE> + + appointment + -> date, comma, time </PRE> +<P> + Since a comma can follow date, according to this rule, and can also follow day +according to the first production, it is impossible to determine, simply by +looking at the <A HREF="gloss.html#Lookahead">lookahead token</A>, whether the date was being given in short form +or long form. One would have to look beyond the comma to see if what follows the +comma matches the rules for time or for year. Although it is possible to build +parsers which can do this, it is not generally feasible. Situations of this sort +are called <A HREF="gloss.html#Conflict">conflicts</A>. AnaGram warns you about the conflicts in your <A HREF="gloss.html#Grammar">grammar</A> when +it analyzes your grammar, and provides numerous facilities to help you correct +them.</P> +<BR> + +<H2><A NAME="Note">A Note on Notation</A></H2> +<P> +<A HREF="gloss.html#ContextFreeGrammar">Context free grammars</A> have been traditionally represented in the literature +using <A HREF="gloss.html#BNF">Backus-Naur Form</A>, or BNF. In Backus-Naur Form, certain characters, called +metacharacters, are used to punctuate <A HREF="gloss.html#Production">productions</A> and all other printable +characters are taken to represent themselves literally. Named tokens are denoted +by enclosing the name within angle brackets <CODE>< ></CODE> . The left side of a +production is distinguished from the right side by the +characters <CODE> ::= </CODE>. +If several productions have the same left side, the vertical +bar <CODE> | </CODE> is used to separate them. The <A HREF="gloss.html#RuleElement">elements</A> of a <A HREF="gloss +.html#GrammarRule">grammar rule</A> are simply juxtaposed +to indicate that one follows another. Blanks are ignored. Thus, in BNF, the +first production given for date, above, would be:</P> +<PRE> + <date> ::= <name of month> <day>, <year> +</PRE> +<P> +AnaGram uses a notation more consonant with ordinary programming usage. Thus +token names need not be bracketed and literal characters must be appropriately +quoted. The elements of rules are joined by commas. Using this approach, there +is no need for metacharacters and it becomes possible to make a number of useful +extensions to the notation.</P> +<BR> + +<H2><A NAME="Reduction">Reduction Procedures</A></H2> +<P> +Of course, the reason for parsing an input stream is to interpret the data in +the stream and to process it in some useful manner. The primary tool for doing +this is the <A HREF="gloss.html#ReductionProcedure">reduction procedure</A>. A reduction procedure is a piece of C code that +is executed when a particular <A HREF="gloss.html#GrammarRule">grammar rule</A> is reduced. Often, a reduction +procedure calculates a value which becomes the semantic value of the reduction +token. The input to the reduction procedure consists of the values of the tokens +that make up the grammar rule.</P> +<P> +AnaGram allows you to assign C variable names to the tokens in a grammar rule, +so that you can refer to them in the reduction procedure. To assign a C variable +name, simply follow the token in the rule with a colon and the C variable name. +Simple reduction procedures can be written as C or C++ expressions. At the end +of the rule, write an equal sign and an expression, terminated by a semicolon. +For example:</P> +<PRE> + (int) hex digit + -> '0-9':d =d-'0'; + -> 'a-f':d =d-'a'+10; + -> 'A-F':d =d-'A'+10; </PRE> +<P> +When any one of these rules is matched, the value of the token in the rule is +assigned to the temporary variable. The expression to the right of the equal +sign is evaluated and the value of the expression is stored as the value of hex +digit, which has been declared to be an int. These <A HREF="gloss.html#Production">productions</A> define +hexadecimal digits as ascii characters, and calculate the binary value of the +digit in terms of the ascii character code, d. The binary value becomes the +value of hex digit. Hexadecimal digits can be combined to make hexadecimal +numbers by writing the following productions:</P> +<PRE> + (int) hex number + -> hex digit + -> hex number:n, hex digit:d =16*n+d; </PRE> +<P> +There are several important points to notice in this example. First, <A HREF="gloss.html#ReductionProcedure">reduction +procedures</A> are executed "from the bottom up". That is, the reduction +procedure for hex digit is executed before any reduction procedure for hex +number. Second, if there is no reduction procedure for a production, the value +of the first token in the rule is assigned to the reduction token. Thus, it is +not necessary to provide a reduction procedure in the first production for hex +number. Third, the reduction procedures for recursive productions are always +executed <I>after</I> the reduction procedure, if any, for the non-recursive +production which begins the recursion. Finally, when an input sequence is +described using left recursion, as in this example, the <A HREF="gloss.html#RuleElement">elements</A> of the sequence +are processed left to right.</P> +<P> +If you wish to process the elements of a sequence right to left, you may use +right recursion. For example, it is sometimes convenient to define the fraction +part of a decimal number thus:</P> +<PRE> + (double) fraction part + -> '0-9':d =(d-'0')/10.; + -> '0-9':d,fraction part:f =(d-'0'+f)/10.;</PRE> +<P> +In this case the leading digits are stored temporarily in the parse stack, and +then the fraction part is evaluated right to left only when the last digit has +been found.</P> +<P> +Reduction procedures can be more complex than simple expressions. After the +equal sign you may include an arbitrary block of C or C++ code, enclosed in +braces, { }. To return a semantic value for the reduction token simply use a +return statement. Of course, reduction procedures have the full resources of C +or C++ at their disposal. They may set and interrogate global variables and may +call functions freely.</P> +<P> +Since the reduction procedures you write will probably need some support code, +such as #include statements and declarations, you may incorporate C or C++ code +into your syntax file at any point. You need only enclose it in braces ({}). +Such code is called embedded C. All embedded C code is also copied to the parser +file, and <I>precedes</I> all of your reduction procedures.</P> +<BR> + +<H2><A NAME="Building">Building a Parser</A></H2> +<P> +In order to round out a <A HREF="gloss.html#Parser">parser</A> into a functioning program it needs input +procedures, as well as error diagnosis and recovery capabilities. AnaGram has a +number of options available which give you a high degree of flexibility in +configuring a parser to suit your particular needs. All of the options are +provided with reasonable defaults, so that you can safely disregard any option +until you need the features it provides.</P> +<BR> + +<H4><A NAME="Invoking">Invoking a Parser</A></H4> +<P> + Normally, AnaGram configures <A HREF="gloss.html#Parser">parsers</A> as functions which you can call from +elsewhere in your program. In this situation, you call the parser, it processes +its input, and returns either when it has finished or cannot proceed because of +errors. Alternatively, if you set the event driven configuration switch, your +parser will be configured so that you have two procedures to call: an +initializer and a parser. In the event driven configuration you start the parse +by calling the initializer and then you call the parser once for each unit of +input. Using the event driven mode makes it quite easy to configure a parser as +a filter and to chain several parsers together so that the output from one +parser is the input to the next. Such multi-stage parsing is a convenient way to +deal with complex input that is not context free.</P> +<BR> + +<H4><A NAME="Communicating">Communicating with a Parser</A></H4> +<P>The complete status of your parser is contained in a single data structure +called a parser control block. All communications with a parser take place via +the parser control block. Input procedures must place input data in the +appropriate field in the parser control block. When the parse is complete or +encounters an error, the results of the parse may be found in the parser control +block. When AnaGram builds a parser it includes, in the header file it +generates, a typedef statement which defines the structure of the parser control +block.</P> +<BR> + +<H4><A NAME="Parser">Parser Input</A></H4> +<P>The input to your <A HREF="gloss.html#Parser">parser</A> may be either characters read directly from an +input stream, or tokens accumulated by a pre-processor or <A HREF="gloss.html#LexicalScanner">lexical scanner</A>. The +way you provide input to your parser depends on how your <A HREF="gloss.html#Grammar">grammar</A> defines input +tokens and also on whether or not you have requested an event driven parser.</P> +<P> +If your parser is event driven, you provide its input by storing the input code +and the input value, if any, into the parser control block and calling the +parser.</P> +<P> +If you have set the pointer input configuration switch in your syntax file, you +simply initialize the pointer field in your parser control block before you call +your parser. Your parser will then read its input directly from memory by simply +incrementing the pointer as necessary.</P> +<P> +Otherwise, your parser will invoke a macro called GET_INPUT every time it needs +more input. You may define GET_INPUT according to your needs. You can define it +so that it calls an input function, or you can define it so that it executes +in-line code each time it is invoked. Your GET_INPUT macro should store its +input code in the input_code field of the parser control block. If you do not +write a GET_ INPUT macro, AnaGram will provide one which will read characters +from stdin.</P> +<P> +If your <A HREF="gloss.html#Grammar">grammar</A> does not define <A HREF="gloss.html#Terminal">terminal tokens</A> in terms of ascii characters or +external token numbers, your GET_INPUT will have to determine the appropriate +internal token number for each input token. To assist you in determining these +token numbers AnaGram provides a typedef enum statement in the header file. You +can then use named constants to specify the internal token numbers for the input +tokens.</P> +<BR> + +<H4><A NAME="Error">Error Handling</A></H4> +<P>Your <A HREF="gloss.html#Parser">parser</A> must be prepared to deal with erroneous input. There are two +aspects to error handling: diagnosing the error, and recovering from the error. +On encountering an error, your parser will invoke a macro called SYNTAX_ ERROR. +If you do not provide a definition for SYNTAX_ERROR, AnaGram will provide a +simple error diagnostic. AnaGram can also provide automatic error diagnoses +which pinpoint the location of the error.</P> +<P> +AnaGram provides two options for error recovery: error +token <A HREF="gloss.html#Resynchronization">resynchronization</A> +and automatic resynchronization. These are techniques for getting your parser +back in synch with its input so it can proceed after encountering an error. +Normally, if you do not select one of these recovery techniques, your parser +will terminate when it encounters an error; however, you may override this +default if you wish and provide your own recovery technique.</P> + + +<IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" + WIDTH=1010 HEIGHT=2 > +<P> +<IMG ALIGN="right" SRC="images/pslrb6d.gif" ALT="Parsifal Software" + WIDTH=181 HEIGHT=25> +<BR CLEAR="right"> + +Back to <A HREF="index.html">Index</A> +<P> +<ADDRESS><FONT SIZE="-1"> + AnaGram parser generator - documentation<BR> + Introduction to Syntax Directed Parsing<BR> + Copyright © 1993-1999, Parsifal Software. <BR> + All Rights Reserved.<BR> +</FONT></ADDRESS> + +</BODY> +</HTML> + +