view doc/misc/html/isdp.html @ 24:a4899cdfc2d6 default tip

Obfuscate the regexps to strip off the IBM compiler's copyright banners. I don't want bots scanning github to think they're real copyright notices because that could cause real problems.
author David A. Holland
date Mon, 13 Jun 2022 00:40:23 -0400
parents 13d2b8934445
children
line wrap: on
line source

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<TITLE>Introduction to Syntax Directed Parsing</TITLE>
</HEAD>

<BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif"
 TEXT="#000000" LINK="#0033CC"
 VLINK="#CC0033" ALINK="#CC0099">

<P>

<IMG ALIGN="right" SRC="images/agrsl6c.gif" ALT="AnaGram"
         WIDTH=124 HEIGHT=30>
<BR CLEAR="all">
Back to <A HREF="index.html">Index</A>
</P>
<IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------"
        WIDTH=1010 HEIGHT=2  >

<BR CLEAR="all">


<H1 ALIGN="LEFT">Introduction to Syntax Directed Parsing</H1>
<IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------"
      WIDTH=1010 HEIGHT=2 >
</P>

<P>
(Adapted from Chapter 4, AnaGram User's Guide)</P>
</P>
<UL>
  <LI SRC="#What"><A HREF="#What">What is Syntax Directed Parsing?</A></LI>
  <LI><A HREF="#Describing">Describing an Input Sequence</A></LI>
  <LI><A HREF="#How">How a Parser Works</A></LI>
  <LI><A HREF="#Note">A Note on Notation</A></LI>
  <LI><A HREF="#Reduction">Reduction Procedures</A></LI>
  <LI><A HREF="#Building">Building a Parser</A>
    <UL>
      <LI SRC="#Invoking"><A HREF="#Invoking">Invoking a Parser</A></LI>
      <LI><A HREF="#Communicating">Communicating with a Parser</A></LI>
      <LI><A HREF="#Parser">Parser Input</A></LI>
      <LI><A HREF="#Error">Error Handling</A></LI>
    </UL>
  </LI>
</UL>
<BR>

<H2><A NAME="What">What is Syntax Directed Parsing?</A></H2>
<P>
Every programmer has to deal with input data. Usually processing input data
depends on what has preceded, and often even on what follows, the input under
consideration. Keeping track of these dependencies in order to know how to
process the data is called parsing. It is often relatively easy to keep track of
simple dependencies when first writing a program. As the program develops, as
new features are added, as bugs are fixed, the dependencies often stop being
simple. Input processing becomes a headache, since it is hard to keep track of
or even identify all the particular cases. Changes to the program cause
unexpected problems and program maintenance threatens to get out of control.</P>
<P>
Syntax directed parsing is a technique for addressing these difficulties. In
syntax directed parsing, the input part of a program is constructed
automatically, by means of a standard algorithm, from a high level description
of the input data structure. Code to perform any required processing of the data
is attached to the description in a convenient way.</P>
<P>
The description, being non-procedural, is generally easier to write and to
modify than the equivalent program code, and much less likely to harbor bugs. It
is easier to read, and easier to maintain. It is easy to re-use in other
programs requiring the same input, thus encouraging uniform interfaces. The
technique also simplifies the overall program by decoupling the input and
processing components and providing a natural, modular structure.</P>
<P>
To use syntax directed parsing you first write the input data description,
called a <A HREF="gloss.html#Grammar">grammar</A>. A file which contains a grammar is called a syntax
file. A parser generator, such as AnaGram, then can create, from the syntax
file, a function (or program) called a <A HREF="gloss.html#Parser">parser</A>, written in C or C++. The parser
keeps track of all the dependencies in your input, and calls certain functions,
<A HREF="gloss.html#ReductionProcedure">reduction procedures</A>, to deal with specific units or sequences of data as they
are encountered.</P>
<P>
Reduction procedures are the code you write to process your data. In your
grammar they are linked to the structures in your input, so that your parser
will call them at precisely the right times with precisely the right data. Note
that with this technique you need only provide a non-procedural <I>description</I>
of the input. The details of flow of control are handled entirely by the parser.
When you write reduction procedures, you can concentrate entirely on what you
have to do with the data. You don't have to encumber your code with switches and
tests to determine the structure of your input.</P>
<P>
The parsers you build using a parser generator such as AnaGram may be complete
stand-alone programs, or they may serve as input routines for a more extensive
program. Some programs may even use more than one parser.</P>
<BR>

<H2><A NAME="Describing">Describing an Input Sequence</A></H2>
<P>
Writing a grammar consists of describing the acceptable input sequences for your
program. The vehicle for describing an input sequence is called a <A HREF="gloss.html#Production">production</A>.
Productions show how a logical component of the input can be made up of a
sequence of more elementary components. A production that describes a date might
be written:</P>
<PRE>
  date
    -&gt; name of month, day, comma, year </PRE>
<P>
 The components of the input are called <A HREF="gloss.html#Token">tokens</A>. The sequence of tokens on the
side of the production is called a <A HREF="gloss.html#GrammarRule">grammar rule</A>, or rule for short. The
individual tokens on the right side of the rule are also called <A HREF="gloss.html#RuleElement">rule elements</A>.
</P>
<P>
The token on the left side of the production is called the <A HREF="gloss.html#ReductionToken">reduction token</A> for
the rule. Tokens may have semantic values, as distinguished from syntactic
values, which you can use in your <A HREF="gloss.html#ReductionProcedure">reduction procedures</A>. For instance, the value
of name of month could be an integer in the range zero to eleven, or it could be
a pointer to an ascii string. The value of day could be an integer in the range
one to thirty-one.</P>
<P>
A <A HREF="gloss.html#Grammar">grammar</A> consists of a number of such <A HREF="gloss.html#Production">productions</A>, each of which describes some
component of the input in terms of other components. It does not take very many
productions to describe quite complex input streams. A grammar for the C
language, for instance, requires about two hundred productions.</P>
<P>
Many people find the term "production" quite confusing. It comes from theoretical
linguistics where it is used to describe how one may produce sequences which
correspond to a set of grammatical rules. Ironically, the major usage of the
idea has been in parsing where the interest is not so much in creating sequences
which satisfy the grammatical rules as in decoding and analyzing such sequences.
Nonetheless, it is convenient, in the above example, to say that the token date
produces a sequence of tokens consisting of name of month, day, comma, and year.
We also say that the sequence reduces to date.</P>
<P>
There may be more than one production to describe a given component, if there is
more than one way it may be represented. For instance,</P>
<PRE>
  date
    -&gt; day, name of month, year </PRE>
<P>
 describes another common way of writing a date. In other words, a <A HREF="gloss.html#ReductionToken">reduction
token</A> may produce a number of different <A HREF="gloss.html#GrammarRule">grammar rules</A>.</P>
<P>
Tokens which appear on the left side of one or more productions are called
<A HREF="gloss.html#Nonterminal">nonterminal tokens</A>. Those which appear <I>only</I> on the right sides of
productions are called <A HREF="gloss.html#Terminal">terminal tokens</A>. Terminal tokens are the units which
actually appear physically in the input. Nonterminal tokens are identified when
a sequence of tokens that matches the right side of a production is seen in the
input. When AnaGram analyzes a <A HREF="gloss.html#Grammar">grammar</A>, it assigns a unique token number to each
token it finds in the grammar.</P>
<P>
Nonterminal tokens, such as date in the example above, may appear in any grammar
rule just as though they were input tokens. The token on the left side of a
production can even appear on the right side as well. Such a production is
called a recursive production. When a nonterminal token appears on the right
side of a production, it may be represented in this context by <I>any</I> of
the <A HREF="gloss.html#GrammarRule">grammar rules</A> it produces. Grammars described in this manner are called
<A HREF="gloss.html#ContextFreeGrammar">context free grammars</A> since there is no contextual constraint on which of the
rules that a token produces can appear in any given context.</P>
<P>
Recursive productions may be either left recursive or right recursive. Left
recursive productions are those where the recursively defined <A HREF="gloss.html#Nonterminal">nonterminal</A>
appears as the first <A HREF="gloss.html#RuleElement">element</A> in the recursive rule. Right recursive productions
are those where it is the last element. If it appears anywhere between, the
production is said to be center recursive.</P>
<P>
Any nonterminal token which has a recursive production must also have at least
one simple, non-recursive production. Otherwise, it is not possible to create a
finite sequence of terminal tokens from the nonterminal token.</P>
<P>
Recursion may also occur implicitly in a <A HREF="gloss.html#Grammar">grammar</A> when one of the tokens on the
right side of a production itself has a production involving the token on the
left. Such implicit recursion sometimes may involve numerous levels of
productions. Implicit recursion occurs most commonly in describing constructs
such as arithmetic expressions or the block structure of programming languages.</P>
<P>
Clearly, grammars can accommodate multiple levels of structure in the input
sequences they describe. There must, at the top, be a single token which
encompasses the entire input. This special token is variously called the <A HREF="gloss.html#GrammarToken">grammar
token</A>, the goal token or the start token.</P>
<P>
AnaGram allows you to specify <A HREF="gloss.html#Terminal">terminal tokens</A> explicitly as ascii characters, or
even as <A HREF="gloss.html#CharacterSets">sets of ascii characters</A>, right in the grammar. Thus, you may write
'0-9' to represent the set of ascii digits, or 'A-Z' to represent the set of
upper case letters. The semantic value of such a <A HREF="gloss.html#Token">token</A> is the ascii character
code that actually appears in the input stream. If the various sets you use in
your grammar overlap, they may not properly represent terminal tokens. In this
case, AnaGram automatically extends your <A HREF="gloss.html#Grammar">grammar</A> appropriately. </P>
<BR>

<H2><A NAME="How">How a Parser Works</A></H2>
<P>
The aim of a <A HREF="gloss.html#Parser">parser</A> is to match its input with the full syntactic structure
specified by the <A HREF="gloss.html#Production">productions</A> which make up the grammar. The primary component of
a parser is an input buffer, sometimes thought of as a stack, into which tokens
are shifted, or stored sequentially, as they are encountered in the input. At
the same time that a token is shifted into the input buffer, its semantic value
is pushed onto the value stack. A token is not shifted into the buffer unless it
"makes sense", that is, unless it is consistent with the rules of the
<A HREF="gloss.html#Grammar">grammar</A> and with the input that has preceded it. If a token does not make sense,
the parser signals a syntax error. In order to determine whether a token makes
sense, the parser has a sort of decision table which provides a list of
acceptable tokens for each of a number of states. The table also specifies what
the parser is to do with each acceptable token. When the table indicates that a
token is to be shifted into the buffer, it also specifies a new state. The
parser stacks the current state on a state stack and jumps to the new state.
Thus every time a token is shifted into the input buffer, a state number is
pushed onto the state stack. For each state of the parser, excepting only the
initial state, there is a unique token which will cause a jump to that state.
This token is called the <A HREF="gloss.html#CharacteristicToken">characteristic token</A> for the state.</P>
<P>
When the rightmost, or most recent, tokens in the input buffer match the right
side of a production precisely, the parser <I>may</I> replace the tokens that
match the rule with a single token, the token on the left side of the
production. This process of replacing a sequence of tokens with a single token
is called reduction. The token that replaces the sequence of tokens is called
the <A HREF="gloss.html#ReductionToken">reduction token</A>.</P>
<P>
The actual mechanism of the reduction is quite important. At the same time that
input tokens are removed from the input buffer, state numbers are popped from
the state stack, so that when all input tokens matching the rule have been
removed, the parser state has been restored to the value it had at the time the
first token in the rule was seen. As state numbers are popped from the state
stack, token values are popped from the value stack. If the rule has a <A HREF="gloss.html#ReductionProcedure">reduction
procedure</A>, temporary variables are loaded with the values popped from the stack
and the reduction procedure is called. The reduction token is now shifted into
the input buffer just as though it were an input token. If the reduction
procedure returned a result it is shifted into the value stack as the value of
the reduction token. The parser stacks the current state again and jumps to a
new state as determined by the parser tables.</P>
<P>
If there are no errors in your input, when the last token has been read from the
input, shifted into the input buffer, and reductions performed, there will be
precisely one token in the input buffer: the <A HREF="gloss.html#GrammarToken">grammar, or goal, token</A> which
describes your entire input. At this point your parser declares itself finished.</P>
<P>
Reductions do not necessarily occur every time a rule matches the tokens in the
input buffer. If the reduction token does not make sense, that is, if it is not
consistent with the rules of the <A HREF="gloss.html#GrammarToken">grammar</A> and with the input that has preceded
it, the parser will not perform the reduction. Suppose there are tokens in the
input which match one of the rules given above for date. A reduction will not
occur unless date is one of the tokens the parser is actually looking for at
that stage in the input.</P>
<P>
Even when the reduction token would make sense, there are still situations where
the reduction would not take place. Suppose a grammar includes the two
<A HREF="gloss.html#Production">productions</A> given above for date as well as the following:</P>
<PRE>

  date
    -&gt; name of month, day </PRE>
<P>
This production is the same as the first, but with no year specification. When a
parser directed by this grammar has encountered name of month and day, it can't
tell without looking further whether it has a short form or a long form date. In
such a circumstance, the parser looks at the next following token, which is
called a <A HREF="gloss.html#Lookahead">lookahead token</A>. If the lookahead token is a comma, then, in the
absence of other productions, the input is a long form date. If the lookahead
token is not a comma, then the input is certainly not a long form date, and it
is proper to reduce the short form production.</P>
<P>
Suppose the lookahead token were a comma and the grammar were also to contain
the following production:</P>
<PRE>

  appointment
    -&gt; date, comma, time </PRE>
<P>
 Since a comma can follow date, according to this rule, and can also follow day
according to the first production, it is impossible to determine, simply by
looking at the <A HREF="gloss.html#Lookahead">lookahead token</A>, whether the date was being given in short form
or long form. One would have to look beyond the comma to see if what follows the
comma matches the rules for time or for year. Although it is possible to build
parsers which can do this, it is not generally feasible. Situations of this sort
are called <A HREF="gloss.html#Conflict">conflicts</A>. AnaGram warns you about the conflicts in your <A HREF="gloss.html#Grammar">grammar</A> when
it analyzes your grammar, and provides numerous facilities to help you correct
them.</P>
<BR>

<H2><A NAME="Note">A Note on Notation</A></H2>
<P>
<A HREF="gloss.html#ContextFreeGrammar">Context free grammars</A> have been traditionally represented in the literature
using <A HREF="gloss.html#BNF">Backus-Naur Form</A>, or BNF. In Backus-Naur Form, certain characters, called
metacharacters, are used to punctuate <A HREF="gloss.html#Production">productions</A> and all other printable
characters are taken to represent themselves literally. Named tokens are denoted
by enclosing the name within angle brackets <CODE>&lt; &gt;</CODE> . The left side of a
production is distinguished from the right side by the
characters <CODE> ::= </CODE>.
If several productions have the same left side, the vertical
bar <CODE> | </CODE> is used to separate them. The <A HREF="gloss.html#RuleElement">elements</A> of a <A HREF="gloss
.html#GrammarRule">grammar rule</A> are simply juxtaposed
to indicate that one follows another. Blanks are ignored. Thus, in BNF, the
first production given for date, above, would be:</P>
<PRE>
  &lt;date&gt; ::= &lt;name of month&gt; &lt;day&gt;, &lt;year&gt;
</PRE>
<P>
AnaGram uses a notation more consonant with ordinary programming usage. Thus
token names need not be bracketed and literal characters must be appropriately
quoted. The elements of rules are joined by commas. Using this approach, there
is no need for metacharacters and it becomes possible to make a number of useful
extensions to the notation.</P>
<BR>

<H2><A NAME="Reduction">Reduction Procedures</A></H2>
<P>
Of course, the reason for parsing an input stream is to interpret the data in
the stream and to process it in some useful manner. The primary tool for doing
this is the <A HREF="gloss.html#ReductionProcedure">reduction procedure</A>. A reduction procedure is a piece of C code that
is executed when a particular <A HREF="gloss.html#GrammarRule">grammar rule</A> is reduced. Often, a reduction
procedure calculates a value which becomes the semantic value of the reduction
token. The input to the reduction procedure consists of the values of the tokens
that make up the grammar rule.</P>
<P>
AnaGram allows you to assign C variable names to the tokens in a grammar rule,
so that you can refer to them in the reduction procedure. To assign a C variable
name, simply follow the token in the rule with a colon and the C variable name.
Simple reduction procedures can be written as C or C++ expressions. At the end
of the rule, write an equal sign and an expression, terminated by a semicolon.
For example:</P>
<PRE>
  (int) hex digit
     -&gt; '0-9':d         =d-'0';
     -&gt; 'a-f':d         =d-'a'+10;
     -&gt; 'A-F':d         =d-'A'+10;   </PRE>
<P>
When any one of these rules is matched, the value of the token in the rule is
assigned to the temporary variable. The expression to the right of the equal
sign is evaluated and the value of the expression is stored as the value of hex
digit, which has been declared to be an int. These <A HREF="gloss.html#Production">productions</A> define
hexadecimal digits as ascii characters, and calculate the binary value of the
digit in terms of the ascii character code, d. The binary value becomes the
value of hex digit. Hexadecimal digits can be combined to make hexadecimal
numbers by writing the following productions:</P>
<PRE>
  (int) hex number
    -&gt; hex digit
    -&gt; hex number:n, hex digit:d     =16*n+d;  </PRE>
<P>
There are several important points to notice in this example. First, <A HREF="gloss.html#ReductionProcedure">reduction
procedures</A> are executed "from the bottom up". That is, the reduction
procedure for hex digit is executed before any reduction procedure for hex
number. Second, if there is no reduction procedure for a production, the value
of the first token in the rule is assigned to the reduction token. Thus, it is
not necessary to provide a reduction procedure in the first production for hex
number. Third, the reduction procedures for recursive productions are always
executed <I>after</I> the reduction procedure, if any, for the non-recursive
production which begins the recursion. Finally, when an input sequence is
described using left recursion, as in this example, the <A HREF="gloss.html#RuleElement">elements</A> of the sequence
are processed left to right.</P>
<P>
If you wish to process the elements of a sequence right to left, you may use
right recursion. For example, it is sometimes convenient to define the fraction
part of a decimal number thus:</P>
<PRE>
  (double) fraction part
    -&gt; '0-9':d                   =(d-'0')/10.;
    -&gt; '0-9':d,fraction part:f   =(d-'0'+f)/10.;</PRE>
<P>
In this case the leading digits are stored temporarily in the parse stack, and
then the fraction part is evaluated right to left only when the last digit has
been found.</P>
<P>
Reduction procedures can be more complex than simple expressions. After the
equal sign you may include an arbitrary block of C or C++ code, enclosed in
braces, { }. To return a semantic value for the reduction token simply use a
return statement. Of course, reduction procedures have the full resources of C
or C++ at their disposal. They may set and interrogate global variables and may
call functions freely.</P>
<P>
Since the reduction procedures you write will probably need some support code,
such as #include statements and declarations, you may incorporate C or C++ code
into your syntax file at any point. You need only enclose it in braces ({}).
Such code is called embedded C. All embedded C code is also copied to the parser
file, and <I>precedes</I> all of your reduction procedures.</P>
<BR>

<H2><A NAME="Building">Building a Parser</A></H2>
<P>
In order to round out a <A HREF="gloss.html#Parser">parser</A> into a functioning program it needs input
procedures, as well as error diagnosis and recovery capabilities. AnaGram has a
number of options available which give you a high degree of flexibility in
configuring a parser to suit your particular needs. All of the options are
provided with reasonable defaults, so that you can safely disregard any option
until you need the features it provides.</P>
<BR>

<H4><A NAME="Invoking">Invoking a Parser</A></H4>
<P>
 Normally, AnaGram configures <A HREF="gloss.html#Parser">parsers</A> as functions which you can call from
elsewhere in your program. In this situation, you call the parser, it processes
its input, and returns either when it has finished or cannot proceed because of
errors. Alternatively, if you set the event driven configuration switch, your
parser will be configured so that you have two procedures to call: an
initializer and a parser. In the event driven configuration you start the parse
by calling the initializer and then you call the parser once for each unit of
input. Using the event driven mode makes it quite easy to configure a parser as
a filter and to chain several parsers together so that the output from one
parser is the input to the next. Such multi-stage parsing is a convenient way to
deal with complex input that is not context free.</P>
<BR>

<H4><A NAME="Communicating">Communicating with a Parser</A></H4>
<P>The complete status of your parser is contained in a single data structure
called a parser control block. All communications with a parser take place via
the parser control block. Input procedures must place input data in the
appropriate field in the parser control block. When the parse is complete or
encounters an error, the results of the parse may be found in the parser control
block. When AnaGram builds a parser it includes, in the header file it
generates, a typedef statement which defines the structure of the parser control
block.</P>
<BR>

<H4><A NAME="Parser">Parser Input</A></H4>
<P>The input to your <A HREF="gloss.html#Parser">parser</A> may be either characters read directly from an
input stream, or tokens accumulated by a pre-processor or <A HREF="gloss.html#LexicalScanner">lexical scanner</A>. The
way you provide input to your parser depends on how your <A HREF="gloss.html#Grammar">grammar</A> defines input
tokens and also on whether or not you have requested an event driven parser.</P>
<P>
If your parser is event driven, you provide its input by storing the input code
and the input value, if any, into the parser control block and calling the
parser.</P>
<P>
If you have set the pointer input configuration switch in your syntax file, you
simply initialize the pointer field in your parser control block before you call
your parser. Your parser will then read its input directly from memory by simply
incrementing the pointer as necessary.</P>
<P>
Otherwise, your parser will invoke a macro called GET_INPUT every time it needs
more input. You may define GET_INPUT according to your needs. You can define it
so that it calls an input function, or you can define it so that it executes
in-line code each time it is invoked. Your GET_INPUT macro should store its
input code in the input_code field of the parser control block. If you do not
write a GET_ INPUT macro, AnaGram will provide one which will read characters
from stdin.</P>
<P>
If your <A HREF="gloss.html#Grammar">grammar</A> does not define <A HREF="gloss.html#Terminal">terminal tokens</A> in terms of ascii characters or
external token numbers, your GET_INPUT will have to determine the appropriate
internal token number for each input token. To assist you in determining these
token numbers AnaGram provides a typedef enum statement in the header file. You
can then use named constants to specify the internal token numbers for the input
tokens.</P>
<BR>

<H4><A NAME="Error">Error Handling</A></H4>
<P>Your <A HREF="gloss.html#Parser">parser</A> must be prepared to deal with erroneous input. There are two
aspects to error handling: diagnosing the error, and recovering from the error.
On encountering an error, your parser will invoke a macro called SYNTAX_ ERROR.
If you do not provide a definition for SYNTAX_ERROR, AnaGram will provide a
simple error diagnostic. AnaGram can also provide automatic error diagnoses
which pinpoint the location of the error.</P>
<P>
AnaGram provides two options for error recovery: error
token <A HREF="gloss.html#Resynchronization">resynchronization</A>
and automatic resynchronization. These are techniques for getting your parser
back in synch with its input so it can proceed after encountering an error.
Normally, if you do not select one of these recovery techniques, your parser
will terminate when it encounters an error; however, you may override this
default if you wish and provide your own recovery technique.</P>


<IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------"
      WIDTH=1010 HEIGHT=2 >
<P>
<IMG ALIGN="right" SRC="images/pslrb6d.gif" ALT="Parsifal Software"
                WIDTH=181 HEIGHT=25>
<BR CLEAR="right">

Back to <A HREF="index.html">Index</A>
<P>
<ADDRESS><FONT SIZE="-1">
                  AnaGram parser generator - documentation<BR>
                  Introduction to Syntax Directed Parsing<BR>
                  Copyright &copy; 1993-1999, Parsifal Software. <BR>
                  All Rights Reserved.<BR>
</FONT></ADDRESS>

</BODY>
</HTML>