Mercurial > ~dholland > hg > ag > index.cgi

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
   <TITLE> Macro preprocessor and C Parser </TITLE>
</HEAD>


<BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif"
 TEXT="#000000" LINK="#0033CC"
 VLINK="#CC0033" ALINK="#CC0099">

<P>
<IMG ALIGN="right" SRC="../../images/agrsl6c.gif" ALT="AnaGram"
         WIDTH=124 HEIGHT=30 >
<BR CLEAR="all">
Back to <A HREF="../../index.html">Index</A>
<P>
<IMG ALIGN="bottom" SRC="../../images/rbline6j.gif" ALT="----------------------"
        WIDTH=1010 HEIGHT=2  >
<P>

<H1>Macro preprocessor and C Parser</H1>

<IMG ALIGN="bottom" SRC="../../images/rbline6j.gif" ALT="----------------------"
        WIDTH=1010 HEIGHT=2  >

<H2>Introduction</H2>

This document provides an overview of the entire macro preprocessor example.
Since the example consists of a number of modules, there is also a separate
document file for each module. These document files provide an overview
of the module and detailed descriptions of the variables, data structures
and syntactic elements associated with the module.

<P>This implementation of a C macro preprocessor demonstrates:
<UL>
<LI>
the use of AnaGram in a real-world problem of considerable complexity.</LI>

<LI>
the use of AnaGram in a C++ environment.</LI>
</UL>
It was felt that only a fairly complex problem would adequately demonstrate
the power of AnaGram. This example, therefore, may not be particularly
easy to grasp or to understand in its entirety.

<P>However, it is not necessary to understand all facets of this example
to make good use of it. If you skim over it, you will see examples of many
common syntactic constructs. You will find that in many cases you can copy
these constructs verbatim and incorporate them directly into your own programs.

<P>A number of AnaGram's features and options are well illustrated. This
example makes use of four separate syntaxes to deal with the preprocessing
so that the complete program, with one or another of the C parsers linked
in, contains five separate parsers. There are, therefore, numerous examples
of interfacing a parser to the rest of a program. In particular, several
of the parsers are configured as C++ classes.

<P>Other AnaGram features, such as semantically determined productions
and context tracking, are used to good avail, particularly in the token
scanner, which also illustrates the use of AnaGram to write lexical scanners.

<P>In addition to the macro preprocessor, this example provides a choice
of two C parsers which have been interfaced to the preprocessor. These
parsers are simply syntax checkers. They have essentially no reduction
procedures except for enough to give them rudimentary (and not fully
correct) capabilities for
coping with typedef types. You may, of course, add your own reduction procedures
to adapt them to your needs.
<P>
Note that this macro preprocessor is not particularly standards
compliant; if you feed it difficult or pedantic test cases, it will
probably give you wrong output.
<P>
<BR>

<H2>
Components of the Macro Preprocessor</H2>
<TABLE WIDTH="100%">

<TR>
<TD COLSPAN=4>
The macro preprocessor example comprises the following modules:
<BR><BR>
</TD>
</TR>

<TR>
<td rowspan=10 width="4%">&nbsp;</td>
<TD><tt><A HREF=mpp.html>mpp.cpp</A></tt></TD>
<td rowspan=10 width="4%">&nbsp;</td>

<TD>data declarations and main program</TD>
</TR>

<TR>
<TD><tt><A HREF=mpp.html>mpp.h</A></tt></TD>

<TD>Structure definitions, data and function declarations</TD>
</TR>

<TR>
<TD><tt><A HREF=token.html>token.cpp</A></tt></TD>

<TD>token class function definitions</TD>
</TR>

<TR>
<TD><tt><A HREF=token.html>token.h</A></tt></TD>

<TD>Token class definitions</TD>
</TR>

<TR>
<TD><tt><A HREF=ts.html>ts.syn</A></tt></TD>

<TD>token scanner</TD>
</TR>

<TR>
<TD><tt><A HREF=mas.html>mas.syn</A></tt></TD>

<TD>macro and argument substitution module</TD>
</TR>

<TR>
<TD><tt><A HREF=ex.html>ex.syn</A></tt></TD>

<TD>constant expression evaluator</TD>
</TR>

<TR>
<TD><tt><A HREF=ct.html>ct.syn</A></tt></TD>

<TD>token classifier</TD>
</TR>

<TR>
<TD><tt><A HREF=parsers.html>jrc.syn</A></tt></TD>

<TD>C parser, based on C grammar by James A. Roskind</TD>
</TR>

<TR>
<TD><tt><A HREF=parsers.html>krc.syn</A></tt></TD>

<TD>C parser, based on C grammar in K &amp; R, section A13</TD>
</TR>

<!--
<P>
Here are links to the corresponding
document files:
<CENTER><A HREF="mpp.html">MPP&nbsp;</A> | <A HREF="token.html">TOKEN</A>
| <A HREF="ts.html">TS</A> | <A HREF="mas.html">MAS</A> | <A HREF="ex.html">EX</A>
| <A HREF="ct.html">CT</A> | <A HREF="parsers.html">PARSERS</A></CENTER>
-->

<TR>
<TD COLSPAN=4>
<BR>
In addition, the following modules found in the <tt>oldclasslib</tt>
directory provide supporting functions:
<BR><BR>
</TD>
</TR>

<TR>
<td rowspan=6 width="4%">&nbsp;</td>
<TD><tt><A HREF=../../oldclasslib/charsink.html>charsink.cpp</A></tt></TD>
<td rowspan=6 width="4%">&nbsp;</td>

<TD>Character sink support</TD>
</TR>

<TR>
<TD><tt><A HREF=../../oldclasslib/charsink.html>charsink.h</A></tt></TD>

<TD>Character sink class definitions</TD>
</TR>

<TR>
<TD><tt><A HREF=../../oldclasslib/strdict.html>strdict.cpp</A></tt></TD>

<TD>String dictionary support</TD>
</TR>

<TR>
<TD><tt><A HREF=../../oldclasslib/strdict.html>strdict.h</A></tt></TD>

<TD>String dictionary class definition</TD>
</TR>

<TR>
<TD><tt><A HREF=../../oldclasslib/array.html>array.h</A></tt></TD>

<TD>Array class definition</TD>
</TR>

<TR>
<TD><tt><A HREF=../../oldclasslib/stack.html>stack.h</A></tt></TD>

<TD>Stack class definition</TD>
</TR>

</TABLE>

<!--
Here are links to the corresponding
document files:
<CENTER><A HREF="../../oldclasslib/charsink.html">CHARSINK</A> | <A HREF="../../oldclasslib/array.html">ARRAY</A>
| <A HREF="../../oldclasslib/stack.html">STACK</A> | <A HREF="../../oldclasslib/strdict.html">STRDICT</A></CENTER>
-->

<P>
<BR>
<H2>
Data Flow in the Macro Preprocessor</H2>
Of the four parsers that make up the macro preprocessor itself, three are
simply operators which transform their input:
<UL>
<LI>
MAS transforms a token string (e.g., the body of a macro) into another
token string (e.g., the expansion of the macro). MAS is called only from
TS, and, recursively, from itself.</LI>

<LI>
EX transforms a token string (e.g., the text of a conditional expression)
into a long integer (e.g., the value of the expression). EX is called only
from TS.</LI>

<LI>
CT transforms a character string (ostensibly a C token) into a type identification
code (e.g., STRINGliteral, identifier, etc.). CT is called only from MAS.</LI>
</UL>
The fourth is the token scanner, TS, which controls the entire process.
The relationships are illustrated in the diagrams below which show the
type direction of data flow among the modules.
</P>
<BR>
<H3>
Relationship between Token Scanner, Macro/Argument Scanner and Token Classifier
modules:</H3>

<CENTER><IMG SRC="reltmt24.gif" ALT="TS, translator, and output diagram" ></CENTER>
<P>
<BR>


<H3>
Relationship between Token Scanner and Expression Evaluator:</H3>

<CENTER><IMG SRC="relte24.gif" ALT="TS, translator, and output diagram" ></CENTER>
<P>
<BR>

<H3>
Relationship between Token Scanner, token translator and output file:</H3>

<CENTER><IMG SRC="reltto24.gif" ALT="TS, translator, and output diagram" ></CENTER>
<P>
<BR>
<H3>
Relationship between Token Scanner and C Parser:</H3>

<CENTER><IMG SRC="reltc24.gif" ALT="TS, translator, and output diagram" ></CENTER>
<P>

<BR>
<H2>
Building and Running the Macro Preprocessor</H2>
To make a working version of the macro preprocessor you need to take the
following steps:
<OL>
<LI>
Run AnaGram and build parsers for TS, MAS, CT, and EX.</LI>

<LI>
Choose which C grammar you would like to use (JRC or KRC), run AnaGram, and
build a parser for your choice.</LI>

<LI>
If you are using JRC, edit the <tt>#include</tt> near the top of
<tt>mpp.h</tt> to load <tt>jrc.h</tt> instead of <tt>krc.h</tt>.

<LI>
Make sure your compiler can find include files from
<tt>oldclasslib/include</tt>.</LI>

<LI>
Then, compile and link the following modules:</LI>

<BR><tt>mpp.cpp</tt>
<BR><tt>token.cpp</tt>
<BR><tt>ts.cpp</tt>
<BR><tt>mas.cpp</tt>
<BR><tt>ct.cpp</tt>
<BR><tt>ex.cpp</tt>
<BR><tt>krc.cpp</tt> or <tt>jrc.cpp</tt>
<BR><tt>oldclasslib/source/charsink.cpp</tt>
<BR><tt>oldclasslib/source/strdict.cpp</tt>
</OL>
Now you can run the macro preprocessor.

<P>The command line syntax is as follows:
<PRE>
    mpp [-c] [-n] &lt;input file name&gt; [&lt;output file name&gt;]
</PRE>
The -c switch causes output of the preprocessor to be directed to the C
parser you have included, rather than to an output file.

<P>The -n switch allows the recognition of nested comments.

<P>If you do not set the -c switch and do not specify an output file name,
output will be directed to stdout.

<P>
<BR>
<H2>
Theory of Operation</H2>
This implementation of a macro preprocessor is based on the description
of preprocessing given in Section A12, Appendix A, of "The C Programming
Language", Second Edition, by Kernighan and Ritchie, Prentice-Hall, 1988.

<P>The preprocessor itself comprises four modules: A token scanner,
<tt>ts.syn</tt>;
a macro/argument substitution module, <tt>mas.syn</tt>; a token
classifier, <tt>ct.syn</tt>;
and an expression evaluator, <tt>ex.syn</tt>. These modules, working
together, deal
with conditional compilation, include files, macro definition, and macro
expansion. The output of the preprocessor may be directed to stdout, to
a file, or to either of two C parsers, depending on which you choose to
link into your version of the program.

<P>Two of the modules, <tt>ts.syn</tt> and <tt>mas.syn</tt> do most of
the work. <tt>ts.syn</tt> breaks
the input into a sequence of "tokens" as defined by section A2.1 in Kernighan
and Ritchie. It also determines the syntactic type of each such token.
Descriptors, consisting of a type identifier and a storage handle, are
then used as the units for further processing. <tt>ts.syn</tt> also handles the
conditional compilation logic and fields macro definitions. When it encounters
a macro call, it enlists <tt>mas.syn</tt> to expand the macro.

<P><tt>ex.syn</tt> exists only to evaluate the conditional expressions
in <TT>#if</TT> and <TT>#elif </TT>control statements. <tt>ct.syn</tt>
is used only when a
new token has been created during macro expansion. The "<TT>##</TT>" operator
requires that two tokens be pasted together to make a single token.
<tt>ct.syn</tt>
is then used to determine what manner of beast has been created.

<P>
<BR>
<H2>
Supporting Class Libraries</H2>
The macro preprocessor uses a number of simple data structures implemented
as C++ classes to record and analyze the data generated by the parsers.
Some of these structures are of general utility and are found in
the <A HREF="../../oldclasslib/index.html">oldclasslib</A> directory.
The others are specific to the preprocessor and are to be found in the
files <tt>token.h</tt> and <tt>token.cpp</tt> with the rest of the
preprocessor files.

<P>
<BR>
<H2>
General Purpose Data Structures</H2>
The general purpose data structures are the following:
<UL>
<LI><tt>character_sink</tt></LI>
<LI><tt>string_accumulator</tt></LI>
<LI><tt>output_file</tt></LI>
<LI><tt>array&lt;class T&gt;</tt></LI>
<LI><tt>stack&lt;class T&gt;</tt></LI>
<LI><tt>string_dictionary</tt></LI>
</UL>
A <tt>character_sink</tt> is an abstract class. It represents simply a
general purpose
character output device which can be plugged in to any character generator
to accept its output.

<P>A <tt>string_accumulator</tt> is a species of
<tt>character_sink</tt>, which can store
up characters as they arrive. It has multiple levels, so it can be used
in recursive contexts without any confusion.

<P>An <tt>output_file</tt> is another species of
<tt>character_sink</tt>. It is simply a
very simple implementation of stream output, set up so that it can be used
interchangeably with other kinds of <tt>character_sink</tt>.

<P><tt>array</tt> is a template class that simplifies the allocation
and freeing of local storage for arrays of arbitrary type.

<P>A <tt>stack</tt> is a template class that provides for
multi-leveled push-down stacks of arbitrary types of data.

<P>A <tt>string_dictionary</tt> is a device for associating a unique
integer handle
with a string so that the integer handle may be used as an alias for the
string.

<P>All of these classes use operator overloading in a consistent manner:

<P><TT>&lt;&lt; </TT>is used to add data to an entity, for example, to
push data onto a stack, to add a string to a string dictionary, to add
data to a string accumulator, to send data to an output file, or to transmit
data to a parser. In all cases, <TT>&lt;&lt; </TT>may be chained:
<PRE>        ta &lt;&lt; s1 &lt;&lt; s2;</PRE>
<TT>&gt;&gt; </TT>is used to remove data from an entity, in particular, to pop
something from a stack, or to remove a character from a string accumulator.
Like " &lt;&lt; ", "&gt;&gt;" may be chained:
<PRE>        ta &gt;&gt; s1 &gt;&gt; s2;</PRE>
<TT>++ </TT>is used with string accumulators and with stacks to increment
the level number. It is defined only as a pre-increment operator.

<P><TT>-- </TT>is used with string accumulators and with stacks to decrement
the level number. It is defined only as a pre-decrement operator.

<P><TT>[] </TT>is used to access a particular item. In the case of the
string dictionary, <TT>[] </TT>with a string argument returns the handle,
or zero, if the string is not in the dictionary. <TT>[] </TT>with a handle
returns a pointer to the string. In the case of the "array" class, <TT>[]
</TT>provides access to a single element and checks for out of bounds references.

<P>Cast operators are also overloaded to provide simple access to the data
stored in an instance of a class.

<P>Several overloaded functions are defined consistently where they are
defined at all:
<TABLE WIDTH="100%">

<TR>
<TD ROWSPAN=3 WIDTH="4%">
<TD><tt>reset(</tt><i>object</i><tt>)</tt></TD>
<TD>restores initial state&nbsp;</TD>
</TR>

<TR>
<TD><tt>size(</tt><i>object</i><tt>)</tt></TD>
<TD>returns size&nbsp;</TD>
</TR>

<TR>
<TD><tt>error(</tt><i>object</i><tt>)</tt>&nbsp;</TD>
<TD>returns error flag&nbsp;</TD>
</TR>

</TABLE>
The macro preprocessor uses instances of the above classes for global data
storage and manipulation:
<PRE>
    extern stack&lt;char *&gt;     paths;
    extern string_accumulator   sa;
    extern string_dictionary    td;
</PRE>
<TT>paths </TT>is used to hold a list of search paths to look for include
files whose names are enclosed in angle brackets.

<P><TT>sa </TT>is used in the token scanner, to accumulate the strings
that constitute C tokens. Once complete, each string is added to the string_dictionary
<TT>td </TT>to get a handle which identifies the string uniquely. <TT>td
</TT>is generally referred to as the "token dictionary".

<P>In the main program, an output file is defined in terms of these classes:
<PRE>   output_file file;</PRE>

<P>
<BR>
<H2>
Token Classes</H2>
A number of class and structure definitions specific to the macro preprocessor
are given in <tt>token.h</tt>. Member functions are defined in
<tt>token.cpp</tt>.

<P>The definitions in <tt>token.h</tt> are geared toward the
transmission and sharing
of data among the modules that make up the macro preprocessor. An enumeration
statement defines enumeration constants for all the different kinds of
terminal tokens a C parser can expect to see. These enumeration constants
are defined to be of type <tt>token_id</tt>.

<!-- this sentence needs to be shot. -->
<P>A structure definition defines a token as a pair consisting of a
<tt>token_id</tt>,
and an unsigned integer which represents the handle in the token dictionary
of the string of characters that constitutes the actual token as defined
in K&amp;R.

<P>Then, to facilitate working with these tokens, a set of classes is
defined using the <tt>character_sink</tt> class and its derived
classes <!-- more or less --> as a model:

<UL>
<LI><tt>token_sink</tt></LI>
<LI><tt>token_accumulator</tt></LI>
<LI><tt>token_translator</tt></LI>
<LI><tt>expression_evaluator</tt></LI>
<LI><tt>c_parser</tt></LI>
</UL>

Like <tt>character_sink</tt>, <tt>token_sink</tt> is an abstract class
that serves
as a general purpose output device for processes which create a stream
of tokens.

<P>A <tt>token_accumulator</tt> is a species of <tt>token_sink</tt>.
It is a repository for
sequences of tokens. It has multiple levels, like a
<tt>string_accumulator</tt>,
so it can be used safely in recursive procedures.

<P>A <tt>token_translator</tt> is a species of <tt>token_sink</tt>
which converts a stream
of tokens to a stream of characters. The constructor for a
<tt>token_translator</tt>
takes a pointer to a <tt>character_sink</tt>, so that tokens handed to
a <tt>token_translator</tt>
are converted to strings and passed on to the specified character sink.

<P>The <tt>expression_evaluator</tt> class is a class structure wrapped about
the expression evaluation module, <tt>ex.syn</tt>. It is a species of
<tt>token_sink</tt>,
so that tokens may be passed to the <tt>expression_evaluator</tt> just
as they are to a <tt>token_accumulator</tt> or a <tt>token_translator</tt>.

<P>The <tt>c_parser</tt> class is a class structure wrapped about a C
parser module.
Implementations of this class are found in both <tt>jrc.syn</tt> and
<tt>krc.syn</tt>. The
<tt>c_parser</tt> class is also a <tt>token_sink</tt>.

<P>The macro preprocessor uses several global variables based on the token
based classes defined above:
<PRE>
    extern token_sink            *scanner_sink;
    extern token_accumulator     ta;
    extern expression_evaluator  condition;
</PRE>
<tt>scanner_sink</tt> is the generic output device for the token
scanner. As the
token scanner develops tokens it sends them to the <tt>token_sink</tt> pointed
to by <tt>scanner_sink</tt>.

<P><tt>condition</tt> is used to evaluate constant expressions in
<TT>#if</TT> and
<TT>#elif</TT> statements. The token scanner diverts its output
to the expression evaluator with the statement:
<PRE>
      scanner_sink = &amp;condition;
</PRE>
Until the <tt>scanner_sink</tt> is restored to its previous value, all
output from
the token scanner flows to the expression_evaluator, <tt>condition</tt>.

<P><TT>ta</TT> is a token_accumulator, used in the token scanner and in
<tt>mas.syn</tt> to accumulate sequences of tokens. As with the
<tt>expression_evaluator</tt>,
output from the token scanner can be diverted to <TT>ta</TT> by means of
one simple statement:
<PRE>
    scanner_sink = &amp;ta;
</PRE>
This diversion simplifies the gathering of the tokens which comprise the
body of a macro or an argument to a macro call.

<P>In the main program, two local variables are defined in terms of these
token based structures:
<PRE>
    c_parser            cp;
    token_translator    tt(&amp;file);
</PRE>
Thus either <tt>cp</tt> or <tt>tt</tt> can serve as an output
destination for the token scanner.
The main program sets <tt>scanner_sink</tt> to point to one or the
other depending
on a command line switch.
</P>

<BR>

<IMG ALIGN="bottom" SRC="../../images/rbline6j.gif" ALT="----------------------"
      WIDTH=1010 HEIGHT=2 >
<P>
<IMG ALIGN="right" SRC="../../images/pslrb6d.gif" ALT="Parsifal Software"
                WIDTH=181 HEIGHT=25>
<BR CLEAR="right">
<P>
Back to <A HREF="../../index.html">Index</A>
<P>
<ADDRESS>
<FONT SIZE=-1>AnaGram parser generator - examples</FONT>
<BR><FONT SIZE=-1>Macro preprocessor and C Parser</FONT>
<BR><FONT SIZE=-1>Copyright &copy; 1993-1999, Parsifal Software.</FONT>
<BR><FONT SIZE=-1>All Rights Reserved.</FONT>
<BR>
</ADDRESS>
</BODY>
</HTML>
author	David A. Holland
date	Sat, 22 Dec 2007 17:52:45 -0500
parents
children