Mercurial > ~dholland > hg > ag > index.cgi
view doc/misc/html/examples/mpp/index.html @ 16:f9e4689b837d
Some minor updates for 15 years later.
author | David A. Holland |
---|---|
date | Tue, 31 May 2022 01:45:26 -0400 |
parents | 13d2b8934445 |
children |
line wrap: on
line source
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <TITLE> Macro preprocessor and C Parser </TITLE> </HEAD> <BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif" TEXT="#000000" LINK="#0033CC" VLINK="#CC0033" ALINK="#CC0099"> <P> <IMG ALIGN="right" SRC="../../images/agrsl6c.gif" ALT="AnaGram" WIDTH=124 HEIGHT=30 > <BR CLEAR="all"> Back to <A HREF="../../index.html">Index</A> <P> <IMG ALIGN="bottom" SRC="../../images/rbline6j.gif" ALT="----------------------" WIDTH=1010 HEIGHT=2 > <P> <H1>Macro preprocessor and C Parser</H1> <IMG ALIGN="bottom" SRC="../../images/rbline6j.gif" ALT="----------------------" WIDTH=1010 HEIGHT=2 > <H2>Introduction</H2> This document provides an overview of the entire macro preprocessor example. Since the example consists of a number of modules, there is also a separate document file for each module. These document files provide an overview of the module and detailed descriptions of the variables, data structures and syntactic elements associated with the module. <P>This implementation of a C macro preprocessor demonstrates: <UL> <LI> the use of AnaGram in a real-world problem of considerable complexity.</LI> <LI> the use of AnaGram in a C++ environment.</LI> </UL> It was felt that only a fairly complex problem would adequately demonstrate the power of AnaGram. This example, therefore, may not be particularly easy to grasp or to understand in its entirety. <P>However, it is not necessary to understand all facets of this example to make good use of it. If you skim over it, you will see examples of many common syntactic constructs. You will find that in many cases you can copy these constructs verbatim and incorporate them directly into your own programs. <P>A number of AnaGram's features and options are well illustrated. This example makes use of four separate syntaxes to deal with the preprocessing so that the complete program, with one or another of the C parsers linked in, contains five separate parsers. There are, therefore, numerous examples of interfacing a parser to the rest of a program. In particular, several of the parsers are configured as C++ classes. <P>Other AnaGram features, such as semantically determined productions and context tracking, are used to good avail, particularly in the token scanner, which also illustrates the use of AnaGram to write lexical scanners. <P>In addition to the macro preprocessor, this example provides a choice of two C parsers which have been interfaced to the preprocessor. These parsers are simply syntax checkers. They have essentially no reduction procedures except for enough to give them rudimentary (and not fully correct) capabilities for coping with typedef types. You may, of course, add your own reduction procedures to adapt them to your needs. <P> Note that this macro preprocessor is not particularly standards compliant; if you feed it difficult or pedantic test cases, it will probably give you wrong output. <P> <BR> <H2> Components of the Macro Preprocessor</H2> <TABLE WIDTH="100%"> <TR> <TD COLSPAN=4> The macro preprocessor example comprises the following modules: <BR><BR> </TD> </TR> <TR> <td rowspan=10 width="4%"> </td> <TD><tt><A HREF=mpp.html>mpp.cpp</A></tt></TD> <td rowspan=10 width="4%"> </td> <TD>data declarations and main program</TD> </TR> <TR> <TD><tt><A HREF=mpp.html>mpp.h</A></tt></TD> <TD>Structure definitions, data and function declarations</TD> </TR> <TR> <TD><tt><A HREF=token.html>token.cpp</A></tt></TD> <TD>token class function definitions</TD> </TR> <TR> <TD><tt><A HREF=token.html>token.h</A></tt></TD> <TD>Token class definitions</TD> </TR> <TR> <TD><tt><A HREF=ts.html>ts.syn</A></tt></TD> <TD>token scanner</TD> </TR> <TR> <TD><tt><A HREF=mas.html>mas.syn</A></tt></TD> <TD>macro and argument substitution module</TD> </TR> <TR> <TD><tt><A HREF=ex.html>ex.syn</A></tt></TD> <TD>constant expression evaluator</TD> </TR> <TR> <TD><tt><A HREF=ct.html>ct.syn</A></tt></TD> <TD>token classifier</TD> </TR> <TR> <TD><tt><A HREF=parsers.html>jrc.syn</A></tt></TD> <TD>C parser, based on C grammar by James A. Roskind</TD> </TR> <TR> <TD><tt><A HREF=parsers.html>krc.syn</A></tt></TD> <TD>C parser, based on C grammar in K & R, section A13</TD> </TR> <!-- <P> Here are links to the corresponding document files: <CENTER><A HREF="mpp.html">MPP </A> | <A HREF="token.html">TOKEN</A> | <A HREF="ts.html">TS</A> | <A HREF="mas.html">MAS</A> | <A HREF="ex.html">EX</A> | <A HREF="ct.html">CT</A> | <A HREF="parsers.html">PARSERS</A></CENTER> --> <TR> <TD COLSPAN=4> <BR> In addition, the following modules found in the <tt>oldclasslib</tt> directory provide supporting functions: <BR><BR> </TD> </TR> <TR> <td rowspan=6 width="4%"> </td> <TD><tt><A HREF=../../oldclasslib/charsink.html>charsink.cpp</A></tt></TD> <td rowspan=6 width="4%"> </td> <TD>Character sink support</TD> </TR> <TR> <TD><tt><A HREF=../../oldclasslib/charsink.html>charsink.h</A></tt></TD> <TD>Character sink class definitions</TD> </TR> <TR> <TD><tt><A HREF=../../oldclasslib/strdict.html>strdict.cpp</A></tt></TD> <TD>String dictionary support</TD> </TR> <TR> <TD><tt><A HREF=../../oldclasslib/strdict.html>strdict.h</A></tt></TD> <TD>String dictionary class definition</TD> </TR> <TR> <TD><tt><A HREF=../../oldclasslib/array.html>array.h</A></tt></TD> <TD>Array class definition</TD> </TR> <TR> <TD><tt><A HREF=../../oldclasslib/stack.html>stack.h</A></tt></TD> <TD>Stack class definition</TD> </TR> </TABLE> <!-- Here are links to the corresponding document files: <CENTER><A HREF="../../oldclasslib/charsink.html">CHARSINK</A> | <A HREF="../../oldclasslib/array.html">ARRAY</A> | <A HREF="../../oldclasslib/stack.html">STACK</A> | <A HREF="../../oldclasslib/strdict.html">STRDICT</A></CENTER> --> <P> <BR> <H2> Data Flow in the Macro Preprocessor</H2> Of the four parsers that make up the macro preprocessor itself, three are simply operators which transform their input: <UL> <LI> MAS transforms a token string (e.g., the body of a macro) into another token string (e.g., the expansion of the macro). MAS is called only from TS, and, recursively, from itself.</LI> <LI> EX transforms a token string (e.g., the text of a conditional expression) into a long integer (e.g., the value of the expression). EX is called only from TS.</LI> <LI> CT transforms a character string (ostensibly a C token) into a type identification code (e.g., STRINGliteral, identifier, etc.). CT is called only from MAS.</LI> </UL> The fourth is the token scanner, TS, which controls the entire process. The relationships are illustrated in the diagrams below which show the type direction of data flow among the modules. </P> <BR> <H3> Relationship between Token Scanner, Macro/Argument Scanner and Token Classifier modules:</H3> <CENTER><IMG SRC="reltmt24.gif" ALT="TS, translator, and output diagram" ></CENTER> <P> <BR> <H3> Relationship between Token Scanner and Expression Evaluator:</H3> <CENTER><IMG SRC="relte24.gif" ALT="TS, translator, and output diagram" ></CENTER> <P> <BR> <H3> Relationship between Token Scanner, token translator and output file:</H3> <CENTER><IMG SRC="reltto24.gif" ALT="TS, translator, and output diagram" ></CENTER> <P> <BR> <H3> Relationship between Token Scanner and C Parser:</H3> <CENTER><IMG SRC="reltc24.gif" ALT="TS, translator, and output diagram" ></CENTER> <P> <BR> <H2> Building and Running the Macro Preprocessor</H2> To make a working version of the macro preprocessor you need to take the following steps: <OL> <LI> Run AnaGram and build parsers for TS, MAS, CT, and EX.</LI> <LI> Choose which C grammar you would like to use (JRC or KRC), run AnaGram, and build a parser for your choice.</LI> <LI> If you are using JRC, edit the <tt>#include</tt> near the top of <tt>mpp.h</tt> to load <tt>jrc.h</tt> instead of <tt>krc.h</tt>. <LI> Make sure your compiler can find include files from <tt>oldclasslib/include</tt>.</LI> <LI> Then, compile and link the following modules:</LI> <BR><tt>mpp.cpp</tt> <BR><tt>token.cpp</tt> <BR><tt>ts.cpp</tt> <BR><tt>mas.cpp</tt> <BR><tt>ct.cpp</tt> <BR><tt>ex.cpp</tt> <BR><tt>krc.cpp</tt> or <tt>jrc.cpp</tt> <BR><tt>oldclasslib/source/charsink.cpp</tt> <BR><tt>oldclasslib/source/strdict.cpp</tt> </OL> Now you can run the macro preprocessor. <P>The command line syntax is as follows: <PRE> mpp [-c] [-n] <input file name> [<output file name>] </PRE> The -c switch causes output of the preprocessor to be directed to the C parser you have included, rather than to an output file. <P>The -n switch allows the recognition of nested comments. <P>If you do not set the -c switch and do not specify an output file name, output will be directed to stdout. <P> <BR> <H2> Theory of Operation</H2> This implementation of a macro preprocessor is based on the description of preprocessing given in Section A12, Appendix A, of "The C Programming Language", Second Edition, by Kernighan and Ritchie, Prentice-Hall, 1988. <P>The preprocessor itself comprises four modules: A token scanner, <tt>ts.syn</tt>; a macro/argument substitution module, <tt>mas.syn</tt>; a token classifier, <tt>ct.syn</tt>; and an expression evaluator, <tt>ex.syn</tt>. These modules, working together, deal with conditional compilation, include files, macro definition, and macro expansion. The output of the preprocessor may be directed to stdout, to a file, or to either of two C parsers, depending on which you choose to link into your version of the program. <P>Two of the modules, <tt>ts.syn</tt> and <tt>mas.syn</tt> do most of the work. <tt>ts.syn</tt> breaks the input into a sequence of "tokens" as defined by section A2.1 in Kernighan and Ritchie. It also determines the syntactic type of each such token. Descriptors, consisting of a type identifier and a storage handle, are then used as the units for further processing. <tt>ts.syn</tt> also handles the conditional compilation logic and fields macro definitions. When it encounters a macro call, it enlists <tt>mas.syn</tt> to expand the macro. <P><tt>ex.syn</tt> exists only to evaluate the conditional expressions in <TT>#if</TT> and <TT>#elif </TT>control statements. <tt>ct.syn</tt> is used only when a new token has been created during macro expansion. The "<TT>##</TT>" operator requires that two tokens be pasted together to make a single token. <tt>ct.syn</tt> is then used to determine what manner of beast has been created. <P> <BR> <H2> Supporting Class Libraries</H2> The macro preprocessor uses a number of simple data structures implemented as C++ classes to record and analyze the data generated by the parsers. Some of these structures are of general utility and are found in the <A HREF="../../oldclasslib/index.html">oldclasslib</A> directory. The others are specific to the preprocessor and are to be found in the files <tt>token.h</tt> and <tt>token.cpp</tt> with the rest of the preprocessor files. <P> <BR> <H2> General Purpose Data Structures</H2> The general purpose data structures are the following: <UL> <LI><tt>character_sink</tt></LI> <LI><tt>string_accumulator</tt></LI> <LI><tt>output_file</tt></LI> <LI><tt>array<class T></tt></LI> <LI><tt>stack<class T></tt></LI> <LI><tt>string_dictionary</tt></LI> </UL> A <tt>character_sink</tt> is an abstract class. It represents simply a general purpose character output device which can be plugged in to any character generator to accept its output. <P>A <tt>string_accumulator</tt> is a species of <tt>character_sink</tt>, which can store up characters as they arrive. It has multiple levels, so it can be used in recursive contexts without any confusion. <P>An <tt>output_file</tt> is another species of <tt>character_sink</tt>. It is simply a very simple implementation of stream output, set up so that it can be used interchangeably with other kinds of <tt>character_sink</tt>. <P><tt>array</tt> is a template class that simplifies the allocation and freeing of local storage for arrays of arbitrary type. <P>A <tt>stack</tt> is a template class that provides for multi-leveled push-down stacks of arbitrary types of data. <P>A <tt>string_dictionary</tt> is a device for associating a unique integer handle with a string so that the integer handle may be used as an alias for the string. <P>All of these classes use operator overloading in a consistent manner: <P><TT><< </TT>is used to add data to an entity, for example, to push data onto a stack, to add a string to a string dictionary, to add data to a string accumulator, to send data to an output file, or to transmit data to a parser. In all cases, <TT><< </TT>may be chained: <PRE> ta << s1 << s2;</PRE> <TT>>> </TT>is used to remove data from an entity, in particular, to pop something from a stack, or to remove a character from a string accumulator. Like " << ", ">>" may be chained: <PRE> ta >> s1 >> s2;</PRE> <TT>++ </TT>is used with string accumulators and with stacks to increment the level number. It is defined only as a pre-increment operator. <P><TT>-- </TT>is used with string accumulators and with stacks to decrement the level number. It is defined only as a pre-decrement operator. <P><TT>[] </TT>is used to access a particular item. In the case of the string dictionary, <TT>[] </TT>with a string argument returns the handle, or zero, if the string is not in the dictionary. <TT>[] </TT>with a handle returns a pointer to the string. In the case of the "array" class, <TT>[] </TT>provides access to a single element and checks for out of bounds references. <P>Cast operators are also overloaded to provide simple access to the data stored in an instance of a class. <P>Several overloaded functions are defined consistently where they are defined at all: <TABLE WIDTH="100%"> <TR> <TD ROWSPAN=3 WIDTH="4%"> <TD><tt>reset(</tt><i>object</i><tt>)</tt></TD> <TD>restores initial state </TD> </TR> <TR> <TD><tt>size(</tt><i>object</i><tt>)</tt></TD> <TD>returns size </TD> </TR> <TR> <TD><tt>error(</tt><i>object</i><tt>)</tt> </TD> <TD>returns error flag </TD> </TR> </TABLE> The macro preprocessor uses instances of the above classes for global data storage and manipulation: <PRE> extern stack<char *> paths; extern string_accumulator sa; extern string_dictionary td; </PRE> <TT>paths </TT>is used to hold a list of search paths to look for include files whose names are enclosed in angle brackets. <P><TT>sa </TT>is used in the token scanner, to accumulate the strings that constitute C tokens. Once complete, each string is added to the string_dictionary <TT>td </TT>to get a handle which identifies the string uniquely. <TT>td </TT>is generally referred to as the "token dictionary". <P>In the main program, an output file is defined in terms of these classes: <PRE> output_file file;</PRE> <P> <BR> <H2> Token Classes</H2> A number of class and structure definitions specific to the macro preprocessor are given in <tt>token.h</tt>. Member functions are defined in <tt>token.cpp</tt>. <P>The definitions in <tt>token.h</tt> are geared toward the transmission and sharing of data among the modules that make up the macro preprocessor. An enumeration statement defines enumeration constants for all the different kinds of terminal tokens a C parser can expect to see. These enumeration constants are defined to be of type <tt>token_id</tt>. <!-- this sentence needs to be shot. --> <P>A structure definition defines a token as a pair consisting of a <tt>token_id</tt>, and an unsigned integer which represents the handle in the token dictionary of the string of characters that constitutes the actual token as defined in K&R. <P>Then, to facilitate working with these tokens, a set of classes is defined using the <tt>character_sink</tt> class and its derived classes <!-- more or less --> as a model: <UL> <LI><tt>token_sink</tt></LI> <LI><tt>token_accumulator</tt></LI> <LI><tt>token_translator</tt></LI> <LI><tt>expression_evaluator</tt></LI> <LI><tt>c_parser</tt></LI> </UL> Like <tt>character_sink</tt>, <tt>token_sink</tt> is an abstract class that serves as a general purpose output device for processes which create a stream of tokens. <P>A <tt>token_accumulator</tt> is a species of <tt>token_sink</tt>. It is a repository for sequences of tokens. It has multiple levels, like a <tt>string_accumulator</tt>, so it can be used safely in recursive procedures. <P>A <tt>token_translator</tt> is a species of <tt>token_sink</tt> which converts a stream of tokens to a stream of characters. The constructor for a <tt>token_translator</tt> takes a pointer to a <tt>character_sink</tt>, so that tokens handed to a <tt>token_translator</tt> are converted to strings and passed on to the specified character sink. <P>The <tt>expression_evaluator</tt> class is a class structure wrapped about the expression evaluation module, <tt>ex.syn</tt>. It is a species of <tt>token_sink</tt>, so that tokens may be passed to the <tt>expression_evaluator</tt> just as they are to a <tt>token_accumulator</tt> or a <tt>token_translator</tt>. <P>The <tt>c_parser</tt> class is a class structure wrapped about a C parser module. Implementations of this class are found in both <tt>jrc.syn</tt> and <tt>krc.syn</tt>. The <tt>c_parser</tt> class is also a <tt>token_sink</tt>. <P>The macro preprocessor uses several global variables based on the token based classes defined above: <PRE> extern token_sink *scanner_sink; extern token_accumulator ta; extern expression_evaluator condition; </PRE> <tt>scanner_sink</tt> is the generic output device for the token scanner. As the token scanner develops tokens it sends them to the <tt>token_sink</tt> pointed to by <tt>scanner_sink</tt>. <P><tt>condition</tt> is used to evaluate constant expressions in <TT>#if</TT> and <TT>#elif</TT> statements. The token scanner diverts its output to the expression evaluator with the statement: <PRE> scanner_sink = &condition; </PRE> Until the <tt>scanner_sink</tt> is restored to its previous value, all output from the token scanner flows to the expression_evaluator, <tt>condition</tt>. <P><TT>ta</TT> is a token_accumulator, used in the token scanner and in <tt>mas.syn</tt> to accumulate sequences of tokens. As with the <tt>expression_evaluator</tt>, output from the token scanner can be diverted to <TT>ta</TT> by means of one simple statement: <PRE> scanner_sink = &ta; </PRE> This diversion simplifies the gathering of the tokens which comprise the body of a macro or an argument to a macro call. <P>In the main program, two local variables are defined in terms of these token based structures: <PRE> c_parser cp; token_translator tt(&file); </PRE> Thus either <tt>cp</tt> or <tt>tt</tt> can serve as an output destination for the token scanner. The main program sets <tt>scanner_sink</tt> to point to one or the other depending on a command line switch. </P> <BR> <IMG ALIGN="bottom" SRC="../../images/rbline6j.gif" ALT="----------------------" WIDTH=1010 HEIGHT=2 > <P> <IMG ALIGN="right" SRC="../../images/pslrb6d.gif" ALT="Parsifal Software" WIDTH=181 HEIGHT=25> <BR CLEAR="right"> <P> Back to <A HREF="../../index.html">Index</A> <P> <ADDRESS> <FONT SIZE=-1>AnaGram parser generator - examples</FONT> <BR><FONT SIZE=-1>Macro preprocessor and C Parser</FONT> <BR><FONT SIZE=-1>Copyright © 1993-1999, Parsifal Software.</FONT> <BR><FONT SIZE=-1>All Rights Reserved.</FONT> <BR> </ADDRESS> </BODY> </HTML>