Mercurial > ~dholland > hg > ag > index.cgi
view doc/misc/html/examples/mpp/ts.html @ 24:a4899cdfc2d6 default tip
Obfuscate the regexps to strip off the IBM compiler's copyright banners.
I don't want bots scanning github to think they're real copyright
notices because that could cause real problems.
author | David A. Holland |
---|---|
date | Mon, 13 Jun 2022 00:40:23 -0400 |
parents | 13d2b8934445 |
children |
line wrap: on
line source
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <TITLE> Token Scanner - Macro preprocessor and C Parser </TITLE> </HEAD> <BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif" TEXT="#000000" LINK="#0033CC" VLINK="#CC0033" ALINK="#CC0099"> <P> <IMG ALIGN="right" SRC="../../images/agrsl6c.gif" ALT="AnaGram" WIDTH=124 HEIGHT=30 > <BR CLEAR="all"> Back to : <A HREF="../../index.html">Index</A> | <A HREF="index.html">Macro preprocessor overview</A> <P> <IMG ALIGN="bottom" SRC="../../images/rbline6j.gif" ALT="----------------------" WIDTH=1010 HEIGHT=2 > <P> <H1> Token Scanner - Macro preprocessor and C Parser </H1> <IMG ALIGN="bottom" SRC="../../images/rbline6j.gif" ALT="----------------------" WIDTH=1010 HEIGHT=2 > <P> <BR> <H2>Introduction</H2> The token scanner module, <tt>ts.syn</tt>, accomplishes the following tasks: <OL> <LI> It reads the raw input, gathers tokens and identifies them. </LI> <LI> It analyzes conditional compilation directives and skips over text that is to be omitted. </LI> <LI> It analyzes macro definitions and maintains the macro tables. </LI> <LI> It identifies macro calls in the input stream and calls the <tt>macro_expand()</tt> function to expand them. </LI> <LI> It recognizes <tt>#include</tt> statements and calls itself recursively to parse the include file. </LI> </OL> The token_scanner parser, <tt>ts()</tt>, is called from a shell function, <tt>scan_input(char *)</tt>, which takes the name of a file as an argument. <tt>scan_input()</tt> opens the file, calls <tt>ts()</tt>, and closes the file. <tt>scan_input()</tt> is called recursively by <tt>include_file()</tt> when an <tt>#include</tt> statement is found in the input. <P> Output from the token scanner is directed to a token_sink pointed to by the <tt>scanner_sink</tt> global variable. The main program may set scanner sink to point to either a <tt>token_translator</tt> or a <tt>c_parser</tt>. During the course of processing, the token scanner redirects output to a token accumulator or to the conditional expression evaluator, as necessary, by temporarily changing the value of <tt>scanner_sink</tt>. <P> The token scanner module contains two syntax error diagnostic procedures: <tt>syntax_error(char *)</tt> and <tt>syntax_error_scanning(char *)</tt>. The former is set up to provide correct line and column numbers for functions called from reduction procedures in the token scanner. The latter is set up to provide line and column numbers for errors discovered in the scanner itself. Both functions accept a pointer to an error message. <P> <BR> <H2> Theory of Operation </H2> The primary purpose of the token scanner is to identify the C language tokens in the input file and pass them on to another module for further processing. In order to package them for transmission, the token scanner maintains a "token dictionary", <tt>td</tt>, which enables it to characterize each distinct input token with a single number. The token scanner also classifies tokens according to the definitions of the C language. The "token" that it passes on for further processing is a pair consisting of an id field, and a value field. The id field is defined by the <tt>token_id</tt> enumeration in <tt>token.h</tt>. The value field is the index of the token in the token dictionary, <tt>td</tt>. <P> To support its primary purpose, the token scanner deals with several other problems. First, it identifies preprocessor control lines which control conditional compilation and skips input appropriately. Second, it fields <tt>#include</tt> statements, and recurses to process include files. Third, it fields <tt>#define</tt> statements and manages the macro definition tables. Finally, it checks the tokens it identifies and calls the macro/argument expansion module to expand them if they turn out to be macros. <P> The conditional compilation logic in the token scanner is carried out in its entirety by syntactic means. The only C code involved deals with evaluating conditional statements. <tt>#ifdef</tt> and <tt>#ifndef</tt> are quite straightforward. <tt>#if</tt> is another matter. To deal with the generality of this statement, token scanner output is diverted to the expression evaluator module, <tt>ex.syn</tt>, where the expression is evaluated. The outcome of the calculation is then used to control a semantically determined production in the token scanner. <P> Processing <tt>#include</tt> statements is reasonably straightforward. Token scanner output is diverted to the token accumulator, <tt>ta</tt>. The content of the token accumulator is then translated back to ASCII string form. This takes care of macro calls in the <tt>#include</tt> statement. Once the file has been identified, <tt>scan_input()</tt> is called recursively to deal with it. <P> The only complication with macro definitions is that the tokens which comprise the body of a macro must not be expanded until the macro is invoked. For that reason, there are two different definitions of token in the token scanner: "simple token" and "expanded token". The difference is that simple tokens are not checked for macro calls. When a macro definition is encountered, the token scanner output is diverted to the token accumulator, so that the body of the macro can be captured and stored. <P> When a macro call is recognized, the token scanner must pick up the arguments for the macro. There are three complications here: First, the tokens must not be scanned for macros; second, the scan must distinguish the commas that separate arguments from commas that may be contained inside balanced parentheses within an argument; and finally, leading white space tokens do not count as argument tokens. <P> <BR> <H2> Elements of the Token Scanner </H2> The remainder of this document describes the macro definitions, the structure definitions, the static data definitions, all configuration parameter settings, and all non-terminal parsing tokens used in the token scanner. It also explains each configuration parameter setting in the syntax file. In <tt>ts.syn</tt>, each function that is defined is preceded by a short explanation of its purpose. <P> <BR> <H2> Macro definitions </H2> <DL> <DT> <tt>GET_CONTEXT</tt> <DD> The <tt>GET_CONTEXT</tt> macro provides the parser with context information for the input character. (Instead of writing a <tt>GET_CONTEXT</tt> macro, the context information could be stored as part of <tt>GET_INPUT</tt>.) <DT> <tt>GET_INPUT</tt> <DD> The <tt>GET_INPUT</tt> macro provides the next input character for the parser. If the parser used <b>pointer input</b> or <b>event driven</b> input, a <tt>GET_INPUT</tt> macro would not be necessary. The default for <tt>GET_INPUT</tt> would read <tt>stdin</tt> and so is not satisfactory for this parser. <DT> <tt>PCB</tt> <DD> Since the <b>declare pcb</b> switch has been turned off, AnaGram will not define <tt>PCB</tt>. Making the parser control block part of the file descriptor structure simplifies saving and restoring the pcb for nested #include files. <DT> <tt>SYNTAX_ERROR</tt> <DD> <tt>ts.syn</tt> defines the <tt>SYNTAX_ERROR</tt> macro, since otherwise the generated parser would use the default definition of <tt>SYNTAX_ERROR</tt>, which would not provide the name of the file currently being read. </DL> <P> <BR> <H2> Local Structure Definitions </H2> <DL><DT> <tt>location</tt> <DD> <tt>location</tt> is a structure which records a line number and a column number. It is handed to AnaGram with the context type statement found in the configuration segment. AnaGram then declares two member fields of type <tt>location</tt> in the parser control block: <tt>input_context</tt> and a stack, <tt>cs</tt>. In <tt>scan_input()</tt>, the <tt>input_context</tt> variable is set explicitly with the current line and column number. In <tt>syntax_error()</tt> the <tt>CONTEXT</tt> macro is used to extract the line and column number at which the rule currently being reduced started. <DT> <tt>file_descriptor</tt> <DD> <tt>file_descriptor</tt> contains the information that needs to be saved and restored when nested include files are processed. </DL> <P> <BR> <H2> Static Variables </H2> <DL><DT> <tt>error_modifier</tt> <DD> Type: <tt>char *</tt><BR> The string identified by <tt>error_modifier</tt> is added to the error diagnostic printed by <tt>syntax_error()</tt>. Normally it is an empty string; however, when macros are being expanded it is set so that the diagnostic will specify that the error was found inside a macro expansion. <DT> <tt>input</tt> <DD> Type: <tt>file_descriptor</tt><BR> <tt>input</tt> provides the name and stream pointer for the currently active input file. <DT> <tt>save_sink</tt> <DD> Type: <tt>stack<token_sink *></tt><BR> This stack provides for saving and restoring <tt>scanner_sink</tt> when it is necessary to divert the scanner output for dealing with conditional expressions, macro definitions and macro arguments. Actually, a stack is not necessary, since such diversions never nest more than one level deep, but it seems clearer to use a stack. </DL> <P> <BR> <H2> Configuration Parameters </H2> <DL><DT> <tt>~allow macros</tt> <DD> This statement turns off the <b>allow macros</b> switch so that AnaGram implements all reduction procedures as explicit function definitions. This simplifies debugging at the cost of a slight performance degradation. <DT> <tt>auto resynch</tt> <DD> This switch turns on automatic resynchronization in case a syntax error is encountered by the token scanner. <DT> <tt>context type = location</tt> <DD> This statement specifies that the generated parser is to track context automatically. The context variables have type <tt>location</tt>. <tt>location</tt> is defined elsewhere to consist of two fields: line number and column number. <DT> <tt>~declare pcb</tt> <DD> This statement tells AnaGram not to declare a parser control block for the parser. The parser control block is declared later as part of the <tt>file_descriptor</tt> structure. <DT> <tt>~error frame</tt> <DD> This turns off the error frame portion of the automatic syntax error diagnostic generator, since the context of the error in the scanner syntax is of little interest. If an error frame were to be used in diagnostics that of the C parser would be more appropriate. <DT> <tt>error trace</tt> <DD> This turns on the <b>error trace</b> functionality, so that if the token scanner encounters a syntax error it will write an <tt>.etr</tt> file. <DT> <tt>line numbers</tt> <DD> This statement causes AnaGram to include <tt>#line</tt> statements in the parser file so that your compiler can provided diagnostics keyed to your syntax file. <DT> <tt>subgrammar</tt> <DD> The basic token grammar for C is usually implemented using some sort of regular expression parser, such as <tt>lex</tt>, which always looks for the longest match to the regular expression. In no case does the regular expression parser use what follows a match to determine the nature of the match. An LALR parser generator, on the other hand, normally looks not only at the content of a token but also looks ahead. The subgrammar declaration tells AnaGram not to look ahead but to parse these tokens based only on their internal structure. Thus the conflicts that would normally be detected are not seen. To see what happens if lookahead is allowed, simply comment out any one of these subgrammar statements and look at the conflicts that result. <DT> <tt>~test range</tt> <DD> This statement tells AnaGram not to check input characters to see if they are within allowable limits. This checking is not necessary since the token scanner is reading a text file and cannot possibly get an out of range token. </DL> <P> <BR> <H2> Scanner Tokens, in alphabetical order </H2> <DL><DT> any text <DD> These productions are used when skipping over text. "any text" consists of all characters other than eof, newline and backslash, as well as any character (including newline and backslash) that is quoted with a preceding backslash character. <DT> arg element <DD> An "arg element" is a token in the argument list of a macro. It is essentially the same as "simple token" except that commas must be detected as separators and nested parentheses must be recognized. An "arg element" is either a space or an "initial arg element". <DT> character constant <DD> A "character constant" is a quoted character or escape sequence. The token scanner does not inquire closely into the internal nature of the character constant. <DT> comment <DD> A "comment" consists of a comment head followed by the closing "*/". <DT> comment head <DD> A "comment head" consists of the entire comment up to the closing "*/". If a complete comment is found following a comment head, its treatment depends on whether one believes, with ANSI, that comments should not be nested, or whether one prefers to allow nested comments. Followers of the ANSI principle will want "comment head, comment" to reduce to "comment". Believers in nested comments will want to finish the comment that was in progress when the nested comment was encountered, so they will want "comment head, comment" to reduce to "comment head", which will allow the search for "*/" to continue. <DT> conditional block <DD> A "conditional block" is an #if, #ifdef, or #ifndef line and all following lines through the terminating #endif. If the initial condition turns out to be true, then everything has to be skipped following an #elif or #else line. If the initial condition is false, everything has to be skipped until a true #elif condition or an #else line is found. <DT> confusion <DD> This token is designed to deal with a curious anomaly of C. Integers which begin with a zero are octal, but floating point numbers may have leading zeroes without losing their fundamental decimal nature. "confusion" is an octal integer that is followed by an eight or a nine. This will become legitimate if eventually a decimal point or an exponent field is encountered. <DT> control line <DD> "control line" consists of any preprocessor control line other than those associated with conditional compilation. <DT> decimal constant <DD> A "decimal constant" is a "decimal integer" and any following qualifiers. <DT> decimal integer <DD> The digits which comprise the integer are pushed onto the string accumulator. When the integer is complete, the string will be entered into the token dictionary and subsequently it will be described by its index in the token dictionary. <DT> defined <DD> See "expanded word". id_macro will recognize "defined" only when the if_clause switch is set. <DT> eof <DD> end of file: equal to the null character. <DT> eol <DD> end of line: a newline and all immediately following white space or newline characters. eol is declared to be a subgrammar since it is used in circumstances where space can legitimately follow, according to the syntax as written. <DT> else if header <DD> This production is simply a portion of the rule for the #elif statement. It is separated out in order to provide a hook on which to hang the call to init_condition(), which diverts scanner output to the expression_evaluator which will calculate the value of the conditional expression. <DT> else section <DD> An "else section" is an #else line and all immediately following complete sections. An "else section" and a "skip else section" are the same except that in an "else section" tokens are sent to the scanner output and in a "skip else section" they are discarded. <DT> endif line <DD> An "endif line" is simply a line that begins #endif <DT> expanded token <DD> The word "token" is used here in the sense of Kernighan and Ritchie, 2nd Edition, Appendix A, p. 191. In this program a "simple token" is one which is simply passed on without regard to macro processing. An "expanded token" is one which has been checked to see if it is a macro identifier and, if so, expanded. "simple tokens" are recognized only in the bodies of macro definitions. Therefore spaces and '#' characters are passed on. For "expanded tokens" they are discarded. <DT> expanded word <DD> This is the treatment of a simple identifier as an "expanded token". "variable", "simple macro", "macro", and "defined" are the various outcomes of semantic analysis of "name string" performed by id_macro(). In this case reserved words and identifiers which are not the names of macros are subsumed under the rubric "variable". These tokens are simply passed on to the scanner output. <P> The distinction between "macro" and "simple macro" depends on whether the macro was defined with or without following parentheses. A "simple macro" is expanded by calling expand(). expand() simply serves as a local interface to the expand_text() function defined in <tt>mas.syn</tt>. <P> If a "macro" was defined with parentheses but appears bereft of an argument list, it is treated as a simple identifier and passed on to the output. Otherwise the argument tokens for the macro are gathered and stacked on the token accumulator, using "macro arg list". Finally, the macro is expanded in the same way as a "simple macro". Note that "macro arg list" provides a count of the number of arguments found inside the balanced parentheses. <P> If "if_clause" is set, it means that the conditional expression of an #if or #elif line is being evaluated. In this case, the pseudo-function defined() must be recognized to determine whether a macro has or has not been defined. The defined() function returns a "1" or "0" token depending on whether the macro has been defined. <DT> exponent <DD> This is simply the exponent field on a floating point number with optional sign. <DT> false condition <DD> The "true condition" and "false condition" tokens are semantically determined. They consist of #if, #ifdef, or #ifndef lines. If the result of the test is true the reduction token is "true condition", otherwise it is "false condition". <DT> false else condition <DD> The "true else condition" and "false else condition" tokens are semantically determined. They consist of an #elif line. If the value of the conditional expression is true the reduction token is "true else condition", otherwise it is "false else condition". <DT> false if section: <DD> A "false if section" is a #if, #ifdef, or #ifndef condition that turns out to be false followed by any number, including zero, of complete sections or false #elif condition lines. All of the text within a "false if section" is discarded. <DT> floating qualifier <DD> These productions are simply the optional qualifiers to specify that a constant is to be treated as a float or as a long double. <DT> hex constant <DD> A "hex constant" is simply a "hex integer" plus any following qualifiers. <DT> hex integer <DD> The digits which comprise the integer are pushed onto the string accumulator. When the integer is complete, the string will be entered into the token dictionary and subsequently it will be described by its index in the token dictionary. <DT> if header <DD> This production is simply a portion of the rule for the #if statement. It is separated out in order to provide a hook on which to hang the call to init_condition(), which diverts scanner output to the expression evaluator which will calculate the value of the conditional expression. <DT> initial arg element <DD> In gathering macro arguments, spaces must not be confused with a true argument. Therefore, the arg element token is broken down into two pieces so that each argument begins with a nonblank token. <DT> include header <DD> "include header" simply represents the initial portion of an #include line and provides a hook for a reduction procedure which diverts scanner output to the token accumulator. This diversion allows the text which follows #include to be scanned for macros and accumulated. The include_file() function will be called to actually identify and scan the specified file. <DT> input file <DD> This is the grammar, or start token. It describes the entire file as alternating sections and eols, terminated by an eof <DT> integer constant <DD> These productions simply gather together the varieties of integer constants under one umbrella. <DT> integer qualifier <DD> These productions are simply the optional qualifiers to specify that an "integer constant" is to be treated as unsigned, long, or both. <DT> macro <DD> See "expanded word". id_macro specifies "macro" or "simple macro" depending on whether the named macro was defined with or without following parentheses. <DT> macro arg list <DD> A "macro arg list" can be either empty or can consist of any number of token sequences separated by commas. Commas that are protected by nested parentheses do not separate arguments. Argument strings are accumulated on the token accumulator and counted by "macro args". <DT> macro args <DD> Each argument to a macro is gathered on a separate level of the token accumulator, so the token accumulator level is incremented before each argument, and the arguments are counted. <DT> macro definition header <DD> The "macro definition header" consists of the #define line up to the beginning of the body text of the macro. It serves as a hook to call init_macro_def() which begins the macro definition and diverts scanner output to the token accumulator. The macro definition will be completed by the save_macro_body() function once the entire macro body has been accumulated. Note that the tokens for the macro body are not examined for macro calls. <DT> name string <DD> "name string" is simply an accumulation on the string accumulator of the characters which make up an identifier. <DT> nested elements <DD> "nested elements" are "arg elements" that are found inside nested parentheses. <DT> not control mark <DD> This consists of any input character excepting eof, newline, backslash and '#', but including any of these if preceded by a backslash. It serves, at the beginning of a line, to distinguish ordinary lines of text from preprocessor control lines. <DT> octal integer <DD> The digits which comprise the integer are pushed onto the string accumulator. When the integer is complete, the string will be entered into the token dictionary and subsequently it will be described by its index in the token dictionary. <DT> operator <DD> This is simply an inventory of all the multi-character operators in C. <DT> parameter list <DD> "parameter list" is simply a wrapper about "names" which allows for empty parentheses. Note that both the "names" token and the "parameter list" tokens provide the count of the number of parameter names found inside the parentheses. The names themselves have been stacked on the string accumulator. <DT> qualified real <DD> This production exists to allow the "floating qualifier" to be appended to a "real constant". <DT> real <DD> These productions itemize the various ways of writing a floating point number with and without decimal points and with and without exponent fields. <DT> real constant <DD> This production is simply an envelope to contain "real" and write the output code once instead of four times. <DT> section <DD> This is a logical block of input. It is either a single line of ordinary code, a control line such as #define or #undef, or an entire conditional compilation block, i.e., everything from the #if to the closing #endif. Notice that the eol that terminates a "section" is not part of the "section". The only difference between a "section" and a "skip section" is that in a "section", all tokens are sent to the scanner output while in a "skip section", all input is discarded. <DT> separator <DD> This is simply a gathering together of all the tokens that are neither white space nor identifiers, since they are treated uniformly throughout the grammar. <DT> simple macro <DD> See "expanded word". <DT> simple real <DD> A "simple real" is one which has a decimal point and has digits on at least one side of the decimal point. Unaccompanied decimal points will be turned away at the door. <DT> simple token <DD> The word "token" is used here in the sense of Kernighan and Ritchie, 2nd Edition, Appendix A, p. 191. In this program a "simple token" is one which is simply passed on without regard to macro processing. An "expanded token" is one which has been checked to see if it is a <P> macro identifier and, if so, expanded. "simple tokens" are recognized only in the bodies of macro definitions. Therefore spaces and '#' characters are passed on. For "expanded tokens" they are discarded. <DT> skip else line <DD> For purposes of skipping over complete conditional sections #elif and #else lines are equivalent. <DT> skip else section <DD> A "skip else section" consists of the #else or #elif line following a satisfied conditional and all subsequent sections and #elif and #else lines. All input in the "skip else section" is discarded. <DT> skip if section <DD> A "skip if section" consists of an #if, #ifdef, or #ifndef line, and all following complete "sections" (represented as "skip sections", so their content will be ignored) and #else and #elif lines. <DT> skip line <DD> When skipping text, we have to distinguish between lines which begin with the control mark ('#') and those which don't so that we deal correctly with nested #endif statements. We wouldn't want to terminate a block of uncompiled code with the wrong #endif. <DT> skip section <DD> A "skip section" is simply a "section" that follows an unsatisfied conditional. In a "skip section", all input is discarded. <DT> space <DD> space consists of either a blank or a comment. If a comment is found, it is replaced with a blank. <DT> simple chars <DD> "simple chars" consists of the body of a character constant up to but not including the final quote. <DT> string chars <DD> "string chars" consists of the body of a string literal up to but not including the final double quote. <DT> string literal <DD> A "string literal" is simply a quoted string. It is accumulated on the string accumulator. <DT> true condition <DD> The "true condition" and "false condition" tokens are semantically determined. They consist of #if, #ifdef, or #ifndef lines. If the result of the test is true the reduction token is "true condition", otherwise it is "false condition". <DT> true condition <DD> The "true condition" and "false condition" tokens are semantically determined. They consist of #if, #ifdef, or #ifndef lines. If the result of the test is true the reduction token is "true condition", otherwise it is "false condition". <DT> true else condition <DD> The "true else condition" and "false else condition" tokens are semantically determined. They consist of an #elif line. If the value of the conditional expression is true the reduction token is "true else condition", otherwise it is "false else condition". <DT> true if section <DD> A "true if section" is a true #if, #ifdef, or #ifndef, followed by any number of complete sections, including zero. Alternatively, it could be a "false if section" that is followed by a true #elif condition, followed by any number of complete "sections". All input in a "true if section" subsequent to the true condition is passed on to the scanner output. <DT> word <DD> This is the treatment of a simple identifier as a "simple token". The name_token() procedure is called to pop the name string from the string accumulator, identify it in the token dictionary and assign a token_id to it by checking to see if it is a reserved word. <DT> variable <DD> See "expanded word". ws <DD> The definition for ws as space... simply allows a briefer reference in those places in the grammar where it is necessary to skip over white space. </DL> <P> <BR> <IMG ALIGN="bottom" SRC="../../images/rbline6j.gif" ALT="----------------------" WIDTH=1010 HEIGHT=2 > <P> <IMG ALIGN="right" SRC="../../images/pslrb6d.gif" ALT="Parsifal Software" WIDTH=181 HEIGHT=25> <BR CLEAR="right"> <P> Back to : <A HREF="../../index.html">Index</A> | <A HREF="index.html">Macro preprocessor overview</A> <P> <ADDRESS><FONT SIZE="-1"> AnaGram parser generator - examples<BR> Token Scanner - Macro preprocessor and C Parser <BR> Copyright © 1993-1999, Parsifal Software. <BR> All Rights Reserved.<BR> </FONT></ADDRESS> </BODY> </HTML>