Mercurial > ~dholland > hg > ag > index.cgi
view doc/misc/html/gloss.html @ 8:ec2b657edf13
Add explicit lint-comment-style fallthrough annotations.
GCC now assumes that if you don't have these you're making a mistake,
which is annoying.
XXX: This changeset updates the AG output files only (by hand) and is
XXX: abusive - rebuilding them will erase the change. However, I need
XXX: to get things to build before I can try to get AG to issue the
XXX: annotations itself, so this seems like a reasonable expedient.
author | David A. Holland |
---|---|
date | Mon, 30 May 2022 23:51:43 -0400 |
parents | 13d2b8934445 |
children |
line wrap: on
line source
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <TITLE>Glossary of Parsing Terms</TITLE> </HEAD> <BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif" TEXT="#000000" LINK="#0033CC" VLINK="#CC0033" ALINK="#CC0099"> <P> <IMG ALIGN="right" SRC="images/agrsl6c.gif" ALT="AnaGram" WIDTH=124 HEIGHT=30> <BR CLEAR="RIGHT"> </P> Back to <A HREF="index.html">Index</A> <P> <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" WIDTH=1010 HEIGHT=2 > <P> <H2>Glossary of Parsing Terms</H2> <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" WIDTH=1010 HEIGHT=2 > </P> <P> <TABLE WIDTH="100%" ALIGN="center" CELLPADDING=15 CELLSPACING=15> <TR ALIGN="left"> <TD VALIGN=TOP NOWRAP WIDTH="50%"> <A HREF="#ParserAction">Action</A><BR> <A HREF="#AcceptAction">Accept Action</A><BR> <A HREF="#Associativity">Associativity</A><BR> <A HREF="#Backtracking">Backtracking</A><BR> <A HREF="#BNF">Backus-Naur Form</A><BR> <A HREF="#BinaryOperator">Binary operator</A><BR> <A HREF="#Binding">Binding</A><BR> <A HREF="#CharacterSets">Character sets</A><BR> <A HREF="#Universe">Character Universe</A><BR> <A HREF="#CharacteristicRule">Characteristic Rule</A><BR> <A HREF="#CharacteristicToken">Characteristic Token</A><BR> <A HREF="#SetComplement">Complement of a set</A><BR> <A HREF="#CompletedRule">Completed Rule</A><BR> <A HREF="#Conflict">Conflict</A><BR> <A HREF="#ContextFreeGrammar">Context Free Grammars</A><BR> <A HREF="#DefaultReduction">Default reductions</A><BR> <A HREF="#SetDifference">Difference of two sets</A><BR> <A HREF="#ErrorAction">Error Action</A><BR> <A HREF="#EventDriven">Event Driven Parser</A><BR> <A HREF="#ExpansionRule">Expansion Rule</A><BR> <A HREF="#Grammar">Grammar</A><BR> <A HREF="#GrammarAnalysis">Grammar Analysis</A><BR> <A HREF="#GrammarRule">Grammar Rule</A><BR> <A HREF="#GrammarToken">Grammar Token</A><BR> <A HREF="#SetIntersection">Intersection</A><BR> <A HREF="#LALRParsing">LALR parsing</A><BR> <A HREF="#LexicalScanner">Lexical Scanner</A><BR> <A HREF="#Lookahead">Lookahead Token</A><BR> <A HREF="#LRParsing">LR Parsing</A><BR> <A HREF="#MarkedRule">Marked Rule</A><BR> <A HREF="#NonAssociative">Non-associative</A><BR> </TD> <TD VALIGN=TOP NOWRAP WIDTH="50%"> <A HREF="#Nonterminal">Nonterminal Token</A><BR> <A HREF="#NullProduction">Null Production</A><BR> <A HREF="#ParameterAssignment">Parameter Assignment</A><BR> <A HREF="#Parser">Parser</A><BR> <A HREF="#ParserGenerator">Parser Generator</A><BR> <A HREF="#ParserStateStack">Parser State Stack</A><BR> <A HREF="#ParserValueStack">Parser Value Stack</A><BR> <A HREF="#ParsingEngine">Parsing Engine</A><BR> <A HREF="#SetPartition">Partition</A><BR> <A HREF="#Precedence">Precedence</A><BR> <A HREF="#Production">Production</A><BR> <A HREF="#ReduceAction">Reduce Action</A><BR> <A HREF="#ReduceReduce">Reduce-Reduce Conflict</A><BR> <A HREF="#ReducingToken">Reducing Token</A><BR> <A HREF="#ReductionProcedure">Reduction Procedure</A><BR> <A HREF="#ReductionToken">Reduction Token</A><BR> <A HREF="#Resynchronization">Resynchronization</A><BR> <A HREF="#RuleElement">Rule Element</A><BR> <A HREF="#SemanticAction">Semantic Action</A><BR> <A HREF="#SemanticValue">Semantic Value</A><BR> <A HREF="#SemanticallyDetermined">Semantically Determined Production</A><BR> <A HREF="#SetExpression">Set Expression</A><BR> <A HREF="#ShiftAction">Shift Action</A><BR> <A HREF="#ShiftReduce">Shift-Reduce Conflict</A><BR> <A HREF="#SyntaxDirectedParsing">Syntax Directed Parsing</A><BR> <A HREF="#Terminal">Terminal Token</A><BR> <A HREF="#Token">Token</A><BR> <A HREF="#UnaryOperator">Unary Operator</A><BR> <A HREF="#SetUnion">Union</A><BR> <A HREF="#VirtualProduction">Virtual Production</A><BR> </TD> </TABLE> <HR> <DL> <DT><B><A NAME="ParserAction">Action</A></B></DT> <DD>A "parser action" is one of the basic executable elements of a <A HREF="#ParsingEngine">parsing engine</A>. In a traditional parsing engine there are four actions: the <A HREF="#ShiftAction">shift action</A>, the <A HREF="#ReduceAction">reduce action</A>, the <A HREF="#AcceptAction">accept action</A>, and the <A HREF="#ErrorAction">error action</A>. The parsing engines which AnaGram generates use a substantially greater number of actions, in order to provide certain short cuts in the interest of greater speed and greater compactness of the tables which control the action of the parsing engine. <P></P> </DD> <DT><B><A NAME="AcceptAction">Accept Action</A></B></DT> <DD> The accept <A HREF="#ParserAction">action</A> is one of the four actions of a traditional <A HREF="#ParsingEngine">parsing engine</A>. The accept action is performed when the <A HREF="#Parser">parser</A> has succeeded in identifying the <A HREF="#GrammarToken">goal token</A> for the <A HREF="#Grammar">grammar</A>. When the parser executes the accept action, it returns to the calling program. The accept action is thus the last action of the parsing engine and occurs only once for each successful execution of the parser. <P></P> </DD> <DT><B><A NAME="Associativity">Associativity</A></B></DT> <DD> This is a term used to describe <A HREF="#BinaryOperator">binary operators </A>. It describes how you interpret a sequence of operations which all involve the same operator. Consider a - b - c, for instance. Here the convention is that we subtract b from a, and then c from the difference. We say that subtraction is <B>left associative</B>, since if there is a sequence of subtraction operations, the leftmost one is to be performed first. As a second example, consider exponentiation. FORTRAN uses ** as an operator to indicate exponentiation. In order to conform as closely as possible to ordinary mathematical notation, the expression a**b**c is interpreted to mean that first b is raised to the power c, and then the resultant value is used as the power to which a is raised. We say that exponentiation is <B>right associative</B> since the rightmost of a sequence of exponentiation operations is to be performed first. If a programming language does not allow for unparenthesized sequences of a particular operator, the operator is said to be non-associative.<P> Another way to view left and right associativity is as implied parenthesization. Left associative operators are parenthesized from the left. Right associative operators are parenthesized from the right. Thus a - b - c = ((a-b) - c) and a**b**c = (a**(b**c))</P><P> AnaGram offers two ways to specify associativity of <A HREF="#BinaryOperator">binary operators</A>. The first way is to use conventional <A HREF="#BNF">Backus-Naur Form</A> syntax descriptions. You would userecursion to describe both subtraction and exponentiation. Subtraction, since it is left associative, uses left recursion, and exponentiation, being right associative, uses right recursion. Thus</P><P> </P><PRE> difference -> value | difference, '-', value exponential -> value | value, "**", exponential </PRE> could be used to describe differences and exponents.<P> The second way to specify associativity is to use an ambiguous <A HREF="#Grammar">grammar</A> and precedence declarations. (See Chapter 9, AnaGram User's Guide.)</P><P> </P></DD> <DT><B><A NAME="Backtracking">Backtracking</A></B></DT> <DD> In order to make parse tables more compact and <A HREF="#Parser">parsers</A> faster, it is common to use <A HREF="#DefaultReduction">default reductions</A>. In case of error, it is necessary to undo default reductions before diagnostics can be properly determined. In AnaGram, this undo operation is called backtracking.<P> </P></DD> <DT><B><A NAME="BNF">Backus-Naur Form</A></B></DT> <DD> Backus-Naur Form, or BNF, is a conventional notation for describing <A HREF="#ContextFreeGrammar">context free grammars</A>. AnaGram uses an extended notation for <A HREF="#Grammar">grammars</A>, which, except for <A HREF="#SemanticallyDetermined">semantically determined productions</A>, can be shown to be equivalent to BNF. The term "BNF" is often used colloquially to refer to a grammar specification.<P> AnaGram's syntax specification language differs from BNF in the following respects: </P><OL> <LI>In conventional BNF, a symbol not enclosed in angle brackets (< >) is taken to represent itself literally. In AnaGram, literal character representations must be enclosed in single quotes and literal strings within double quotes. </LI> <LI>In BNF, the right side of a <A HREF="#Production">production</A> is simply a list of symbols without any delimiters or punctuation. AnaGram requires that the <A HREF="#RuleElement">rule elements</A> which make up a <A HREF="#GrammarRule">grammar rule</A>, or right side, be joined by commas. </LI> <LI>BNF makes no provision for identifying <A HREF="#ReductionProcedure">reduction procedures</A> or their arguments. AnaGram provides both reduction procedures, introduced by an "=" at the end of the production, and named arguments, introduced by a ":" at the end of any <A HREF="#Token">token</A> on the right side of the production. </LI> <LI>AnaGram allows <A HREF="#VirtualProduction">virtual productions</A> to be used freely. </LI> <LI>BNF is "pure" in that if you wish to define a <A HREF="#Nonterminal">nonterminal token</A> called "digit" to represent the digits from zero to nine, you must provide ten explicit productions to define it. AnaGram treats the concept of <A HREF="#Terminal">"terminal token"</A> as used in language theory as an abstraction, and interposes <A HREF="#CharacterSets">character set</A> functionality between actual character input and the terminal tokens of BNF. You can define digit to be the character range '0-9', for instance, and AnaGram will determine and define precisely the minimum number of productions necessary, taking into account any other usage of the characters '0-9' in your grammar. This makes your grammar more compact, more manageable, and easier to modify. </LI> <LI>AnaGram allows for <A HREF="#SemanticallyDetermined"> semantically determined productions</A>, which provide a significant mechanism for melding semantic analysis with syntactic analysis. </LI> </OL> </DD> <DT><B><A NAME="BinaryOperator">Binary operator</A></B></DT> <DD> A binary operator is an operator that works on two operands to create a result. It is often written between its operands and is sometimes called an infix operator. In ordinary programming languages "+" and "-" are binary operators. <P></P> </DD> <DT><B><A NAME="Binding">Binding</A></B></DT> <DD> Most programming languages, rather than executing arithmetic operators in simple left to right order, conform to the traditional conventions of ordinary algebra, which dictate that, except where parenthesis indicate otherwise, exponentiation is done before multiplication, multiplication before addition, and addition before comparison. One says that exponentiation is "more tightly binding" than multiplication, multiplication is more tightly binding than addition, and so on. The sense of the word here is that the operator binds together its operands into a single entity. An operand which falls between two different operators is "bound" by the more tightly binding operator. An operator that is more tightly binding than another is also said to have higher <A HREF="#Precedence">precedence</A>. <P></P> </DD> <DT><B><A NAME="CharacterSets">Character sets</A></B></DT> <DD> One of the traditional problems with syntax directed programming is that caused by the fact that most formal language specifications have productions of the form: <PRE> letter -> a | b | c | d ... | z </PRE> Since the letters are not often distinguished in the <A HREF = "#Grammar">grammar</A>, this large number of essentially identical <A HREF="#Production">productions</A> causes a correspondingly large number of states in the <A HREF="#Parser">parser</A>. This problem is often attacked by using a<A HREF="#LexicalScanner"> "lexical scanner"</A> which simply specifies a "letter token" whose value indicates precisely which letter is intended. This works fine as long as there is nothing in the grammar which distinguishes any letter at all. If there is, and there is usually some special use of h, x, q, o, e, or even a-f, the situation gets more complicated. Often the result is to abandon the use of syntax directed programming for elemental units of the language and use a more complex lexical scanner which identifies names, numbers, key words and operators. Such lexical scanners are often built using conventional programming languages or simple pattern recognizing languages such as LEX.<P> AnaGram avoids this problem by incorporation the notion of character set into its input specifications. Briefly, the way it works is the following: You specify the set of characters which make up any ostensibly <A HREF="#Terminal">terminal token</A>. AnaGram then analyzes the overlaps among these definitions and creates a minimal set of <A HREF="#Token">tokens</A> which match your specifications exactly. For instance, suppose you have the following definitions: </P><PRE> letter = 'a-z' + 'A-Z' hex digit = '0-9' + 'a-f' + 'A-F' </PRE> AnaGram will define a token for letter which has two productions: <PRE> letter -> 'a-f' + 'A-F' | 'g-z' + 'G-Z' </PRE> With this technique, you have the freedom to specify your grammar in the easiest possible manner, without the penalty of an absurdly large, slow parser.<P> Character sets may be specified in terms of ranges of characters, as in the above example, as <A HREF="#SetUnion">unions</A>, denoted by "+", as <A HREF="#SetIntersection">intersections</A>, denoted by "&", as <A HREF="#SetDifference">differences,</A> denoted by "-", and as <A HREF="#SetComplement">complements</A>, using "~". You may also specify a set consisting of a single character either with a literal character or with a numeric specification. If you specify a character numerically, you may use either decimal, octal or hexadecimal notation, as in C. </P><P></P> </DD> <DT><B><A NAME="Universe">Character Universe</A></B></DT> <DD> In ordinary set theory, the "universe" is the set of all entities which may be elements of a set. It must be defined so that the complement of a set can mean something unambiguous. In AnaGram, the character universe normally comprises the ascii characters and the IBM extended character set, that is, all characters in the range from 0 through 255. If, however, you have defined input characters outside this range, the character universe will be extended to the least character you have used and to the greatest character you have used in your <A HREF = "#Grammar">grammar</A>. <P></P> </DD> <DT><B><A NAME="CharacteristicRule">Characteristic Rule</A></B></DT> <DD> Each parser state is characterized by a particular set of <A HREF="#GrammarRule">grammar rules</A>, and for each such rule, a marked token which is the next <A HREF="#Token">token</A> expected. These rules are called the characteristic rules for the state. In the course of doing <A HREF="#GrammarAnalysis">grammar analysis</A>, AnaGram determines the characteristic rules for each parser state. After analyzing your grammar, you may inspect the State Definition Table to see the characteristic rules for any state in your <A HREF="#Parser">parser</A>. <P></P> </DD> <DT><B><A NAME="CharacteristicToken">Characteristic Token</A></B></DT> <DD> Every state in a <A HREF="#Parser">parser</A>, except state zero, can be characterized by the one, unique <A HREF="#Token">token</A> which causes a jump to that state. That token is called the characteristic token of the state, because to get to that parser state you must have just seen precisely that token in the input. Note that several states could have the same characteristic token.<P> When you have a list of states, such as is given by the parser state stack, it is equivalent to a list of characteristic tokens. This list of tokens is the list of tokens that have been recognized so far by the parser. Some of these tokens, of course, may be <A HREF="#Nonterminal">nonterminal tokens</A> and may thus represent the result of reducing a sequence of previously recognized tokens. </P> <P></P> </DD> <DT><B><A NAME="SetComplement">Complement of a set</A></B></DT> <DD> In set theory, the complement of a set, S, is the collection of all elements of the <A HREF="#Universe">character universe</A> which are not elements of S. Using AnaGram's notation, the complement of S is written ~S. The complement operator has higher <A HREF="#Precedence">precedence</A> than the <A HREF="#SetDifference">difference</A>, <A HREF="#SetIntersection">intersection</A>, or <A HREF="#SetUnion">union</A> operators. <P></P> </DD> <DT><B><A NAME="CompletedRule">Completed Rule</A></B></DT> <DD> A "completed rule" is a <A HREF="#CharacteristicRule">characteristic rule</A> whose index is pointing beyond the last <A HREF="#RuleElement">rule element</A>. In other words, it has been completely matched and will be <A HREF="#ReduceAction">reduced</A> by the next input. <P> If a state has more than one completed rule, the decision as to which to reduce is made based on the next input <A HREF="#Token">token</A>. If there is only one completed rule in a state, it will be reduced by default unless the default reductions switch has been reset, i.e., turned off. </P></DD> <DT><B><A NAME="Conflict">Conflict</A></B></DT> <DD> Conflicts arise during a <A HREF="#GrammarAnalysis">grammar analysis</A> when AnaGram cannot determine how to treat a given input <A HREF="#Token">token</A>. There are two sorts of conflicts: <A HREF="#ShiftReduce">shift-reduce conflicts</A> and <A HREF="#ReduceReduce">reduce-reduce conflicts</A>. Conflicts may arise either because the <A HREF = "#Grammar">grammar</A> is inherently ambiguous, or simply because the grammar analyzer cannot look far enough ahead to resolve the conflict. In the latter case, it is often possible to rewrite the grammar in such a way as to eliminate the conflict. <A HREF="#NullProduction">Null productions</A> are a common source of conflict. <P> There are a number of ways to deal with conflicts. If you understand the conflict well, you may simply choose to ignore it. When AnaGram encounters a shift-reduce conflict while building parse tables it resolves it by choosing the <A HREF="#ShiftAction">shift action</A>. When AnaGram encounters a reduce-reduce conflict while building parse tables, it resolves it by selecting the <A HREF="#GrammarRule">grammar rule</A> which occurred first in the grammar. </P><P> A second way to deal with conflicts is to set operator <A HREF="#Precedence">precedence</A> parameters. If you set these parameters, AnaGram will use them preferentially to resolve conflicts. Any conflicts so resolved will be listed in the Resolved Conflicts window. </P><P> A third way to resolve a conflict is to declare some tokens as sticky or as subgrammars. This is particularly useful for <A HREF="#Production">productions</A> whose sole purpose is to skip over uninteresting input.</P><P> </P><P> The fourth way to deal with conflicts is to rewrite the grammar to eliminate them. Many people prefer this approach since it yields the highest level of confidence in the resulting program. </P></DD> <DT><B><A NAME="ContextFreeGrammar">Context Free Grammars</A></B></DT> <DD> Context free grammars are <A HREF = "#Grammar">grammars</A> wherein the definition of a grammar unit, or <A HREF="#Nonterminal">nonterminal token</A>, does not depend on the context in which the nonterminal is used. That means that if you define "widget" in a certain way, that definition is valid in all contexts in which "widget" might occur. Context free grammars can be represented in <A HREF="#BNF">Backus-Naur Form</A>. AnaGram's syntax specification language has no facility for representing grammars which are not context free. <P></P> </DD> <DT><B><A NAME="DefaultReduction">Default reductions</A></B></DT> <DD> A "default reduction" is a parser action which may be used in your <A HREF="#Parser">parser</A> in any state which has precisely one <A HREF="#CompletedRule">completed rule</A>. <P> If a given parser state has, among its <A HREF="#CharacteristicRule">characteristic rules</A>, exactly one completed rule, it is usually faster to reduce it on any input than to check specifically for correct input before reducing it. The only time this default reduction causes trouble is in the event of a syntax error. In this situation you may get an erroneous reduction. Normally when you are parsing a file, this is inconsequential because you are not going to continue performing <A HREF="#SemanticAction">semantic actions</A> in the presence of errors. But, if you are using your parser to handle real-time interactive input, you have to be able to continue semantic processing after notifying your user that he has entered erroneous input. In this case you would want default reductions to have been turned off so that <A HREF="#Production">production</A>s are reduced only when there is correct input. </P></DD> <DT><B><A NAME="SetDifference">Difference of two sets</A></B></DT> <DD> In set theory, the difference of two sets, A and B, is defined to be the set of all elements of A that are not elements of B. In an AnaGram syntax file, you represent the difference of two <A HREF="#CharacterSets">character sets</A> by using the "-" operator. Thus the difference of A and B is A - B. The difference operator is <A HREF="#Associativity">left associative</A>. <P></P> </DD> <DT><B><A NAME="ErrorAction">Error Action</A></B></DT> <DD> The error action is one of the four <A HREF="#ParserAction">actions</A> of a traditional <A HREF="#ParsingEngine">parsing engine</A>. The error action is performed when the <A HREF="#Parser">parser</A> has encountered an input <A HREF="#Token">token</A> which is not admissible in the current state. The further behavior of a traditional parser is undefined. <P></P> </DD> <DT><B><A NAME="EventDriven">Event Driven Parser</A></B></DT> <DD> An <I>event driven</I> <A HREF="#Parser">parser</A> is one in which the relationship between the host program and the parser is turned inside out. In a conventional parser, the host program calls the parser, the parser analyzes the complete input text and returns to the host program only when it has finished with the entire input. <P>In an event driven parser, the parser does not read its input directly from a file or from memory. Instead, the host program, after initializing the parser, calls it once for each input token. Each time the parser is called, it updates its state appropriately, calls any <A HREF="#ReductionProcedure">reduction procedures</A> that need to be called and finally when it needs more input, returns to the host program. The effect is that parsing can occur in parallel with other processing performed by the host program. This technique is especially useful in situations where the token stream to be parsed is developed on the fly, as when using <A HREF="#LexicalScanner">lexical scanners</A>, for instance. <P></P> </DD> <DT><B><A NAME="ExpansionRule">Expansion Rule</A></B></DT> <DD> In analyzing a <A HREF = "#Grammar">grammar</A>, we are often interested in the full range of input that can be expected at a certain point. The expansion of a <A HREF="#Token">token</A> or state shows us all the expected input. An expansion yields a set of <A HREF="#MarkedRule">marked rules</A>. Looking at the marked token shows us what input to expect. <P> The set of expansion rules of a <A HREF="#Nonterminal">nonterminal token</A> shows all the expected input that can occur whenever the token appears in the grammar. The set consists of all the <A HREF="#GrammarRule">grammar rules</A> produced by the token, plus all the rules produced by the first token of any rule in the set. The marked tokens for the expansion rules of a token are always the first element in the rule. </P><P> The expansion of a state consists of its <A HREF="#CharacteristicRule">characteristic rules</A> plus the expansion rules of the marked token in each characteristic rule. </P> </DD> <DT><B><A NAME="Grammar">Grammar</A></B></DT> <DD> Traditionally, a grammar is a set of <A HREF="#Production">productions</A> which taken together specify precisely a set of acceptable input sequences in terms of an abstract set of <A HREF="#Terminal">terminal tokens</A>. The set of acceptable input sequences is often called the "language" defined by the grammar. <P> In AnaGram, the term grammar also includes configuration segments, definitions of <A HREF="#CharacterSets">character sets</A>, <A HREF="#VirtualProduction">virtual productions</A>, etc. that augment the collection of productions. A grammar is often called a "syntax", and the <A HREF="#GrammarRule">rules of the grammar</A> are often called syntactic rules. </P> <P></P> </DD> <DT><B><A NAME="GrammarAnalysis">Grammar Analysis</A></B></DT> <DD> The major function of AnaGram is the analysis of <A HREF = "#ContextFreeGrammar">context free grammars</A> written in a particular variant of <A HREF="#BNF">Backus-Naur Form</A>. <P> The analysis of a <A HREF="#Grammar">grammar</A> proceeds in four stages. In the first, the input grammar is analyzed and a number of tables are built which describe all of the <A HREF="#Production">productions</A> and components of the grammar. </P><P> In the second stage, AnaGram analyzes all of the <A HREF="#CharacterSets">character sets</A> defined in the grammar, and where necessary, defines auxiliary <A HREF="#Token">tokens</A> and productions. </P><P> In the third stage, AnaGram identifies all of the states of the <A HREF="#Parser">parser</A> and builds the go-to table for the parser. </P><P> In the fourth stage, Anagram identifies <A HREF="#ReductionToken">reduction tokens</A> for each completed <A HREF="#GrammarRule">grammar rule</A> in each state and checks for <A HREF="#Conflict">conflicts</A>. </P> <P></P> </DD> <DT><B><A NAME="GrammarRule">Grammar Rule</A></B></DT> <DD> A "grammar rule" is the right side of a <A HREF="#Production">production</A>. It consists of a sequence of <A HREF="#RuleElement">rule elements</A>. Each rule element identifies some <A HREF="#Token">token</A>, which can be either a <A HREF="#Terminal">terminal token</A> or <A HREF="#Nonterminal">nonterminal token</A>. It is "matched" by a corresponding sequence of tokens in the input stream to the <A HREF="#Parser">parser</A>. The left side of the production identifies one or more nonterminal tokens, or <A HREF="#ReductionToken">reduction tokens</A>, to which the rule reduces when matched. If there is more than one reduction token, there should be a <A HREF="#ReductionProcedure">reduction procedure</A> to make the choice. <P></P> </DD> <DT><B><A NAME="GrammarToken">Grammar Token</A></B></DT> <DD> The grammar <A HREF="#Token">token</A> is the token which represents the "top level" in your grammar. Some people refer to it as the "goal" or "goal token" and others as the "start token". Whatever it is called, it is the single token which describes the complete input to your <A HREF="#Parser">parser</A>. <P></P> </DD> <DT><B><A NAME="SetIntersection">Intersection</A></B></DT> <DD> In set theory, the intersection of two sets, A and B, is defined to be the set of all elements of A which are also elements of B. In an AnaGram syntax file, the intersection of two <A HREF="#CharacterSets">character sets</A> is represented with the "&" operator. The intersection operator has lower <A HREF="#Precedence">precedence</A> than the complement operator, but higher precedence than the <A HREF="#SetUnion">union</A> and <A HREF="#SetDifference">difference</A> operators. The intersection operator is <A HREF="#Associativity">left associative</A>. <P></P> </DD> <DT><B><A NAME="LALRParsing">LALR parsing</A></B></DT> <DD> LALR, or "lookahead LR" parsing, a slightly less powerful variant of <A HREF="#LRParsing">LR parsing</A>, produces much smaller parsing tables. Thus an LALR <A HREF="#Parser">parser</A> has very significant advantages in size over its LR counterpart. It is possible to construct <A HREF = "#Grammar">grammar</A>s which can be parsed by an LR parser but cannot be parsed by an LALR parser. That is, the LALR parser will find these grammars to be ambiguous and the LR parser will not. In practice, however, such grammars seem to be rare and the ambiguities can usually be eliminated by minor revisions. <P> AnaGram generates LALR parsers and, in addition, uses LALR parsing to interpret syntax files. </P><P></P> </DD> <DT><B><A NAME="LexicalScanner">Lexical Scanner</A></B></DT> <DD> A lexical scanner is a program or procedure used as a preprocessor for a <A HREF="#Parser">parser</A>. It scans input characters and lumps them together into <A HREF="#Token">tokens</A>. Many systems for generating parsers, most notably YACC, can scarcely be used without a lexical scanner. <P> AnaGram, although it can support the use of a lexical scanner, does not need a lexical scanner for most practical problems. AnaGram avoids the need for a lexical scanner by using <A HREF="#CharacterSets">character sets</A> and disregard and lexeme statements. </P><P> If your problem does in fact need a lexical scanner, you can use AnaGram itself to write it, so you don't need to know different languages for the scanner and for the parser. </P> <P></P> </DD> <DT><B><A NAME="Lookahead">Lookahead Token</A></B></DT> <DD> The current input <A HREF="#Token">token</A> to a <A HREF="#Parser">parser</A> is often called the "lookahead" token. In many situations, it is not shifted into the input buffer immediately, but acts like a catalyst to cause a number of rules to be reduced before it is eventually shifted in. <P></P> </DD> <DT><B><A NAME="LRParsing">LR Parsing</A></B></DT> <DD> An LR(k) <A HREF="#Parser">parser</A> is a type of deterministic bottom-up shift-reduce parser using at most k lookahead symbols to determine its next <A HREF="#ParserAction">action</A> at any point in the parse. If (k) is omitted, then k is assumed to be 1. Discussion of the technical details of LR(k) parsing is beyond the scope of this manual. The interested reader may consult any of a variety of works covering the subject. <P> AnaGram produces parsers which employ a variant of LR parsing known as LALR parsing. It is not necessary to understand the details of LR and LALR parsing theory in order to use AnaGram. </P> <P></P> </DD> <DT><B><A NAME="MarkedRule">Marked Rule</A></B></DT> <DD> A "marked rule" is a <A HREF="#GrammarRule">grammar rule</A> together with a "marked token" that indicates how much of the rule has already been matched and what input should be expected, starting with the marked token, if the remainder of the rule is to be matched. Each <A HREF="#Parser">parser</A> state is defined by a small number of <A HREF="#CharacteristicRule">characteristic rules</A> which indicate what matches have already been made and what input can be expected. Note that when a state has more than one characteristic rule, any two characteristic rules with the same number of <A HREF="#Token">tokens</A> to the left of the marked token match identically up to the marked token. If one rule has fewer tokens to the left than the other, all of its tokens preceding the marked token match the corresponding tokens of the longer rule exactly. <P> When marked rules are displayed in AnaGram windows, the marked token is displayed in a distinctive font which the developer can select. The default is italic. </P> <P></P> </DD> <DT><B><A NAME="NonAssociative">Non-associative</A></B></DT> <DD> A <A HREF="#BinaryOperator">binary operator</A>, say #, is said to be non-associative if the sequence x # y # z is not permissible. If this sequence is to be interpreted as (x#y)#z, the operator is said to be <B>left associative</B>. If the sequence is to be interpreted as x#(y#z), the operator is said to be <B>right associative</B>. (See <A HREF="#Associativity">associativity</A>.) <P></P> </DD> <DT><B><A NAME="Nonterminal">Nonterminal Token</A></B></DT> <DD> A "nonterminal token" is a <A HREF="#Token">token</A> which results from reducing the sequence of tokens which match a <A HREF="#GrammarRule">grammar rule</A> and replacing them with the appropriate <A HREF="#ReductionToken">reduction token</A>. Nonterminal tokens are to be distinguished from <A HREF="#Terminal">terminal tokens</A> or input tokens. <P></P> </DD> <DT><B><A NAME="NullProduction">Null Production</A></B></DT> <DD> A "null production" is one that has no <A HREF="#Token">tokens</A> on the right side whatsoever. Null <A HREF="#Production">production</A>s essentially are identified by the first following input token. Null productions are extremely convenient syntactic elements when you wish to make some input optional. For example, suppose that you wish to allow an optional semicolon at some point in your <A HREF = "#Grammar">grammar</A>. You could write the following pair of productions: <PRE> optional semicolon -> | ';' </PRE> Note that a null production can never follow a '|'. The above could also be written on multiple lines thus: <PRE> optional semicolon -> -> ';' </PRE> You can always rewrite your grammar to eliminate null productions if you wish, but you usually pay a price in conciseness and clarity. Sometimes, however, it is necessary to do such a rewrite in order to avoid <A HREF="#Conflict">conflicts</A>, to which null productions are especially prone. <P> If you have a null production with no <A HREF="#ReductionProcedure">reduction procedure</A> specified, your <A HREF="#Parser">parser</A> will automatically assign the value zero to the <A HREF="#ReductionToken">reduction token</A>. </P><P> Null productions can be generated by <A HREF="#VirtualProduction">virtual productions</A>. </P> <P></P> </DD> <DT><B><A NAME="ParameterAssignment">Parameter Assignment</A></B></DT> <DD>In any <A HREF="#GrammarRule"> grammar rule</A>, the <A HREF="#SemanticValue">semantic value</A> of any <A HREF="#RuleElement">rule element</A> may be passed to a <A HREF="#ReductionProcedure">reduction procedure</A> by means of a parameter assignment. In AnaGram this is particularly convenient since you can use an ordinary variable name for the parameter. Simply follow the rule element with a colon and a C variable name. The C variable name can then be used in the reduction procedure to reference the semantic value of the token it is attached to. AnaGram will automatically provide necessary declarations. Here are some examples of rule elements with parameter assignments: <PRE> '0-9':d integer:n expression:x declaration : declaration_descriptor </PRE> </DD> <DT><B><A NAME="Parser">Parser</A></B></DT> <DD> A parser is a program, or more likely a procedure within a program, which scans a sequence of input characters or input <A HREF="#Token">tokens</A> and accumulates them in an input buffer or stack as determined by a set of <A HREF="#Production">productions</A> which constitute a <A HREF = "#Grammar">grammar</A>. When the parser discovers a sequence of tokens as defined by a <A HREF="#GrammarRule">grammar rule</A>, or right side of a production, it "reduces" the sequence to a single <A HREF="#ReductionToken">reduction token</A> as defined by the left side of the <A HREF="#GrammarRule">grammar rule</A>. This <A HREF="#Nonterminal">nonterminal token</A> now replaces the tokens which matched the grammar rule and the search for matches continues. If an input token is encountered which will not yield a match for any rule, it is considered a syntax error and some kind of error recovery may be required to continue. If a match, or reduction action, yields the grammar token, sometimes called the goal token or start token, the parser deems its work complete and returns to whatever procedure may have called it. <P> Tokens may have <A HREF="#SemanticValue">semantic values</A>. If the input values configuration switch is on, your parser will expect semantic values to be provided by the input process along with the token identification code. If the input values switch is off, your parser will take the ascii value of the input character, that is, the actual input code, as the value of the character. When the parser <A HREF="#ReduceAction">reduces</A> a <A HREF="#Production">production</A>, it can call a <A HREF="#ReductionProcedure">reduction procedure</A> or <A HREF="#SemanticAction">semantic action</A> to analyze the values of the constituent tokens. This reduction procedure can then return a value which characterizes the reduced token. </P> <P></P> </DD> <DT><B><A NAME="ParserGenerator">Parser Generator</A></B></DT> <DD> A parser generator, such as AnaGram, is a program that converts a <A HREF="#Grammar">grammar</A>, a rule-based description of the input to a program, into a conventional, procedural module called a <A HREF="#Parser">parser</A>. The parsers AnaGram generates are simple C modules which can be compiled on almost any platform. AnaGram parsers are also compatible with C++. <P></P> </DD> <DT><B><A NAME="ParserStateStack">Parser State Stack</A></B></DT> <DD> The parser state stack is a stack maintained by an AnaGram parser and which is an integral part of the parsing process. At any point in the parse of your input stream, the parser state stack provides a list of the states which were traversed in order to arrive at the current state. <P></P> </DD> <DT><B><A NAME="ParserValueStack">Parser Value Stack</A></B></DT> <DD>In parallel with the <A HREF="#ParserStateStack">parser state stack</A>, an AnaGram <A HREF="#Parser"> parser</A> maintains a <I>value stack</I> each entry of which corresponds to the <A HREF="#SemanticValue">semantic value</A> of the token identified at that state. Since the semantic values of different tokens might well have different data types, AnaGram gives you the opportunity, in your syntax file, to define the data type for any token. AnaGram then builds a typedef statement creating a data type which is a union of the all the types you have defined. AnaGram creates the name for this data type by appending "_vs_type" to the name of the parser. AnaGram uses this data type to define the value stack. <P></P> </DD> <DT><B><A NAME="ParsingEngine">Parsing Engine</A></B></DT> <DD> A <A HREF="#Parser">parser</A> consists of three basic components: A set of syntax tables, a set of <A HREF="#ReductionProcedure">reduction procedures</A> and a parsing engine. The parsing engine is the body of code that interprets the parsing table, invokes input functions, and calls the reduction procedures. The Build Parser command configures a parsing engine according to the implicit requirements of the syntax specification and according to the explicit values of the configuration parameters. <P> The parsing engine itself is a simple automaton, characterized by a set of states and a set of inputs. The inputs are the <A HREF="#Token">tokens</A> of your <A HREF = "#Grammar">grammar</A>. Each state is represented by a list of tokens which are admissible in that state and for each token an <A HREF="#ParserAction">action</A> to perform and a parameter which further defines the action. In a traditional <A HREF="#LALRParsing">LALR parser</A>, there are only four <A HREF="#ParserAction">actions</A>: the <A HREF="#ShiftAction">shift action</A>, the <A HREF="#ReduceAction">reduce action</A>, the <A HREF="#AcceptAction">accept action</A> and the <A HREF="#ErrorAction">error action</A>. AnaGram, in doing its <A HREF="#GrammarAnalysis">grammar analysis</A>, identifies a number of special cases, and creates a number of extra actions which make for faster processing, but which can be represented as combinations of these primitive actions. </P><P> When a shift action is performed, the current state number is pushed onto the parser state stack and the new state number is determined by the current state number and the current lookahead token. Different tokens cause different new states. </P><P> When a reduce action is performed, the length of the rule being reduced is subtracted from the parser stack index and the new state number is read from the top of the parser state stack. The <A HREF="#ReductionToken">reduction token</A> for the rule being reduced is then used as an input token. </P><P> Each state in the grammar, with the exception of state zero, has a <A HREF="#CharacteristicToken">characteristic token</A> which must have been recognized in order to jump to that state. Therefore, the parser state stack, which is essentially a list of state numbers, can also be thought of as a list of token numbers. This is the list of tokens that have been seen so far in the parse of your input stream. Some of these tokens, of course, may be <A HREF="#Nonterminal">nonterminal tokens</A> which have resulted from reducing other sequences of tokens. </P> <P></P> </DD> <DT><B><A NAME="SetPartition">Partition</A></B></DT> <DD> If you use <A HREF="#CharacterSets">character sets</A> in your <A HREF = "#Grammar">grammar</A>, AnaGram will compute a "partition" of the <A HREF="#Universe">character universe</A>. This partition is a collection of non-overlapping character sets such that every one of the sets you have defined can be written as a <A HREF="#SetUnion">union</A> of partition sets. Each partition set is identified by a unique reference number called the partition set number. <P> Each partition set is assigned a unique <A HREF="#Token">token</A>. If one of your character sets requires more than one partition set to represent it, AnaGram will create appropriate <A HREF="#Production">productions</A> and add them to your grammar so your <A HREF="#Parser">parser</A> can make the necessary distinctions. </P> <P></P> </DD> <DT><B><A NAME="Precedence">Precedence</A></B></DT> <DD> "Precedence" is an attribute of <A HREF="#BinaryOperator">binary operators</A> and <A HREF = "#UnaryOperator">unary operators</A> which can be used to resolve <A HREF="#Conflict">conflicts</A>. Operator precedence is also referred to as <A HREF = "#Binding">binding</A>. Suppose that # and @ are two binary operators. The question is how to interpret x # y @ z. If # has higher precedence than @, it is interpreted as (x#y)@z. On the other hand if # has lower precedence than @, it would be x#(y@z). If # and @ have the same precedence then the question must be resolved by <A HREF="#Associativity">associativity</A>. <P> Note that all operators at the same precedence level must have the same associativity. </P><P> The situation is somewhat simpler for unary operators. If # and @ were both prefix operators, or both suffix operators, there would be no ambiguity in interpretation, since neither # @ x and x # @ offer more than one possible interpretation. The only difficulty arises when both a prefix and a suffix operator are applied to the same operand. </P><P> Suppose that # is a prefix operator and @ is a suffix operator. The question then is how to interpret # x @. If # has higher precedence than @, it would be interpreted as (# x) @. On the other hand, if # has lower precedence than @, it would be interpreted as # (x @). If # and @ have the same precedence then the question must be resolved by associativity. </P> <P></P> </DD> <DT><B><A NAME="Production">Production</A></B></DT> <DD> Productions are the mechanism used to describe how complex input structures are built up out of simpler ones. Each production has a left side and a right side. The right side, or <A HREF="#GrammarRule">grammar rule</A>, is a sequence of <A HREF="#RuleElement">rule elements</A>, which may represent either <A HREF="#Terminal">terminal tokens</A> or <A HREF="#Nonterminal">nonterminal tokens</A>. The left side is a list of <A HREF="#ReductionToken">reduction tokens</A>. In most cases there would be only a single reduction token. Productions with more than one <A HREF="#Token">token</A> on the left side are called <A HREF = "#SemanticallyDetermined">semantically determined productions</A>. <p>The "<CODE>-></CODE>" symbol is used to separate the left side from the right side. If you have several productions with the same left hand side, you can avoid rewriting the left hand side either by using '|' or by using another "->". <p> A <A HREF="#NullProduction"> null production</A>, or empty right hand side, cannot follow a '|'. <P> Productions may be written thus: <pre> name -> letter -> name, digit </pre> This could also be written <pre> name -> letter | name, digit </pre> In order to accommodate semantic analysis of the data, you may attach to any grammar rule a <A HREF="#ReductionProcedure">reduction procedure</A> which will be executed when the rule is identified. Each token may have a <A HREF="#SemanticValue"> semantic value</A>. By using <A HREF="#ParameterAssignment"> parameter assignments</A>, you may provide the reduction procedure with access to the semantic values of tokens that comprise the grammar rule. When it finishes, the reduction procedure may return a value which will be saved on the <A HREF="#ParserValueStack"> parser value stack</A> as the semantic value of the <A HREF="#ReductionToken">reduction token</A>. <P></P> </DD> <DT><B><A NAME="ReduceAction">Reduce Action</A></B></DT> <DD> The reduce action is one of the four <A HREF="#ParserAction">actions</A> of a traditional parsing engine. The reduce action is performed when the <A HREF="#Parser">parser</A> has succeeded in matching all the elements of a <A HREF="#GrammarRule">grammar rule</A> and the next input <A HREF="#Token">token</A> is not erroneous. Reducing the grammar rule amounts to subtracting the length of the rule from the parser stack index, identifying the <A HREF="#ReductionToken">reduction token</A>, stacking its <A HREF="#SemanticValue">semantic value</A> and then doing a <A HREF="#ShiftAction">shift action</A> with the reduction token as though it had been input directly. <P></P> </DD> <DT><B><A NAME="ReduceReduce">Reduce-Reduce Conflict</A></B></DT> <DD> A <A HREF = "#Grammar">grammar</A> has a "reduce-reduce" conflict at some state if a single <A HREF="#Token">token</A> turns out to be a <A HREF="#ReducingToken">reducing token</A> for more than one <A HREF="#CompletedRule">completed rule</A>. <P></P> </DD> <DT><B><A NAME="ReducingToken">Reducing Token</A></B></DT> <DD> In a state with more than one <A HREF="#CompletedRule">completed rule</A>, your <A HREF="#Parser">parser</A> must be able to determine which one was actually found. AnaGram deals with this problem by looking at all the states the parser will branch to once each rule is reduced. The acceptable input <A HREF="#Token">tokens</A> for those states are the "reducing tokens" for the completed rules in the state under investigation. If there is a single token which is a reducing token for more than one rule, then the <A HREF = "#Grammar">grammar</A> is said to have a <A HREF="#ReduceReduce">reduce-reduce conflict</A> at that state. If in a particular state there is both a <A HREF="#ShiftAction">shift action</A> and a <A HREF="#ReduceAction">reduce action</A> for the same token the grammar is said to have a <A HREF="#ShiftReduce">shift-reduce conflict</A> in that state. <P> A reducing token is not the same as a <A HREF="#ReductionToken">reduction token</A>. </P> <P></P> </DD> <DT><B><A NAME="ReductionProcedure">Reduction Procedure</A></B></DT> <DD> A "reduction procedure" is a function you write which your <A HREF="#Parser">parser</A> executes when it has identified the <A HREF="#GrammarRule">grammar rule</A> to which the reduction procedure is attached in your <A HREF = "#Grammar">grammar</A>. There are two formats for reduction procedures, depending on the size and complexity of the procedure. <P></P> </DD> <DT><B><A NAME="ReductionToken">Reduction Token</A></B></DT> <DD> A <A HREF="#Token">token</A> which appears on the left side of a <A HREF="#Production">production</A> is called a reduction token. It is so called because when the <A HREF="#GrammarRule">grammar rule</A> on the right side of the production is matched in the input stream, your <A HREF="#Parser">parser</A> will <A HREF="#ReduceAction">"reduce"</A> the sequence of tokens which matches the rule by replacing the sequence of tokens with the reduction token. Note that, if more than one reduction token is specified, your <A HREF="#ReductionProcedure">reduction procedure</A> should choose the exact one. If it does not, your parser will use the leftmost syntactically correct one in the list as the default. <P> A reduction token is not the same as a <A HREF="#ReducingToken">reducing token</A>. </P> <P></P> </DD> <DT><B><A NAME="Resynchronization">Resynchronization</A></B></DT> <DD> "Resynchronization" is the process of getting your <A HREF="#Parser">parser</A> back in step with its input after encountering a syntax error. As such, it is one method of error recovery. Of course, you would resynchronize only if it is necessary to continue after the error. There are several options available when using AnaGram. You could use the auto resynch option, which causes AnaGram to incorporate an automatic resynchronizing procedure into your parser, or you could use the error resynch option, which is similar to the technique used by YACC programmers. </DD> <P></P> <DT><B><A NAME="RuleElement">Rule Element</A></B></DT> <DD> A <A HREF="#GrammarRule">grammar rule</A> is a list of "rule elements", separated by commas. Rule elements may be <A HREF="#Token">token</A> names, <A HREF="#CharacterSets">character sets</A>, keywords, immediate actions, or <A HREF="#VirtualProduction">virtual productions</A>. When AnaGram encounters a rule element for which no token presently exists, it creates one. <P></P> </DD> <DT><B><A NAME="SemanticAction">Semantic Action</A></B></DT> <DD> A semantic action is a piece of C or C++ code that is executed when a <A HREF="#GrammarRule">grammar rule</A> has been identified by a <A HREF="#Parser">parser</A>. Semantic actions are also called <A HREF="#ReductionProcedure">reduction procedures</A>, since they are executed when a grammar rule is <A HREF="#ReduceAction">reduced</A> to the token on the left side of the <A HREF="#Production">production.</A> <P></P> </DD> <DT><B><A NAME="SemanticValue">Semantic Value</A></B></DT> <DD> <A HREF="#Token">Tokens</A>, whether <A HREF="#Terminal">terminal</A> or <A HREF="#Nonterminal">nonterminal</A>, may have a semantic value. In the case of terminal tokens, this may be a value assigned by the <A HREF="#LexicalScanner">lexical scanner</A> or, if the parser is using direct character input, it will be the ascii value of the character itself. The values of nonterminal tokens are created by <A HREF="#ReductionProcedure">reduction procedures</A>. As a parse progresses, token values are <A HREF="#ShiftAction">shifted</A> onto the stack, so that in a reduction procedure, the values of the tokens that comprise the <A HREF="#GrammarRule">grammar rule</A> that is being <A HREF="#ReduceAction">reduced</A> are available for inspection. <P></P> </DD> <DT><B><A NAME="SemanticallyDetermined">Semantically Determined Production</A></B></DT> <DD> A "semantically determined production" is one which has more than one <A HREF="#ReductionToken">reduction token</A> specified on the left side of the production. You would write such a <A HREF="#Production">production</A> when the reduction tokens are syntactically indistinguishable. The <A HREF="#ReductionProcedure">reduction procedure</A> may then specify which of the listed reduction tokens the <A HREF="#GrammarRule">grammar rule</A> is to reduce to based on semantic considerations. If there is no reduction procedure, or the reduction procedure does not specify a reduction token, the <A HREF="#Parser">parser</A> will use the leftmost syntactically correct token. <P></P> </DD> <DT><B><A NAME="SetExpression">Set Expression</A></B></DT> <DD> A set expression is an algebraic expression used to define a <A HREF="#CharacterSets">character set</A> in terms of individual characters, ranges of characters, and/or other sets of characters as constructed using <A HREF="#SetComplement">complements</A>, <A HREF="#SetUnion">unions</A>, <A HREF="#SetIntersection">intersections</A>, and <A HREF="#SetDifference">differences</A>. <P></P> </DD> <DT><B><A NAME="ShiftAction">Shift Action</A></B></DT> <DD> The shift action is one of the four <A HREF="#ParserAction">actions</A> of a traditional parsing engine. The shift action is performed when the input <A HREF="#Token">token</A> matches one of the acceptable input tokens for the current parser state. The <A HREF="#SemanticValue">semantic value</A> of the token and the current state number are stacked, the parser stack index is incremented and the state number is set to a value determined by the previous state and the input token. <P></P> </DD> <DT><B><A NAME="ShiftReduce">Shift-Reduce Conflict</A></B></DT> <DD> A "shift-reduce" conflict occurs if in some state there is a single <A HREF="#Token">token</A> that should be shifted, because it is legitimate input for one of the rules of the state, but should also be used to reduce some other rule because it is a <A HREF="#ReducingToken">reducing token</A> for that rule. <P></P> </DD> <DT><B><A NAME="SyntaxDirectedParsing">Syntax Directed Parsing</A></B></DT> <DD> <P>Syntax directed parsing, or formal parsing, is an approach to building <A HREF="#Parser">parsers</A> based on formal language theory. Given a suitable description of a language, called a <A HREF="#Grammar">grammar</A>, there are algorithms which can be used to create parsers for the language automatically. In this context, the set of all possible inputs to a program may be considered to constitute a language, and the rules for formulating the input to the program constitute the grammar for the language. <P>The parsers built from a grammar have the advantage that they can recognize any input that conforms to the rules, and can reject as erroneous any input that fails to conform. <P>Since the program logic necessary to parse input is often extremely intricate, programs which use formal parsing are usually much more reliable than those built by hand. They are also much easier to maintain, since it is much easier to modify a grammar specification than it is to modify complex program logic. <P></P> </DD> <DT><B><A NAME="Terminal">Terminal Token</A></B></DT> <DD> A "terminal token" is a <A HREF="#Token">token</A> which does not appear on the left side of a <A HREF="#Production">production</A>. It represents, therefore, a basic unit of input to your <A HREF="#Parser">parser</A>. If the input to your parser consists of ascii characters, you may define terminal tokens explicitly as ascii characters or as <A HREF="#CharacterSets">sets of ascii characters</A>. If you have an input procedure which produces numeric codes, you may define the terminal tokens directly in terms of these numeric codes. <P></P> </DD> <DT><B><A NAME="Token">Token</A></B></DT> <DD> Tokens are the units with which your <A HREF="#Parser">parser</A> works. There are two kinds of tokens: <A HREF="#Terminal">terminal tokens</A> and <A HREF="#Nonterminal">nonterminal tokens</A>. These latter are identified by the parser as sequences of tokens. The grouping of tokens into more complex tokens is governed by the <A HREF="#GrammarRule">grammar rules</A>, or <A HREF="#Production">productions</A> in your <A HREF = "#Grammar">grammar</A>. In your grammar, tokens may be denoted by token names, by <A HREF="#VirtualProduction">virtual productions</A>, by immediate actions, by explicit character representations, or by expressions which yield <A HREF="#CharacterSets">character sets</A>. <P> A "token" should be thought of more or less as being something like a cloakroom token. Imagine you had taken a kindergarten class to the museum and all the kids checked their coats. Each one gets a token. What is the probability of losing one? You take all of the tokens from the children, put them in a paper bag, tie it up, and check it. You get one token in return. We would say that the kids' coats represent the actual input to the system, their individual tokens are the <A HREF="#Terminal">terminal tokens</A>, and your one token which enables you to redeem the paper bag is a <A HREF="#Nonterminal">nonterminal token</A>. Nonterminal tokens are just single token numbers to represent the result of compacting a sequence of more primitive tokens. </P><P> Actually, the word "token" is commonly used to refer to a number of apparently different things. The tokens you use in a grammar are abstract. The tokens identified by a parser are concrete instances of the abstract tokens in the grammar. Furthermore, we have to identify tokens in some manner, so we have token names and token numbers with which to refer to particular tokens. As a result the word "token", in any particular context, may refer directly to either an abstract token, or a concrete instance of an abstract token, or may refer only indirectly to the token by means of a token name or token number. </P><P> As an example, consider a C compiler which has a lexical scanner to identify input tokens. According to Kernighan and Ritchie, there are six classes of C tokens: identifiers, keywords, constants, string literals, operators, and other separators. Consider a particular token, a hexadecimal constant, 0XF00. When the lexical scanner hands this over to the parser, it has to provide an identification code which says what kind of token it is, a hexadecimal constant, as well as the <A HREF="#SemanticValue">"semantic value"</A> of the token, 3840. The identification code is usually an integer value, determined by the designer of the lexical scanner, which by agreement with the designer of the parser uniquely represents a hexadecimal constant. This identification code can be thought of as an external token number. </P><P> The grammar for the parser, on the other hand, has a token, "HEXconstant", let us say, to represent the abstract usage of hexadecimal constants in the C language. When AnaGram, or any other parser generator for that matter, analyzes the grammar, it assigns its own reference number to identify "HEXconstant". This reference number can be thought of as an internal token number to contrast it with the external token number provided by the lexical scanner. The parsers AnaGram builds contain a table, ag_ tcv, which provides a translation from the external to the internal token number. Both of these token numbers, of course, differ from the semantic value of the token, in this case, 3840. </P><P> Even when an AnaGram parser is using simple character input, the word "token" may represent several different things. Suppose a grammar has a token 'A-Z', which represents the set of upper case letters. There will be an internal token number, assigned by AnaGram, which identifies the (abstract) token in parser tables. The actual (concrete) input token to a parser will, however, be a single letter, say "Q". In this case, the ascii value of the character "Q", 81, is used both as the external token number and as the semantic value of the token. </P><P></P> </DD> <DT><B><A NAME="UnaryOperator">Unary Operator</A></B></DT> <DD>A "unary operator" is a <A HREF="#Token">token</A> which represents some sort of operation which operates on a single entity to produce a new resultant entity. It is normally represented by a symbol which is written either preceding the operand or following the operand. In the first case it is called a "prefix operator" and in the second case a "suffix operator". </P><P></P> </DD> <DT><B><A NAME="SetUnion">Union</A></B></DT> <DD> The union of two sets is the set of all elements that are to be found in one or another of the two sets. In an AnaGram syntax file the union of two <A HREF="#CharacterSets">character sets</A> A and B is represented using the plus sign, as in A + B. The union operator has the same <A HREF="#Precedence">precedence</A> as the <A HREF="#SetDifference">difference</A> operator: lower than that of <A HREF="#SetIntersection">intersection</A> and <A HREF="#SetComplement">complement</A>. The union operator is <A HREF="#Associativity">left associative</A>. <P></P> </DD> <DT><B><A NAME="VirtualProduction">Virtual Production</A></B></DT> <DD> <P>Virtual productions are a special short hand representation of <A HREF="#GrammarRule">grammar rules</A> which can be used to indicate a choice of inputs. They are an important convenience, especially useful when you are first building a <A HREF="#Grammar">grammar</A>. <P>AnaGram rewrites virtual productions, so that when you look at the syntax tables in AnaGram, there will be actual <A HREF="#Production">productions</A> replacing the virtual productions. <P>A virtual production appears as one of the <A HREF="#RuleElement">rule elements</A> in a <A HREF="#GrammarRule">grammar rule</A>, i.e. as one of the members of the list on the right side of a production. <P>The simplest virtual production is the "optional" <A HREF="#Token">token</A>. If x is an arbitrary token, x? can be used to indicate an optional x. <P>Related virtual productions are x... and x?... where the three dots indicate repetition. x... represents an arbitrary number of occurrences of x, but at least one. x?... represents zero or more occurrences of x. <P>The remaining virtual productions use curly or square brackets to enclose a sequence of rules. The brackets may be followed variously by nothing, a string of three dots, or a slash, to indicate the choices to be made from the rules. Note that rules may be used, not merely tokens. <P>If r1 through rn are a set of grammar rules, then <PRE> {r1 | r2 | ... | rn} </PRE> is a virtual production that allows a choice of exactly one of the rules. Similarly, <PRE> {r1 | r2 | ... | rn}... </PRE> is a virtual production that allows a choice of one or more of the rules. And, finally, <PRE> {r1 | r2 | ... | rn}/... </PRE> is a virtual production that allows a choice of one or more of the rules subject to the side condition that rules must alternate, that is, that no rule can follow itself immediately without the interposition of some other rule. This is a case that is not particularly easy to write by hand, but is quite useful in a number of contexts. <P>If the above virtual productions are written with [ ] instead of { }, they all become optional. [ ] is an optional choice, [ ]... is zero or more choices, and [ ]/... is zero or more alternating choices. <P><A HREF="#NullProduction">Null productions</A> are not permitted in virtual productions in those cases where they would cause an intrinsic ambiguity. <P>You may use a definition statement to assign a name to a virtual production. </DD> </DL> <P> <BR> <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" WIDTH=1010 HEIGHT=2 > <P> <IMG ALIGN="right" SRC="images/pslrb6d.gif" ALT="Parsifal Software" WIDTH=181 HEIGHT=25> <BR CLEAR="right"> Back to <A HREF="index.html">Index</A> <P> <ADDRESS><FONT SIZE="-1"> AnaGram parser generator - documentation<BR> Glossary<BR> Copyright © 1993-1999, Parsifal Software. <BR> All Rights Reserved.<BR> </FONT></ADDRESS> </BODY> </HTML>