Mercurial > ~dholland > hg > ag > index.cgi
diff doc/misc/html/gloss.html @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/misc/html/gloss.html Sat Dec 22 17:52:45 2007 -0500 @@ -0,0 +1,1226 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> +<HTML> +<HEAD> +<TITLE>Glossary of Parsing Terms</TITLE> + +</HEAD> + +<BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif" + TEXT="#000000" LINK="#0033CC" + VLINK="#CC0033" ALINK="#CC0099"> + +<P> +<IMG ALIGN="right" SRC="images/agrsl6c.gif" ALT="AnaGram" + WIDTH=124 HEIGHT=30> +<BR CLEAR="RIGHT"> +</P> + +Back to <A HREF="index.html">Index</A> +<P> +<IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" + WIDTH=1010 HEIGHT=2 > + +<P> +<H2>Glossary of Parsing Terms</H2> +<IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" + WIDTH=1010 HEIGHT=2 > +</P> + +<P> +<TABLE WIDTH="100%" ALIGN="center" CELLPADDING=15 CELLSPACING=15> +<TR ALIGN="left"> + +<TD VALIGN=TOP NOWRAP WIDTH="50%"> + + + +<A HREF="#ParserAction">Action</A><BR> +<A HREF="#AcceptAction">Accept Action</A><BR> +<A HREF="#Associativity">Associativity</A><BR> +<A HREF="#Backtracking">Backtracking</A><BR> +<A HREF="#BNF">Backus-Naur Form</A><BR> +<A HREF="#BinaryOperator">Binary operator</A><BR> +<A HREF="#Binding">Binding</A><BR> +<A HREF="#CharacterSets">Character sets</A><BR> +<A HREF="#Universe">Character Universe</A><BR> +<A HREF="#CharacteristicRule">Characteristic Rule</A><BR> +<A HREF="#CharacteristicToken">Characteristic Token</A><BR> +<A HREF="#SetComplement">Complement of a set</A><BR> +<A HREF="#CompletedRule">Completed Rule</A><BR> +<A HREF="#Conflict">Conflict</A><BR> +<A HREF="#ContextFreeGrammar">Context Free Grammars</A><BR> +<A HREF="#DefaultReduction">Default reductions</A><BR> +<A HREF="#SetDifference">Difference of two sets</A><BR> +<A HREF="#ErrorAction">Error Action</A><BR> +<A HREF="#EventDriven">Event Driven Parser</A><BR> +<A HREF="#ExpansionRule">Expansion Rule</A><BR> +<A HREF="#Grammar">Grammar</A><BR> +<A HREF="#GrammarAnalysis">Grammar Analysis</A><BR> +<A HREF="#GrammarRule">Grammar Rule</A><BR> +<A HREF="#GrammarToken">Grammar Token</A><BR> +<A HREF="#SetIntersection">Intersection</A><BR> +<A HREF="#LALRParsing">LALR parsing</A><BR> +<A HREF="#LexicalScanner">Lexical Scanner</A><BR> +<A HREF="#Lookahead">Lookahead Token</A><BR> +<A HREF="#LRParsing">LR Parsing</A><BR> +<A HREF="#MarkedRule">Marked Rule</A><BR> +<A HREF="#NonAssociative">Non-associative</A><BR> +</TD> + +<TD VALIGN=TOP NOWRAP WIDTH="50%"> +<A HREF="#Nonterminal">Nonterminal Token</A><BR> +<A HREF="#NullProduction">Null Production</A><BR> +<A HREF="#ParameterAssignment">Parameter Assignment</A><BR> +<A HREF="#Parser">Parser</A><BR> +<A HREF="#ParserGenerator">Parser Generator</A><BR> +<A HREF="#ParserStateStack">Parser State Stack</A><BR> +<A HREF="#ParserValueStack">Parser Value Stack</A><BR> +<A HREF="#ParsingEngine">Parsing Engine</A><BR> +<A HREF="#SetPartition">Partition</A><BR> +<A HREF="#Precedence">Precedence</A><BR> +<A HREF="#Production">Production</A><BR> +<A HREF="#ReduceAction">Reduce Action</A><BR> +<A HREF="#ReduceReduce">Reduce-Reduce Conflict</A><BR> +<A HREF="#ReducingToken">Reducing Token</A><BR> +<A HREF="#ReductionProcedure">Reduction Procedure</A><BR> +<A HREF="#ReductionToken">Reduction Token</A><BR> +<A HREF="#Resynchronization">Resynchronization</A><BR> +<A HREF="#RuleElement">Rule Element</A><BR> +<A HREF="#SemanticAction">Semantic Action</A><BR> +<A HREF="#SemanticValue">Semantic Value</A><BR> +<A HREF="#SemanticallyDetermined">Semantically Determined Production</A><BR> +<A HREF="#SetExpression">Set Expression</A><BR> +<A HREF="#ShiftAction">Shift Action</A><BR> +<A HREF="#ShiftReduce">Shift-Reduce Conflict</A><BR> +<A HREF="#SyntaxDirectedParsing">Syntax Directed Parsing</A><BR> +<A HREF="#Terminal">Terminal Token</A><BR> +<A HREF="#Token">Token</A><BR> +<A HREF="#UnaryOperator">Unary Operator</A><BR> +<A HREF="#SetUnion">Union</A><BR> +<A HREF="#VirtualProduction">Virtual Production</A><BR> + +</TD> + +</TABLE> + +<HR> + + + +<DL> +<DT><B><A NAME="ParserAction">Action</A></B></DT> +<DD>A "parser action" is one of the basic executable elements of a +<A HREF="#ParsingEngine">parsing engine</A>. In a traditional parsing engine +there are four actions: the +<A HREF="#ShiftAction">shift action</A>, the <A HREF="#ReduceAction">reduce +action</A>, the <A HREF="#AcceptAction">accept action</A>, and the +<A HREF="#ErrorAction">error action</A>. The parsing engines which AnaGram +generates use a substantially greater number of actions, in order to provide +certain short cuts in the interest of greater speed and greater compactness of +the tables which control the action of the parsing engine. +<P></P> +</DD> + +<DT><B><A NAME="AcceptAction">Accept Action</A></B></DT> +<DD> The accept <A HREF="#ParserAction">action</A> is one of the four actions +of a traditional <A HREF="#ParsingEngine">parsing engine</A>. The accept action +is performed when the <A HREF="#Parser">parser</A> has succeeded in identifying +the <A HREF="#GrammarToken">goal token</A> for the <A HREF="#Grammar">grammar</A>. +When the parser executes the accept action, it returns to the calling program. +The accept action is thus the last action of +the parsing engine and occurs only once for each successful execution +of the parser. + +<P></P> +</DD> + +<DT><B><A NAME="Associativity">Associativity</A></B></DT> +<DD> This is a term used to describe <A HREF="#BinaryOperator">binary +operators </A>. It describes how you interpret a sequence of operations which +all involve the same operator. Consider a - b - c, for instance. Here the +convention is that we subtract b from a, and then c from the difference. We say +that subtraction is <B>left associative</B>, since if there is a sequence of +subtraction operations, the leftmost one is to be performed first. As a second +example, consider exponentiation. FORTRAN uses ** as an operator to indicate +exponentiation. In order to conform as closely as possible to ordinary +mathematical notation, the expression a**b**c is interpreted to mean that first +b is raised to the power c, and then the resultant value is used as the power to +which a is raised. We say that exponentiation is <B>right associative</B> since the +rightmost of a sequence of exponentiation operations is to be performed first. +If a programming language does not allow for unparenthesized sequences of a +particular operator, the operator is said to be non-associative.<P> +Another way to view left and right associativity is as implied parenthesization. +Left associative operators are parenthesized from the left. Right associative +operators are parenthesized from the right. Thus a - b - c = ((a-b) - c) and +a**b**c = (a**(b**c))</P><P> +AnaGram offers two ways to specify associativity of + <A HREF="#BinaryOperator">binary operators</A>. The first +way is to use conventional <A HREF="#BNF">Backus-Naur Form</A> syntax +descriptions. You would userecursion to describe both subtraction and +exponentiation. Subtraction, since it is left associative, uses left recursion, +and exponentiation, being right associative, uses right recursion. Thus</P><P> +</P><PRE> + difference -> value | difference, '-', value + exponential -> value | value, "**", exponential +</PRE> could be used to describe differences and exponents.<P> +The second way to specify associativity is to use an ambiguous <A HREF="#Grammar">grammar</A> and +precedence declarations. (See Chapter 9, AnaGram User's Guide.)</P><P> +</P></DD> + +<DT><B><A NAME="Backtracking">Backtracking</A></B></DT> +<DD> In order to make parse tables more compact and <A HREF="#Parser">parsers</A> faster, it is +common to use <A HREF="#DefaultReduction">default reductions</A>. In case of +error, it is necessary to undo default reductions before diagnostics can be +properly determined. In AnaGram, this undo operation is called backtracking.<P> + +</P></DD> + +<DT><B><A NAME="BNF">Backus-Naur Form</A></B></DT> +<DD> Backus-Naur Form, or BNF, is a conventional notation for describing +<A HREF="#ContextFreeGrammar">context free grammars</A>. AnaGram uses an +extended notation for <A HREF="#Grammar">grammars</A>, which, except for +<A HREF="#SemanticallyDetermined">semantically determined +productions</A>, can be shown to be equivalent to BNF. The term "BNF" is +often used colloquially to refer to a grammar specification.<P> +AnaGram's syntax specification language differs from BNF in the following +respects: +</P><OL> +<LI>In conventional BNF, a symbol not enclosed in angle brackets (< >) is +taken to represent itself literally. In AnaGram, literal character +representations must be enclosed in single quotes and literal strings within +double quotes. +</LI> +<LI>In BNF, the right side of a <A HREF="#Production">production</A> is simply a list of symbols without +any delimiters or punctuation. AnaGram requires that the <A HREF="#RuleElement">rule elements</A> which +make up a <A HREF="#GrammarRule">grammar rule</A>, or right side, be joined by commas. +</LI> +<LI>BNF makes no provision for identifying <A HREF="#ReductionProcedure">reduction procedures</A> or their +arguments. AnaGram provides both reduction procedures, introduced by an "=" +at the end of the production, and named arguments, introduced by a ":" +at the end of any <A HREF="#Token">token</A> on the right side of the production. +</LI> +<LI>AnaGram allows <A HREF="#VirtualProduction">virtual productions</A> to be used freely. +</LI> +<LI>BNF is "pure" in that if you wish to define a <A HREF="#Nonterminal">nonterminal token</A> +called "digit" to represent the digits from zero to nine, you must +provide ten explicit productions to define it. AnaGram treats the concept of <A HREF="#Terminal">"terminal +token"</A> as used in language theory as an abstraction, and interposes +<A HREF="#CharacterSets">character set</A> functionality between actual character input and the terminal +tokens of BNF. You can define digit to be the character range '0-9', for +instance, and AnaGram will determine and define precisely the minimum number of +productions necessary, taking into account any other usage of the characters +'0-9' in your grammar. This makes your grammar more compact, more manageable, +and easier to modify. +</LI> +<LI>AnaGram allows for <A HREF="#SemanticallyDetermined"> +semantically determined productions</A>, which provide a +significant mechanism for melding semantic analysis with syntactic +analysis. + +</LI> +</OL> +</DD> + +<DT><B><A NAME="BinaryOperator">Binary operator</A></B></DT> +<DD> A binary operator is an operator that works on two operands to create a +result. It is often written between its operands and is sometimes called an +infix operator. In ordinary programming languages "+" and "-" +are binary operators. +<P></P> +</DD> + +<DT><B><A NAME="Binding">Binding</A></B></DT> +<DD> Most programming languages, rather than executing arithmetic operators in +simple left to right order, conform to the traditional conventions of ordinary +algebra, which dictate that, except where parenthesis indicate otherwise, +exponentiation is done before multiplication, multiplication before addition, +and addition before comparison. One says that exponentiation is "more +tightly binding" than multiplication, multiplication is more tightly +binding than addition, and so on. The sense of the word here is that the +operator binds together its operands into a single entity. An operand which +falls between two different operators is "bound" by the more tightly +binding operator. An operator that is more tightly binding than another is also +said to have higher <A HREF="#Precedence">precedence</A>. +<P></P> +</DD> + +<DT><B><A NAME="CharacterSets">Character sets</A></B></DT> +<DD> One of the traditional problems with syntax directed programming is that +caused by the fact that most formal language specifications have productions of +the form: +<PRE> + letter -> a | b | c | d ... | z +</PRE> Since the letters are not often distinguished in the <A HREF = "#Grammar">grammar</A>, this large +number of essentially identical <A HREF="#Production">productions</A> causes a correspondingly large +number of states in the <A HREF="#Parser">parser</A>. This problem is often attacked by using a<A +HREF="#LexicalScanner"> "lexical scanner"</A> which simply specifies +a "letter token" whose value indicates precisely which letter is +intended. This works fine as long as there is nothing in the grammar which +distinguishes any letter at all. If there is, and there is usually some special +use of h, x, q, o, e, or even a-f, the situation gets more complicated. Often the +result is to abandon the use of syntax directed programming for elemental units +of the language and use a more complex lexical scanner which identifies names, +numbers, key words and operators. Such lexical scanners are often built using +conventional programming languages or simple pattern recognizing languages such +as LEX.<P> +AnaGram avoids this problem by incorporation the notion of character set into +its input specifications. Briefly, the way it works is the following: You +specify the set of characters which make up any ostensibly <A HREF="#Terminal">terminal token</A>. +AnaGram then analyzes the overlaps among these definitions and creates a minimal +set of <A HREF="#Token">tokens</A> which match your specifications exactly. For instance, suppose you +have the following definitions: +</P><PRE> + letter = 'a-z' + 'A-Z' + hex digit = '0-9' + 'a-f' + 'A-F' +</PRE> AnaGram will define a token for letter which has two productions: +<PRE> + letter -> 'a-f' + 'A-F' | 'g-z' + 'G-Z' +</PRE> With this technique, you have the freedom to specify your grammar in the +easiest possible manner, without the penalty of an absurdly large, slow parser.<P> +Character sets may be specified in terms of ranges of characters, as in the +above example, as <A HREF="#SetUnion">unions</A>, denoted by "+", as +<A HREF="#SetIntersection">intersections</A>, denoted by "&", as +<A HREF="#SetDifference">differences,</A> denoted by "-", and as +<A HREF="#SetComplement">complements</A>, using "~". You may also +specify a set consisting of a single character either with a literal character +or with a numeric specification. If you specify a character numerically, you may +use either decimal, octal or hexadecimal notation, as in C. +</P><P></P> +</DD> + +<DT><B><A NAME="Universe">Character Universe</A></B></DT> +<DD> In ordinary set theory, the "universe" is the set of all +entities which may be elements of a set. It must be defined so that the +complement of a set can mean something unambiguous. In AnaGram, the character +universe normally comprises the ascii characters and the IBM extended character +set, that is, all characters in the range from 0 through 255. If, however, you +have defined input characters outside this range, the character universe will be +extended to the least character you have used and to the greatest character you +have used in your <A HREF = "#Grammar">grammar</A>. +<P></P> +</DD> + +<DT><B><A NAME="CharacteristicRule">Characteristic Rule</A></B></DT> +<DD> Each parser state is characterized by a particular set of <A HREF="#GrammarRule">grammar rules</A>, +and for each such rule, a marked token which is the next <A HREF="#Token">token</A> expected. These +rules are called the characteristic rules for the state. In the course of doing +<A HREF="#GrammarAnalysis">grammar analysis</A>, AnaGram determines the characteristic rules for each parser +state. After analyzing your grammar, you may inspect the State Definition Table +to see the characteristic rules for any state in your <A HREF="#Parser">parser</A>. +<P></P> +</DD> + +<DT><B><A NAME="CharacteristicToken">Characteristic Token</A></B></DT> +<DD> Every state in a <A HREF="#Parser">parser</A>, except state zero, can be characterized by the +one, unique <A HREF="#Token">token</A> which causes a jump to that state. That token is called the +characteristic token of the state, because to get to that parser state you must +have just seen precisely that token in the input. Note that several states could +have the same characteristic token.<P> +When you have a list of states, such as is given by the parser state stack, it +is equivalent to a list of characteristic tokens. This list of tokens is the +list of tokens that have been recognized so far by the parser. Some of these +tokens, of course, may be <A HREF="#Nonterminal">nonterminal tokens</A> and may thus represent the result +of reducing a sequence of previously recognized tokens. +</P> +<P></P> +</DD> + +<DT><B><A NAME="SetComplement">Complement of a set</A></B></DT> +<DD> In set theory, the complement of a set, S, is the collection of all +elements of the <A HREF="#Universe">character universe</A> which are not elements of S. Using AnaGram's +notation, the complement of S is written ~S. The complement operator has higher +<A HREF="#Precedence">precedence</A> than the +<A HREF="#SetDifference">difference</A>, <A HREF="#SetIntersection">intersection</A>, or <A HREF="#SetUnion">union</A> operators. +<P></P> +</DD> + +<DT><B><A NAME="CompletedRule">Completed Rule</A></B></DT> +<DD> A "completed rule" is a <A HREF="#CharacteristicRule">characteristic rule</A> whose index is +pointing beyond the last <A HREF="#RuleElement">rule element</A>. In other words, it has been completely +matched and will be <A HREF="#ReduceAction">reduced</A> by the next input. +<P> +If a state has more than one completed rule, the decision as to which to reduce +is made based on the next input <A HREF="#Token">token</A>. If there is only one completed rule in a +state, it will be reduced by default unless the default reductions switch has +been reset, i.e., turned off. +</P></DD> +<DT><B><A NAME="Conflict">Conflict</A></B></DT> +<DD> Conflicts arise during a <A HREF="#GrammarAnalysis">grammar analysis</A> when AnaGram cannot determine +how to treat a given input <A HREF="#Token">token</A>. There are two sorts of conflicts: <A HREF="#ShiftReduce">shift-reduce +conflicts</A> and <A HREF="#ReduceReduce">reduce-reduce conflicts</A>. Conflicts may arise either because the +<A HREF = "#Grammar">grammar</A> is inherently ambiguous, or simply because the grammar analyzer cannot +look far enough ahead to resolve the conflict. In the latter case, it is often +possible to rewrite the grammar in such a way as to eliminate the conflict. <A HREF="#NullProduction">Null +productions</A> are a common source of conflict. +<P> +There are a number of ways to deal with conflicts. If you understand the +conflict well, you may simply choose to ignore it. When AnaGram encounters a +shift-reduce conflict while building parse tables it resolves it by choosing the +<A HREF="#ShiftAction">shift action</A>. When AnaGram encounters a reduce-reduce conflict while building +parse tables, it resolves it by selecting the <A HREF="#GrammarRule">grammar rule</A> which occurred first +in the grammar. +</P><P> +A second way to deal with conflicts is to set operator <A HREF="#Precedence">precedence</A> parameters. If +you set these parameters, AnaGram will use them preferentially to resolve +conflicts. Any conflicts so resolved will be listed in the Resolved Conflicts +window. +</P><P> +A third way to resolve a conflict is to declare some tokens as sticky or as +subgrammars. This is particularly useful for <A HREF="#Production">productions</A> whose sole purpose is +to skip over uninteresting input.</P><P> +</P><P> +The fourth way to deal with conflicts is to rewrite the grammar to eliminate +them. Many people prefer this approach since it yields the highest level of +confidence in the resulting program. +</P></DD> +<DT><B><A NAME="ContextFreeGrammar">Context Free Grammars</A></B></DT> +<DD> Context free grammars are <A HREF = "#Grammar">grammars</A> wherein the definition of a grammar +unit, or <A HREF="#Nonterminal">nonterminal token</A>, does not depend on the context in which the +nonterminal is used. That means that if you define "widget" in a +certain way, that definition is valid in all contexts in which "widget" +might occur. Context free grammars can be represented in <A HREF="#BNF">Backus-Naur Form</A>. +AnaGram's syntax specification language has no facility for representing +grammars which are not context free. +<P></P> +</DD> +<DT><B><A NAME="DefaultReduction">Default reductions</A></B></DT> +<DD> A "default reduction" is a parser action which may be used in +your <A HREF="#Parser">parser</A> in any state which has precisely one <A HREF="#CompletedRule">completed rule</A>. +<P> +If a given parser state has, among its <A HREF="#CharacteristicRule">characteristic rules</A>, exactly one +completed rule, it is usually faster to reduce it on any input than to check +specifically for correct input before reducing it. The only time this default +reduction causes trouble is in the event of a syntax error. In this situation +you may get an erroneous reduction. Normally when you are parsing a file, this +is inconsequential because you are not going to continue performing +<A HREF="#SemanticAction">semantic actions</A> in the presence of +errors. But, if you are using your parser to handle real-time +interactive input, you have to be able to continue semantic +processing after notifying your user that he has entered erroneous +input. In this case you would want default reductions to have been +turned off so that <A HREF="#Production">production</A>s are reduced +only when there is correct input. + +</P></DD> + +<DT><B><A NAME="SetDifference">Difference of two sets</A></B></DT> +<DD> In set theory, the difference of two sets, A and B, is defined to be the +set of all elements of A that are not elements of B. In an AnaGram syntax file, +you represent the difference of two <A HREF="#CharacterSets">character sets</A> by using the "-" +operator. Thus the difference of A and B is A - B. The difference operator is +<A HREF="#Associativity">left associative</A>. +<P></P> +</DD> + +<DT><B><A NAME="ErrorAction">Error Action</A></B></DT> +<DD> The error action is one of the four <A HREF="#ParserAction">actions</A> of a traditional <A HREF="#ParsingEngine">parsing +engine</A>. The error action is performed when the <A HREF="#Parser">parser</A> has encountered an input +<A HREF="#Token">token</A> which is not admissible in the current state. The further behavior of a +traditional parser is undefined. +<P></P> +</DD> + +<DT><B><A NAME="EventDriven">Event Driven Parser</A></B></DT> +<DD> +An <I>event driven</I> <A HREF="#Parser">parser</A> is one in which +the relationship between the host program and the parser is turned +inside out. In a conventional parser, the host program calls the +parser, the parser analyzes the complete input text and returns to +the host program only when it has finished with the entire input. +<P>In an event driven parser, the parser does not read its input +directly from a file or from memory. Instead, the host program, after +initializing the parser, calls it once for each input token. Each +time the parser is called, it updates its state +appropriately, calls any <A HREF="#ReductionProcedure">reduction +procedures</A> that need to be called and finally when it needs more +input, returns to the host program. The effect is that parsing can +occur in parallel with other processing performed by the host +program. This technique is especially useful in situations where the +token stream to be parsed is developed on the fly, as when using <A +HREF="#LexicalScanner">lexical scanners</A>, for instance. +<P></P> +</DD> + + +<DT><B><A NAME="ExpansionRule">Expansion Rule</A></B></DT> +<DD> In analyzing a <A HREF = "#Grammar">grammar</A>, we are often interested in the full range of input +that can be expected at a certain point. The expansion of a <A HREF="#Token">token</A> or state shows +us all the expected input. An expansion yields a set of +<A HREF="#MarkedRule">marked rules</A>. Looking +at the marked token shows us what input to expect. +<P> +The set of expansion rules of a <A HREF="#Nonterminal">nonterminal token</A> shows all the expected input +that can occur whenever the token appears in the grammar. The set consists of +all the <A HREF="#GrammarRule">grammar rules</A> produced by the token, plus all the rules produced by the +first token of any rule in the set. The marked tokens for the expansion rules of a +token are always the first element in the rule. +</P><P> +The expansion of a state consists of its +<A HREF="#CharacteristicRule">characteristic rules</A> plus the expansion +rules of the marked token in each characteristic rule. +</P> +</DD> + + +<DT><B><A NAME="Grammar">Grammar</A></B></DT> +<DD> Traditionally, a grammar is a set of <A HREF="#Production">productions</A> which taken together +specify precisely a set of acceptable input sequences in terms of an abstract +set of <A HREF="#Terminal">terminal tokens</A>. The set of acceptable input sequences is often called +the "language" defined by the grammar. +<P> +In AnaGram, the term grammar also includes configuration segments, definitions +of <A HREF="#CharacterSets">character sets</A>, <A HREF="#VirtualProduction">virtual productions</A>, etc. that augment the collection of +productions. A grammar is often called a "syntax", and the <A HREF="#GrammarRule">rules of +the grammar</A> are often called syntactic rules. +</P> +<P></P> +</DD> + +<DT><B><A NAME="GrammarAnalysis">Grammar Analysis</A></B></DT> +<DD> The major function of AnaGram is the analysis of <A HREF = "#ContextFreeGrammar">context free grammars</A> +written in a particular variant of <A HREF="#BNF">Backus-Naur Form</A>. +<P> +The analysis of a <A HREF="#Grammar">grammar</A> proceeds in four stages. In the first, the input +grammar is analyzed and a number of tables are built which describe all of the +<A HREF="#Production">productions</A> and components of the grammar. +</P><P> +In the second stage, AnaGram analyzes all of the <A HREF="#CharacterSets">character sets</A> defined in the +grammar, and where necessary, defines auxiliary <A HREF="#Token">tokens</A> and productions. +</P><P> +In the third stage, AnaGram identifies all of the states of the <A HREF="#Parser">parser</A> and +builds the go-to table for the parser. +</P><P> +In the fourth stage, Anagram identifies <A HREF="#ReductionToken">reduction tokens</A> for each completed +<A HREF="#GrammarRule">grammar rule</A> in each state and checks for <A HREF="#Conflict">conflicts</A>. +</P> +<P></P> +</DD> + +<DT><B><A NAME="GrammarRule">Grammar Rule</A></B></DT> +<DD> A "grammar rule" is the right side of a <A HREF="#Production">production</A>. It consists +of a sequence of <A HREF="#RuleElement">rule elements</A>. Each rule element identifies some <A HREF="#Token">token</A>, which +can be either a <A HREF="#Terminal">terminal token</A> or <A HREF="#Nonterminal">nonterminal token</A>. It is "matched" +by a corresponding sequence of tokens in the input stream to the <A HREF="#Parser">parser</A>. The +left side of the production identifies one or more nonterminal tokens, or +<A HREF="#ReductionToken">reduction tokens</A>, to which the rule reduces when matched. If there is more than +one reduction token, there should be a <A HREF="#ReductionProcedure">reduction procedure</A> to make the choice. +<P></P> +</DD> + +<DT><B><A NAME="GrammarToken">Grammar Token</A></B></DT> +<DD> The grammar <A HREF="#Token">token</A> is the token which represents the "top level" +in your grammar. Some people refer to it as the "goal" or "goal +token" and others as the "start token". Whatever it is called, it +is the single token which describes the complete input to your <A HREF="#Parser">parser</A>. +<P></P> +</DD> + +<DT><B><A NAME="SetIntersection">Intersection</A></B></DT> +<DD> In set theory, the intersection of two sets, A and B, is defined to be the +set of all elements of A which are also elements of B. In an AnaGram syntax +file, the intersection of two <A HREF="#CharacterSets">character sets</A> is represented with the "&" +operator. The intersection operator has lower <A HREF="#Precedence">precedence</A> than the complement +operator, but higher precedence than the <A HREF="#SetUnion">union</A> and <A HREF="#SetDifference">difference</A> operators. The +intersection operator is <A HREF="#Associativity">left associative</A>. +<P></P> +</DD> + +<DT><B><A NAME="LALRParsing">LALR parsing</A></B></DT> +<DD> LALR, or "lookahead LR" parsing, a slightly less powerful +variant of <A HREF="#LRParsing">LR parsing</A>, produces much smaller parsing tables. Thus an LALR +<A HREF="#Parser">parser</A> has very significant advantages in size over its LR counterpart. It is +possible to construct <A HREF = "#Grammar">grammar</A>s which can be parsed by an LR parser but cannot be +parsed by an LALR parser. That is, the LALR parser will find these grammars to +be ambiguous and the LR parser will not. In practice, however, such grammars +seem to be rare and the ambiguities can usually be eliminated by minor revisions. +<P> +AnaGram generates LALR parsers and, in addition, uses LALR parsing to interpret +syntax files. +</P><P></P> +</DD> + +<DT><B><A NAME="LexicalScanner">Lexical Scanner</A></B></DT> +<DD> A lexical scanner is a program or procedure used as a preprocessor for a +<A HREF="#Parser">parser</A>. It scans input characters and lumps them together into <A HREF="#Token">tokens</A>. Many +systems for generating parsers, most notably YACC, can scarcely be used without +a lexical scanner. +<P> +AnaGram, although it can support the use of a lexical scanner, does not need a +lexical scanner for most practical problems. AnaGram avoids the need for a +lexical scanner by using <A HREF="#CharacterSets">character sets</A> and disregard and lexeme statements. +</P><P> +If your problem does in fact need a lexical scanner, you can use AnaGram itself +to write it, so you don't need to know different languages for the scanner and +for the parser. +</P> +<P></P> +</DD> + +<DT><B><A NAME="Lookahead">Lookahead Token</A></B></DT> +<DD> The current input <A HREF="#Token">token</A> to a <A HREF="#Parser">parser</A> is often called the "lookahead" +token. In many situations, it is not shifted into the input buffer immediately, +but acts like a catalyst to cause a number of rules to be reduced before it is +eventually shifted in. +<P></P> +</DD> + +<DT><B><A NAME="LRParsing">LR Parsing</A></B></DT> +<DD> An LR(k) <A HREF="#Parser">parser</A> is a type of deterministic bottom-up shift-reduce parser +using at most k lookahead symbols to determine its next <A HREF="#ParserAction">action</A> at any point in +the parse. If (k) is omitted, then k is assumed to be 1. Discussion of the +technical details of LR(k) parsing is beyond the scope of this manual. The +interested reader may consult any of a variety of works covering the subject. +<P> +AnaGram produces parsers which employ a variant of LR parsing known as LALR +parsing. It is not necessary to understand the details of LR and LALR parsing +theory in order to use AnaGram. +</P> +<P></P> +</DD> + +<DT><B><A NAME="MarkedRule">Marked Rule</A></B></DT> +<DD> A "marked rule" is a <A HREF="#GrammarRule">grammar rule</A> + together with a "marked token" +that indicates how much of the rule has already been matched and what input +should be expected, starting with the marked token, if the remainder of the rule is to be matched. Each <A HREF="#Parser">parser</A> +state is defined by a small number of <A HREF="#CharacteristicRule">characteristic rules</A> which indicate what +matches have already been made and what input can be expected. Note that when a +state has more than one characteristic rule, any two characteristic rules with +the same number of <A HREF="#Token">tokens</A> to the left of the marked token +match identically up to the +marked token. If one rule has fewer tokens to the left than the other, all of its tokens +preceding the marked token match the corresponding tokens of the longer rule +exactly. +<P> +When marked rules are displayed in AnaGram windows, the marked token is displayed in +a distinctive font which the developer can select. The default is italic. +</P> +<P></P> +</DD> + +<DT><B><A NAME="NonAssociative">Non-associative</A></B></DT> +<DD> A <A HREF="#BinaryOperator">binary operator</A>, say #, is said to be non-associative if the sequence x +# y # z is not permissible. If this sequence is to be interpreted as (x#y)#z, +the operator is said to be <B>left associative</B>. If the sequence is to be +interpreted as x#(y#z), the operator is said to be <B>right associative</B>. (See +<A HREF="#Associativity">associativity</A>.) +<P></P> +</DD> + +<DT><B><A NAME="Nonterminal">Nonterminal Token</A></B></DT> +<DD> A "nonterminal token" is a <A HREF="#Token">token</A> which results from reducing the +sequence of tokens which match a <A HREF="#GrammarRule">grammar rule</A> and replacing them with the +appropriate <A HREF="#ReductionToken">reduction token</A>. Nonterminal tokens are to be distinguished from +<A HREF="#Terminal">terminal tokens</A> or input tokens. +<P></P> +</DD> + +<DT><B><A NAME="NullProduction">Null Production</A></B></DT> +<DD> A "null production" is one that has no <A HREF="#Token">tokens</A> on the right side +whatsoever. Null <A HREF="#Production">production</A>s essentially are identified by the first following +input token. Null productions are extremely convenient syntactic elements when +you wish to make some input optional. For example, suppose that you wish to +allow an optional semicolon at some point in your <A HREF = "#Grammar">grammar</A>. You could write the +following pair of productions: +<PRE> + optional semicolon -> | ';' +</PRE> Note that a null production can never follow a '|'. The above could also +be written on multiple lines thus: +<PRE> + optional semicolon + -> + -> ';' +</PRE> You can always rewrite your grammar to eliminate null productions if you +wish, but you usually pay a price in conciseness and clarity. Sometimes, +however, it is necessary to do such a rewrite in order to avoid <A HREF="#Conflict">conflicts</A>, to +which null productions are especially prone. +<P> +If you have a null production with no <A HREF="#ReductionProcedure">reduction procedure</A> specified, your <A HREF="#Parser">parser</A> +will automatically assign the value zero to the <A HREF="#ReductionToken">reduction token</A>. +</P><P> +Null productions can be generated by <A HREF="#VirtualProduction">virtual productions</A>. +</P> +<P></P> +</DD> + +<DT><B><A NAME="ParameterAssignment">Parameter Assignment</A></B></DT> + +<DD>In any <A HREF="#GrammarRule"> grammar rule</A>, the +<A HREF="#SemanticValue">semantic value</A> of any +<A HREF="#RuleElement">rule element</A> may be passed to a +<A HREF="#ReductionProcedure">reduction procedure</A> +by means of a parameter assignment. In AnaGram this is particularly +convenient since you can use an ordinary variable name for the +parameter. Simply follow the +rule element with a colon and a C variable name. The C +variable name can then be used in the reduction +procedure to reference the semantic value of the token +it is attached to. AnaGram will automatically provide +necessary declarations. + +Here are some examples of rule elements with parameter +assignments: +<PRE> + '0-9':d + integer:n + expression:x + declaration : declaration_descriptor +</PRE> +</DD> + +<DT><B><A NAME="Parser">Parser</A></B></DT> +<DD> A parser is a program, or more likely a procedure within a program, which +scans a sequence of input characters or input <A HREF="#Token">tokens</A> and accumulates them in an +input buffer or stack as determined by a set of <A HREF="#Production">productions</A> which constitute a +<A HREF = "#Grammar">grammar</A>. When the parser discovers a sequence of tokens as defined by a <A HREF="#GrammarRule">grammar +rule</A>, or right side of a production, it "reduces" the sequence to a +single <A HREF="#ReductionToken">reduction token</A> as defined by the left side of the <A HREF="#GrammarRule">grammar rule</A>. This +<A HREF="#Nonterminal">nonterminal token</A> now replaces the tokens which matched the grammar rule and the +search for matches continues. If an input token is encountered which will not +yield a match for any rule, it is considered a syntax error and some kind of +error recovery may be required to continue. If a match, or reduction action, +yields the grammar token, sometimes called the goal token or start token, the +parser deems its work complete and returns to whatever procedure may have called +it. +<P> +Tokens may have <A HREF="#SemanticValue">semantic values</A>. If the +input values configuration switch is on, your parser will expect +semantic values to be provided by the input process along with the +token identification code. If the input values switch is off, your +parser will take the ascii value of the input character, that is, the +actual input code, as the value of the character. When the parser <A +HREF="#ReduceAction">reduces</A> a <A +HREF="#Production">production</A>, it can call a <A +HREF="#ReductionProcedure">reduction procedure</A> or <A +HREF="#SemanticAction">semantic action</A> to analyze the values of +the constituent tokens. This reduction procedure can then return a +value which characterizes the reduced token. +</P> +<P></P> +</DD> + +<DT><B><A NAME="ParserGenerator">Parser Generator</A></B></DT> +<DD> +A parser generator, such as AnaGram, is a program that +converts a <A HREF="#Grammar">grammar</A>, a rule-based description of the +input to a program, into a conventional, procedural +module called a <A HREF="#Parser">parser</A>. The parsers AnaGram generates +are simple C modules which can be compiled on almost +any platform. AnaGram parsers are also compatible with +C++. +<P></P> +</DD> + +<DT><B><A NAME="ParserStateStack">Parser State Stack</A></B></DT> +<DD> +The parser state stack is a stack maintained by an AnaGram +parser and which is an integral part of the parsing +process. At any point in the parse of your input +stream, the parser state stack provides a list of the +states which were traversed in order to arrive at the +current state. + +<P></P> +</DD> + + +<DT><B><A NAME="ParserValueStack">Parser Value Stack</A></B></DT> +<DD>In parallel with the <A HREF="#ParserStateStack">parser state stack</A>, +an AnaGram <A HREF="#Parser"> parser</A> +maintains a <I>value stack</I> each entry of +which corresponds to the <A HREF="#SemanticValue">semantic value</A> of the token +identified at that state. Since the semantic values of +different tokens might well have different data types, +AnaGram gives you the opportunity, in your syntax +file, to define the data type for any token. AnaGram +then builds a typedef statement creating a data type +which is a union of the all the types you have defined. +AnaGram creates the name for this data type by +appending "_vs_type" to the name of the parser. AnaGram uses +this data type to define the value stack. +<P></P> +</DD> + + +<DT><B><A NAME="ParsingEngine">Parsing Engine</A></B></DT> +<DD> A <A HREF="#Parser">parser</A> consists of three basic components: A set of syntax tables, a set +of <A HREF="#ReductionProcedure">reduction procedures</A> and a parsing engine. The parsing engine is the body of +code that interprets the parsing table, invokes input functions, and calls the +reduction procedures. The Build Parser command configures a parsing engine +according to the implicit requirements of the syntax specification and according +to the explicit values of the configuration parameters. +<P> +The parsing engine itself is a simple automaton, characterized by a set of +states and a set of inputs. The inputs are the <A HREF="#Token">tokens</A> of your <A HREF = "#Grammar">grammar</A>. Each +state is represented by a list of tokens which are admissible in that state and +for each token an <A HREF="#ParserAction">action</A> to perform and a parameter which further defines the +action. In a traditional <A HREF="#LALRParsing">LALR parser</A>, there are only four <A HREF="#ParserAction">actions</A>: the <A HREF="#ShiftAction">shift +action</A>, the <A HREF="#ReduceAction">reduce action</A>, the <A HREF="#AcceptAction">accept action</A> and the <A HREF="#ErrorAction">error action</A>. AnaGram, in +doing its <A HREF="#GrammarAnalysis">grammar analysis</A>, identifies a number of special cases, and creates a +number of extra actions which make for faster processing, but which can be +represented as combinations of these primitive actions. +</P><P> +When a shift action is performed, the current state number is pushed onto the +parser state stack and the new state number is determined by the current state +number and the current lookahead token. Different tokens cause different new +states. +</P><P> +When a reduce action is performed, the length of the rule being reduced is +subtracted from the parser stack index and the new state number is read from the +top of the parser state stack. The <A HREF="#ReductionToken">reduction token</A> for the rule being reduced is +then used as an input token. +</P><P> +Each state in the grammar, with the exception of state zero, has a +<A HREF="#CharacteristicToken">characteristic token</A> which must have been recognized in order to jump to that +state. Therefore, the parser state stack, which is essentially a list of state +numbers, can also be thought of as a list of token numbers. This is the list of +tokens that have been seen so far in the parse of your input stream. Some of +these tokens, of course, may be <A HREF="#Nonterminal">nonterminal tokens</A> which have resulted from +reducing other sequences of tokens. +</P> +<P></P> +</DD> + +<DT><B><A NAME="SetPartition">Partition</A></B></DT> +<DD> If you use <A HREF="#CharacterSets">character sets</A> in your <A HREF = "#Grammar">grammar</A>, AnaGram will compute a "partition" +of the <A HREF="#Universe">character universe</A>. This partition is a collection of non-overlapping +character sets such that every one of the sets you have defined can be written +as a <A HREF="#SetUnion">union</A> of partition sets. Each partition set is identified by a unique +reference number called the partition set number. +<P> +Each partition set is assigned a unique <A HREF="#Token">token</A>. If one of your character sets +requires more than one partition set to represent it, AnaGram will create +appropriate <A HREF="#Production">productions</A> and add them to your grammar so your <A HREF="#Parser">parser</A> can make the +necessary distinctions. +</P> +<P></P> +</DD> + +<DT><B><A NAME="Precedence">Precedence</A></B></DT> +<DD> "Precedence" is an attribute of <A HREF="#BinaryOperator">binary operators</A> and +<A HREF = "#UnaryOperator">unary +operators</A> which can be used to resolve <A HREF="#Conflict">conflicts</A>. Operator precedence is also +referred to as <A HREF = "#Binding">binding</A>. Suppose that # and @ are two binary operators. The +question is how to interpret x # y @ z. If # has higher precedence than @, it is +interpreted as (x#y)@z. On the other hand if # has lower precedence than @, it +would be x#(y@z). If # and @ have the same precedence then the question must be +resolved by <A HREF="#Associativity">associativity</A>. +<P> +Note that all operators at the same precedence level must have the same +associativity. +</P><P> +The situation is somewhat simpler for unary operators. If # and @ were both +prefix operators, or both suffix operators, there would be no ambiguity in +interpretation, since neither # @ x and x # @ offer more than one possible +interpretation. The only difficulty arises when both a prefix and a suffix +operator are applied to the same operand. +</P><P> +Suppose that # is a prefix operator and @ is a suffix operator. The question +then is how to interpret # x @. If # has higher precedence than @, it would be +interpreted as (# x) @. On the other hand, if # has lower precedence than @, it +would be interpreted as # (x @). If # and @ have the same precedence then the +question must be resolved by associativity. +</P> +<P></P> +</DD> + +<DT><B><A NAME="Production">Production</A></B></DT> +<DD> Productions are the mechanism used to describe how complex input +structures are built up out of simpler ones. Each production has a left side and +a right side. The right side, or <A HREF="#GrammarRule">grammar rule</A>, +is a sequence of <A HREF="#RuleElement">rule elements</A>, +which may represent either <A HREF="#Terminal">terminal tokens</A> or +<A HREF="#Nonterminal">nonterminal tokens</A>. The left side +is a list of <A HREF="#ReductionToken">reduction tokens</A>. +In most cases there would be only a single +reduction token. Productions with more than one <A HREF="#Token">token</A> +on the left side are +called <A HREF = "#SemanticallyDetermined">semantically determined productions</A>. +<p>The +"<CODE>-></CODE>" symbol is used +to separate the left side from the right side. + +If you have several +productions with the same left hand side, you can avoid +rewriting the left hand side either by using '|' or by +using another "->". +<p> + +A <A HREF="#NullProduction"> null production</A>, or empty right hand side, cannot +follow a '|'. +<P> +Productions may be written thus: +<pre> + name + -> letter + -> name, digit +</pre> +This could also be written +<pre> + name -> letter | name, digit +</pre> +In order to accommodate semantic analysis of the data, +you may attach to any grammar rule a <A HREF="#ReductionProcedure">reduction +procedure</A> which will be executed when the rule is +identified. Each token may have a <A HREF="#SemanticValue"> semantic value</A>. By +using <A HREF="#ParameterAssignment"> parameter assignments</A>, you may provide the +reduction procedure with access to the semantic values of +tokens that comprise the grammar rule. When it finishes, the +reduction procedure may return a value which will be +saved on the <A HREF="#ParserValueStack"> parser value stack</A> as the semantic value of the +<A HREF="#ReductionToken">reduction token</A>. + +<P></P> +</DD> + +<DT><B><A NAME="ReduceAction">Reduce Action</A></B></DT> +<DD> The reduce action is one of the four <A HREF="#ParserAction">actions</A> of a traditional parsing +engine. The reduce action is performed when the <A HREF="#Parser">parser</A> has succeeded in matching +all the elements of a <A HREF="#GrammarRule">grammar rule</A> and the next input <A HREF="#Token">token</A> is not erroneous. +Reducing the grammar rule amounts to subtracting the length of the rule from the +parser stack index, identifying the <A HREF="#ReductionToken">reduction token</A>, stacking its +<A HREF="#SemanticValue">semantic value</A> +and then doing a <A HREF="#ShiftAction">shift action</A> with the reduction token as though it had been +input directly. +<P></P> +</DD> + +<DT><B><A NAME="ReduceReduce">Reduce-Reduce Conflict</A></B></DT> +<DD> A <A HREF = "#Grammar">grammar</A> has a "reduce-reduce" conflict at some state if a +single <A HREF="#Token">token</A> turns out to be a <A HREF="#ReducingToken">reducing token</A> for more than one <A HREF="#CompletedRule">completed rule</A>. +<P></P> +</DD> + +<DT><B><A NAME="ReducingToken">Reducing Token</A></B></DT> +<DD> In a state with more than one <A HREF="#CompletedRule">completed rule</A>, your <A HREF="#Parser">parser</A> must be able to +determine which one was actually found. AnaGram deals with this problem by +looking at all the states the parser will branch to once each rule is reduced. +The acceptable input <A HREF="#Token">tokens</A> for those states are the "reducing tokens" +for the completed rules in the state under investigation. If there is a single +token which is a reducing token for more than one rule, then the <A HREF = "#Grammar">grammar</A> is said +to have a <A HREF="#ReduceReduce">reduce-reduce conflict</A> at that state. If in a particular state there +is both a <A HREF="#ShiftAction">shift action</A> and a <A HREF="#ReduceAction">reduce action</A> for the same token the grammar is +said to have a <A HREF="#ShiftReduce">shift-reduce conflict</A> in that state. +<P> +A reducing token is not the same as a <A HREF="#ReductionToken">reduction token</A>. +</P> +<P></P> +</DD> + +<DT><B><A NAME="ReductionProcedure">Reduction Procedure</A></B></DT> +<DD> A "reduction procedure" is a function you write which your +<A HREF="#Parser">parser</A> executes when it has identified the <A HREF="#GrammarRule">grammar rule</A> to which the reduction +procedure is attached in your <A HREF = "#Grammar">grammar</A>. There are two formats for reduction +procedures, depending on the size and complexity of the procedure. +<P></P> +</DD> + +<DT><B><A NAME="ReductionToken">Reduction Token</A></B></DT> +<DD> A <A HREF="#Token">token</A> which appears on the left side of a <A HREF="#Production">production</A> is called a +reduction token. It is so called because when the <A HREF="#GrammarRule">grammar rule</A> on the right side +of the production is matched in the input stream, your <A HREF="#Parser">parser</A> will <A HREF="#ReduceAction">"reduce"</A> +the sequence of tokens which matches the rule by replacing the sequence of +tokens with the reduction token. Note that, if more than one reduction token is +specified, your <A HREF="#ReductionProcedure">reduction procedure</A> should choose the exact one. If it does not, +your parser will use the leftmost syntactically correct one in the list as the default. +<P> +A reduction token is not the same as a <A HREF="#ReducingToken">reducing token</A>. +</P> +<P></P> +</DD> + +<DT><B><A NAME="Resynchronization">Resynchronization</A></B></DT> +<DD> "Resynchronization" is the process of getting your <A HREF="#Parser">parser</A> back +in step with its input after encountering a syntax error. As such, it is one +method of error recovery. Of course, you would resynchronize only if it +is necessary to continue after the error. There are several options available +when using AnaGram. You could use the auto resynch option, which causes AnaGram +to incorporate an automatic resynchronizing procedure into your parser, or you +could use the error resynch option, which is similar to the technique used by +YACC programmers. +</DD> + +<P></P> +<DT><B><A NAME="RuleElement">Rule Element</A></B></DT> +<DD> A <A HREF="#GrammarRule">grammar rule</A> is a list of "rule elements", separated by +commas. Rule elements may be <A HREF="#Token">token</A> names, <A HREF="#CharacterSets">character sets</A>, keywords, immediate +actions, or <A HREF="#VirtualProduction">virtual productions</A>. When AnaGram encounters a rule element for +which no token presently exists, it creates one. +<P></P> +</DD> + +<DT><B><A NAME="SemanticAction">Semantic Action</A></B></DT> +<DD> +A semantic action is a piece of C or C++ code that is executed when +a <A HREF="#GrammarRule">grammar rule</A> has been identified by a +<A HREF="#Parser">parser</A>. Semantic actions are also called +<A HREF="#ReductionProcedure">reduction procedures</A>, since they +are executed when a grammar rule is <A +HREF="#ReduceAction">reduced</A> to the token on the left side of the +<A HREF="#Production">production.</A> +<P></P> +</DD> + +<DT><B><A NAME="SemanticValue">Semantic Value</A></B></DT> +<DD> +<A HREF="#Token">Tokens</A>, whether <A HREF="#Terminal">terminal</A> +or <A HREF="#Nonterminal">nonterminal</A>, may have a semantic value. +In the case of terminal tokens, this may be a value assigned by the +<A HREF="#LexicalScanner">lexical scanner</A> or, if the parser is +using direct character input, it will be the ascii value of the +character itself. The values of nonterminal tokens are created by +<A HREF="#ReductionProcedure">reduction procedures</A>. As a parse +progresses, token values are <A HREF="#ShiftAction">shifted</A> +onto the stack, so that in a reduction procedure, the values of the +tokens that comprise the <A HREF="#GrammarRule">grammar rule</A> that +is being <A HREF="#ReduceAction">reduced</A> are available +for inspection. +<P></P> +</DD> + +<DT><B><A NAME="SemanticallyDetermined">Semantically Determined Production</A></B></DT> +<DD> A "semantically determined production" is one which has more +than one <A HREF="#ReductionToken">reduction token</A> specified on the left side of the production. You would +write such a <A HREF="#Production">production</A> when the reduction tokens are syntactically +indistinguishable. The <A HREF="#ReductionProcedure">reduction procedure</A> may then specify which of the listed +reduction tokens the <A HREF="#GrammarRule">grammar rule</A> is to reduce to based on semantic +considerations. If there is no reduction procedure, or the reduction procedure +does not specify a reduction token, the <A HREF="#Parser">parser</A> will use the leftmost syntactically correct token. + +<P></P> +</DD> + +<DT><B><A NAME="SetExpression">Set Expression</A></B></DT> +<DD> A set expression is an algebraic expression used to define a <A HREF="#CharacterSets">character set</A> +in terms of individual characters, ranges of characters, and/or other sets of +characters as constructed using <A HREF="#SetComplement">complements</A>, <A HREF="#SetUnion">unions</A>, <A HREF="#SetIntersection">intersections</A>, and +<A HREF="#SetDifference">differences</A>. +<P></P> +</DD> + +<DT><B><A NAME="ShiftAction">Shift Action</A></B></DT> +<DD> The shift action is one of the four <A HREF="#ParserAction">actions</A> of a traditional parsing +engine. The shift action is performed when the input <A HREF="#Token">token</A> matches one of the +acceptable input tokens for the current parser state. The +<A HREF="#SemanticValue">semantic value</A> of the +token and the current state number are stacked, the parser stack index is +incremented and the state number is set to a value determined by the previous +state and the input token. +<P></P> +</DD> + +<DT><B><A NAME="ShiftReduce">Shift-Reduce Conflict</A></B></DT> +<DD> A "shift-reduce" conflict occurs if in some state there is a +single <A HREF="#Token">token</A> that should be shifted, because it is legitimate input for one of +the rules of the state, but should also be used to reduce some other rule +because it is a <A HREF="#ReducingToken">reducing token</A> for that rule. +<P></P> +</DD> +<DT><B><A NAME="SyntaxDirectedParsing">Syntax Directed Parsing</A></B></DT> +<DD> +<P>Syntax directed parsing, or formal parsing, is an +approach to building <A HREF="#Parser">parsers</A> based on formal language +theory. Given a suitable description of a language, +called a <A HREF="#Grammar">grammar</A>, there are algorithms which can be +used to create parsers for the language automatically. +In this context, the set of all possible inputs to a +program may be considered to constitute a language, and +the rules for formulating the input to the program +constitute the grammar for the language. + +<P>The parsers built from a grammar have the advantage +that they can recognize any input that conforms to the +rules, and can reject as erroneous any input that fails +to conform. + +<P>Since the program logic necessary to parse input is +often extremely intricate, programs which use formal +parsing are usually much more reliable than those built +by hand. They are also much easier to maintain, since +it is much easier to modify a grammar specification +than it is to modify complex program logic. + +<P></P> +</DD> + +<DT><B><A NAME="Terminal">Terminal Token</A></B></DT> +<DD> A "terminal token" is a <A HREF="#Token">token</A> which does not appear on the left +side of a <A HREF="#Production">production</A>. It represents, therefore, a basic unit of input to your +<A HREF="#Parser">parser</A>. If the input to your parser consists of ascii characters, you may define +terminal tokens explicitly as ascii characters or as <A HREF="#CharacterSets">sets of ascii characters</A>. +If you have an input procedure which produces numeric codes, you may define the +terminal tokens directly in terms of these numeric codes. +<P></P> +</DD> + +<DT><B><A NAME="Token">Token</A></B></DT> +<DD> Tokens are the units with which your <A HREF="#Parser">parser</A> works. There are two kinds of +tokens: <A HREF="#Terminal">terminal tokens</A> and <A HREF="#Nonterminal">nonterminal tokens</A>. These latter are identified by +the parser as sequences of tokens. The grouping of tokens into more complex +tokens is governed by the <A HREF="#GrammarRule">grammar rules</A>, or <A HREF="#Production">productions</A> in your <A HREF = "#Grammar">grammar</A>. In your +grammar, tokens may be denoted by token names, by <A HREF="#VirtualProduction">virtual productions</A>, by +immediate actions, by explicit character representations, or by expressions +which yield <A HREF="#CharacterSets">character sets</A>. +<P> +A "token" should be thought of more or less as being something like a +cloakroom token. Imagine you had taken a kindergarten class to the museum and +all the kids checked their coats. Each one gets a token. What is the probability +of losing one? You take all of the tokens from the children, put them in a paper +bag, tie it up, and check it. You get one token in return. We would say that the +kids' coats represent the actual input to the system, their individual tokens +are the <A HREF="#Terminal">terminal tokens</A>, and your one token which enables you to redeem the +paper bag is a <A HREF="#Nonterminal">nonterminal token</A>. Nonterminal tokens are just single token +numbers to represent the result of compacting a sequence of more primitive +tokens. +</P><P> +Actually, the word "token" is commonly used to refer to a number of +apparently different things. The tokens you use in a grammar are abstract. The +tokens identified by a parser are concrete instances of the abstract tokens in +the grammar. Furthermore, we have to identify tokens in some manner, so we have +token names and token numbers with which to refer to particular tokens. As a +result the word "token", in any particular context, may refer directly +to either an abstract token, or a concrete instance of an abstract token, or may +refer only indirectly to the token by means of a token name or token number. +</P><P> +As an example, consider a C compiler which has a lexical scanner to identify +input tokens. According to Kernighan and Ritchie, there are six classes of C +tokens: identifiers, keywords, constants, string literals, operators, and other +separators. Consider a particular token, a hexadecimal constant, 0XF00. When the +lexical scanner hands this over to the parser, it has to provide an +identification code which says what kind of token it is, a hexadecimal constant, +as well as the +<A HREF="#SemanticValue">"semantic value"</A> of the token, +3840. The identification code is usually an integer value, +determined by the designer of the lexical scanner, which by agreement +with the designer of the parser uniquely represents a hexadecimal +constant. This identification code can be thought of as an external +token number. + +</P><P> +The grammar for the parser, on the other hand, has a token, "HEXconstant", +let us say, to represent the abstract usage of hexadecimal constants in the C +language. When AnaGram, or any other parser generator for that matter, analyzes +the grammar, it assigns its own reference number to identify "HEXconstant". +This reference number can be thought of as an internal token number to contrast +it with the external token number provided by the lexical scanner. The parsers +AnaGram builds contain a table, ag_ tcv, which provides a translation from the +external to the internal token number. Both of these token numbers, of course, +differ from the semantic value of the token, in this case, 3840. +</P><P> +Even when an AnaGram parser is using simple character input, the word "token" +may represent several different things. Suppose a grammar has a token 'A-Z', +which represents the set of upper case letters. There will be an internal token +number, assigned by AnaGram, which identifies the (abstract) token in parser +tables. The actual (concrete) input token to a parser will, however, be a single +letter, say "Q". In this case, the ascii value of the character "Q", +81, is used both as the external token number and as the semantic value of the +token. +</P><P></P> +</DD> + +<DT><B><A NAME="UnaryOperator">Unary Operator</A></B></DT> +<DD>A "unary operator" is a <A HREF="#Token">token</A> which represents +some sort of operation which operates on a single entity to produce a new +resultant entity. It is normally represented by a symbol which is written either +preceding the operand or following the operand. In the first case it is called a +"prefix operator" and in the second case a "suffix operator". +</P><P></P> +</DD> + +<DT><B><A NAME="SetUnion">Union</A></B></DT> +<DD> +The union of two sets is the set of all elements that are to be found in one or +another of the two sets. In an AnaGram syntax file the union of two <A HREF="#CharacterSets">character +sets</A> A and B is represented using the plus sign, as in A + B. The union operator +has the same <A HREF="#Precedence">precedence</A> as the <A HREF="#SetDifference">difference</A> operator: lower than that of +<A HREF="#SetIntersection">intersection</A> and <A HREF="#SetComplement">complement</A>. The union operator is <A HREF="#Associativity">left associative</A>. +<P></P> +</DD> + +<DT><B><A NAME="VirtualProduction">Virtual Production</A></B></DT> +<DD> +<P>Virtual productions are a special short hand +representation of <A HREF="#GrammarRule">grammar rules</A> which can be used to +indicate a choice of inputs. They are an important +convenience, especially useful when you are first +building a <A HREF="#Grammar">grammar</A>. + +<P>AnaGram rewrites virtual productions, so that when you +look at the syntax tables in AnaGram, there will be +actual <A HREF="#Production">productions</A> replacing the virtual productions. + +<P>A virtual production appears as one of the <A HREF="#RuleElement">rule +elements</A> in a <A HREF="#GrammarRule">grammar rule</A>, i.e. as one of the members +of the list on the right side of a production. + +<P>The simplest virtual production is the "optional" +<A HREF="#Token">token</A>. If x is an arbitrary token, x? can be used to +indicate an optional x. + +<P>Related virtual productions are x... and x?... where +the three dots indicate repetition. x... represents an +arbitrary number of occurrences of x, but at least one. +x?... represents zero or more occurrences of x. + +<P>The remaining virtual productions use curly or square +brackets to enclose a sequence of rules. The brackets +may be followed variously by nothing, a string of three +dots, or a slash, to indicate the choices to be made +from the rules. Note that rules may be used, not merely +tokens. + +<P>If r1 through rn are a set of grammar rules, then +<PRE> {r1 | r2 | ... | rn} </PRE> +is a virtual production that allows a choice of exactly +one of the rules. Similarly, +<PRE> {r1 | r2 | ... | rn}... </PRE> +is a virtual production that allows a choice of one or +more of the rules. And, finally, +<PRE> {r1 | r2 | ... | rn}/... </PRE> +is a virtual production that allows a choice of one or +more of the rules subject to the side condition that +rules must alternate, that is, that no rule can follow +itself immediately without the interposition of some +other rule. This is a case that is not particularly +easy to write by hand, but is quite useful in a number +of contexts. + +<P>If the above virtual productions are written with [ ] +instead of { }, they all become optional. [ ] is an +optional choice, [ ]... is zero or more choices, and +[ ]/... is zero or more alternating choices. + +<P><A HREF="#NullProduction">Null productions</A> are not permitted in virtual +productions in those cases where they would cause an +intrinsic ambiguity. + +<P>You may use a definition statement to assign a name to +a virtual production. + +</DD> + +</DL> + + +<P> +<BR> + +<IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" + WIDTH=1010 HEIGHT=2 > +<P> +<IMG ALIGN="right" SRC="images/pslrb6d.gif" ALT="Parsifal Software" + WIDTH=181 HEIGHT=25> +<BR CLEAR="right"> +Back to <A HREF="index.html">Index</A> +<P> +<ADDRESS><FONT SIZE="-1"> + AnaGram parser generator - documentation<BR> + Glossary<BR> + Copyright © 1993-1999, Parsifal Software. <BR> + All Rights Reserved.<BR> +</FONT></ADDRESS> + +</BODY> +</HTML> + +