Mercurial > ~dholland > hg > ag > index.cgi
diff doc/manual/gl.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/manual/gl.tex Sat Dec 22 17:52:45 2007 -0500 @@ -0,0 +1,960 @@ +\chapter{Glossary} + +% XXX Add: +% parameter assignment (?) +% parser (state) stack +% parser value stack +% (may be defined in some other version of the glossary) +% (and appear in the help) +% (there are apparently dangling links to these in some version) + +\index{Action} +\index{Parser action} +\index{Parsing engine} +\paragraph{Action.} +A ``parser action'' is one of the basic executable elements of a +parsing engine. In a traditional parsing engine there are four +actions: the shift action, the reduce action, the accept action, and +the error action. The parsing engines which AnaGram generates use a +substantially greater number of actions, in order to provide certain +short cuts in the interest of greater speed and greater compactness of +the tables which control the action of the parsing engine. + +\paragraph{Accept Action.} +The accept action is one of the four actions of a traditional parsing +engine. The accept action is performed when the parser has succeeded +in identifying the goal token for the grammar. When the parser +executes the accept action, it returns to the calling program. +% , returning the semantic value of the goal token as its value. +The accept action is thus the last action of the parsing engine and +occurs only once for each successful execution of the parser. + +\paragraph{Associativity.} +This is a term used to describe binary operators (q.v.). It describes +how you interpret a sequence of operations which all involve the same +operator. Consider $a - b - c$, for instance. Here the convention is +that we subtract $b$ from $a$, and then $c$ from the difference. We +say that subtraction is left associative, since if there is a sequence +of subtraction operations, the leftmost one is to be performed +first. As a second example, consider exponentiation. FORTRAN uses +\agcode{**} as an operator to indicate exponentiation. In order to +conform as closely as possible to ordinary mathematical notation, the +expression \agcode{a**b**c} is interpreted to mean that first +\agcode{b} is raised to the power \agcode{c}, and then the resultant +value is used as the power to which \agcode{a} is raised. We say that +exponentiation is right associative since the rightmost of a sequence +of exponentiation operations is to be performed first. If a +programming language does not allow for unparenthesized sequences of a +particular operator, the operator is said to be non-associative. + +Another way to view left and right associativity is as implied +parenthesization. Left associative operators are parenthesized from +the left. Right associative operators are parenthesized from the +right. Thus $a - b - c = ((a-b) - c)$ and \agcode{a**b**c = (a**(b**c))}. + +AnaGram offers two ways to specify associativity of binary +operators. The first way is to use conventional Backus-Naur Form +syntax descriptions. You would use recursion to describe both +subtraction and exponentiation. Subtraction, since it is left +associative, uses left recursion, and exponentiation, being right +associative, uses right recursion. Thus +\begin{indentingcode}{0.4in} +difference -> value | difference, '-', value +exponential -> value | value, "**", exponential +\end{indentingcode} +could be used to describe differences and exponents. + +The second way to specify associativity is to use an ambiguous grammar +and precedence declarations. (See Chapter 9.) + +\paragraph{Backtracking.} +In order to make parse tables more compact and parsers faster, it is +common to use default reductions. In case of error, it is necessary to +undo default reductions before diagnostics can be properly +determined. In AnaGram, this undo operation is called backtracking. + +\index{Backus-Naur form} +\index{BNF} +\paragraph{Backus-Naur Form.} +Backus-Naur Form, or BNF, is a conventional notation for describing +context free grammars. AnaGram uses an extended notation for grammars, +which, except for semantically determined productions, can be shown to +be equivalent to BNF. The term ``BNF'' is often used colloquially to +refer to a grammar specification. +% XXX: shouldn't this say ``any grammar specification''? + +AnaGram's syntax specification language differs from BNF in the +following respects: + +\begin{enumerate} + +\item +In conventional BNF, a symbol not enclosed in angle brackets ($< >$) +is taken to represent itself literally. In AnaGram, literal character +representations must be enclosed in single quotes and literal strings +within double quotes. + +\item +In BNF, the right side of a production is simply a list of symbols +without any delimiters or punctuation. AnaGram requires that the rule +elements which make up a grammar rule, or right side, be joined by +commas. + +\item +BNF makes no provision for identifying reduction procedures or their +arguments. AnaGram provides both reduction procedures, introduced by +an ``\agcode{=}'' at the end of the production, and named arguments, +introduced by a ``\agcode{:}'' at the end of any token on the right +side of the production. + +\item +AnaGram allows virtual productions to be used freely. + +\item +BNF is ``pure'' in that if you wish to define a nonterminal token +called ``digit'' to represent the digits from zero to nine, you must +provide ten explicit productions to define it. AnaGram treats the +concept of ``terminal token'' as used in language theory as an +abstraction, and interposes character set functionality between actual +character input and the terminal tokens of BNF. You can define digit +to be the character range \agcode{'0-9'}, for instance, and AnaGram +will determine and define precisely the minimum number of productions +necessary, taking into account any other usage of the characters +\agcode{'0'} through \agcode{'9'} in your grammar. This makes your +grammar more compact, more manageable, and easier to modify. + +\item +AnaGram allows for semantically determined productions, which provide +a significant increment in the ease with which semantic analysis may +be joined to syntactic analysis. + +\end{enumerate} + +\paragraph{Binary operator.} +A binary operator is an operator that works on two operands to create +a result. It is often written between its operands and is sometimes +called an infix operator. In ordinary programming languages +``\agcode{+}'' and ``\agcode{-}'' are binary operators. + +\paragraph{Binding.} +Most programming languages, rather than executing arithmetic operators +in simple left to right order, conform to the traditional conventions +of ordinary algebra, which dictate that, except where parenthesis +indicate otherwise, exponentiation is done before multiplication, +multiplication before addition, and addition before comparison. One +says that exponentiation is ``more tightly binding'' than +multiplication, multiplication is more tightly binding than addition, +and so on. The sense of the word here is that the operator binds +together its operands into a single entity. An operand which falls +between two different operators is ``bound'' by the more tightly +binding operator. An operator that is more tightly binding than +another is also said to have higher precedence. + +\index{Character sets} +\paragraph{Character sets.} +One of the traditional problems with syntax directed programming is +that caused by the fact that most formal language specifications have +productions of the form +\begin{indentingcode}{0.4in} +letter -> a | b | c | d ... | z +\end{indentingcode} + +Since the letters are not often distinguished in the grammar, this +large number of essentially identical productions causes a +correspondingly large number of states in the parser. This problem is +often attacked by using a ``lexical scanner'' which simply specifies a +``letter token'' whose value indicates precisely which letter is +intended. This works fine as long as there is nothing in the grammar +which distinguishes any letter at all. If there is, and there is +usually some special use of \agcode{h}, \agcode{x}, \agcode{q}, +\agcode{o}, \agcode{e}, or even \agcode{a-f}, the situation gets more +complicated. Often the result is to abandon the use of syntax directed +programming for elemental units of the language and use a more complex +lexical scanner which identifies names, numbers, key words and +operators. Such lexical scanners are often built using conventional +programming languages or simple pattern recognizing languages such as +\agfile{lex}. + +AnaGram avoids this problem by incorporation the notion of character +set into its input specifications. Briefly, the way it works is the +following: You specify the set of characters which make up any +ostensibly terminal token. AnaGram then analyzes the overlaps among +these definitions and creates a minimal set of tokens which match your +specifications exactly. For instance, suppose you have the following +definitions: + +\begin{indentingcode}{0.4in} +letter = 'a-z' + 'A-Z' +hex digit = '0-9' + 'a-f' + 'A-F' +\end{indentingcode} + +AnaGram will define a token for letter which has two productions: + +\begin{indentingcode}{0.4in} +letter -> 'a-f' + 'A-F' | 'g-z' + 'G-Z' +\end{indentingcode} + +With this technique, you have the freedom to specify your grammar in +the easiest possible manner, without the penalty of an absurdly large, +slow parser. + +Character sets may be specified in terms of ranges of characters, as +in the above example, as unions, denoted by ``\agcode{+}'', as +intersections, denoted by ``\agcode{\&}'', as differences, denoted by +``\agcode{-}'', and as complements, using ``\agcode{\~{}}''. You may +also specify a set consisting of a single character either with a +literal character or with a numeric specification. If you specify a +character numerically, you may use either decimal, octal or +hexadecimal notation, as in C. + +\index{Character universe} +\index{Universe} +\paragraph{Character Universe.} +In ordinary set theory, the ``universe'' is the set of all entities +which may be elements of a set. It must be defined so that the +complement of a set can mean something unambiguous. In AnaGram, the +character universe normally comprises the ASCII characters and the IBM +extended character set, that is, all characters in the range from 0 +through 255. If, however, you have defined input characters outside +this range, the character universe will be extended to the least +character you have used and to the greatest character you have used in +your grammar. +% XXX this should mention character sets, and it should mention that +% it's limited to character sets of size 65536. + +\index{Characteristic rules} +\index{Rule} +\paragraph{Characteristic Rule.} +Each parser state is characterized by a particular set of grammar +rules, and for each such rule, a marked token which is the next token +expected. These rules are called the characteristic rules for the +state. In the course of doing grammar analysis, AnaGram determines the +characteristic rules for each parser state. After analyzing your +grammar, you may inspect the \agwindow{State Definition Table} to see +the characteristic rules for any state in your parser. + +\index{Characteristic token} +\index{Token} +\paragraph{Characteristic Token.} +Every state in a parser, except state zero, can be characterized by +the one, unique token which causes a jump to that state. That token is +called the characteristic token of the state, because to get to that +parser state you must have just seen precisely that token in the +input. Note that several states could have the same characteristic +token. +% XXX s/one,/one/ + +When you have a list of states, such as is given by the parser state +stack, it is equivalent to a list of characteristic tokens. This list +of tokens is the list of tokens that have been recognized so far by +the parser. Some of these tokens, of course, may be nonterminal tokens +and may thus represent the result of reducing a sequence of previously +recognized tokens. + +\index{Complement} +\paragraph{Complement of a set.} +In set theory, the complement of a set, $S$, is the collection of all +elements of the character universe which are not elements of $S$. Using +AnaGram's notation, the complement of \agcode{S} is written +\agcode{\~{}S}. The complement operator has higher precedence than the +difference, intersection, or union operators. + +\index{Rule} +\index{Completed rule} +\paragraph{Completed Rule.} +A ``completed rule'' is a characteristic rule whose index is pointing +beyond the last rule element. In other words, it has been completely +matched and will be reduced by the next input. + +If a state has more than one completed rule, the decision as to which +to reduce is made based on the next input token. If there is only one +completed rule in a state, it will be reduced by default unless the +default reductions switch has been reset, i.e., turned off. + +\index{Conflicts} +\paragraph{Conflict.} +Conflicts arise during a grammar analysis when AnaGram cannot +determine how to treat a given input token. There are two sorts of +conflicts: \agterm{shift-reduce} conflicts and \agterm{reduce-reduce} +conflicts. Conflicts may arise either because the grammar is +inherently ambiguous, or simply because the grammar analyzer cannot +look far enough ahead to resolve the conflict. In the latter case, it +is often possible to rewrite the grammar in such a way as to eliminate +the conflict. Null productions are a common source of conflict. + +There are a number of ways to deal with conflicts. If you understand +the conflict well, you may simply choose to ignore it. When AnaGram +encounters a shift-reduce conflict while building parse tables it +resolves it by choosing the shift action. When AnaGram encounters a +reduce-reduce conflict while building parse tables, it resolves it by +selecting the grammar rule which occurred first in the grammar. + +A second way to deal with conflicts is to set operator precedence +parameters. If you set these parameters, AnaGram will use them +preferentially to resolve conflicts. Any conflicts so resolved will be +listed in the Resolved Conflicts window. + +A third way to resolve a conflict is to declare some tokens as +\agparam{sticky} or as \agparam{subgrammar}s. This is particularly +useful for productions whose sole purpose is to skip over +uninteresting input. + +The fourth way to deal with conflicts is to rewrite the grammar to +eliminate them. Many people prefer this approach since it yields the +highest level of confidence in the resulting program. + +\index{Context free grammar} +\paragraph{Context Free Grammars.} +Context free grammars are grammars wherein the definition of a grammar +unit, or nonterminal token, does not depend on the context in which +the nonterminal is used. That means that if you define ``widget'' in a +certain way, that definition is valid in all contexts in which +``widget'' might occur. Context free grammars can be represented in +Backus-Naur Form. AnaGram's syntax specification language has no +facility for representing grammars which are not context free. + +\index{Default reductions} +\index{Reduction} +\paragraph{Default reductions.} +A ``default reduction'' is a parser action which may be used in your +parser in any state which has precisely one completed rule. + +If a given parser state has, among its characteristic rules, exactly +one completed rule, it is usually faster to reduce it on any input +than to check specifically for correct input before reducing it. The +only time this default reduction causes trouble is in the event of a +syntax error. In this situation you may get an erroneous reduction. +Normally when you are parsing a file, this is inconsequential because +you are not going to continue semantic action in the presence of +error. But, if you are using your parser to handle real-time +interactive input, you have to be able to continue semantic processing +after notifying your user that he has entered erroneous input. In this +case you would want \agparam{default reductions} to have been turned +off so that productions are reduced only when there is correct input. + +\index{Difference} +% XXX: \index{Set difference} ? +\paragraph{Difference of two sets.} +In set theory, the difference of two sets, $A$ and $B$, is defined to +be the set of all elements of $A$ that are not elements of $B$. In an +AnaGram syntax file, you represent the difference of two character +sets by using the ``\agcode{-}'' operator. Thus the difference of +\agcode{A} and \agcode{B} is \agcode{A - B}. The difference operator +is left associative. + +\paragraph{Error Action.} +The error action is one of the four actions of a traditional parsing +engine. The error action is performed when the parser has encountered +an input token which is not admissible in the current state. The +further behavior of a traditional parser is undefined. + +\index{Expansion rule} +\index{Rule} +\paragraph{Expansion Rule.} +In analyzing a grammar, we are often interested in the full range of +input that can be expected at a certain point. The expansion of a +token or state shows us all the expected input. An expansion yields a +set of marked rules. Looking at the marked token shows us what input +to expect. + +The set of expansion rules of a (nonterminal) token shows all the +expected input that can occur whenever the token appears in the +grammar. The set consists of all the grammar rules produced by the +token, plus all the rules produced by the first token of any rule in +the set. The marked tokens for the expansion rules of a token are +always the first element in the rule. + +The expansion of a state consists of its characteristic rules plus the +expansion rules of the marked token in each characteristic rule. + +\index{Grammar} +\paragraph{Grammar.} +Traditionally, a grammar is a set of productions which taken together +specify precisely a set of acceptable input sequences in terms of an +abstract set of terminal tokens. The set of acceptable input sequences +is often called the ``language'' defined by the grammar. + +In AnaGram, the term grammar also includes configuration sections, +definitions of character sets, virtual productions, etc. that augment +the collection of productions. A grammar is often called a +``syntax'', and the rules of the grammar are often called syntactic +rules. + +% XXX you can't say that step 1 (of 4) of analysis is analysis. +\paragraph{Grammar Analysis.} +The major function of AnaGram is the analysis of context free grammars +written in a particular variant of Backus-Naur Form. + +The analysis of a grammar proceeds in four stages. In the first, the +input grammar is analyzed and a number of tables are built which +describe all of the productions and components of the grammar. + +In the second stage, AnaGram analyzes all of the character sets +defined in the grammar, and where necessary, defines auxiliary tokens +and productions. + +In the third stage, AnaGram identifies all of the states of the parser +and builds the go-to table for the parser. + +In the fourth stage, Anagram identifies reduction tokens for each +completed grammar rule in each state and checks for conflicts. + +\index{Grammar rule} +\paragraph{Grammar Rule.} +A ``grammar rule'' is the right side of a production. It consists of a +sequence of rule elements. Each rule element identifies some token, +which can be either a terminal token or nonterminal token. It is +``matched'' by a corresponding sequence of tokens in the input stream +to the parser. The left side of the production identifies one or more +nonterminal tokens, or reduction tokens, to which the rule reduces +when matched. If there is more than one reduction token, there should +be a reduction procedure to make the choice. + +\index{Grammar token} +\index{Token} +\paragraph{Grammar Token.} +The grammar token is the token which represents the ``top level'' in +your grammar. Some people refer to it as the ``goal'' or ``goal +token'' and others as the ``start token''. Whatever it is called, it +is the single token which describes the complete input to your parser. + +\index{Intersection} +\paragraph{Intersection.} +In set theory, the intersection of two sets, $A$ and $B$, is defined +to be the set of all elements of $A$ which are also elements of $B$. +In an AnaGram syntax file, the intersection of two character sets is +represented with the ``\agcode{\&}'' operator. The intersection +operator has lower precedence than the complement operator, but higher +precedence than the union and difference operators. The intersection +operator is left associative. + +\index{LALR parsing} +\paragraph{LALR parsing.} +LALR, or ``lookahead LR'' parsing, is a somewhat less powerful variant +of LR parsing which produces much smaller parsing tables. Thus an LALR +parser has very significant advantages in size over its LR +counterpart. It is possible to construct grammars which can be parsed +by an LR parser but cannot be parsed by an LALR parser. That is, the +LALR parser will find these grammars to be ambiguous and the LR parser +will not. In practice, however, such grammars seem to be rare and the +ambiguities can often be eliminated by minor revisions. + +AnaGram generates LALR parsers and, in addition, uses LALR parsing to +interpret syntax files. + +\index{Lexical scanner} +\paragraph{Lexical Scanner.} +A lexical scanner is a program or procedure used as a preprocessor for +a parser. It scans input characters and lumps them together into +tokens. Many systems for generating parsers, most notably +\agfile{yacc}, can scarcely be used without a lexical scanner. + +AnaGram, although it can support the use of a lexical scanner, does +not need a lexical scanner for most practical problems. AnaGram avoids +the need for a lexical scanner by using character sets and disregard +and lexeme statements. + +If your problem does in fact need a lexical scanner, you can use +AnaGram itself to write it, so you don't need to know different +languages for the scanner and for the parser. + +\index{Lookahead token} +\index{Token} +\paragraph{Lookahead Token.} +The current input token to a parser is often called the ``lookahead'' +token. In many situations, it is not shifted into the input buffer +immediately, but acts like a catalyst to cause a number of rules to be +reduced before it is eventually shifted in. + +\index{LR parsing} +\paragraph{LR Parsing.} +An LR($k$) parser is a type of deterministic bottom-up shift-reduce +parser using at most $k$ lookahead symbols to determine its next +action at any point in the parse. If ($k$) is omitted, then $k$ is +assumed to be 1. Discussion of the technical details of LR($k$) +parsing is beyond the scope of this manual. The interested reader may +consult any of a variety of works covering the subject\footnote{ See, +for example, Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. +\textit{Compilers: Principles, Techniques, and Tools}. (Reading, +Massachusetts: Addison-Wesley, 1988).}. + +AnaGram produces parsers which employ a variant of LR parsing known as +LALR parsing. It is not necessary to understand the details of LR and +LALR parsing theory in order to use AnaGram. + +\index{Marked rule} +\index{Rule} +\paragraph{Marked Rule.} +A ``marked rule'' is a grammar rule together with a ``marked token'' +that indicates how much of the rule has already been matched and what +input should be expected if the remainder of the rule is to be +matched. Each parser state is defined by a small number of +characteristic rules which indicate what matches have already been +made and what input can be expected. Note that when a state has more +than one characteristic rule, any two characteristic rules with the +same number of tokens match identically up to the marked token. Even +if one has fewer tokens to the left than the other, its tokens match +the other exactly up to the marked token. + +When marked rules are displayed in AnaGram windows, the marked token +is displayed in a distinctive font which the developer can select. + +\paragraph{Non-associative.} +A binary operator, say \agcode{\#}, is said to be non-associative if +the sequence \agcode{x \# y \# z} is not permissible. If this sequence +is to be interpreted as \agcode{(x \# y) \# z}, the operator is said +to be left associative. If the sequence is to be interpreted as +\agcode{x \# (y \# z)}, the operator is said to be right +associative. (See \agterm{associativity}.) + +\index{Nonterminal token} +\index{Token} +\paragraph{Nonterminal Token.} +A ``nonterminal token'' is a token which results from reducing the +sequence of tokens which match a grammar rule and replacing them with +the appropriate reduction token. Nonterminal tokens are to be +distinguished from terminal tokens or input tokens. + +\index{Null productions} +\index{Production} +\paragraph{Null Production.} +A ``null production'' is one that has no tokens on the right side +whatsoever. Null productions essentially are identified by the first +following input token. Null productions are extremely convenient +syntactic elements when you wish to make some input optional. For +example, suppose that you wish to allow an optional semicolon at some +point in your grammar. You could write the following pair of +productions: + +\begin{indentingcode}{0.4in} +optional semicolon -> | ';' +\end{indentingcode} + +Note that a null production can never follow a ``\agcode{|}''. + +The above could also be written on multiple lines thus: + +\begin{indentingcode}{0.4in} +optional semicolon + -> + -> ';' +\end{indentingcode} + +You can always rewrite your grammar to eliminate null productions if +you wish, but you usually pay a price in conciseness and +clarity. Sometimes, however, it is necessary to do such a rewrite in +order to avoid conflicts, to which null productions are especially +prone. + +If you have a null production with no reduction procedure specified, +your parser will automatically assign the value zero to the reduction +token. + +Null productions can be generated by virtual productions. + +\index{Parser} +\paragraph{Parser.} +A parser is a program, or more likely a procedure within a program, +which scans a sequence of input characters or input tokens and +accumulates them in an input buffer or stack as determined by a set of +productions which constitute a grammar. When the parser discovers a +sequence of tokens as defined by a grammar rule, or right side of a +production, it ``reduces'' the sequence to a single reduction token as +defined by the left side of the grammar rule. This nonterminal token +now replaces the tokens which matched the grammar rule and the search +for matches continues. If an input token is encountered which will not +yield a match for any rule, it is considered a syntax error and some +kind of error recovery may be required to continue. If a match, or +reduction action, yields the grammar token, sometimes called the goal +token or start token, the parser deems its work complete and returns +to whatever procedure may have called it. + +Tokens may have semantic values. If the input values configuration +switch is on, your parser will expect semantic values to be provided +by the input process along with the token identification code. If the +\agparam{input values} switch is off, your parser will take the ASCII +value of the input character, that is, the actual input code, as the +value of the character. When the parser reduces a production, it can +call a reduction procedure or semantic action to analyze the values of +the constituent tokens. This reduction procedure can then return a +value which characterizes the reduced token. + +\paragraph{Parser Generator.} +A parser generator, such as AnaGram, is a program that converts a +grammar, a rule-based description of the input to a program, into a +conventional, procedural module called a parser. The parsers AnaGram +generates are simple C modules which can be compiled on almost any +platform. AnaGram parsers are also compatible with C++. + +\index{Parsing engine} +\paragraph{Parsing Engine.} +A parser consists of three basic components: A set of syntax tables, a +set of reduction procedures and a parsing engine. The parsing engine +is the body of code that interprets the parsing table, invokes input +functions, and calls the reduction procedures. The \agmenu{Build +Parser} command configures a parsing engine according to the implicit +requirements of the syntax specification and according to the explicit +values of the configuration parameters. + +The parsing engine itself is a simple automaton, characterized by a +set of states and a set of inputs. The inputs are the tokens of your +grammar. Each state is represented by a list of tokens which are +admissible in that state and for each token an action to perform and a +parameter which further defines the action. In a traditional LR +parser, there are only four actions: the shift action, the reduce +action, the accept action and the error action. AnaGram, in doing its +grammar analysis, identifies a number of special cases, and creates a +number of extra actions which make for faster processing, but which +can be represented as combinations of these primitive actions. + +When a shift action is performed, the current state number is pushed +onto the parser state stack and the new state number is determined by +the current state number and the current lookahead token. Different +tokens cause different new states. + +When a reduce action is performed, the length of the rule being +reduced is subtracted from the parser stack index and the new state +number is read from the top of the parser state stack. The reduction +token for the rule being reduced is then used as an input token. + +Each state in the grammar, with the exception of state zero, has a +characteristic token which must have been recognized in order to jump +to that state. Therefore, the parser state stack, which is essentially +a list of state numbers, can also be thought of as a list of token +numbers. This is the list of tokens that have been seen so far in the +parse of your input stream. Some of these tokens, of course, may be +nonterminal tokens which have resulted from reducing other sequences +of tokens. + +\index{Partition} +\paragraph{Partition.} +If you use character sets in your grammar, AnaGram will compute a +``partition'' of the character universe. This partition is a +collection of non-overlapping character sets such that every one of +the sets you have defined can be written as a union of partition +sets. Each partition set is identified by a unique reference number +called the partition set number. + +Each partition set is assigned a unique token. If one of your +character sets requires more than one partition set to represent it, +AnaGram will create appropriate productions and add them to your +grammar so your parser can make the necessary distinctions. + +\index{Precedence} +\paragraph{Precedence.} +``Precedence'' is an attribute of binary operators and unary operators +which can be used to resolve conflicts. Operator precedence is also +referred to as \agterm{binding}. Suppose that \agcode{\#} and +\agcode{@} are two binary operators. The question is how to interpret +\agcode{x \# y @ z}. If \agcode{\#} has higher precedence than +\agcode{@}, it is interpreted as \agcode{(x \# y) @ z}. On the other +hand if \agcode{\#} has lower precedence than \agcode{@}, it would be +\agcode{x \# (y @ z)}. If \agcode{\#} and \agcode{@} have the same +precedence then the question must be resolved by associativity. + +Note that all operators at the same precedence level must have the +same associativity. + +% XXX ``neither ... and ...?'' maybe you mean ``neither ... nor ...'' +The situation is somewhat simpler for unary operators. If +\agcode{\#} and \agcode{@} were both +prefix operators, or both suffix operators, there would be no +ambiguity in interpretation, since neither \agcode{\# @ x} and +\agcode{x \# @} offer more than one possible interpretation. The only +difficulty arises when both a prefix and a suffix operator are applied +to the same operand. + +% XXX also ``if ... has ... would'' is wrong; it should be +% either ``if ... had ... would'' or ``if ... has ... will''. +Suppose that \agcode{\#} is a prefix operator and \agcode{@} is a +suffix operator. The question then is how to interpret \agcode{\# x @}. +If \agcode{\#} has higher precedence than \agcode{@}, it would be +interpreted as \agcode{(\# x) @}. On the other hand, if \agcode{\#} +has lower precedence than \agcode{@}, it would be interpreted as +\agcode{\# (x @)}. If \agcode{\#} and \agcode{@} have the same +precedence then the question must be resolved by associativity. + +\index{Production} +\paragraph{Production.} +Productions are the mechanism used to describe how complex input +structures are built up out of simpler ones. Each production has a +left side and a right side. The right side, or grammar rule, is a +sequence of rule elements, which may represent either terminal tokens +or nonterminal tokens. The left side is a list of reduction tokens. In +most cases there would be only a single reduction token. Productions +with more than one token on the left side are called semantically +determined productions. The ``\agcode{->}'' symbol is used to separate +the left side from the right side. + +\paragraph{Reduce Action.} +The reduce action is one of the four actions of a traditional parsing +engine. The reduce action is performed when the parser has succeeded +in matching all the elements of a grammar rule and the next input +token is not erroneous. Reducing the grammar rule amounts to +subtracting the length of the rule from the parser stack index, +identifying the reduction token, stacking its semantic value and then +doing a shift action with the reduction token as though it had been +input directly. + +\index{Conflicts} +\index{Reduce-reduce conflicts} +\paragraph{Reduce-Reduce Conflict.} +A grammar has a ``reduce-reduce'' conflict at some state if a single +token turns out to be a reducing token for more than one completed +rule. +% XXX it would be nice to expand this. And given an example. + +\index{Reducing token} +\index{Token} +\paragraph{Reducing Token.} +In a state with more than one completed rule, your parser must be able +to determine which one was actually found. AnaGram deals with this +problem by looking at all the states the parser will branch to once +each rule is reduced. The acceptable input tokens for those states are +the ``reducing tokens'' for the completed rules in the state under +investigation. If there is a single token which is a reducing token +for more than one rule, then the grammar is said to have a +reduce-reduce conflict at that state. If in a particular state there +is both a shift action and a reduce action for the same token the +grammar is said to have a shift-reduce conflict in that state. + +A reducing token is not the same as a reduction token. + +\index{Reduction procedure} +\paragraph{Reduction Procedure.} +A ``reduction procedure'' is a function you write which your parser +executes when it has identified the grammar rule to which the +reduction procedure is attached in your grammar. There are two formats +for reduction procedures, depending on the size and complexity of the +procedure. + +\index{Reduction token} +\index{Token} +\paragraph{Reduction Token.} +A token which appears on the left side of a production is called a +reduction token. It is so called because when the grammar rule on the +right side of the production is matched in the input stream, your +parser will ``reduce'' the sequence of tokens which matches the rule +by replacing the sequence of tokens with the reduction token. Note +that, if more than one reduction token is specified, your reduction +procedure should choose the exact one. If it does not, your parser +will use the leftmost syntactically correct one in the list as the +default. + +A reduction token is not the same as a reducing token. + +\index{Resynchronization} +\paragraph{Resynchronization.} +``Resynchronization'' is the process of getting your parser back in +step with its input after encountering a syntax error. As such, it is +one method of error recovery. Of course, you would resynchronize only +if it is necessary to continue after the error. There are several +options available when using AnaGram. You could use the \agparam{auto +resynch} option, which causes AnaGram to incorporate an automatic +resynchronizing procedure into your parser, or you could use the +\agparam{error resynch} option, which is similar to the technique used +by \agfile{yacc} programmers. +% XXX That should be ``error token'', not ``error resynch''. + +\index{Rule elements} +\paragraph{Rule Element.} +A grammar rule is a list of ``rule elements'', separated by commas. +Rule elements may be token names, character sets, keywords, immediate +actions, or virtual productions. When AnaGram encounters a rule +element for which no token presently exists, it creates one. + +\index{Semantic action} +\index{Action} +\paragraph{Semantic Action.} +A semantic action is a piece of C or C++ code that is executed when a +grammar rule has been identified by a parser. Semantic actions are +also called reduction procedures, since they are executed when a +grammar rule is reduced to the token on the left side of the +production. + +\index{Semantic value} +\index{Value} +\paragraph{Semantic Value.} +Tokens, whether terminal or nonterminal, may have a semantic value. In +the case of terminal tokens, this may be a value assigned by the +lexical scanner or, if the parser is using direct character input, it +will be the ASCII value of the character itself. The values of +nonterminal tokens are created by reduction procedures. As a parse +progresses, token values are shifted onto the stack, so that in a +reduction procedure, the values of the tokens that comprise the +grammar rule that is being reduced are available for inspection. + +\index{Semantically determined production} +\index{Production} +\paragraph{Semantically Determined Production.} +A ``semantically determined production'' is one which has more than +one reduction token specified on the left side of the production. You +would write such a production when the reduction tokens are +syntactically indistinguishable. The reduction procedure may then +specify which of the listed reduction tokens the grammar rule is to +reduce to based on semantic considerations. If there is no reduction +procedure, or the reduction procedure does not specify a reduction +token, the parser will use the leftmost syntactically correct +reduction token. + +\index{Set expressions} +\paragraph{Set Expression.} +A set expression is an algebraic expression used to define a character +set in terms of individual characters, ranges of characters, and/or +other sets of characters as constructed using complements, unions, +intersections, and differences. + +\paragraph{Shift Action.} + The shift action is one of the four actions of a traditional parsing +engine. The shift action is performed when the input token matches one +of the acceptable input tokens for the current parser state. The +semantic value of the token and the current state number are stacked, +the parser stack index is incremented and the state number is set to a +value determined by the previous state and the input token. + +\index{Shift-reduce conflict} +\index{Conflicts} +\paragraph{Shift-Reduce Conflict.} +A ``shift-reduce'' conflict occurs if in some state there is a single +token that should be shifted, because it is legitimate input for one +of the rules of the state, but should also be used to reduce some +other rule because it is a reducing token for that rule. +% XXX again, should expand this and give an example. + +\index{Parsing} +\index{Syntax directed parsing} +\paragraph{Syntax Directed Parsing.} +Syntax directed parsing, or formal parsing, is an approach to building +parsers based on formal language theory. Given a suitable description +of a language, called a grammar, there are algorithms which can be +used to create parsers for the language automatically. In this +context, the set of all possible inputs to a program may be considered +to constitute a language, and the rules for formulating the input to +the program constitute the grammar for the language. + +The parsers built from a grammar have the advantage that they can +recognize any input that conforms to the rules, and can reject as +erroneous any input that fails to conform. + +Since the program logic necessary to parse input is often extremely +intricate, programs which use formal parsing are usually much more +reliable than those built by hand. They are also much easier to +maintain, since it is much easier to modify a grammar specification +than it is to modify complex program logic. + +\index{Terminal token} +\index{Token} +\paragraph{Terminal Token.} +A ``terminal token'' is a token which does not appear on the left side +of a production. It represents, therefore, a basic unit of input to +your parser. If the input to your parser consists of ASCII +characters, you may define terminal tokens explicitly as ASCII +characters or as sets of ASCII characters. If you have an input +procedure which produces numeric codes, you may define the terminal +tokens directly in terms of these numeric codes. + +\index{Token} +\paragraph{Token.} +Tokens are the units with which your parser works. There are two kinds +of tokens: terminal tokens and nonterminal tokens. These latter are +identified by the parser as sequences of tokens. The grouping of +tokens into more complex tokens is governed by the grammar rules, or +productions in your grammar. In your grammar, tokens may be denoted by +token names, by virtual productions, by immediate actions, by explicit +character representations, or by expressions which yield character +sets. + +A ``token'' should be thought of more or less as being something like +a cloakroom token. Imagine you had taken a kindergarten class to the +museum and all the kids checked their coats. Each one gets a token. +What is the probability of losing one? You take all of the tokens +from the children, put them in a paper bag, tie it up, and check +it. You get one token in return. We would say that the kids' coats +represent the actual input to the system, their individual tokens are +the terminal tokens, and your one token which enables you to redeem +the paper bag is a nonterminal token. Nonterminal tokens are just +single token numbers to represent the result of compacting a sequence +of more primitive tokens. +%Thus we sometimes refer to the input character as the input even though +%there is a translation process, token conversion, required to turn the +%input character, the winter coat, into a token number, or numbered brass +%slug. + +Actually, the word ``token'' is commonly used to refer to a number of +apparently different things. The tokens you use in a grammar are +abstract. The tokens identified by a parser are concrete instances of +the abstract tokens in the grammar. Furthermore, we have to identify +tokens in some manner, so we have token names and token numbers with +which to refer to particular tokens. As a result the word ``token'', +in any particular context, may refer directly to either an abstract +token, or a concrete instance of an abstract token, or may refer only +indirectly to the token by means of a token name or token number. +% +% concrete whether concrete or abstract, can be confused +% +%we can easily confuse the identification with the token itself, whether +%concrete or abstract. +% +% +%the token, whether concrete or abstract, can be confused with its +%identification. +% +%"token" is sometimes used to refer to the token identification rather than +%the token itself. The correct meaning is usually clear from the context, at +%least to an experienced reader. +% + +As an example, consider a C compiler which has a lexical scanner to +identify input tokens. According to Kernighan and +Ritchie\footnote{Brian W. Kernighan and Dennis M. Ritchie. \textit{The +C Programming Language}, 2nd ed. (Englewood Cliffs, New Jersey: +Prentice-Hall, 1988), p. 191.}, there are six classes of C tokens: +identifiers, keywords, constants, string literals, operators, and +other separators. Consider a particular token, a hexadecimal constant, +\agcode{0XF00}. When the lexical scanner hands this over to the +parser, it has to provide an identification code which says what kind +of token it is, a hexadecimal constant, as well as the ``semantic +value'' of the token, 3840. The identification code is usually an +integer value, determined by the designer of the lexical scanner, +which by agreement with the designer of the parser uniquely represents +a hexadecimal constant. This identification code can be thought of as +an external token number. + +The grammar for the parser, on the other hand, has a token, +\agcode{HEXconstant}, let us say, to represent the abstract usage of +hexadecimal constants in the C language. When AnaGram, or any other +parser generator for that matter, analyzes the grammar, it assigns its +own reference number to identify \agcode{HEXconstant}. This reference +number can be thought of as an internal token number to contrast it +with the external token number provided by the lexical scanner. The +parsers AnaGram builds contain a table, \agcode{ag{\us}tcv}, which +provides a translation from the external to the internal token +number. Both of these token numbers, of course, differ from the +semantic value of the token, in this case, 3840. + +Even when an AnaGram parser is using simple character input, the word +``token'' may represent several different things. Suppose a grammar +has a token \agcode{'A-Z'}, which represents the set of upper case +letters. There will be an internal token number, assigned by AnaGram, +which identifies the (abstract) token in parser tables. The actual +(concrete) input token to a parser will, however, be a single letter, +say ``Q''. In this case, the ASCII value of the character ``Q'', 81, +is used both as the external token number and as the semantic value of +the token. + +\paragraph{Unary Operator.} +A ``unary operator'' is a token which represents some sort of +operation which operates on a single entity to produce a new resultant +entity. It is normally represented by a symbol which is written either +preceding the operand or following the operand. In the first case it +is called a ``prefix operator'' and in the second case a ``suffix +operator''. + +\index{Union} +\paragraph{Union.} +The union of two sets is the set of all elements that are to be found +in one or another of the two sets. In an AnaGram syntax file the union +of two character sets \agcode{A} and \agcode{B} is represented using +the plus sign, as in \agcode{A + B}. The union operator has the same +precedence as the difference operator: lower than that of intersection +and complement. The union operator is left associative.