Mercurial > ~dholland > hg > ag > index.cgi

diff doc/manual/gl.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author: David A. Holland
date: Sat, 22 Dec 2007 17:52:45 -0500
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/manual/gl.tex	Sat Dec 22 17:52:45 2007 -0500
@@ -0,0 +1,960 @@
+\chapter{Glossary}
+
+% XXX Add:
+%   parameter assignment  (?)
+%   parser (state) stack
+%   parser value stack
+% (may be defined in some other version of the glossary)
+% (and appear in the help)
+% (there are apparently dangling links to these in some version)
+
+\index{Action}
+\index{Parser action}
+\index{Parsing engine}
+\paragraph{Action.}
+A ``parser action'' is one of the basic executable elements of a
+parsing engine. In a traditional parsing engine there are four
+actions: the shift action, the reduce action, the accept action, and
+the error action. The parsing engines which AnaGram generates use a
+substantially greater number of actions, in order to provide certain
+short cuts in the interest of greater speed and greater compactness of
+the tables which control the action of the parsing engine.
+
+\paragraph{Accept Action.}
+The accept action is one of the four actions of a traditional parsing
+engine. The accept action is performed when the parser has succeeded
+in identifying the goal token for the grammar. When the parser
+executes the accept action, it returns to the calling program.
+% , returning the semantic value of the goal token as its value.
+The accept action is thus the last action of the parsing engine and
+occurs only once for each successful execution of the parser.
+
+\paragraph{Associativity.}
+This is a term used to describe binary operators (q.v.). It describes
+how you interpret a sequence of operations which all involve the same
+operator.  Consider $a - b - c$, for instance. Here the convention is
+that we subtract $b$ from $a$, and then $c$ from the difference. We
+say that subtraction is left associative, since if there is a sequence
+of subtraction operations, the leftmost one is to be performed
+first. As a second example, consider exponentiation. FORTRAN uses
+\agcode{**} as an operator to indicate exponentiation. In order to
+conform as closely as possible to ordinary mathematical notation, the
+expression \agcode{a**b**c} is interpreted to mean that first
+\agcode{b} is raised to the power \agcode{c}, and then the resultant
+value is used as the power to which \agcode{a} is raised. We say that
+exponentiation is right associative since the rightmost of a sequence
+of exponentiation operations is to be performed first. If a
+programming language does not allow for unparenthesized sequences of a
+particular operator, the operator is said to be non-associative.
+
+Another way to view left and right associativity is as implied
+parenthesization.  Left associative operators are parenthesized from
+the left. Right associative operators are parenthesized from the
+right. Thus $a - b - c = ((a-b) - c)$ and \agcode{a**b**c = (a**(b**c))}.
+
+AnaGram offers two ways to specify associativity of binary
+operators. The first way is to use conventional Backus-Naur Form
+syntax descriptions. You would use recursion to describe both
+subtraction and exponentiation. Subtraction, since it is left
+associative, uses left recursion, and exponentiation, being right
+associative, uses right recursion. Thus
+\begin{indentingcode}{0.4in}
+difference  -> value | difference, '-', value
+exponential -> value | value, "**", exponential
+\end{indentingcode}
+could be used to describe differences and exponents.
+
+The second way to specify associativity is to use an ambiguous grammar
+and precedence declarations. (See Chapter 9.)
+
+\paragraph{Backtracking.}
+In order to make parse tables more compact and parsers faster, it is
+common to use default reductions. In case of error, it is necessary to
+undo default reductions before diagnostics can be properly
+determined. In AnaGram, this undo operation is called backtracking.
+
+\index{Backus-Naur form}
+\index{BNF}
+\paragraph{Backus-Naur Form.}
+Backus-Naur Form, or BNF, is a conventional notation for describing
+context free grammars. AnaGram uses an extended notation for grammars,
+which, except for semantically determined productions, can be shown to
+be equivalent to BNF. The term ``BNF'' is often used colloquially to
+refer to a grammar specification.
+% XXX: shouldn't this say ``any grammar specification''?
+
+AnaGram's syntax specification language differs from BNF in the
+following respects:
+
+\begin{enumerate}
+
+\item
+In conventional BNF, a symbol not enclosed in angle brackets ($< >$)
+is taken to represent itself literally. In AnaGram, literal character
+representations must be enclosed in single quotes and literal strings
+within double quotes.
+
+\item
+In BNF, the right side of a production is simply a list of symbols
+without any delimiters or punctuation. AnaGram requires that the rule
+elements which make up a grammar rule, or right side, be joined by
+commas.
+
+\item
+BNF makes no provision for identifying reduction procedures or their
+arguments. AnaGram provides both reduction procedures, introduced by
+an ``\agcode{=}'' at the end of the production, and named arguments,
+introduced by a ``\agcode{:}'' at the end of any token on the right
+side of the production.
+
+\item
+AnaGram allows virtual productions to be used freely.
+
+\item
+BNF is ``pure'' in that if you wish to define a nonterminal token
+called ``digit'' to represent the digits from zero to nine, you must
+provide ten explicit productions to define it. AnaGram treats the
+concept of ``terminal token'' as used in language theory as an
+abstraction, and interposes character set functionality between actual
+character input and the terminal tokens of BNF. You can define digit
+to be the character range \agcode{'0-9'}, for instance, and AnaGram
+will determine and define precisely the minimum number of productions
+necessary, taking into account any other usage of the characters
+\agcode{'0'} through \agcode{'9'} in your grammar. This makes your
+grammar more compact, more manageable, and easier to modify.
+
+\item
+AnaGram allows for semantically determined productions, which provide
+a significant increment in the ease with which semantic analysis may
+be joined to syntactic analysis.
+
+\end{enumerate}
+
+\paragraph{Binary operator.}
+A binary operator is an operator that works on two operands to create
+a result. It is often written between its operands and is sometimes
+called an infix operator. In ordinary programming languages
+``\agcode{+}'' and ``\agcode{-}'' are binary operators.
+
+\paragraph{Binding.}
+Most programming languages, rather than executing arithmetic operators
+in simple left to right order, conform to the traditional conventions
+of ordinary algebra, which dictate that, except where parenthesis
+indicate otherwise, exponentiation is done before multiplication,
+multiplication before addition, and addition before comparison. One
+says that exponentiation is ``more tightly binding'' than
+multiplication, multiplication is more tightly binding than addition,
+and so on. The sense of the word here is that the operator binds
+together its operands into a single entity. An operand which falls
+between two different operators is ``bound'' by the more tightly
+binding operator. An operator that is more tightly binding than
+another is also said to have higher precedence.
+
+\index{Character sets}
+\paragraph{Character sets.}
+One of the traditional problems with syntax directed programming is
+that caused by the fact that most formal language specifications have
+productions of the form
+\begin{indentingcode}{0.4in}
+letter -> a | b | c | d ... | z
+\end{indentingcode}
+
+Since the letters are not often distinguished in the grammar, this
+large number of essentially identical productions causes a
+correspondingly large number of states in the parser. This problem is
+often attacked by using a ``lexical scanner'' which simply specifies a
+``letter token'' whose value indicates precisely which letter is
+intended. This works fine as long as there is nothing in the grammar
+which distinguishes any letter at all. If there is, and there is
+usually some special use of \agcode{h}, \agcode{x}, \agcode{q},
+\agcode{o}, \agcode{e}, or even \agcode{a-f}, the situation gets more
+complicated. Often the result is to abandon the use of syntax directed
+programming for elemental units of the language and use a more complex
+lexical scanner which identifies names, numbers, key words and
+operators. Such lexical scanners are often built using conventional
+programming languages or simple pattern recognizing languages such as
+\agfile{lex}.
+
+AnaGram avoids this problem by incorporation the notion of character
+set into its input specifications. Briefly, the way it works is the
+following: You specify the set of characters which make up any
+ostensibly terminal token. AnaGram then analyzes the overlaps among
+these definitions and creates a minimal set of tokens which match your
+specifications exactly. For instance, suppose you have the following
+definitions:
+
+\begin{indentingcode}{0.4in}
+letter = 'a-z' + 'A-Z'
+hex digit = '0-9' + 'a-f' + 'A-F'
+\end{indentingcode}
+
+AnaGram will define a token for letter which has two productions:
+
+\begin{indentingcode}{0.4in}
+letter -> 'a-f' + 'A-F' | 'g-z' + 'G-Z'
+\end{indentingcode}
+
+With this technique, you have the freedom to specify your grammar in
+the easiest possible manner, without the penalty of an absurdly large,
+slow parser.
+
+Character sets may be specified in terms of ranges of characters, as
+in the above example, as unions, denoted by ``\agcode{+}'', as
+intersections, denoted by ``\agcode{\&}'', as differences, denoted by
+``\agcode{-}'', and as complements, using ``\agcode{\~{}}''. You may
+also specify a set consisting of a single character either with a
+literal character or with a numeric specification. If you specify a
+character numerically, you may use either decimal, octal or
+hexadecimal notation, as in C.
+
+\index{Character universe}
+\index{Universe}
+\paragraph{Character Universe.}
+In ordinary set theory, the ``universe'' is the set of all entities
+which may be elements of a set. It must be defined so that the
+complement of a set can mean something unambiguous. In AnaGram, the
+character universe normally comprises the ASCII characters and the IBM
+extended character set, that is, all characters in the range from 0
+through 255. If, however, you have defined input characters outside
+this range, the character universe will be extended to the least
+character you have used and to the greatest character you have used in
+your grammar.
+% XXX this should mention character sets, and it should mention that
+% it's limited to character sets of size 65536.
+
+\index{Characteristic rules}
+\index{Rule}
+\paragraph{Characteristic Rule.}
+Each parser state is characterized by a particular set of grammar
+rules, and for each such rule, a marked token which is the next token
+expected. These rules are called the characteristic rules for the
+state. In the course of doing grammar analysis, AnaGram determines the
+characteristic rules for each parser state. After analyzing your
+grammar, you may inspect the \agwindow{State Definition Table} to see
+the characteristic rules for any state in your parser.
+
+\index{Characteristic token}
+\index{Token}
+\paragraph{Characteristic Token.}
+Every state in a parser, except state zero, can be characterized by
+the one, unique token which causes a jump to that state. That token is
+called the characteristic token of the state, because to get to that
+parser state you must have just seen precisely that token in the
+input. Note that several states could have the same characteristic
+token.
+% XXX s/one,/one/
+
+When you have a list of states, such as is given by the parser state
+stack, it is equivalent to a list of characteristic tokens. This list
+of tokens is the list of tokens that have been recognized so far by
+the parser. Some of these tokens, of course, may be nonterminal tokens
+and may thus represent the result of reducing a sequence of previously
+recognized tokens.
+
+\index{Complement}
+\paragraph{Complement of a set.}
+In set theory, the complement of a set, $S$, is the collection of all
+elements of the character universe which are not elements of $S$. Using
+AnaGram's notation, the complement of \agcode{S} is written
+\agcode{\~{}S}. The complement operator has higher precedence than the
+difference, intersection, or union operators.
+
+\index{Rule}
+\index{Completed rule}
+\paragraph{Completed Rule.}
+A ``completed rule'' is a characteristic rule whose index is pointing
+beyond the last rule element. In other words, it has been completely
+matched and will be reduced by the next input.
+
+If a state has more than one completed rule, the decision as to which
+to reduce is made based on the next input token. If there is only one
+completed rule in a state, it will be reduced by default unless the
+default reductions switch has been reset, i.e., turned off.
+
+\index{Conflicts}
+\paragraph{Conflict.}
+Conflicts arise during a grammar analysis when AnaGram cannot
+determine how to treat a given input token. There are two sorts of
+conflicts: \agterm{shift-reduce} conflicts and \agterm{reduce-reduce}
+conflicts. Conflicts may arise either because the grammar is
+inherently ambiguous, or simply because the grammar analyzer cannot
+look far enough ahead to resolve the conflict. In the latter case, it
+is often possible to rewrite the grammar in such a way as to eliminate
+the conflict. Null productions are a common source of conflict.
+
+There are a number of ways to deal with conflicts. If you understand
+the conflict well, you may simply choose to ignore it. When AnaGram
+encounters a shift-reduce conflict while building parse tables it
+resolves it by choosing the shift action. When AnaGram encounters a
+reduce-reduce conflict while building parse tables, it resolves it by
+selecting the grammar rule which occurred first in the grammar.
+
+A second way to deal with conflicts is to set operator precedence
+parameters. If you set these parameters, AnaGram will use them
+preferentially to resolve conflicts. Any conflicts so resolved will be
+listed in the Resolved Conflicts window.
+
+A third way to resolve a conflict is to declare some tokens as
+\agparam{sticky} or as \agparam{subgrammar}s. This is particularly
+useful for productions whose sole purpose is to skip over
+uninteresting input.
+
+The fourth way to deal with conflicts is to rewrite the grammar to
+eliminate them.  Many people prefer this approach since it yields the
+highest level of confidence in the resulting program.
+
+\index{Context free grammar}
+\paragraph{Context Free Grammars.}
+Context free grammars are grammars wherein the definition of a grammar
+unit, or nonterminal token, does not depend on the context in which
+the nonterminal is used. That means that if you define ``widget'' in a
+certain way, that definition is valid in all contexts in which
+``widget'' might occur.  Context free grammars can be represented in
+Backus-Naur Form. AnaGram's syntax specification language has no
+facility for representing grammars which are not context free.
+
+\index{Default reductions}
+\index{Reduction}
+\paragraph{Default reductions.}
+A ``default reduction'' is a parser action which may be used in your
+parser in any state which has precisely one completed rule.
+
+If a given parser state has, among its characteristic rules, exactly
+one completed rule, it is usually faster to reduce it on any input
+than to check specifically for correct input before reducing it. The
+only time this default reduction causes trouble is in the event of a
+syntax error.  In this situation you may get an erroneous reduction.
+Normally when you are parsing a file, this is inconsequential because
+you are not going to continue semantic action in the presence of
+error.  But, if you are using your parser to handle real-time
+interactive input, you have to be able to continue semantic processing
+after notifying your user that he has entered erroneous input. In this
+case you would want \agparam{default reductions} to have been turned
+off so that productions are reduced only when there is correct input.
+
+\index{Difference}
+% XXX: \index{Set difference} ?
+\paragraph{Difference of two sets.}
+In set theory, the difference of two sets, $A$ and $B$, is defined to
+be the set of all elements of $A$ that are not elements of $B$. In an
+AnaGram syntax file, you represent the difference of two character
+sets by using the ``\agcode{-}'' operator. Thus the difference of
+\agcode{A} and \agcode{B} is \agcode{A - B}. The difference operator
+is left associative.
+
+\paragraph{Error Action.}
+The error action is one of the four actions of a traditional parsing
+engine. The error action is performed when the parser has encountered
+an input token which is not admissible in the current state. The
+further behavior of a traditional parser is undefined.
+
+\index{Expansion rule}
+\index{Rule}
+\paragraph{Expansion Rule.}
+In analyzing a grammar, we are often interested in the full range of
+input that can be expected at a certain point. The expansion of a
+token or state shows us all the expected input. An expansion yields a
+set of marked rules.  Looking at the marked token shows us what input
+to expect.
+
+The set of expansion rules of a (nonterminal) token shows all the
+expected input that can occur whenever the token appears in the
+grammar. The set consists of all the grammar rules produced by the
+token, plus all the rules produced by the first token of any rule in
+the set. The marked tokens for the expansion rules of a token are
+always the first element in the rule.
+
+The expansion of a state consists of its characteristic rules plus the
+expansion rules of the marked token in each characteristic rule.
+
+\index{Grammar}
+\paragraph{Grammar.}
+Traditionally, a grammar is a set of productions which taken together
+specify precisely a set of acceptable input sequences in terms of an
+abstract set of terminal tokens. The set of acceptable input sequences
+is often called the ``language'' defined by the grammar.
+
+In AnaGram, the term grammar also includes configuration sections,
+definitions of character sets, virtual productions, etc. that augment
+the collection of productions.  A grammar is often called a
+``syntax'', and the rules of the grammar are often called syntactic
+rules.
+
+% XXX you can't say that step 1 (of 4) of analysis is analysis.
+\paragraph{Grammar Analysis.}
+The major function of AnaGram is the analysis of context free grammars
+written in a particular variant of Backus-Naur Form.
+
+The analysis of a grammar proceeds in four stages. In the first, the
+input grammar is analyzed and a number of tables are built which
+describe all of the productions and components of the grammar.
+
+In the second stage, AnaGram analyzes all of the character sets
+defined in the grammar, and where necessary, defines auxiliary tokens
+and productions.
+
+In the third stage, AnaGram identifies all of the states of the parser
+and builds the go-to table for the parser.
+
+In the fourth stage, Anagram identifies reduction tokens for each
+completed grammar rule in each state and checks for conflicts.
+
+\index{Grammar rule}
+\paragraph{Grammar Rule.}
+A ``grammar rule'' is the right side of a production. It consists of a
+sequence of rule elements. Each rule element identifies some token,
+which can be either a terminal token or nonterminal token. It is
+``matched'' by a corresponding sequence of tokens in the input stream
+to the parser. The left side of the production identifies one or more
+nonterminal tokens, or reduction tokens, to which the rule reduces
+when matched. If there is more than one reduction token, there should
+be a reduction procedure to make the choice.
+
+\index{Grammar token}
+\index{Token}
+\paragraph{Grammar Token.}
+The grammar token is the token which represents the ``top level'' in
+your grammar. Some people refer to it as the ``goal'' or ``goal
+token'' and others as the ``start token''. Whatever it is called, it
+is the single token which describes the complete input to your parser.
+
+\index{Intersection}
+\paragraph{Intersection.}
+In set theory, the intersection of two sets, $A$ and $B$, is defined
+to be the set of all elements of $A$ which are also elements of $B$.
+In an AnaGram syntax file, the intersection of two character sets is
+represented with the ``\agcode{\&}'' operator.  The intersection
+operator has lower precedence than the complement operator, but higher
+precedence than the union and difference operators. The intersection
+operator is left associative.
+
+\index{LALR parsing}
+\paragraph{LALR parsing.}
+LALR, or ``lookahead LR'' parsing, is a somewhat less powerful variant
+of LR parsing which produces much smaller parsing tables. Thus an LALR
+parser has very significant advantages in size over its LR
+counterpart. It is possible to construct grammars which can be parsed
+by an LR parser but cannot be parsed by an LALR parser. That is, the
+LALR parser will find these grammars to be ambiguous and the LR parser
+will not. In practice, however, such grammars seem to be rare and the
+ambiguities can often be eliminated by minor revisions.
+
+AnaGram generates LALR parsers and, in addition, uses LALR parsing to
+interpret syntax files.
+
+\index{Lexical scanner}
+\paragraph{Lexical Scanner.}
+A lexical scanner is a program or procedure used as a preprocessor for
+a parser. It scans input characters and lumps them together into
+tokens.  Many systems for generating parsers, most notably
+\agfile{yacc}, can scarcely be used without a lexical scanner.
+
+AnaGram, although it can support the use of a lexical scanner, does
+not need a lexical scanner for most practical problems. AnaGram avoids
+the need for a lexical scanner by using character sets and disregard
+and lexeme statements.
+
+If your problem does in fact need a lexical scanner, you can use
+AnaGram itself to write it, so you don't need to know different
+languages for the scanner and for the parser.
+
+\index{Lookahead token}
+\index{Token}
+\paragraph{Lookahead Token.}
+The current input token to a parser is often called the ``lookahead''
+token. In many situations, it is not shifted into the input buffer
+immediately, but acts like a catalyst to cause a number of rules to be
+reduced before it is eventually shifted in.
+
+\index{LR parsing}
+\paragraph{LR Parsing.}
+An LR($k$) parser is a type of deterministic bottom-up shift-reduce
+parser using at most $k$ lookahead symbols to determine its next
+action at any point in the parse. If ($k$) is omitted, then $k$ is
+assumed to be 1. Discussion of the technical details of LR($k$)
+parsing is beyond the scope of this manual. The interested reader may
+consult any of a variety of works covering the subject\footnote{ See,
+for example, Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.
+\textit{Compilers: Principles, Techniques, and Tools}. (Reading,
+Massachusetts: Addison-Wesley, 1988).}.
+
+AnaGram produces parsers which employ a variant of LR parsing known as
+LALR parsing. It is not necessary to understand the details of LR and
+LALR parsing theory in order to use AnaGram.
+
+\index{Marked rule}
+\index{Rule}
+\paragraph{Marked Rule.}
+A ``marked rule'' is a grammar rule together with a ``marked token''
+that indicates how much of the rule has already been matched and what
+input should be expected if the remainder of the rule is to be
+matched. Each parser state is defined by a small number of
+characteristic rules which indicate what matches have already been
+made and what input can be expected. Note that when a state has more
+than one characteristic rule, any two characteristic rules with the
+same number of tokens match identically up to the marked token. Even
+if one has fewer tokens to the left than the other, its tokens match
+the other exactly up to the marked token.
+
+When marked rules are displayed in AnaGram windows, the marked token
+is displayed in a distinctive font which the developer can select.
+
+\paragraph{Non-associative.}
+A binary operator, say \agcode{\#}, is said to be non-associative if
+the sequence \agcode{x \# y \# z} is not permissible. If this sequence
+is to be interpreted as \agcode{(x \# y) \# z}, the operator is said
+to be left associative. If the sequence is to be interpreted as 
+\agcode{x \# (y \# z)}, the operator is said to be right
+associative. (See \agterm{associativity}.) 
+
+\index{Nonterminal token}
+\index{Token}
+\paragraph{Nonterminal Token.}
+A ``nonterminal token'' is a token which results from reducing the
+sequence of tokens which match a grammar rule and replacing them with
+the appropriate reduction token. Nonterminal tokens are to be
+distinguished from terminal tokens or input tokens.
+
+\index{Null productions}
+\index{Production}
+\paragraph{Null Production.}
+A ``null production'' is one that has no tokens on the right side
+whatsoever. Null productions essentially are identified by the first
+following input token. Null productions are extremely convenient
+syntactic elements when you wish to make some input optional. For
+example, suppose that you wish to allow an optional semicolon at some
+point in your grammar. You could write the following pair of
+productions:
+
+\begin{indentingcode}{0.4in}
+optional semicolon -> | ';'
+\end{indentingcode}
+
+Note that a null production can never follow a ``\agcode{|}''.
+
+The above could also be written on multiple lines thus:
+
+\begin{indentingcode}{0.4in}
+optional semicolon
+  ->
+  -> ';'
+\end{indentingcode}
+
+You can always rewrite your grammar to eliminate null productions if
+you wish, but you usually pay a price in conciseness and
+clarity. Sometimes, however, it is necessary to do such a rewrite in
+order to avoid conflicts, to which null productions are especially
+prone.
+
+If you have a null production with no reduction procedure specified,
+your parser will automatically assign the value zero to the reduction
+token.
+
+Null productions can be generated by virtual productions.
+
+\index{Parser}
+\paragraph{Parser.}
+A parser is a program, or more likely a procedure within a program,
+which scans a sequence of input characters or input tokens and
+accumulates them in an input buffer or stack as determined by a set of
+productions which constitute a grammar. When the parser discovers a
+sequence of tokens as defined by a grammar rule, or right side of a
+production, it ``reduces'' the sequence to a single reduction token as
+defined by the left side of the grammar rule. This nonterminal token
+now replaces the tokens which matched the grammar rule and the search
+for matches continues. If an input token is encountered which will not
+yield a match for any rule, it is considered a syntax error and some
+kind of error recovery may be required to continue. If a match, or
+reduction action, yields the grammar token, sometimes called the goal
+token or start token, the parser deems its work complete and returns
+to whatever procedure may have called it.
+
+Tokens may have semantic values. If the input values configuration
+switch is on, your parser will expect semantic values to be provided
+by the input process along with the token identification code. If the
+\agparam{input values} switch is off, your parser will take the ASCII
+value of the input character, that is, the actual input code, as the
+value of the character. When the parser reduces a production, it can
+call a reduction procedure or semantic action to analyze the values of
+the constituent tokens.  This reduction procedure can then return a
+value which characterizes the reduced token.
+
+\paragraph{Parser Generator.}
+A parser generator, such as AnaGram, is a program that converts a
+grammar, a rule-based description of the input to a program, into a
+conventional, procedural module called a parser. The parsers AnaGram
+generates are simple C modules which can be compiled on almost any
+platform. AnaGram parsers are also compatible with C++.
+
+\index{Parsing engine}
+\paragraph{Parsing Engine.}
+A parser consists of three basic components: A set of syntax tables, a
+set of reduction procedures and a parsing engine. The parsing engine
+is the body of code that interprets the parsing table, invokes input
+functions, and calls the reduction procedures. The \agmenu{Build
+Parser} command configures a parsing engine according to the implicit
+requirements of the syntax specification and according to the explicit
+values of the configuration parameters.
+
+The parsing engine itself is a simple automaton, characterized by a
+set of states and a set of inputs. The inputs are the tokens of your
+grammar. Each state is represented by a list of tokens which are
+admissible in that state and for each token an action to perform and a
+parameter which further defines the action. In a traditional LR
+parser, there are only four actions: the shift action, the reduce
+action, the accept action and the error action. AnaGram, in doing its
+grammar analysis, identifies a number of special cases, and creates a
+number of extra actions which make for faster processing, but which
+can be represented as combinations of these primitive actions.
+
+When a shift action is performed, the current state number is pushed
+onto the parser state stack and the new state number is determined by
+the current state number and the current lookahead token. Different
+tokens cause different new states.
+
+When a reduce action is performed, the length of the rule being
+reduced is subtracted from the parser stack index and the new state
+number is read from the top of the parser state stack. The reduction
+token for the rule being reduced is then used as an input token.
+
+Each state in the grammar, with the exception of state zero, has a
+characteristic token which must have been recognized in order to jump
+to that state. Therefore, the parser state stack, which is essentially
+a list of state numbers, can also be thought of as a list of token
+numbers. This is the list of tokens that have been seen so far in the
+parse of your input stream. Some of these tokens, of course, may be
+nonterminal tokens which have resulted from reducing other sequences
+of tokens.
+
+\index{Partition}
+\paragraph{Partition.}
+If you use character sets in your grammar, AnaGram will compute a
+``partition'' of the character universe. This partition is a
+collection of non-overlapping character sets such that every one of
+the sets you have defined can be written as a union of partition
+sets. Each partition set is identified by a unique reference number
+called the partition set number.
+
+Each partition set is assigned a unique token. If one of your
+character sets requires more than one partition set to represent it,
+AnaGram will create appropriate productions and add them to your
+grammar so your parser can make the necessary distinctions.
+
+\index{Precedence}
+\paragraph{Precedence.}
+``Precedence'' is an attribute of binary operators and unary operators
+which can be used to resolve conflicts. Operator precedence is also
+referred to as \agterm{binding}. Suppose that \agcode{\#} and
+\agcode{@} are two binary operators. The question is how to interpret
+\agcode{x \# y @ z}. If \agcode{\#} has higher precedence than
+\agcode{@}, it is interpreted as \agcode{(x \# y) @ z}. On the other
+hand if \agcode{\#} has lower precedence than \agcode{@}, it would be
+\agcode{x \# (y @ z)}. If \agcode{\#} and \agcode{@} have the same
+precedence then the question must be resolved by associativity.
+
+Note that all operators at the same precedence level must have the
+same associativity.
+
+% XXX ``neither ... and ...?'' maybe you mean ``neither ... nor ...''
+The situation is somewhat simpler for unary operators. If
+\agcode{\#} and \agcode{@} were both
+prefix operators, or both suffix operators, there would be no
+ambiguity in interpretation, since neither \agcode{\# @ x} and
+\agcode{x \# @} offer more than one possible interpretation.  The only
+difficulty arises when both a prefix and a suffix operator are applied
+to the same operand.
+
+% XXX also ``if ... has ... would'' is wrong; it should be 
+% either ``if ... had ... would'' or ``if ... has ... will''.
+Suppose that \agcode{\#} is a prefix operator and \agcode{@} is a
+suffix operator. The question then is how to interpret \agcode{\# x @}.
+If \agcode{\#} has higher precedence than \agcode{@}, it would be
+interpreted as \agcode{(\# x) @}. On the other hand, if \agcode{\#}
+has lower precedence than \agcode{@}, it would be interpreted as
+\agcode{\# (x @)}. If \agcode{\#} and \agcode{@} have the same
+precedence then the question must be resolved by associativity.
+
+\index{Production}
+\paragraph{Production.}
+Productions are the mechanism used to describe how complex input
+structures are built up out of simpler ones. Each production has a
+left side and a right side. The right side, or grammar rule, is a
+sequence of rule elements, which may represent either terminal tokens
+or nonterminal tokens. The left side is a list of reduction tokens. In
+most cases there would be only a single reduction token.  Productions
+with more than one token on the left side are called semantically
+determined productions. The ``\agcode{->}'' symbol is used to separate
+the left side from the right side.
+
+\paragraph{Reduce Action.}
+The reduce action is one of the four actions of a traditional parsing
+engine. The reduce action is performed when the parser has succeeded
+in matching all the elements of a grammar rule and the next input
+token is not erroneous. Reducing the grammar rule amounts to
+subtracting the length of the rule from the parser stack index,
+identifying the reduction token, stacking its semantic value and then
+doing a shift action with the reduction token as though it had been
+input directly.
+
+\index{Conflicts}
+\index{Reduce-reduce conflicts}
+\paragraph{Reduce-Reduce Conflict.}
+A grammar has a ``reduce-reduce'' conflict at some state if a single
+token turns out to be a reducing token for more than one completed
+rule.
+% XXX it would be nice to expand this. And given an example.
+
+\index{Reducing token}
+\index{Token}
+\paragraph{Reducing Token.}
+In a state with more than one completed rule, your parser must be able
+to determine which one was actually found. AnaGram deals with this
+problem by looking at all the states the parser will branch to once
+each rule is reduced. The acceptable input tokens for those states are
+the ``reducing tokens'' for the completed rules in the state under
+investigation. If there is a single token which is a reducing token
+for more than one rule, then the grammar is said to have a
+reduce-reduce conflict at that state. If in a particular state there
+is both a shift action and a reduce action for the same token the
+grammar is said to have a shift-reduce conflict in that state.
+
+A reducing token is not the same as a reduction token.
+
+\index{Reduction procedure}
+\paragraph{Reduction Procedure.}
+A ``reduction procedure'' is a function you write which your parser
+executes when it has identified the grammar rule to which the
+reduction procedure is attached in your grammar. There are two formats
+for reduction procedures, depending on the size and complexity of the
+procedure.
+
+\index{Reduction token}
+\index{Token}
+\paragraph{Reduction Token.}
+A token which appears on the left side of a production is called a
+reduction token. It is so called because when the grammar rule on the
+right side of the production is matched in the input stream, your
+parser will ``reduce'' the sequence of tokens which matches the rule
+by replacing the sequence of tokens with the reduction token. Note
+that, if more than one reduction token is specified, your reduction
+procedure should choose the exact one. If it does not, your parser
+will use the leftmost syntactically correct one in the list as the
+default.
+
+A reduction token is not the same as a reducing token.
+
+\index{Resynchronization}
+\paragraph{Resynchronization.}
+``Resynchronization'' is the process of getting your parser back in
+step with its input after encountering a syntax error. As such, it is
+one method of error recovery. Of course, you would resynchronize only
+if it is necessary to continue after the error. There are several
+options available when using AnaGram. You could use the \agparam{auto
+resynch} option, which causes AnaGram to incorporate an automatic
+resynchronizing procedure into your parser, or you could use the
+\agparam{error resynch} option, which is similar to the technique used
+by \agfile{yacc} programmers.
+% XXX That should be ``error token'', not ``error resynch''.
+
+\index{Rule elements}
+\paragraph{Rule Element.}
+A grammar rule is a list of ``rule elements'', separated by commas.
+Rule elements may be token names, character sets, keywords, immediate
+actions, or virtual productions. When AnaGram encounters a rule
+element for which no token presently exists, it creates one.
+
+\index{Semantic action}
+\index{Action}
+\paragraph{Semantic Action.}
+A semantic action is a piece of C or C++ code that is executed when a
+grammar rule has been identified by a parser. Semantic actions are
+also called reduction procedures, since they are executed when a
+grammar rule is reduced to the token on the left side of the
+production.
+
+\index{Semantic value}
+\index{Value}
+\paragraph{Semantic Value.}
+Tokens, whether terminal or nonterminal, may have a semantic value. In
+the case of terminal tokens, this may be a value assigned by the
+lexical scanner or, if the parser is using direct character input, it
+will be the ASCII value of the character itself. The values of
+nonterminal tokens are created by reduction procedures. As a parse
+progresses, token values are shifted onto the stack, so that in a
+reduction procedure, the values of the tokens that comprise the
+grammar rule that is being reduced are available for inspection.
+
+\index{Semantically determined production}
+\index{Production}
+\paragraph{Semantically Determined Production.}
+A ``semantically determined production'' is one which has more than
+one reduction token specified on the left side of the production. You
+would write such a production when the reduction tokens are
+syntactically indistinguishable. The reduction procedure may then
+specify which of the listed reduction tokens the grammar rule is to
+reduce to based on semantic considerations. If there is no reduction
+procedure, or the reduction procedure does not specify a reduction
+token, the parser will use the leftmost syntactically correct
+reduction token.
+
+\index{Set expressions}
+\paragraph{Set Expression.}
+A set expression is an algebraic expression used to define a character
+set in terms of individual characters, ranges of characters, and/or
+other sets of characters as constructed using complements, unions,
+intersections, and differences.
+
+\paragraph{Shift Action.}
+ The shift action is one of the four actions of a traditional parsing
+engine. The shift action is performed when the input token matches one
+of the acceptable input tokens for the current parser state. The
+semantic value of the token and the current state number are stacked,
+the parser stack index is incremented and the state number is set to a
+value determined by the previous state and the input token.
+
+\index{Shift-reduce conflict}
+\index{Conflicts}
+\paragraph{Shift-Reduce Conflict.}
+A ``shift-reduce'' conflict occurs if in some state there is a single
+token that should be shifted, because it is legitimate input for one
+of the rules of the state, but should also be used to reduce some
+other rule because it is a reducing token for that rule.
+% XXX again, should expand this and give an example.
+
+\index{Parsing}
+\index{Syntax directed parsing}
+\paragraph{Syntax Directed Parsing.}
+Syntax directed parsing, or formal parsing, is an approach to building
+parsers based on formal language theory. Given a suitable description
+of a language, called a grammar, there are algorithms which can be
+used to create parsers for the language automatically. In this
+context, the set of all possible inputs to a program may be considered
+to constitute a language, and the rules for formulating the input to
+the program constitute the grammar for the language.
+
+The parsers built from a grammar have the advantage that they can
+recognize any input that conforms to the rules, and can reject as
+erroneous any input that fails to conform.
+
+Since the program logic necessary to parse input is often extremely
+intricate, programs which use formal parsing are usually much more
+reliable than those built by hand. They are also much easier to
+maintain, since it is much easier to modify a grammar specification
+than it is to modify complex program logic.
+
+\index{Terminal token}
+\index{Token}
+\paragraph{Terminal Token.}
+A ``terminal token'' is a token which does not appear on the left side
+of a production. It represents, therefore, a basic unit of input to
+your parser.  If the input to your parser consists of ASCII
+characters, you may define terminal tokens explicitly as ASCII
+characters or as sets of ASCII characters. If you have an input
+procedure which produces numeric codes, you may define the terminal
+tokens directly in terms of these numeric codes.
+
+\index{Token}
+\paragraph{Token.}
+Tokens are the units with which your parser works. There are two kinds
+of tokens: terminal tokens and nonterminal tokens. These latter are
+identified by the parser as sequences of tokens. The grouping of
+tokens into more complex tokens is governed by the grammar rules, or
+productions in your grammar. In your grammar, tokens may be denoted by
+token names, by virtual productions, by immediate actions, by explicit
+character representations, or by expressions which yield character
+sets.
+
+A ``token'' should be thought of more or less as being something like
+a cloakroom token. Imagine you had taken a kindergarten class to the
+museum and all the kids checked their coats. Each one gets a token.
+What is the probability of losing one?  You take all of the tokens
+from the children, put them in a paper bag, tie it up, and check
+it. You get one token in return. We would say that the kids' coats
+represent the actual input to the system, their individual tokens are
+the terminal tokens, and your one token which enables you to redeem
+the paper bag is a nonterminal token. Nonterminal tokens are just
+single token numbers to represent the result of compacting a sequence
+of more primitive tokens.
+%Thus we sometimes refer to the input character as the input even though
+%there is a translation process, token conversion, required to turn the
+%input character, the winter coat, into a token number, or numbered brass
+%slug.
+
+Actually, the word ``token'' is commonly used to refer to a number of
+apparently different things. The tokens you use in a grammar are
+abstract. The tokens identified by a parser are concrete instances of
+the abstract tokens in the grammar.  Furthermore, we have to identify
+tokens in some manner, so we have token names and token numbers with
+which to refer to particular tokens. As a result the word ``token'',
+in any particular context, may refer directly to either an abstract
+token, or a concrete instance of an abstract token, or may refer only
+indirectly to the token by means of a token name or token number.
+%
+% concrete  whether concrete or abstract, can be confused
+%
+%we can easily confuse the identification with the token itself, whether
+%concrete or abstract. 
+%
+%
+%the token, whether concrete or abstract, can be confused with its
+%identification.
+%
+%"token" is sometimes used to refer to the token identification rather than
+%the token itself. The correct meaning is usually clear from the context, at
+%least to an experienced reader.
+%
+
+As an example, consider a C compiler which has a lexical scanner to
+identify input tokens. According to Kernighan and
+Ritchie\footnote{Brian W. Kernighan and Dennis M. Ritchie. \textit{The
+C Programming Language}, 2nd ed. (Englewood Cliffs, New Jersey:
+Prentice-Hall, 1988), p. 191.}, there are six classes of C tokens:
+identifiers, keywords, constants, string literals, operators, and
+other separators. Consider a particular token, a hexadecimal constant,
+\agcode{0XF00}. When the lexical scanner hands this over to the
+parser, it has to provide an identification code which says what kind
+of token it is, a hexadecimal constant, as well as the ``semantic
+value'' of the token, 3840. The identification code is usually an
+integer value, determined by the designer of the lexical scanner,
+which by agreement with the designer of the parser uniquely represents
+a hexadecimal constant. This identification code can be thought of as
+an external token number.
+
+The grammar for the parser, on the other hand, has a token,
+\agcode{HEXconstant}, let us say, to represent the abstract usage of
+hexadecimal constants in the C language.  When AnaGram, or any other
+parser generator for that matter, analyzes the grammar, it assigns its
+own reference number to identify \agcode{HEXconstant}. This reference
+number can be thought of as an internal token number to contrast it
+with the external token number provided by the lexical scanner. The
+parsers AnaGram builds contain a table, \agcode{ag{\us}tcv}, which
+provides a translation from the external to the internal token
+number. Both of these token numbers, of course, differ from the
+semantic value of the token, in this case, 3840.
+
+Even when an AnaGram parser is using simple character input, the word
+``token'' may represent several different things. Suppose a grammar
+has a token \agcode{'A-Z'}, which represents the set of upper case
+letters. There will be an internal token number, assigned by AnaGram,
+which identifies the (abstract) token in parser tables. The actual
+(concrete) input token to a parser will, however, be a single letter,
+say ``Q''. In this case, the ASCII value of the character ``Q'', 81,
+is used both as the external token number and as the semantic value of
+the token.
+
+\paragraph{Unary Operator.}
+A ``unary operator'' is a token which represents some sort of
+operation which operates on a single entity to produce a new resultant
+entity. It is normally represented by a symbol which is written either
+preceding the operand or following the operand. In the first case it
+is called a ``prefix operator'' and in the second case a ``suffix
+operator''.
+
+\index{Union}
+\paragraph{Union.}
+The union of two sets is the set of all elements that are to be found
+in one or another of the two sets. In an AnaGram syntax file the union
+of two character sets \agcode{A} and \agcode{B} is represented using
+the plus sign, as in \agcode{A + B}. The union operator has the same
+precedence as the difference operator: lower than that of intersection
+and complement. The union operator is left associative.
author	David A. Holland
date	Sat, 22 Dec 2007 17:52:45 -0500
parents
children