Mercurial > ~dholland > hg > ag > index.cgi

\chapter{Glossary}

% XXX Add:
%   parameter assignment  (?)
%   parser (state) stack
%   parser value stack
% (may be defined in some other version of the glossary)
% (and appear in the help)
% (there are apparently dangling links to these in some version)

\index{Action}
\index{Parser action}
\index{Parsing engine}
\paragraph{Action.}
A ``parser action'' is one of the basic executable elements of a
parsing engine. In a traditional parsing engine there are four
actions: the shift action, the reduce action, the accept action, and
the error action. The parsing engines which AnaGram generates use a
substantially greater number of actions, in order to provide certain
short cuts in the interest of greater speed and greater compactness of
the tables which control the action of the parsing engine.

\paragraph{Accept Action.}
The accept action is one of the four actions of a traditional parsing
engine. The accept action is performed when the parser has succeeded
in identifying the goal token for the grammar. When the parser
executes the accept action, it returns to the calling program.
% , returning the semantic value of the goal token as its value.
The accept action is thus the last action of the parsing engine and
occurs only once for each successful execution of the parser.

\paragraph{Associativity.}
This is a term used to describe binary operators (q.v.). It describes
how you interpret a sequence of operations which all involve the same
operator.  Consider $a - b - c$, for instance. Here the convention is
that we subtract $b$ from $a$, and then $c$ from the difference. We
say that subtraction is left associative, since if there is a sequence
of subtraction operations, the leftmost one is to be performed
first. As a second example, consider exponentiation. FORTRAN uses
\agcode{**} as an operator to indicate exponentiation. In order to
conform as closely as possible to ordinary mathematical notation, the
expression \agcode{a**b**c} is interpreted to mean that first
\agcode{b} is raised to the power \agcode{c}, and then the resultant
value is used as the power to which \agcode{a} is raised. We say that
exponentiation is right associative since the rightmost of a sequence
of exponentiation operations is to be performed first. If a
programming language does not allow for unparenthesized sequences of a
particular operator, the operator is said to be non-associative.

Another way to view left and right associativity is as implied
parenthesization.  Left associative operators are parenthesized from
the left. Right associative operators are parenthesized from the
right. Thus $a - b - c = ((a-b) - c)$ and \agcode{a**b**c = (a**(b**c))}.

AnaGram offers two ways to specify associativity of binary
operators. The first way is to use conventional Backus-Naur Form
syntax descriptions. You would use recursion to describe both
subtraction and exponentiation. Subtraction, since it is left
associative, uses left recursion, and exponentiation, being right
associative, uses right recursion. Thus
\begin{indentingcode}{0.4in}
difference  -> value | difference, '-', value
exponential -> value | value, "**", exponential
\end{indentingcode}
could be used to describe differences and exponents.

The second way to specify associativity is to use an ambiguous grammar
and precedence declarations. (See Chapter 9.)

\paragraph{Backtracking.}
In order to make parse tables more compact and parsers faster, it is
common to use default reductions. In case of error, it is necessary to
undo default reductions before diagnostics can be properly
determined. In AnaGram, this undo operation is called backtracking.

\index{Backus-Naur form}
\index{BNF}
\paragraph{Backus-Naur Form.}
Backus-Naur Form, or BNF, is a conventional notation for describing
context free grammars. AnaGram uses an extended notation for grammars,
which, except for semantically determined productions, can be shown to
be equivalent to BNF. The term ``BNF'' is often used colloquially to
refer to a grammar specification.
% XXX: shouldn't this say ``any grammar specification''?

AnaGram's syntax specification language differs from BNF in the
following respects:

\begin{enumerate}

\item
In conventional BNF, a symbol not enclosed in angle brackets ($< >$)
is taken to represent itself literally. In AnaGram, literal character
representations must be enclosed in single quotes and literal strings
within double quotes.

\item
In BNF, the right side of a production is simply a list of symbols
without any delimiters or punctuation. AnaGram requires that the rule
elements which make up a grammar rule, or right side, be joined by
commas.

\item
BNF makes no provision for identifying reduction procedures or their
arguments. AnaGram provides both reduction procedures, introduced by
an ``\agcode{=}'' at the end of the production, and named arguments,
introduced by a ``\agcode{:}'' at the end of any token on the right
side of the production.

\item
AnaGram allows virtual productions to be used freely.

\item
BNF is ``pure'' in that if you wish to define a nonterminal token
called ``digit'' to represent the digits from zero to nine, you must
provide ten explicit productions to define it. AnaGram treats the
concept of ``terminal token'' as used in language theory as an
abstraction, and interposes character set functionality between actual
character input and the terminal tokens of BNF. You can define digit
to be the character range \agcode{'0-9'}, for instance, and AnaGram
will determine and define precisely the minimum number of productions
necessary, taking into account any other usage of the characters
\agcode{'0'} through \agcode{'9'} in your grammar. This makes your
grammar more compact, more manageable, and easier to modify.

\item
AnaGram allows for semantically determined productions, which provide
a significant increment in the ease with which semantic analysis may
be joined to syntactic analysis.

\end{enumerate}

\paragraph{Binary operator.}
A binary operator is an operator that works on two operands to create
a result. It is often written between its operands and is sometimes
called an infix operator. In ordinary programming languages
``\agcode{+}'' and ``\agcode{-}'' are binary operators.

\paragraph{Binding.}
Most programming languages, rather than executing arithmetic operators
in simple left to right order, conform to the traditional conventions
of ordinary algebra, which dictate that, except where parenthesis
indicate otherwise, exponentiation is done before multiplication,
multiplication before addition, and addition before comparison. One
says that exponentiation is ``more tightly binding'' than
multiplication, multiplication is more tightly binding than addition,
and so on. The sense of the word here is that the operator binds
together its operands into a single entity. An operand which falls
between two different operators is ``bound'' by the more tightly
binding operator. An operator that is more tightly binding than
another is also said to have higher precedence.

\index{Character sets}
\paragraph{Character sets.}
One of the traditional problems with syntax directed programming is
that caused by the fact that most formal language specifications have
productions of the form
\begin{indentingcode}{0.4in}
letter -> a | b | c | d ... | z
\end{indentingcode}

Since the letters are not often distinguished in the grammar, this
large number of essentially identical productions causes a
correspondingly large number of states in the parser. This problem is
often attacked by using a ``lexical scanner'' which simply specifies a
``letter token'' whose value indicates precisely which letter is
intended. This works fine as long as there is nothing in the grammar
which distinguishes any letter at all. If there is, and there is
usually some special use of \agcode{h}, \agcode{x}, \agcode{q},
\agcode{o}, \agcode{e}, or even \agcode{a-f}, the situation gets more
complicated. Often the result is to abandon the use of syntax directed
programming for elemental units of the language and use a more complex
lexical scanner which identifies names, numbers, key words and
operators. Such lexical scanners are often built using conventional
programming languages or simple pattern recognizing languages such as
\agfile{lex}.

AnaGram avoids this problem by incorporation the notion of character
set into its input specifications. Briefly, the way it works is the
following: You specify the set of characters which make up any
ostensibly terminal token. AnaGram then analyzes the overlaps among
these definitions and creates a minimal set of tokens which match your
specifications exactly. For instance, suppose you have the following
definitions:

\begin{indentingcode}{0.4in}
letter = 'a-z' + 'A-Z'
hex digit = '0-9' + 'a-f' + 'A-F'
\end{indentingcode}

AnaGram will define a token for letter which has two productions:

\begin{indentingcode}{0.4in}
letter -> 'a-f' + 'A-F' | 'g-z' + 'G-Z'
\end{indentingcode}

With this technique, you have the freedom to specify your grammar in
the easiest possible manner, without the penalty of an absurdly large,
slow parser.

Character sets may be specified in terms of ranges of characters, as
in the above example, as unions, denoted by ``\agcode{+}'', as
intersections, denoted by ``\agcode{\&}'', as differences, denoted by
``\agcode{-}'', and as complements, using ``\agcode{\~{}}''. You may
also specify a set consisting of a single character either with a
literal character or with a numeric specification. If you specify a
character numerically, you may use either decimal, octal or
hexadecimal notation, as in C.

\index{Character universe}
\index{Universe}
\paragraph{Character Universe.}
In ordinary set theory, the ``universe'' is the set of all entities
which may be elements of a set. It must be defined so that the
complement of a set can mean something unambiguous. In AnaGram, the
character universe normally comprises the ASCII characters and the IBM
extended character set, that is, all characters in the range from 0
through 255. If, however, you have defined input characters outside
this range, the character universe will be extended to the least
character you have used and to the greatest character you have used in
your grammar.
% XXX this should mention character sets, and it should mention that
% it's limited to character sets of size 65536.

\index{Characteristic rules}
\index{Rule}
\paragraph{Characteristic Rule.}
Each parser state is characterized by a particular set of grammar
rules, and for each such rule, a marked token which is the next token
expected. These rules are called the characteristic rules for the
state. In the course of doing grammar analysis, AnaGram determines the
characteristic rules for each parser state. After analyzing your
grammar, you may inspect the \agwindow{State Definition Table} to see
the characteristic rules for any state in your parser.

\index{Characteristic token}
\index{Token}
\paragraph{Characteristic Token.}
Every state in a parser, except state zero, can be characterized by
the one, unique token which causes a jump to that state. That token is
called the characteristic token of the state, because to get to that
parser state you must have just seen precisely that token in the
input. Note that several states could have the same characteristic
token.
% XXX s/one,/one/

When you have a list of states, such as is given by the parser state
stack, it is equivalent to a list of characteristic tokens. This list
of tokens is the list of tokens that have been recognized so far by
the parser. Some of these tokens, of course, may be nonterminal tokens
and may thus represent the result of reducing a sequence of previously
recognized tokens.

\index{Complement}
\paragraph{Complement of a set.}
In set theory, the complement of a set, $S$, is the collection of all
elements of the character universe which are not elements of $S$. Using
AnaGram's notation, the complement of \agcode{S} is written
\agcode{\~{}S}. The complement operator has higher precedence than the
difference, intersection, or union operators.

\index{Rule}
\index{Completed rule}
\paragraph{Completed Rule.}
A ``completed rule'' is a characteristic rule whose index is pointing
beyond the last rule element. In other words, it has been completely
matched and will be reduced by the next input.

If a state has more than one completed rule, the decision as to which
to reduce is made based on the next input token. If there is only one
completed rule in a state, it will be reduced by default unless the
default reductions switch has been reset, i.e., turned off.

\index{Conflicts}
\paragraph{Conflict.}
Conflicts arise during a grammar analysis when AnaGram cannot
determine how to treat a given input token. There are two sorts of
conflicts: \agterm{shift-reduce} conflicts and \agterm{reduce-reduce}
conflicts. Conflicts may arise either because the grammar is
inherently ambiguous, or simply because the grammar analyzer cannot
look far enough ahead to resolve the conflict. In the latter case, it
is often possible to rewrite the grammar in such a way as to eliminate
the conflict. Null productions are a common source of conflict.

There are a number of ways to deal with conflicts. If you understand
the conflict well, you may simply choose to ignore it. When AnaGram
encounters a shift-reduce conflict while building parse tables it
resolves it by choosing the shift action. When AnaGram encounters a
reduce-reduce conflict while building parse tables, it resolves it by
selecting the grammar rule which occurred first in the grammar.

A second way to deal with conflicts is to set operator precedence
parameters. If you set these parameters, AnaGram will use them
preferentially to resolve conflicts. Any conflicts so resolved will be
listed in the Resolved Conflicts window.

A third way to resolve a conflict is to declare some tokens as
\agparam{sticky} or as \agparam{subgrammar}s. This is particularly
useful for productions whose sole purpose is to skip over
uninteresting input.

The fourth way to deal with conflicts is to rewrite the grammar to
eliminate them.  Many people prefer this approach since it yields the
highest level of confidence in the resulting program.

\index{Context free grammar}
\paragraph{Context Free Grammars.}
Context free grammars are grammars wherein the definition of a grammar
unit, or nonterminal token, does not depend on the context in which
the nonterminal is used. That means that if you define ``widget'' in a
certain way, that definition is valid in all contexts in which
``widget'' might occur.  Context free grammars can be represented in
Backus-Naur Form. AnaGram's syntax specification language has no
facility for representing grammars which are not context free.

\index{Default reductions}
\index{Reduction}
\paragraph{Default reductions.}
A ``default reduction'' is a parser action which may be used in your
parser in any state which has precisely one completed rule.

If a given parser state has, among its characteristic rules, exactly
one completed rule, it is usually faster to reduce it on any input
than to check specifically for correct input before reducing it. The
only time this default reduction causes trouble is in the event of a
syntax error.  In this situation you may get an erroneous reduction.
Normally when you are parsing a file, this is inconsequential because
you are not going to continue semantic action in the presence of
error.  But, if you are using your parser to handle real-time
interactive input, you have to be able to continue semantic processing
after notifying your user that he has entered erroneous input. In this
case you would want \agparam{default reductions} to have been turned
off so that productions are reduced only when there is correct input.

\index{Difference}
% XXX: \index{Set difference} ?
\paragraph{Difference of two sets.}
In set theory, the difference of two sets, $A$ and $B$, is defined to
be the set of all elements of $A$ that are not elements of $B$. In an
AnaGram syntax file, you represent the difference of two character
sets by using the ``\agcode{-}'' operator. Thus the difference of
\agcode{A} and \agcode{B} is \agcode{A - B}. The difference operator
is left associative.

\paragraph{Error Action.}
The error action is one of the four actions of a traditional parsing
engine. The error action is performed when the parser has encountered
an input token which is not admissible in the current state. The
further behavior of a traditional parser is undefined.

\index{Expansion rule}
\index{Rule}
\paragraph{Expansion Rule.}
In analyzing a grammar, we are often interested in the full range of
input that can be expected at a certain point. The expansion of a
token or state shows us all the expected input. An expansion yields a
set of marked rules.  Looking at the marked token shows us what input
to expect.

The set of expansion rules of a (nonterminal) token shows all the
expected input that can occur whenever the token appears in the
grammar. The set consists of all the grammar rules produced by the
token, plus all the rules produced by the first token of any rule in
the set. The marked tokens for the expansion rules of a token are
always the first element in the rule.

The expansion of a state consists of its characteristic rules plus the
expansion rules of the marked token in each characteristic rule.

\index{Grammar}
\paragraph{Grammar.}
Traditionally, a grammar is a set of productions which taken together
specify precisely a set of acceptable input sequences in terms of an
abstract set of terminal tokens. The set of acceptable input sequences
is often called the ``language'' defined by the grammar.

In AnaGram, the term grammar also includes configuration sections,
definitions of character sets, virtual productions, etc. that augment
the collection of productions.  A grammar is often called a
``syntax'', and the rules of the grammar are often called syntactic
rules.

% XXX you can't say that step 1 (of 4) of analysis is analysis.
\paragraph{Grammar Analysis.}
The major function of AnaGram is the analysis of context free grammars
written in a particular variant of Backus-Naur Form.

The analysis of a grammar proceeds in four stages. In the first, the
input grammar is analyzed and a number of tables are built which
describe all of the productions and components of the grammar.

In the second stage, AnaGram analyzes all of the character sets
defined in the grammar, and where necessary, defines auxiliary tokens
and productions.

In the third stage, AnaGram identifies all of the states of the parser
and builds the go-to table for the parser.

In the fourth stage, Anagram identifies reduction tokens for each
completed grammar rule in each state and checks for conflicts.

\index{Grammar rule}
\paragraph{Grammar Rule.}
A ``grammar rule'' is the right side of a production. It consists of a
sequence of rule elements. Each rule element identifies some token,
which can be either a terminal token or nonterminal token. It is
``matched'' by a corresponding sequence of tokens in the input stream
to the parser. The left side of the production identifies one or more
nonterminal tokens, or reduction tokens, to which the rule reduces
when matched. If there is more than one reduction token, there should
be a reduction procedure to make the choice.

\index{Grammar token}
\index{Token}
\paragraph{Grammar Token.}
The grammar token is the token which represents the ``top level'' in
your grammar. Some people refer to it as the ``goal'' or ``goal
token'' and others as the ``start token''. Whatever it is called, it
is the single token which describes the complete input to your parser.

\index{Intersection}
\paragraph{Intersection.}
In set theory, the intersection of two sets, $A$ and $B$, is defined
to be the set of all elements of $A$ which are also elements of $B$.
In an AnaGram syntax file, the intersection of two character sets is
represented with the ``\agcode{\&}'' operator.  The intersection
operator has lower precedence than the complement operator, but higher
precedence than the union and difference operators. The intersection
operator is left associative.

\index{LALR parsing}
\paragraph{LALR parsing.}
LALR, or ``lookahead LR'' parsing, is a somewhat less powerful variant
of LR parsing which produces much smaller parsing tables. Thus an LALR
parser has very significant advantages in size over its LR
counterpart. It is possible to construct grammars which can be parsed
by an LR parser but cannot be parsed by an LALR parser. That is, the
LALR parser will find these grammars to be ambiguous and the LR parser
will not. In practice, however, such grammars seem to be rare and the
ambiguities can often be eliminated by minor revisions.

AnaGram generates LALR parsers and, in addition, uses LALR parsing to
interpret syntax files.

\index{Lexical scanner}
\paragraph{Lexical Scanner.}
A lexical scanner is a program or procedure used as a preprocessor for
a parser. It scans input characters and lumps them together into
tokens.  Many systems for generating parsers, most notably
\agfile{yacc}, can scarcely be used without a lexical scanner.

AnaGram, although it can support the use of a lexical scanner, does
not need a lexical scanner for most practical problems. AnaGram avoids
the need for a lexical scanner by using character sets and disregard
and lexeme statements.

If your problem does in fact need a lexical scanner, you can use
AnaGram itself to write it, so you don't need to know different
languages for the scanner and for the parser.

\index{Lookahead token}
\index{Token}
\paragraph{Lookahead Token.}
The current input token to a parser is often called the ``lookahead''
token. In many situations, it is not shifted into the input buffer
immediately, but acts like a catalyst to cause a number of rules to be
reduced before it is eventually shifted in.

\index{LR parsing}
\paragraph{LR Parsing.}
An LR($k$) parser is a type of deterministic bottom-up shift-reduce
parser using at most $k$ lookahead symbols to determine its next
action at any point in the parse. If ($k$) is omitted, then $k$ is
assumed to be 1. Discussion of the technical details of LR($k$)
parsing is beyond the scope of this manual. The interested reader may
consult any of a variety of works covering the subject\footnote{ See,
for example, Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.
\textit{Compilers: Principles, Techniques, and Tools}. (Reading,
Massachusetts: Addison-Wesley, 1988).}.

AnaGram produces parsers which employ a variant of LR parsing known as
LALR parsing. It is not necessary to understand the details of LR and
LALR parsing theory in order to use AnaGram.

\index{Marked rule}
\index{Rule}
\paragraph{Marked Rule.}
A ``marked rule'' is a grammar rule together with a ``marked token''
that indicates how much of the rule has already been matched and what
input should be expected if the remainder of the rule is to be
matched. Each parser state is defined by a small number of
characteristic rules which indicate what matches have already been
made and what input can be expected. Note that when a state has more
than one characteristic rule, any two characteristic rules with the
same number of tokens match identically up to the marked token. Even
if one has fewer tokens to the left than the other, its tokens match
the other exactly up to the marked token.

When marked rules are displayed in AnaGram windows, the marked token
is displayed in a distinctive font which the developer can select.

\paragraph{Non-associative.}
A binary operator, say \agcode{\#}, is said to be non-associative if
the sequence \agcode{x \# y \# z} is not permissible. If this sequence
is to be interpreted as \agcode{(x \# y) \# z}, the operator is said
to be left associative. If the sequence is to be interpreted as
\agcode{x \# (y \# z)}, the operator is said to be right
associative. (See \agterm{associativity}.)

\index{Nonterminal token}
\index{Token}
\paragraph{Nonterminal Token.}
A ``nonterminal token'' is a token which results from reducing the
sequence of tokens which match a grammar rule and replacing them with
the appropriate reduction token. Nonterminal tokens are to be
distinguished from terminal tokens or input tokens.

\index{Null productions}
\index{Production}
\paragraph{Null Production.}
A ``null production'' is one that has no tokens on the right side
whatsoever. Null productions essentially are identified by the first
following input token. Null productions are extremely convenient
syntactic elements when you wish to make some input optional. For
example, suppose that you wish to allow an optional semicolon at some
point in your grammar. You could write the following pair of
productions:

\begin{indentingcode}{0.4in}
optional semicolon -> | ';'
\end{indentingcode}

Note that a null production can never follow a ``\agcode{|}''.

The above could also be written on multiple lines thus:

\begin{indentingcode}{0.4in}
optional semicolon
  ->
  -> ';'
\end{indentingcode}

You can always rewrite your grammar to eliminate null productions if
you wish, but you usually pay a price in conciseness and
clarity. Sometimes, however, it is necessary to do such a rewrite in
order to avoid conflicts, to which null productions are especially
prone.

If you have a null production with no reduction procedure specified,
your parser will automatically assign the value zero to the reduction
token.

Null productions can be generated by virtual productions.

\index{Parser}
\paragraph{Parser.}
A parser is a program, or more likely a procedure within a program,
which scans a sequence of input characters or input tokens and
accumulates them in an input buffer or stack as determined by a set of
productions which constitute a grammar. When the parser discovers a
sequence of tokens as defined by a grammar rule, or right side of a
production, it ``reduces'' the sequence to a single reduction token as
defined by the left side of the grammar rule. This nonterminal token
now replaces the tokens which matched the grammar rule and the search
for matches continues. If an input token is encountered which will not
yield a match for any rule, it is considered a syntax error and some
kind of error recovery may be required to continue. If a match, or
reduction action, yields the grammar token, sometimes called the goal
token or start token, the parser deems its work complete and returns
to whatever procedure may have called it.

Tokens may have semantic values. If the input values configuration
switch is on, your parser will expect semantic values to be provided
by the input process along with the token identification code. If the
\agparam{input values} switch is off, your parser will take the ASCII
value of the input character, that is, the actual input code, as the
value of the character. When the parser reduces a production, it can
call a reduction procedure or semantic action to analyze the values of
the constituent tokens.  This reduction procedure can then return a
value which characterizes the reduced token.

\paragraph{Parser Generator.}
A parser generator, such as AnaGram, is a program that converts a
grammar, a rule-based description of the input to a program, into a
conventional, procedural module called a parser. The parsers AnaGram
generates are simple C modules which can be compiled on almost any
platform. AnaGram parsers are also compatible with C++.

\index{Parsing engine}
\paragraph{Parsing Engine.}
A parser consists of three basic components: A set of syntax tables, a
set of reduction procedures and a parsing engine. The parsing engine
is the body of code that interprets the parsing table, invokes input
functions, and calls the reduction procedures. The \agmenu{Build
Parser} command configures a parsing engine according to the implicit
requirements of the syntax specification and according to the explicit
values of the configuration parameters.

The parsing engine itself is a simple automaton, characterized by a
set of states and a set of inputs. The inputs are the tokens of your
grammar. Each state is represented by a list of tokens which are
admissible in that state and for each token an action to perform and a
parameter which further defines the action. In a traditional LR
parser, there are only four actions: the shift action, the reduce
action, the accept action and the error action. AnaGram, in doing its
grammar analysis, identifies a number of special cases, and creates a
number of extra actions which make for faster processing, but which
can be represented as combinations of these primitive actions.

When a shift action is performed, the current state number is pushed
onto the parser state stack and the new state number is determined by
the current state number and the current lookahead token. Different
tokens cause different new states.

When a reduce action is performed, the length of the rule being
reduced is subtracted from the parser stack index and the new state
number is read from the top of the parser state stack. The reduction
token for the rule being reduced is then used as an input token.

Each state in the grammar, with the exception of state zero, has a
characteristic token which must have been recognized in order to jump
to that state. Therefore, the parser state stack, which is essentially
a list of state numbers, can also be thought of as a list of token
numbers. This is the list of tokens that have been seen so far in the
parse of your input stream. Some of these tokens, of course, may be
nonterminal tokens which have resulted from reducing other sequences
of tokens.

\index{Partition}
\paragraph{Partition.}
If you use character sets in your grammar, AnaGram will compute a
``partition'' of the character universe. This partition is a
collection of non-overlapping character sets such that every one of
the sets you have defined can be written as a union of partition
sets. Each partition set is identified by a unique reference number
called the partition set number.

Each partition set is assigned a unique token. If one of your
character sets requires more than one partition set to represent it,
AnaGram will create appropriate productions and add them to your
grammar so your parser can make the necessary distinctions.

\index{Precedence}
\paragraph{Precedence.}
``Precedence'' is an attribute of binary operators and unary operators
which can be used to resolve conflicts. Operator precedence is also
referred to as \agterm{binding}. Suppose that \agcode{\#} and
\agcode{@} are two binary operators. The question is how to interpret
\agcode{x \# y @ z}. If \agcode{\#} has higher precedence than
\agcode{@}, it is interpreted as \agcode{(x \# y) @ z}. On the other
hand if \agcode{\#} has lower precedence than \agcode{@}, it would be
\agcode{x \# (y @ z)}. If \agcode{\#} and \agcode{@} have the same
precedence then the question must be resolved by associativity.

Note that all operators at the same precedence level must have the
same associativity.

% XXX ``neither ... and ...?'' maybe you mean ``neither ... nor ...''
The situation is somewhat simpler for unary operators. If
\agcode{\#} and \agcode{@} were both
prefix operators, or both suffix operators, there would be no
ambiguity in interpretation, since neither \agcode{\# @ x} and
\agcode{x \# @} offer more than one possible interpretation.  The only
difficulty arises when both a prefix and a suffix operator are applied
to the same operand.

% XXX also ``if ... has ... would'' is wrong; it should be
% either ``if ... had ... would'' or ``if ... has ... will''.
Suppose that \agcode{\#} is a prefix operator and \agcode{@} is a
suffix operator. The question then is how to interpret \agcode{\# x @}.
If \agcode{\#} has higher precedence than \agcode{@}, it would be
interpreted as \agcode{(\# x) @}. On the other hand, if \agcode{\#}
has lower precedence than \agcode{@}, it would be interpreted as
\agcode{\# (x @)}. If \agcode{\#} and \agcode{@} have the same
precedence then the question must be resolved by associativity.

\index{Production}
\paragraph{Production.}
Productions are the mechanism used to describe how complex input
structures are built up out of simpler ones. Each production has a
left side and a right side. The right side, or grammar rule, is a
sequence of rule elements, which may represent either terminal tokens
or nonterminal tokens. The left side is a list of reduction tokens. In
most cases there would be only a single reduction token.  Productions
with more than one token on the left side are called semantically
determined productions. The ``\agcode{->}'' symbol is used to separate
the left side from the right side.

\paragraph{Reduce Action.}
The reduce action is one of the four actions of a traditional parsing
engine. The reduce action is performed when the parser has succeeded
in matching all the elements of a grammar rule and the next input
token is not erroneous. Reducing the grammar rule amounts to
subtracting the length of the rule from the parser stack index,
identifying the reduction token, stacking its semantic value and then
doing a shift action with the reduction token as though it had been
input directly.

\index{Conflicts}
\index{Reduce-reduce conflicts}
\paragraph{Reduce-Reduce Conflict.}
A grammar has a ``reduce-reduce'' conflict at some state if a single
token turns out to be a reducing token for more than one completed
rule.
% XXX it would be nice to expand this. And given an example.

\index{Reducing token}
\index{Token}
\paragraph{Reducing Token.}
In a state with more than one completed rule, your parser must be able
to determine which one was actually found. AnaGram deals with this
problem by looking at all the states the parser will branch to once
each rule is reduced. The acceptable input tokens for those states are
the ``reducing tokens'' for the completed rules in the state under
investigation. If there is a single token which is a reducing token
for more than one rule, then the grammar is said to have a
reduce-reduce conflict at that state. If in a particular state there
is both a shift action and a reduce action for the same token the
grammar is said to have a shift-reduce conflict in that state.

A reducing token is not the same as a reduction token.

\index{Reduction procedure}
\paragraph{Reduction Procedure.}
A ``reduction procedure'' is a function you write which your parser
executes when it has identified the grammar rule to which the
reduction procedure is attached in your grammar. There are two formats
for reduction procedures, depending on the size and complexity of the
procedure.

\index{Reduction token}
\index{Token}
\paragraph{Reduction Token.}
A token which appears on the left side of a production is called a
reduction token. It is so called because when the grammar rule on the
right side of the production is matched in the input stream, your
parser will ``reduce'' the sequence of tokens which matches the rule
by replacing the sequence of tokens with the reduction token. Note
that, if more than one reduction token is specified, your reduction
procedure should choose the exact one. If it does not, your parser
will use the leftmost syntactically correct one in the list as the
default.

A reduction token is not the same as a reducing token.

\index{Resynchronization}
\paragraph{Resynchronization.}
``Resynchronization'' is the process of getting your parser back in
step with its input after encountering a syntax error. As such, it is
one method of error recovery. Of course, you would resynchronize only
if it is necessary to continue after the error. There are several
options available when using AnaGram. You could use the \agparam{auto
resynch} option, which causes AnaGram to incorporate an automatic
resynchronizing procedure into your parser, or you could use the
\agparam{error resynch} option, which is similar to the technique used
by \agfile{yacc} programmers.
% XXX That should be ``error token'', not ``error resynch''.

\index{Rule elements}
\paragraph{Rule Element.}
A grammar rule is a list of ``rule elements'', separated by commas.
Rule elements may be token names, character sets, keywords, immediate
actions, or virtual productions. When AnaGram encounters a rule
element for which no token presently exists, it creates one.

\index{Semantic action}
\index{Action}
\paragraph{Semantic Action.}
A semantic action is a piece of C or C++ code that is executed when a
grammar rule has been identified by a parser. Semantic actions are
also called reduction procedures, since they are executed when a
grammar rule is reduced to the token on the left side of the
production.

\index{Semantic value}
\index{Value}
\paragraph{Semantic Value.}
Tokens, whether terminal or nonterminal, may have a semantic value. In
the case of terminal tokens, this may be a value assigned by the
lexical scanner or, if the parser is using direct character input, it
will be the ASCII value of the character itself. The values of
nonterminal tokens are created by reduction procedures. As a parse
progresses, token values are shifted onto the stack, so that in a
reduction procedure, the values of the tokens that comprise the
grammar rule that is being reduced are available for inspection.

\index{Semantically determined production}
\index{Production}
\paragraph{Semantically Determined Production.}
A ``semantically determined production'' is one which has more than
one reduction token specified on the left side of the production. You
would write such a production when the reduction tokens are
syntactically indistinguishable. The reduction procedure may then
specify which of the listed reduction tokens the grammar rule is to
reduce to based on semantic considerations. If there is no reduction
procedure, or the reduction procedure does not specify a reduction
token, the parser will use the leftmost syntactically correct
reduction token.

\index{Set expressions}
\paragraph{Set Expression.}
A set expression is an algebraic expression used to define a character
set in terms of individual characters, ranges of characters, and/or
other sets of characters as constructed using complements, unions,
intersections, and differences.

\paragraph{Shift Action.}
 The shift action is one of the four actions of a traditional parsing
engine. The shift action is performed when the input token matches one
of the acceptable input tokens for the current parser state. The
semantic value of the token and the current state number are stacked,
the parser stack index is incremented and the state number is set to a
value determined by the previous state and the input token.

\index{Shift-reduce conflict}
\index{Conflicts}
\paragraph{Shift-Reduce Conflict.}
A ``shift-reduce'' conflict occurs if in some state there is a single
token that should be shifted, because it is legitimate input for one
of the rules of the state, but should also be used to reduce some
other rule because it is a reducing token for that rule.
% XXX again, should expand this and give an example.

\index{Parsing}
\index{Syntax directed parsing}
\paragraph{Syntax Directed Parsing.}
Syntax directed parsing, or formal parsing, is an approach to building
parsers based on formal language theory. Given a suitable description
of a language, called a grammar, there are algorithms which can be
used to create parsers for the language automatically. In this
context, the set of all possible inputs to a program may be considered
to constitute a language, and the rules for formulating the input to
the program constitute the grammar for the language.

The parsers built from a grammar have the advantage that they can
recognize any input that conforms to the rules, and can reject as
erroneous any input that fails to conform.

Since the program logic necessary to parse input is often extremely
intricate, programs which use formal parsing are usually much more
reliable than those built by hand. They are also much easier to
maintain, since it is much easier to modify a grammar specification
than it is to modify complex program logic.

\index{Terminal token}
\index{Token}
\paragraph{Terminal Token.}
A ``terminal token'' is a token which does not appear on the left side
of a production. It represents, therefore, a basic unit of input to
your parser.  If the input to your parser consists of ASCII
characters, you may define terminal tokens explicitly as ASCII
characters or as sets of ASCII characters. If you have an input
procedure which produces numeric codes, you may define the terminal
tokens directly in terms of these numeric codes.

\index{Token}
\paragraph{Token.}
Tokens are the units with which your parser works. There are two kinds
of tokens: terminal tokens and nonterminal tokens. These latter are
identified by the parser as sequences of tokens. The grouping of
tokens into more complex tokens is governed by the grammar rules, or
productions in your grammar. In your grammar, tokens may be denoted by
token names, by virtual productions, by immediate actions, by explicit
character representations, or by expressions which yield character
sets.

A ``token'' should be thought of more or less as being something like
a cloakroom token. Imagine you had taken a kindergarten class to the
museum and all the kids checked their coats. Each one gets a token.
What is the probability of losing one?  You take all of the tokens
from the children, put them in a paper bag, tie it up, and check
it. You get one token in return. We would say that the kids' coats
represent the actual input to the system, their individual tokens are
the terminal tokens, and your one token which enables you to redeem
the paper bag is a nonterminal token. Nonterminal tokens are just
single token numbers to represent the result of compacting a sequence
of more primitive tokens.
%Thus we sometimes refer to the input character as the input even though
%there is a translation process, token conversion, required to turn the
%input character, the winter coat, into a token number, or numbered brass
%slug.

Actually, the word ``token'' is commonly used to refer to a number of
apparently different things. The tokens you use in a grammar are
abstract. The tokens identified by a parser are concrete instances of
the abstract tokens in the grammar.  Furthermore, we have to identify
tokens in some manner, so we have token names and token numbers with
which to refer to particular tokens. As a result the word ``token'',
in any particular context, may refer directly to either an abstract
token, or a concrete instance of an abstract token, or may refer only
indirectly to the token by means of a token name or token number.
%
% concrete  whether concrete or abstract, can be confused
%
%we can easily confuse the identification with the token itself, whether
%concrete or abstract.
%
%
%the token, whether concrete or abstract, can be confused with its
%identification.
%
%"token" is sometimes used to refer to the token identification rather than
%the token itself. The correct meaning is usually clear from the context, at
%least to an experienced reader.
%

As an example, consider a C compiler which has a lexical scanner to
identify input tokens. According to Kernighan and
Ritchie\footnote{Brian W. Kernighan and Dennis M. Ritchie. \textit{The
C Programming Language}, 2nd ed. (Englewood Cliffs, New Jersey:
Prentice-Hall, 1988), p. 191.}, there are six classes of C tokens:
identifiers, keywords, constants, string literals, operators, and
other separators. Consider a particular token, a hexadecimal constant,
\agcode{0XF00}. When the lexical scanner hands this over to the
parser, it has to provide an identification code which says what kind
of token it is, a hexadecimal constant, as well as the ``semantic
value'' of the token, 3840. The identification code is usually an
integer value, determined by the designer of the lexical scanner,
which by agreement with the designer of the parser uniquely represents
a hexadecimal constant. This identification code can be thought of as
an external token number.

The grammar for the parser, on the other hand, has a token,
\agcode{HEXconstant}, let us say, to represent the abstract usage of
hexadecimal constants in the C language.  When AnaGram, or any other
parser generator for that matter, analyzes the grammar, it assigns its
own reference number to identify \agcode{HEXconstant}. This reference
number can be thought of as an internal token number to contrast it
with the external token number provided by the lexical scanner. The
parsers AnaGram builds contain a table, \agcode{ag{\us}tcv}, which
provides a translation from the external to the internal token
number. Both of these token numbers, of course, differ from the
semantic value of the token, in this case, 3840.

Even when an AnaGram parser is using simple character input, the word
``token'' may represent several different things. Suppose a grammar
has a token \agcode{'A-Z'}, which represents the set of upper case
letters. There will be an internal token number, assigned by AnaGram,
which identifies the (abstract) token in parser tables. The actual
(concrete) input token to a parser will, however, be a single letter,
say ``Q''. In this case, the ASCII value of the character ``Q'', 81,
is used both as the external token number and as the semantic value of
the token.

\paragraph{Unary Operator.}
A ``unary operator'' is a token which represents some sort of
operation which operates on a single entity to produce a new resultant
entity. It is normally represented by a symbol which is written either
preceding the operand or following the operand. In the first case it
is called a ``prefix operator'' and in the second case a ``suffix
operator''.

\index{Union}
\paragraph{Union.}
The union of two sets is the set of all elements that are to be found
in one or another of the two sets. In an AnaGram syntax file the union
of two character sets \agcode{A} and \agcode{B} is represented using
the plus sign, as in \agcode{A + B}. The union operator has the same
precedence as the difference operator: lower than that of intersection
and complement. The union operator is left associative.
author	David A. Holland
date	Mon, 30 May 2022 23:51:43 -0400
parents	13d2b8934445
children