comparison doc/misc/html/gloss.html @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.
author David A. Holland
date Sat, 22 Dec 2007 17:52:45 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:13d2b8934445
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
2 <HTML>
3 <HEAD>
4 <TITLE>Glossary of Parsing Terms</TITLE>
5
6 </HEAD>
7
8 <BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif"
9 TEXT="#000000" LINK="#0033CC"
10 VLINK="#CC0033" ALINK="#CC0099">
11
12 <P>
13 <IMG ALIGN="right" SRC="images/agrsl6c.gif" ALT="AnaGram"
14 WIDTH=124 HEIGHT=30>
15 <BR CLEAR="RIGHT">
16 </P>
17
18 Back to <A HREF="index.html">Index</A>
19 <P>
20 <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------"
21 WIDTH=1010 HEIGHT=2 >
22
23 <P>
24 <H2>Glossary of Parsing Terms</H2>
25 <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------"
26 WIDTH=1010 HEIGHT=2 >
27 </P>
28
29 <P>
30 <TABLE WIDTH="100%" ALIGN="center" CELLPADDING=15 CELLSPACING=15>
31 <TR ALIGN="left">
32
33 <TD VALIGN=TOP NOWRAP WIDTH="50%">
34
35
36
37 <A HREF="#ParserAction">Action</A><BR>
38 <A HREF="#AcceptAction">Accept Action</A><BR>
39 <A HREF="#Associativity">Associativity</A><BR>
40 <A HREF="#Backtracking">Backtracking</A><BR>
41 <A HREF="#BNF">Backus-Naur Form</A><BR>
42 <A HREF="#BinaryOperator">Binary operator</A><BR>
43 <A HREF="#Binding">Binding</A><BR>
44 <A HREF="#CharacterSets">Character sets</A><BR>
45 <A HREF="#Universe">Character Universe</A><BR>
46 <A HREF="#CharacteristicRule">Characteristic Rule</A><BR>
47 <A HREF="#CharacteristicToken">Characteristic Token</A><BR>
48 <A HREF="#SetComplement">Complement of a set</A><BR>
49 <A HREF="#CompletedRule">Completed Rule</A><BR>
50 <A HREF="#Conflict">Conflict</A><BR>
51 <A HREF="#ContextFreeGrammar">Context Free Grammars</A><BR>
52 <A HREF="#DefaultReduction">Default reductions</A><BR>
53 <A HREF="#SetDifference">Difference of two sets</A><BR>
54 <A HREF="#ErrorAction">Error Action</A><BR>
55 <A HREF="#EventDriven">Event Driven Parser</A><BR>
56 <A HREF="#ExpansionRule">Expansion Rule</A><BR>
57 <A HREF="#Grammar">Grammar</A><BR>
58 <A HREF="#GrammarAnalysis">Grammar Analysis</A><BR>
59 <A HREF="#GrammarRule">Grammar Rule</A><BR>
60 <A HREF="#GrammarToken">Grammar Token</A><BR>
61 <A HREF="#SetIntersection">Intersection</A><BR>
62 <A HREF="#LALRParsing">LALR parsing</A><BR>
63 <A HREF="#LexicalScanner">Lexical Scanner</A><BR>
64 <A HREF="#Lookahead">Lookahead Token</A><BR>
65 <A HREF="#LRParsing">LR Parsing</A><BR>
66 <A HREF="#MarkedRule">Marked Rule</A><BR>
67 <A HREF="#NonAssociative">Non-associative</A><BR>
68 </TD>
69
70 <TD VALIGN=TOP NOWRAP WIDTH="50%">
71 <A HREF="#Nonterminal">Nonterminal Token</A><BR>
72 <A HREF="#NullProduction">Null Production</A><BR>
73 <A HREF="#ParameterAssignment">Parameter Assignment</A><BR>
74 <A HREF="#Parser">Parser</A><BR>
75 <A HREF="#ParserGenerator">Parser Generator</A><BR>
76 <A HREF="#ParserStateStack">Parser State Stack</A><BR>
77 <A HREF="#ParserValueStack">Parser Value Stack</A><BR>
78 <A HREF="#ParsingEngine">Parsing Engine</A><BR>
79 <A HREF="#SetPartition">Partition</A><BR>
80 <A HREF="#Precedence">Precedence</A><BR>
81 <A HREF="#Production">Production</A><BR>
82 <A HREF="#ReduceAction">Reduce Action</A><BR>
83 <A HREF="#ReduceReduce">Reduce-Reduce Conflict</A><BR>
84 <A HREF="#ReducingToken">Reducing Token</A><BR>
85 <A HREF="#ReductionProcedure">Reduction Procedure</A><BR>
86 <A HREF="#ReductionToken">Reduction Token</A><BR>
87 <A HREF="#Resynchronization">Resynchronization</A><BR>
88 <A HREF="#RuleElement">Rule Element</A><BR>
89 <A HREF="#SemanticAction">Semantic Action</A><BR>
90 <A HREF="#SemanticValue">Semantic Value</A><BR>
91 <A HREF="#SemanticallyDetermined">Semantically Determined Production</A><BR>
92 <A HREF="#SetExpression">Set Expression</A><BR>
93 <A HREF="#ShiftAction">Shift Action</A><BR>
94 <A HREF="#ShiftReduce">Shift-Reduce Conflict</A><BR>
95 <A HREF="#SyntaxDirectedParsing">Syntax Directed Parsing</A><BR>
96 <A HREF="#Terminal">Terminal Token</A><BR>
97 <A HREF="#Token">Token</A><BR>
98 <A HREF="#UnaryOperator">Unary Operator</A><BR>
99 <A HREF="#SetUnion">Union</A><BR>
100 <A HREF="#VirtualProduction">Virtual Production</A><BR>
101
102 </TD>
103
104 </TABLE>
105
106 <HR>
107
108
109
110 <DL>
111 <DT><B><A NAME="ParserAction">Action</A></B></DT>
112 <DD>A "parser action" is one of the basic executable elements of a
113 <A HREF="#ParsingEngine">parsing engine</A>. In a traditional parsing engine
114 there are four actions: the
115 <A HREF="#ShiftAction">shift action</A>, the <A HREF="#ReduceAction">reduce
116 action</A>, the <A HREF="#AcceptAction">accept action</A>, and the
117 <A HREF="#ErrorAction">error action</A>. The parsing engines which AnaGram
118 generates use a substantially greater number of actions, in order to provide
119 certain short cuts in the interest of greater speed and greater compactness of
120 the tables which control the action of the parsing engine.
121 <P></P>
122 </DD>
123
124 <DT><B><A NAME="AcceptAction">Accept Action</A></B></DT>
125 <DD> The accept <A HREF="#ParserAction">action</A> is one of the four actions
126 of a traditional <A HREF="#ParsingEngine">parsing engine</A>. The accept action
127 is performed when the <A HREF="#Parser">parser</A> has succeeded in identifying
128 the <A HREF="#GrammarToken">goal token</A> for the <A HREF="#Grammar">grammar</A>.
129 When the parser executes the accept action, it returns to the calling program.
130 The accept action is thus the last action of
131 the parsing engine and occurs only once for each successful execution
132 of the parser.
133
134 <P></P>
135 </DD>
136
137 <DT><B><A NAME="Associativity">Associativity</A></B></DT>
138 <DD> This is a term used to describe <A HREF="#BinaryOperator">binary
139 operators </A>. It describes how you interpret a sequence of operations which
140 all involve the same operator. Consider a - b - c, for instance. Here the
141 convention is that we subtract b from a, and then c from the difference. We say
142 that subtraction is <B>left associative</B>, since if there is a sequence of
143 subtraction operations, the leftmost one is to be performed first. As a second
144 example, consider exponentiation. FORTRAN uses ** as an operator to indicate
145 exponentiation. In order to conform as closely as possible to ordinary
146 mathematical notation, the expression a**b**c is interpreted to mean that first
147 b is raised to the power c, and then the resultant value is used as the power to
148 which a is raised. We say that exponentiation is <B>right associative</B> since the
149 rightmost of a sequence of exponentiation operations is to be performed first.
150 If a programming language does not allow for unparenthesized sequences of a
151 particular operator, the operator is said to be non-associative.<P>
152 Another way to view left and right associativity is as implied parenthesization.
153 Left associative operators are parenthesized from the left. Right associative
154 operators are parenthesized from the right. Thus a - b - c = ((a-b) - c) and
155 a**b**c = (a**(b**c))</P><P>
156 AnaGram offers two ways to specify associativity of
157 <A HREF="#BinaryOperator">binary operators</A>. The first
158 way is to use conventional <A HREF="#BNF">Backus-Naur Form</A> syntax
159 descriptions. You would userecursion to describe both subtraction and
160 exponentiation. Subtraction, since it is left associative, uses left recursion,
161 and exponentiation, being right associative, uses right recursion. Thus</P><P>
162 </P><PRE>
163 difference -&gt; value | difference, '-', value
164 exponential -&gt; value | value, "**", exponential
165 </PRE> could be used to describe differences and exponents.<P>
166 The second way to specify associativity is to use an ambiguous <A HREF="#Grammar">grammar</A> and
167 precedence declarations. (See Chapter 9, AnaGram User's Guide.)</P><P>
168 </P></DD>
169
170 <DT><B><A NAME="Backtracking">Backtracking</A></B></DT>
171 <DD> In order to make parse tables more compact and <A HREF="#Parser">parsers</A> faster, it is
172 common to use <A HREF="#DefaultReduction">default reductions</A>. In case of
173 error, it is necessary to undo default reductions before diagnostics can be
174 properly determined. In AnaGram, this undo operation is called backtracking.<P>
175
176 </P></DD>
177
178 <DT><B><A NAME="BNF">Backus-Naur Form</A></B></DT>
179 <DD> Backus-Naur Form, or BNF, is a conventional notation for describing
180 <A HREF="#ContextFreeGrammar">context free grammars</A>. AnaGram uses an
181 extended notation for <A HREF="#Grammar">grammars</A>, which, except for
182 <A HREF="#SemanticallyDetermined">semantically determined
183 productions</A>, can be shown to be equivalent to BNF. The term "BNF" is
184 often used colloquially to refer to a grammar specification.<P>
185 AnaGram's syntax specification language differs from BNF in the following
186 respects:
187 </P><OL>
188 <LI>In conventional BNF, a symbol not enclosed in angle brackets (&lt; &gt;) is
189 taken to represent itself literally. In AnaGram, literal character
190 representations must be enclosed in single quotes and literal strings within
191 double quotes.
192 </LI>
193 <LI>In BNF, the right side of a <A HREF="#Production">production</A> is simply a list of symbols without
194 any delimiters or punctuation. AnaGram requires that the <A HREF="#RuleElement">rule elements</A> which
195 make up a <A HREF="#GrammarRule">grammar rule</A>, or right side, be joined by commas.
196 </LI>
197 <LI>BNF makes no provision for identifying <A HREF="#ReductionProcedure">reduction procedures</A> or their
198 arguments. AnaGram provides both reduction procedures, introduced by an "="
199 at the end of the production, and named arguments, introduced by a ":"
200 at the end of any <A HREF="#Token">token</A> on the right side of the production.
201 </LI>
202 <LI>AnaGram allows <A HREF="#VirtualProduction">virtual productions</A> to be used freely.
203 </LI>
204 <LI>BNF is "pure" in that if you wish to define a <A HREF="#Nonterminal">nonterminal token</A>
205 called "digit" to represent the digits from zero to nine, you must
206 provide ten explicit productions to define it. AnaGram treats the concept of <A HREF="#Terminal">"terminal
207 token"</A> as used in language theory as an abstraction, and interposes
208 <A HREF="#CharacterSets">character set</A> functionality between actual character input and the terminal
209 tokens of BNF. You can define digit to be the character range '0-9', for
210 instance, and AnaGram will determine and define precisely the minimum number of
211 productions necessary, taking into account any other usage of the characters
212 '0-9' in your grammar. This makes your grammar more compact, more manageable,
213 and easier to modify.
214 </LI>
215 <LI>AnaGram allows for <A HREF="#SemanticallyDetermined">
216 semantically determined productions</A>, which provide a
217 significant mechanism for melding semantic analysis with syntactic
218 analysis.
219
220 </LI>
221 </OL>
222 </DD>
223
224 <DT><B><A NAME="BinaryOperator">Binary operator</A></B></DT>
225 <DD> A binary operator is an operator that works on two operands to create a
226 result. It is often written between its operands and is sometimes called an
227 infix operator. In ordinary programming languages "+" and "-"
228 are binary operators.
229 <P></P>
230 </DD>
231
232 <DT><B><A NAME="Binding">Binding</A></B></DT>
233 <DD> Most programming languages, rather than executing arithmetic operators in
234 simple left to right order, conform to the traditional conventions of ordinary
235 algebra, which dictate that, except where parenthesis indicate otherwise,
236 exponentiation is done before multiplication, multiplication before addition,
237 and addition before comparison. One says that exponentiation is "more
238 tightly binding" than multiplication, multiplication is more tightly
239 binding than addition, and so on. The sense of the word here is that the
240 operator binds together its operands into a single entity. An operand which
241 falls between two different operators is "bound" by the more tightly
242 binding operator. An operator that is more tightly binding than another is also
243 said to have higher <A HREF="#Precedence">precedence</A>.
244 <P></P>
245 </DD>
246
247 <DT><B><A NAME="CharacterSets">Character sets</A></B></DT>
248 <DD> One of the traditional problems with syntax directed programming is that
249 caused by the fact that most formal language specifications have productions of
250 the form:
251 <PRE>
252 letter -&gt; a | b | c | d ... | z
253 </PRE> Since the letters are not often distinguished in the <A HREF = "#Grammar">grammar</A>, this large
254 number of essentially identical <A HREF="#Production">productions</A> causes a correspondingly large
255 number of states in the <A HREF="#Parser">parser</A>. This problem is often attacked by using a<A
256 HREF="#LexicalScanner"> "lexical scanner"</A> which simply specifies
257 a "letter token" whose value indicates precisely which letter is
258 intended. This works fine as long as there is nothing in the grammar which
259 distinguishes any letter at all. If there is, and there is usually some special
260 use of h, x, q, o, e, or even a-f, the situation gets more complicated. Often the
261 result is to abandon the use of syntax directed programming for elemental units
262 of the language and use a more complex lexical scanner which identifies names,
263 numbers, key words and operators. Such lexical scanners are often built using
264 conventional programming languages or simple pattern recognizing languages such
265 as LEX.<P>
266 AnaGram avoids this problem by incorporation the notion of character set into
267 its input specifications. Briefly, the way it works is the following: You
268 specify the set of characters which make up any ostensibly <A HREF="#Terminal">terminal token</A>.
269 AnaGram then analyzes the overlaps among these definitions and creates a minimal
270 set of <A HREF="#Token">tokens</A> which match your specifications exactly. For instance, suppose you
271 have the following definitions:
272 </P><PRE>
273 letter = 'a-z' + 'A-Z'
274 hex digit = '0-9' + 'a-f' + 'A-F'
275 </PRE> AnaGram will define a token for letter which has two productions:
276 <PRE>
277 letter -&gt; 'a-f' + 'A-F' | 'g-z' + 'G-Z'
278 </PRE> With this technique, you have the freedom to specify your grammar in the
279 easiest possible manner, without the penalty of an absurdly large, slow parser.<P>
280 Character sets may be specified in terms of ranges of characters, as in the
281 above example, as <A HREF="#SetUnion">unions</A>, denoted by "+", as
282 <A HREF="#SetIntersection">intersections</A>, denoted by "&amp;", as
283 <A HREF="#SetDifference">differences,</A> denoted by "-", and as
284 <A HREF="#SetComplement">complements</A>, using "~". You may also
285 specify a set consisting of a single character either with a literal character
286 or with a numeric specification. If you specify a character numerically, you may
287 use either decimal, octal or hexadecimal notation, as in C.
288 </P><P></P>
289 </DD>
290
291 <DT><B><A NAME="Universe">Character Universe</A></B></DT>
292 <DD> In ordinary set theory, the "universe" is the set of all
293 entities which may be elements of a set. It must be defined so that the
294 complement of a set can mean something unambiguous. In AnaGram, the character
295 universe normally comprises the ascii characters and the IBM extended character
296 set, that is, all characters in the range from 0 through 255. If, however, you
297 have defined input characters outside this range, the character universe will be
298 extended to the least character you have used and to the greatest character you
299 have used in your <A HREF = "#Grammar">grammar</A>.
300 <P></P>
301 </DD>
302
303 <DT><B><A NAME="CharacteristicRule">Characteristic Rule</A></B></DT>
304 <DD> Each parser state is characterized by a particular set of <A HREF="#GrammarRule">grammar rules</A>,
305 and for each such rule, a marked token which is the next <A HREF="#Token">token</A> expected. These
306 rules are called the characteristic rules for the state. In the course of doing
307 <A HREF="#GrammarAnalysis">grammar analysis</A>, AnaGram determines the characteristic rules for each parser
308 state. After analyzing your grammar, you may inspect the State Definition Table
309 to see the characteristic rules for any state in your <A HREF="#Parser">parser</A>.
310 <P></P>
311 </DD>
312
313 <DT><B><A NAME="CharacteristicToken">Characteristic Token</A></B></DT>
314 <DD> Every state in a <A HREF="#Parser">parser</A>, except state zero, can be characterized by the
315 one, unique <A HREF="#Token">token</A> which causes a jump to that state. That token is called the
316 characteristic token of the state, because to get to that parser state you must
317 have just seen precisely that token in the input. Note that several states could
318 have the same characteristic token.<P>
319 When you have a list of states, such as is given by the parser state stack, it
320 is equivalent to a list of characteristic tokens. This list of tokens is the
321 list of tokens that have been recognized so far by the parser. Some of these
322 tokens, of course, may be <A HREF="#Nonterminal">nonterminal tokens</A> and may thus represent the result
323 of reducing a sequence of previously recognized tokens.
324 </P>
325 <P></P>
326 </DD>
327
328 <DT><B><A NAME="SetComplement">Complement of a set</A></B></DT>
329 <DD> In set theory, the complement of a set, S, is the collection of all
330 elements of the <A HREF="#Universe">character universe</A> which are not elements of S. Using AnaGram's
331 notation, the complement of S is written ~S. The complement operator has higher
332 <A HREF="#Precedence">precedence</A> than the
333 <A HREF="#SetDifference">difference</A>, <A HREF="#SetIntersection">intersection</A>, or <A HREF="#SetUnion">union</A> operators.
334 <P></P>
335 </DD>
336
337 <DT><B><A NAME="CompletedRule">Completed Rule</A></B></DT>
338 <DD> A "completed rule" is a <A HREF="#CharacteristicRule">characteristic rule</A> whose index is
339 pointing beyond the last <A HREF="#RuleElement">rule element</A>. In other words, it has been completely
340 matched and will be <A HREF="#ReduceAction">reduced</A> by the next input.
341 <P>
342 If a state has more than one completed rule, the decision as to which to reduce
343 is made based on the next input <A HREF="#Token">token</A>. If there is only one completed rule in a
344 state, it will be reduced by default unless the default reductions switch has
345 been reset, i.e., turned off.
346 </P></DD>
347 <DT><B><A NAME="Conflict">Conflict</A></B></DT>
348 <DD> Conflicts arise during a <A HREF="#GrammarAnalysis">grammar analysis</A> when AnaGram cannot determine
349 how to treat a given input <A HREF="#Token">token</A>. There are two sorts of conflicts: <A HREF="#ShiftReduce">shift-reduce
350 conflicts</A> and <A HREF="#ReduceReduce">reduce-reduce conflicts</A>. Conflicts may arise either because the
351 <A HREF = "#Grammar">grammar</A> is inherently ambiguous, or simply because the grammar analyzer cannot
352 look far enough ahead to resolve the conflict. In the latter case, it is often
353 possible to rewrite the grammar in such a way as to eliminate the conflict. <A HREF="#NullProduction">Null
354 productions</A> are a common source of conflict.
355 <P>
356 There are a number of ways to deal with conflicts. If you understand the
357 conflict well, you may simply choose to ignore it. When AnaGram encounters a
358 shift-reduce conflict while building parse tables it resolves it by choosing the
359 <A HREF="#ShiftAction">shift action</A>. When AnaGram encounters a reduce-reduce conflict while building
360 parse tables, it resolves it by selecting the <A HREF="#GrammarRule">grammar rule</A> which occurred first
361 in the grammar.
362 </P><P>
363 A second way to deal with conflicts is to set operator <A HREF="#Precedence">precedence</A> parameters. If
364 you set these parameters, AnaGram will use them preferentially to resolve
365 conflicts. Any conflicts so resolved will be listed in the Resolved Conflicts
366 window.
367 </P><P>
368 A third way to resolve a conflict is to declare some tokens as sticky or as
369 subgrammars. This is particularly useful for <A HREF="#Production">productions</A> whose sole purpose is
370 to skip over uninteresting input.</P><P>
371 </P><P>
372 The fourth way to deal with conflicts is to rewrite the grammar to eliminate
373 them. Many people prefer this approach since it yields the highest level of
374 confidence in the resulting program.
375 </P></DD>
376 <DT><B><A NAME="ContextFreeGrammar">Context Free Grammars</A></B></DT>
377 <DD> Context free grammars are <A HREF = "#Grammar">grammars</A> wherein the definition of a grammar
378 unit, or <A HREF="#Nonterminal">nonterminal token</A>, does not depend on the context in which the
379 nonterminal is used. That means that if you define "widget" in a
380 certain way, that definition is valid in all contexts in which "widget"
381 might occur. Context free grammars can be represented in <A HREF="#BNF">Backus-Naur Form</A>.
382 AnaGram's syntax specification language has no facility for representing
383 grammars which are not context free.
384 <P></P>
385 </DD>
386 <DT><B><A NAME="DefaultReduction">Default reductions</A></B></DT>
387 <DD> A "default reduction" is a parser action which may be used in
388 your <A HREF="#Parser">parser</A> in any state which has precisely one <A HREF="#CompletedRule">completed rule</A>.
389 <P>
390 If a given parser state has, among its <A HREF="#CharacteristicRule">characteristic rules</A>, exactly one
391 completed rule, it is usually faster to reduce it on any input than to check
392 specifically for correct input before reducing it. The only time this default
393 reduction causes trouble is in the event of a syntax error. In this situation
394 you may get an erroneous reduction. Normally when you are parsing a file, this
395 is inconsequential because you are not going to continue performing
396 <A HREF="#SemanticAction">semantic actions</A> in the presence of
397 errors. But, if you are using your parser to handle real-time
398 interactive input, you have to be able to continue semantic
399 processing after notifying your user that he has entered erroneous
400 input. In this case you would want default reductions to have been
401 turned off so that <A HREF="#Production">production</A>s are reduced
402 only when there is correct input.
403
404 </P></DD>
405
406 <DT><B><A NAME="SetDifference">Difference of two sets</A></B></DT>
407 <DD> In set theory, the difference of two sets, A and B, is defined to be the
408 set of all elements of A that are not elements of B. In an AnaGram syntax file,
409 you represent the difference of two <A HREF="#CharacterSets">character sets</A> by using the "-"
410 operator. Thus the difference of A and B is A - B. The difference operator is
411 <A HREF="#Associativity">left associative</A>.
412 <P></P>
413 </DD>
414
415 <DT><B><A NAME="ErrorAction">Error Action</A></B></DT>
416 <DD> The error action is one of the four <A HREF="#ParserAction">actions</A> of a traditional <A HREF="#ParsingEngine">parsing
417 engine</A>. The error action is performed when the <A HREF="#Parser">parser</A> has encountered an input
418 <A HREF="#Token">token</A> which is not admissible in the current state. The further behavior of a
419 traditional parser is undefined.
420 <P></P>
421 </DD>
422
423 <DT><B><A NAME="EventDriven">Event Driven Parser</A></B></DT>
424 <DD>
425 An <I>event driven</I> <A HREF="#Parser">parser</A> is one in which
426 the relationship between the host program and the parser is turned
427 inside out. In a conventional parser, the host program calls the
428 parser, the parser analyzes the complete input text and returns to
429 the host program only when it has finished with the entire input.
430 <P>In an event driven parser, the parser does not read its input
431 directly from a file or from memory. Instead, the host program, after
432 initializing the parser, calls it once for each input token. Each
433 time the parser is called, it updates its state
434 appropriately, calls any <A HREF="#ReductionProcedure">reduction
435 procedures</A> that need to be called and finally when it needs more
436 input, returns to the host program. The effect is that parsing can
437 occur in parallel with other processing performed by the host
438 program. This technique is especially useful in situations where the
439 token stream to be parsed is developed on the fly, as when using <A
440 HREF="#LexicalScanner">lexical scanners</A>, for instance.
441 <P></P>
442 </DD>
443
444
445 <DT><B><A NAME="ExpansionRule">Expansion Rule</A></B></DT>
446 <DD> In analyzing a <A HREF = "#Grammar">grammar</A>, we are often interested in the full range of input
447 that can be expected at a certain point. The expansion of a <A HREF="#Token">token</A> or state shows
448 us all the expected input. An expansion yields a set of
449 <A HREF="#MarkedRule">marked rules</A>. Looking
450 at the marked token shows us what input to expect.
451 <P>
452 The set of expansion rules of a <A HREF="#Nonterminal">nonterminal token</A> shows all the expected input
453 that can occur whenever the token appears in the grammar. The set consists of
454 all the <A HREF="#GrammarRule">grammar rules</A> produced by the token, plus all the rules produced by the
455 first token of any rule in the set. The marked tokens for the expansion rules of a
456 token are always the first element in the rule.
457 </P><P>
458 The expansion of a state consists of its
459 <A HREF="#CharacteristicRule">characteristic rules</A> plus the expansion
460 rules of the marked token in each characteristic rule.
461 </P>
462 </DD>
463
464
465 <DT><B><A NAME="Grammar">Grammar</A></B></DT>
466 <DD> Traditionally, a grammar is a set of <A HREF="#Production">productions</A> which taken together
467 specify precisely a set of acceptable input sequences in terms of an abstract
468 set of <A HREF="#Terminal">terminal tokens</A>. The set of acceptable input sequences is often called
469 the "language" defined by the grammar.
470 <P>
471 In AnaGram, the term grammar also includes configuration segments, definitions
472 of <A HREF="#CharacterSets">character sets</A>, <A HREF="#VirtualProduction">virtual productions</A>, etc. that augment the collection of
473 productions. A grammar is often called a "syntax", and the <A HREF="#GrammarRule">rules of
474 the grammar</A> are often called syntactic rules.
475 </P>
476 <P></P>
477 </DD>
478
479 <DT><B><A NAME="GrammarAnalysis">Grammar Analysis</A></B></DT>
480 <DD> The major function of AnaGram is the analysis of <A HREF = "#ContextFreeGrammar">context free grammars</A>
481 written in a particular variant of <A HREF="#BNF">Backus-Naur Form</A>.
482 <P>
483 The analysis of a <A HREF="#Grammar">grammar</A> proceeds in four stages. In the first, the input
484 grammar is analyzed and a number of tables are built which describe all of the
485 <A HREF="#Production">productions</A> and components of the grammar.
486 </P><P>
487 In the second stage, AnaGram analyzes all of the <A HREF="#CharacterSets">character sets</A> defined in the
488 grammar, and where necessary, defines auxiliary <A HREF="#Token">tokens</A> and productions.
489 </P><P>
490 In the third stage, AnaGram identifies all of the states of the <A HREF="#Parser">parser</A> and
491 builds the go-to table for the parser.
492 </P><P>
493 In the fourth stage, Anagram identifies <A HREF="#ReductionToken">reduction tokens</A> for each completed
494 <A HREF="#GrammarRule">grammar rule</A> in each state and checks for <A HREF="#Conflict">conflicts</A>.
495 </P>
496 <P></P>
497 </DD>
498
499 <DT><B><A NAME="GrammarRule">Grammar Rule</A></B></DT>
500 <DD> A "grammar rule" is the right side of a <A HREF="#Production">production</A>. It consists
501 of a sequence of <A HREF="#RuleElement">rule elements</A>. Each rule element identifies some <A HREF="#Token">token</A>, which
502 can be either a <A HREF="#Terminal">terminal token</A> or <A HREF="#Nonterminal">nonterminal token</A>. It is "matched"
503 by a corresponding sequence of tokens in the input stream to the <A HREF="#Parser">parser</A>. The
504 left side of the production identifies one or more nonterminal tokens, or
505 <A HREF="#ReductionToken">reduction tokens</A>, to which the rule reduces when matched. If there is more than
506 one reduction token, there should be a <A HREF="#ReductionProcedure">reduction procedure</A> to make the choice.
507 <P></P>
508 </DD>
509
510 <DT><B><A NAME="GrammarToken">Grammar Token</A></B></DT>
511 <DD> The grammar <A HREF="#Token">token</A> is the token which represents the "top level"
512 in your grammar. Some people refer to it as the "goal" or "goal
513 token" and others as the "start token". Whatever it is called, it
514 is the single token which describes the complete input to your <A HREF="#Parser">parser</A>.
515 <P></P>
516 </DD>
517
518 <DT><B><A NAME="SetIntersection">Intersection</A></B></DT>
519 <DD> In set theory, the intersection of two sets, A and B, is defined to be the
520 set of all elements of A which are also elements of B. In an AnaGram syntax
521 file, the intersection of two <A HREF="#CharacterSets">character sets</A> is represented with the "&amp;"
522 operator. The intersection operator has lower <A HREF="#Precedence">precedence</A> than the complement
523 operator, but higher precedence than the <A HREF="#SetUnion">union</A> and <A HREF="#SetDifference">difference</A> operators. The
524 intersection operator is <A HREF="#Associativity">left associative</A>.
525 <P></P>
526 </DD>
527
528 <DT><B><A NAME="LALRParsing">LALR parsing</A></B></DT>
529 <DD> LALR, or "lookahead LR" parsing, a slightly less powerful
530 variant of <A HREF="#LRParsing">LR parsing</A>, produces much smaller parsing tables. Thus an LALR
531 <A HREF="#Parser">parser</A> has very significant advantages in size over its LR counterpart. It is
532 possible to construct <A HREF = "#Grammar">grammar</A>s which can be parsed by an LR parser but cannot be
533 parsed by an LALR parser. That is, the LALR parser will find these grammars to
534 be ambiguous and the LR parser will not. In practice, however, such grammars
535 seem to be rare and the ambiguities can usually be eliminated by minor revisions.
536 <P>
537 AnaGram generates LALR parsers and, in addition, uses LALR parsing to interpret
538 syntax files.
539 </P><P></P>
540 </DD>
541
542 <DT><B><A NAME="LexicalScanner">Lexical Scanner</A></B></DT>
543 <DD> A lexical scanner is a program or procedure used as a preprocessor for a
544 <A HREF="#Parser">parser</A>. It scans input characters and lumps them together into <A HREF="#Token">tokens</A>. Many
545 systems for generating parsers, most notably YACC, can scarcely be used without
546 a lexical scanner.
547 <P>
548 AnaGram, although it can support the use of a lexical scanner, does not need a
549 lexical scanner for most practical problems. AnaGram avoids the need for a
550 lexical scanner by using <A HREF="#CharacterSets">character sets</A> and disregard and lexeme statements.
551 </P><P>
552 If your problem does in fact need a lexical scanner, you can use AnaGram itself
553 to write it, so you don't need to know different languages for the scanner and
554 for the parser.
555 </P>
556 <P></P>
557 </DD>
558
559 <DT><B><A NAME="Lookahead">Lookahead Token</A></B></DT>
560 <DD> The current input <A HREF="#Token">token</A> to a <A HREF="#Parser">parser</A> is often called the "lookahead"
561 token. In many situations, it is not shifted into the input buffer immediately,
562 but acts like a catalyst to cause a number of rules to be reduced before it is
563 eventually shifted in.
564 <P></P>
565 </DD>
566
567 <DT><B><A NAME="LRParsing">LR Parsing</A></B></DT>
568 <DD> An LR(k) <A HREF="#Parser">parser</A> is a type of deterministic bottom-up shift-reduce parser
569 using at most k lookahead symbols to determine its next <A HREF="#ParserAction">action</A> at any point in
570 the parse. If (k) is omitted, then k is assumed to be 1. Discussion of the
571 technical details of LR(k) parsing is beyond the scope of this manual. The
572 interested reader may consult any of a variety of works covering the subject.
573 <P>
574 AnaGram produces parsers which employ a variant of LR parsing known as LALR
575 parsing. It is not necessary to understand the details of LR and LALR parsing
576 theory in order to use AnaGram.
577 </P>
578 <P></P>
579 </DD>
580
581 <DT><B><A NAME="MarkedRule">Marked Rule</A></B></DT>
582 <DD> A "marked rule" is a <A HREF="#GrammarRule">grammar rule</A>
583 together with a "marked token"
584 that indicates how much of the rule has already been matched and what input
585 should be expected, starting with the marked token, if the remainder of the rule is to be matched. Each <A HREF="#Parser">parser</A>
586 state is defined by a small number of <A HREF="#CharacteristicRule">characteristic rules</A> which indicate what
587 matches have already been made and what input can be expected. Note that when a
588 state has more than one characteristic rule, any two characteristic rules with
589 the same number of <A HREF="#Token">tokens</A> to the left of the marked token
590 match identically up to the
591 marked token. If one rule has fewer tokens to the left than the other, all of its tokens
592 preceding the marked token match the corresponding tokens of the longer rule
593 exactly.
594 <P>
595 When marked rules are displayed in AnaGram windows, the marked token is displayed in
596 a distinctive font which the developer can select. The default is italic.
597 </P>
598 <P></P>
599 </DD>
600
601 <DT><B><A NAME="NonAssociative">Non-associative</A></B></DT>
602 <DD> A <A HREF="#BinaryOperator">binary operator</A>, say #, is said to be non-associative if the sequence x
603 # y # z is not permissible. If this sequence is to be interpreted as (x#y)#z,
604 the operator is said to be <B>left associative</B>. If the sequence is to be
605 interpreted as x#(y#z), the operator is said to be <B>right associative</B>. (See
606 <A HREF="#Associativity">associativity</A>.)
607 <P></P>
608 </DD>
609
610 <DT><B><A NAME="Nonterminal">Nonterminal Token</A></B></DT>
611 <DD> A "nonterminal token" is a <A HREF="#Token">token</A> which results from reducing the
612 sequence of tokens which match a <A HREF="#GrammarRule">grammar rule</A> and replacing them with the
613 appropriate <A HREF="#ReductionToken">reduction token</A>. Nonterminal tokens are to be distinguished from
614 <A HREF="#Terminal">terminal tokens</A> or input tokens.
615 <P></P>
616 </DD>
617
618 <DT><B><A NAME="NullProduction">Null Production</A></B></DT>
619 <DD> A "null production" is one that has no <A HREF="#Token">tokens</A> on the right side
620 whatsoever. Null <A HREF="#Production">production</A>s essentially are identified by the first following
621 input token. Null productions are extremely convenient syntactic elements when
622 you wish to make some input optional. For example, suppose that you wish to
623 allow an optional semicolon at some point in your <A HREF = "#Grammar">grammar</A>. You could write the
624 following pair of productions:
625 <PRE>
626 optional semicolon -&gt; | ';'
627 </PRE> Note that a null production can never follow a '|'. The above could also
628 be written on multiple lines thus:
629 <PRE>
630 optional semicolon
631 -&gt;
632 -&gt; ';'
633 </PRE> You can always rewrite your grammar to eliminate null productions if you
634 wish, but you usually pay a price in conciseness and clarity. Sometimes,
635 however, it is necessary to do such a rewrite in order to avoid <A HREF="#Conflict">conflicts</A>, to
636 which null productions are especially prone.
637 <P>
638 If you have a null production with no <A HREF="#ReductionProcedure">reduction procedure</A> specified, your <A HREF="#Parser">parser</A>
639 will automatically assign the value zero to the <A HREF="#ReductionToken">reduction token</A>.
640 </P><P>
641 Null productions can be generated by <A HREF="#VirtualProduction">virtual productions</A>.
642 </P>
643 <P></P>
644 </DD>
645
646 <DT><B><A NAME="ParameterAssignment">Parameter Assignment</A></B></DT>
647
648 <DD>In any <A HREF="#GrammarRule"> grammar rule</A>, the
649 <A HREF="#SemanticValue">semantic value</A> of any
650 <A HREF="#RuleElement">rule element</A> may be passed to a
651 <A HREF="#ReductionProcedure">reduction procedure</A>
652 by means of a parameter assignment. In AnaGram this is particularly
653 convenient since you can use an ordinary variable name for the
654 parameter. Simply follow the
655 rule element with a colon and a C variable name. The C
656 variable name can then be used in the reduction
657 procedure to reference the semantic value of the token
658 it is attached to. AnaGram will automatically provide
659 necessary declarations.
660
661 Here are some examples of rule elements with parameter
662 assignments:
663 <PRE>
664 '0-9':d
665 integer:n
666 expression:x
667 declaration : declaration_descriptor
668 </PRE>
669 </DD>
670
671 <DT><B><A NAME="Parser">Parser</A></B></DT>
672 <DD> A parser is a program, or more likely a procedure within a program, which
673 scans a sequence of input characters or input <A HREF="#Token">tokens</A> and accumulates them in an
674 input buffer or stack as determined by a set of <A HREF="#Production">productions</A> which constitute a
675 <A HREF = "#Grammar">grammar</A>. When the parser discovers a sequence of tokens as defined by a <A HREF="#GrammarRule">grammar
676 rule</A>, or right side of a production, it "reduces" the sequence to a
677 single <A HREF="#ReductionToken">reduction token</A> as defined by the left side of the <A HREF="#GrammarRule">grammar rule</A>. This
678 <A HREF="#Nonterminal">nonterminal token</A> now replaces the tokens which matched the grammar rule and the
679 search for matches continues. If an input token is encountered which will not
680 yield a match for any rule, it is considered a syntax error and some kind of
681 error recovery may be required to continue. If a match, or reduction action,
682 yields the grammar token, sometimes called the goal token or start token, the
683 parser deems its work complete and returns to whatever procedure may have called
684 it.
685 <P>
686 Tokens may have <A HREF="#SemanticValue">semantic values</A>. If the
687 input values configuration switch is on, your parser will expect
688 semantic values to be provided by the input process along with the
689 token identification code. If the input values switch is off, your
690 parser will take the ascii value of the input character, that is, the
691 actual input code, as the value of the character. When the parser <A
692 HREF="#ReduceAction">reduces</A> a <A
693 HREF="#Production">production</A>, it can call a <A
694 HREF="#ReductionProcedure">reduction procedure</A> or <A
695 HREF="#SemanticAction">semantic action</A> to analyze the values of
696 the constituent tokens. This reduction procedure can then return a
697 value which characterizes the reduced token.
698 </P>
699 <P></P>
700 </DD>
701
702 <DT><B><A NAME="ParserGenerator">Parser Generator</A></B></DT>
703 <DD>
704 A parser generator, such as AnaGram, is a program that
705 converts a <A HREF="#Grammar">grammar</A>, a rule-based description of the
706 input to a program, into a conventional, procedural
707 module called a <A HREF="#Parser">parser</A>. The parsers AnaGram generates
708 are simple C modules which can be compiled on almost
709 any platform. AnaGram parsers are also compatible with
710 C++.
711 <P></P>
712 </DD>
713
714 <DT><B><A NAME="ParserStateStack">Parser State Stack</A></B></DT>
715 <DD>
716 The parser state stack is a stack maintained by an AnaGram
717 parser and which is an integral part of the parsing
718 process. At any point in the parse of your input
719 stream, the parser state stack provides a list of the
720 states which were traversed in order to arrive at the
721 current state.
722
723 <P></P>
724 </DD>
725
726
727 <DT><B><A NAME="ParserValueStack">Parser Value Stack</A></B></DT>
728 <DD>In parallel with the <A HREF="#ParserStateStack">parser state stack</A>,
729 an AnaGram <A HREF="#Parser"> parser</A>
730 maintains a <I>value stack</I> each entry of
731 which corresponds to the <A HREF="#SemanticValue">semantic value</A> of the token
732 identified at that state. Since the semantic values of
733 different tokens might well have different data types,
734 AnaGram gives you the opportunity, in your syntax
735 file, to define the data type for any token. AnaGram
736 then builds a typedef statement creating a data type
737 which is a union of the all the types you have defined.
738 AnaGram creates the name for this data type by
739 appending "_vs_type" to the name of the parser. AnaGram uses
740 this data type to define the value stack.
741 <P></P>
742 </DD>
743
744
745 <DT><B><A NAME="ParsingEngine">Parsing Engine</A></B></DT>
746 <DD> A <A HREF="#Parser">parser</A> consists of three basic components: A set of syntax tables, a set
747 of <A HREF="#ReductionProcedure">reduction procedures</A> and a parsing engine. The parsing engine is the body of
748 code that interprets the parsing table, invokes input functions, and calls the
749 reduction procedures. The Build Parser command configures a parsing engine
750 according to the implicit requirements of the syntax specification and according
751 to the explicit values of the configuration parameters.
752 <P>
753 The parsing engine itself is a simple automaton, characterized by a set of
754 states and a set of inputs. The inputs are the <A HREF="#Token">tokens</A> of your <A HREF = "#Grammar">grammar</A>. Each
755 state is represented by a list of tokens which are admissible in that state and
756 for each token an <A HREF="#ParserAction">action</A> to perform and a parameter which further defines the
757 action. In a traditional <A HREF="#LALRParsing">LALR parser</A>, there are only four <A HREF="#ParserAction">actions</A>: the <A HREF="#ShiftAction">shift
758 action</A>, the <A HREF="#ReduceAction">reduce action</A>, the <A HREF="#AcceptAction">accept action</A> and the <A HREF="#ErrorAction">error action</A>. AnaGram, in
759 doing its <A HREF="#GrammarAnalysis">grammar analysis</A>, identifies a number of special cases, and creates a
760 number of extra actions which make for faster processing, but which can be
761 represented as combinations of these primitive actions.
762 </P><P>
763 When a shift action is performed, the current state number is pushed onto the
764 parser state stack and the new state number is determined by the current state
765 number and the current lookahead token. Different tokens cause different new
766 states.
767 </P><P>
768 When a reduce action is performed, the length of the rule being reduced is
769 subtracted from the parser stack index and the new state number is read from the
770 top of the parser state stack. The <A HREF="#ReductionToken">reduction token</A> for the rule being reduced is
771 then used as an input token.
772 </P><P>
773 Each state in the grammar, with the exception of state zero, has a
774 <A HREF="#CharacteristicToken">characteristic token</A> which must have been recognized in order to jump to that
775 state. Therefore, the parser state stack, which is essentially a list of state
776 numbers, can also be thought of as a list of token numbers. This is the list of
777 tokens that have been seen so far in the parse of your input stream. Some of
778 these tokens, of course, may be <A HREF="#Nonterminal">nonterminal tokens</A> which have resulted from
779 reducing other sequences of tokens.
780 </P>
781 <P></P>
782 </DD>
783
784 <DT><B><A NAME="SetPartition">Partition</A></B></DT>
785 <DD> If you use <A HREF="#CharacterSets">character sets</A> in your <A HREF = "#Grammar">grammar</A>, AnaGram will compute a "partition"
786 of the <A HREF="#Universe">character universe</A>. This partition is a collection of non-overlapping
787 character sets such that every one of the sets you have defined can be written
788 as a <A HREF="#SetUnion">union</A> of partition sets. Each partition set is identified by a unique
789 reference number called the partition set number.
790 <P>
791 Each partition set is assigned a unique <A HREF="#Token">token</A>. If one of your character sets
792 requires more than one partition set to represent it, AnaGram will create
793 appropriate <A HREF="#Production">productions</A> and add them to your grammar so your <A HREF="#Parser">parser</A> can make the
794 necessary distinctions.
795 </P>
796 <P></P>
797 </DD>
798
799 <DT><B><A NAME="Precedence">Precedence</A></B></DT>
800 <DD> "Precedence" is an attribute of <A HREF="#BinaryOperator">binary operators</A> and
801 <A HREF = "#UnaryOperator">unary
802 operators</A> which can be used to resolve <A HREF="#Conflict">conflicts</A>. Operator precedence is also
803 referred to as <A HREF = "#Binding">binding</A>. Suppose that # and @ are two binary operators. The
804 question is how to interpret x # y @ z. If # has higher precedence than @, it is
805 interpreted as (x#y)@z. On the other hand if # has lower precedence than @, it
806 would be x#(y@z). If # and @ have the same precedence then the question must be
807 resolved by <A HREF="#Associativity">associativity</A>.
808 <P>
809 Note that all operators at the same precedence level must have the same
810 associativity.
811 </P><P>
812 The situation is somewhat simpler for unary operators. If # and @ were both
813 prefix operators, or both suffix operators, there would be no ambiguity in
814 interpretation, since neither # @ x and x # @ offer more than one possible
815 interpretation. The only difficulty arises when both a prefix and a suffix
816 operator are applied to the same operand.
817 </P><P>
818 Suppose that # is a prefix operator and @ is a suffix operator. The question
819 then is how to interpret # x @. If # has higher precedence than @, it would be
820 interpreted as (# x) @. On the other hand, if # has lower precedence than @, it
821 would be interpreted as # (x @). If # and @ have the same precedence then the
822 question must be resolved by associativity.
823 </P>
824 <P></P>
825 </DD>
826
827 <DT><B><A NAME="Production">Production</A></B></DT>
828 <DD> Productions are the mechanism used to describe how complex input
829 structures are built up out of simpler ones. Each production has a left side and
830 a right side. The right side, or <A HREF="#GrammarRule">grammar rule</A>,
831 is a sequence of <A HREF="#RuleElement">rule elements</A>,
832 which may represent either <A HREF="#Terminal">terminal tokens</A> or
833 <A HREF="#Nonterminal">nonterminal tokens</A>. The left side
834 is a list of <A HREF="#ReductionToken">reduction tokens</A>.
835 In most cases there would be only a single
836 reduction token. Productions with more than one <A HREF="#Token">token</A>
837 on the left side are
838 called <A HREF = "#SemanticallyDetermined">semantically determined productions</A>.
839 <p>The
840 "<CODE>-&gt;</CODE>" symbol is used
841 to separate the left side from the right side.
842
843 If you have several
844 productions with the same left hand side, you can avoid
845 rewriting the left hand side either by using '|' or by
846 using another "->".
847 <p>
848
849 A <A HREF="#NullProduction"> null production</A>, or empty right hand side, cannot
850 follow a '|'.
851 <P>
852 Productions may be written thus:
853 <pre>
854 name
855 -> letter
856 -> name, digit
857 </pre>
858 This could also be written
859 <pre>
860 name -> letter | name, digit
861 </pre>
862 In order to accommodate semantic analysis of the data,
863 you may attach to any grammar rule a <A HREF="#ReductionProcedure">reduction
864 procedure</A> which will be executed when the rule is
865 identified. Each token may have a <A HREF="#SemanticValue"> semantic value</A>. By
866 using <A HREF="#ParameterAssignment"> parameter assignments</A>, you may provide the
867 reduction procedure with access to the semantic values of
868 tokens that comprise the grammar rule. When it finishes, the
869 reduction procedure may return a value which will be
870 saved on the <A HREF="#ParserValueStack"> parser value stack</A> as the semantic value of the
871 <A HREF="#ReductionToken">reduction token</A>.
872
873 <P></P>
874 </DD>
875
876 <DT><B><A NAME="ReduceAction">Reduce Action</A></B></DT>
877 <DD> The reduce action is one of the four <A HREF="#ParserAction">actions</A> of a traditional parsing
878 engine. The reduce action is performed when the <A HREF="#Parser">parser</A> has succeeded in matching
879 all the elements of a <A HREF="#GrammarRule">grammar rule</A> and the next input <A HREF="#Token">token</A> is not erroneous.
880 Reducing the grammar rule amounts to subtracting the length of the rule from the
881 parser stack index, identifying the <A HREF="#ReductionToken">reduction token</A>, stacking its
882 <A HREF="#SemanticValue">semantic value</A>
883 and then doing a <A HREF="#ShiftAction">shift action</A> with the reduction token as though it had been
884 input directly.
885 <P></P>
886 </DD>
887
888 <DT><B><A NAME="ReduceReduce">Reduce-Reduce Conflict</A></B></DT>
889 <DD> A <A HREF = "#Grammar">grammar</A> has a "reduce-reduce" conflict at some state if a
890 single <A HREF="#Token">token</A> turns out to be a <A HREF="#ReducingToken">reducing token</A> for more than one <A HREF="#CompletedRule">completed rule</A>.
891 <P></P>
892 </DD>
893
894 <DT><B><A NAME="ReducingToken">Reducing Token</A></B></DT>
895 <DD> In a state with more than one <A HREF="#CompletedRule">completed rule</A>, your <A HREF="#Parser">parser</A> must be able to
896 determine which one was actually found. AnaGram deals with this problem by
897 looking at all the states the parser will branch to once each rule is reduced.
898 The acceptable input <A HREF="#Token">tokens</A> for those states are the "reducing tokens"
899 for the completed rules in the state under investigation. If there is a single
900 token which is a reducing token for more than one rule, then the <A HREF = "#Grammar">grammar</A> is said
901 to have a <A HREF="#ReduceReduce">reduce-reduce conflict</A> at that state. If in a particular state there
902 is both a <A HREF="#ShiftAction">shift action</A> and a <A HREF="#ReduceAction">reduce action</A> for the same token the grammar is
903 said to have a <A HREF="#ShiftReduce">shift-reduce conflict</A> in that state.
904 <P>
905 A reducing token is not the same as a <A HREF="#ReductionToken">reduction token</A>.
906 </P>
907 <P></P>
908 </DD>
909
910 <DT><B><A NAME="ReductionProcedure">Reduction Procedure</A></B></DT>
911 <DD> A "reduction procedure" is a function you write which your
912 <A HREF="#Parser">parser</A> executes when it has identified the <A HREF="#GrammarRule">grammar rule</A> to which the reduction
913 procedure is attached in your <A HREF = "#Grammar">grammar</A>. There are two formats for reduction
914 procedures, depending on the size and complexity of the procedure.
915 <P></P>
916 </DD>
917
918 <DT><B><A NAME="ReductionToken">Reduction Token</A></B></DT>
919 <DD> A <A HREF="#Token">token</A> which appears on the left side of a <A HREF="#Production">production</A> is called a
920 reduction token. It is so called because when the <A HREF="#GrammarRule">grammar rule</A> on the right side
921 of the production is matched in the input stream, your <A HREF="#Parser">parser</A> will <A HREF="#ReduceAction">"reduce"</A>
922 the sequence of tokens which matches the rule by replacing the sequence of
923 tokens with the reduction token. Note that, if more than one reduction token is
924 specified, your <A HREF="#ReductionProcedure">reduction procedure</A> should choose the exact one. If it does not,
925 your parser will use the leftmost syntactically correct one in the list as the default.
926 <P>
927 A reduction token is not the same as a <A HREF="#ReducingToken">reducing token</A>.
928 </P>
929 <P></P>
930 </DD>
931
932 <DT><B><A NAME="Resynchronization">Resynchronization</A></B></DT>
933 <DD> "Resynchronization" is the process of getting your <A HREF="#Parser">parser</A> back
934 in step with its input after encountering a syntax error. As such, it is one
935 method of error recovery. Of course, you would resynchronize only if it
936 is necessary to continue after the error. There are several options available
937 when using AnaGram. You could use the auto resynch option, which causes AnaGram
938 to incorporate an automatic resynchronizing procedure into your parser, or you
939 could use the error resynch option, which is similar to the technique used by
940 YACC programmers.
941 </DD>
942
943 <P></P>
944 <DT><B><A NAME="RuleElement">Rule Element</A></B></DT>
945 <DD> A <A HREF="#GrammarRule">grammar rule</A> is a list of "rule elements", separated by
946 commas. Rule elements may be <A HREF="#Token">token</A> names, <A HREF="#CharacterSets">character sets</A>, keywords, immediate
947 actions, or <A HREF="#VirtualProduction">virtual productions</A>. When AnaGram encounters a rule element for
948 which no token presently exists, it creates one.
949 <P></P>
950 </DD>
951
952 <DT><B><A NAME="SemanticAction">Semantic Action</A></B></DT>
953 <DD>
954 A semantic action is a piece of C or C++ code that is executed when
955 a <A HREF="#GrammarRule">grammar rule</A> has been identified by a
956 <A HREF="#Parser">parser</A>. Semantic actions are also called
957 <A HREF="#ReductionProcedure">reduction procedures</A>, since they
958 are executed when a grammar rule is <A
959 HREF="#ReduceAction">reduced</A> to the token on the left side of the
960 <A HREF="#Production">production.</A>
961 <P></P>
962 </DD>
963
964 <DT><B><A NAME="SemanticValue">Semantic Value</A></B></DT>
965 <DD>
966 <A HREF="#Token">Tokens</A>, whether <A HREF="#Terminal">terminal</A>
967 or <A HREF="#Nonterminal">nonterminal</A>, may have a semantic value.
968 In the case of terminal tokens, this may be a value assigned by the
969 <A HREF="#LexicalScanner">lexical scanner</A> or, if the parser is
970 using direct character input, it will be the ascii value of the
971 character itself. The values of nonterminal tokens are created by
972 <A HREF="#ReductionProcedure">reduction procedures</A>. As a parse
973 progresses, token values are <A HREF="#ShiftAction">shifted</A>
974 onto the stack, so that in a reduction procedure, the values of the
975 tokens that comprise the <A HREF="#GrammarRule">grammar rule</A> that
976 is being <A HREF="#ReduceAction">reduced</A> are available
977 for inspection.
978 <P></P>
979 </DD>
980
981 <DT><B><A NAME="SemanticallyDetermined">Semantically Determined Production</A></B></DT>
982 <DD> A "semantically determined production" is one which has more
983 than one <A HREF="#ReductionToken">reduction token</A> specified on the left side of the production. You would
984 write such a <A HREF="#Production">production</A> when the reduction tokens are syntactically
985 indistinguishable. The <A HREF="#ReductionProcedure">reduction procedure</A> may then specify which of the listed
986 reduction tokens the <A HREF="#GrammarRule">grammar rule</A> is to reduce to based on semantic
987 considerations. If there is no reduction procedure, or the reduction procedure
988 does not specify a reduction token, the <A HREF="#Parser">parser</A> will use the leftmost syntactically correct token.
989
990 <P></P>
991 </DD>
992
993 <DT><B><A NAME="SetExpression">Set Expression</A></B></DT>
994 <DD> A set expression is an algebraic expression used to define a <A HREF="#CharacterSets">character set</A>
995 in terms of individual characters, ranges of characters, and/or other sets of
996 characters as constructed using <A HREF="#SetComplement">complements</A>, <A HREF="#SetUnion">unions</A>, <A HREF="#SetIntersection">intersections</A>, and
997 <A HREF="#SetDifference">differences</A>.
998 <P></P>
999 </DD>
1000
1001 <DT><B><A NAME="ShiftAction">Shift Action</A></B></DT>
1002 <DD> The shift action is one of the four <A HREF="#ParserAction">actions</A> of a traditional parsing
1003 engine. The shift action is performed when the input <A HREF="#Token">token</A> matches one of the
1004 acceptable input tokens for the current parser state. The
1005 <A HREF="#SemanticValue">semantic value</A> of the
1006 token and the current state number are stacked, the parser stack index is
1007 incremented and the state number is set to a value determined by the previous
1008 state and the input token.
1009 <P></P>
1010 </DD>
1011
1012 <DT><B><A NAME="ShiftReduce">Shift-Reduce Conflict</A></B></DT>
1013 <DD> A "shift-reduce" conflict occurs if in some state there is a
1014 single <A HREF="#Token">token</A> that should be shifted, because it is legitimate input for one of
1015 the rules of the state, but should also be used to reduce some other rule
1016 because it is a <A HREF="#ReducingToken">reducing token</A> for that rule.
1017 <P></P>
1018 </DD>
1019 <DT><B><A NAME="SyntaxDirectedParsing">Syntax Directed Parsing</A></B></DT>
1020 <DD>
1021 <P>Syntax directed parsing, or formal parsing, is an
1022 approach to building <A HREF="#Parser">parsers</A> based on formal language
1023 theory. Given a suitable description of a language,
1024 called a <A HREF="#Grammar">grammar</A>, there are algorithms which can be
1025 used to create parsers for the language automatically.
1026 In this context, the set of all possible inputs to a
1027 program may be considered to constitute a language, and
1028 the rules for formulating the input to the program
1029 constitute the grammar for the language.
1030
1031 <P>The parsers built from a grammar have the advantage
1032 that they can recognize any input that conforms to the
1033 rules, and can reject as erroneous any input that fails
1034 to conform.
1035
1036 <P>Since the program logic necessary to parse input is
1037 often extremely intricate, programs which use formal
1038 parsing are usually much more reliable than those built
1039 by hand. They are also much easier to maintain, since
1040 it is much easier to modify a grammar specification
1041 than it is to modify complex program logic.
1042
1043 <P></P>
1044 </DD>
1045
1046 <DT><B><A NAME="Terminal">Terminal Token</A></B></DT>
1047 <DD> A "terminal token" is a <A HREF="#Token">token</A> which does not appear on the left
1048 side of a <A HREF="#Production">production</A>. It represents, therefore, a basic unit of input to your
1049 <A HREF="#Parser">parser</A>. If the input to your parser consists of ascii characters, you may define
1050 terminal tokens explicitly as ascii characters or as <A HREF="#CharacterSets">sets of ascii characters</A>.
1051 If you have an input procedure which produces numeric codes, you may define the
1052 terminal tokens directly in terms of these numeric codes.
1053 <P></P>
1054 </DD>
1055
1056 <DT><B><A NAME="Token">Token</A></B></DT>
1057 <DD> Tokens are the units with which your <A HREF="#Parser">parser</A> works. There are two kinds of
1058 tokens: <A HREF="#Terminal">terminal tokens</A> and <A HREF="#Nonterminal">nonterminal tokens</A>. These latter are identified by
1059 the parser as sequences of tokens. The grouping of tokens into more complex
1060 tokens is governed by the <A HREF="#GrammarRule">grammar rules</A>, or <A HREF="#Production">productions</A> in your <A HREF = "#Grammar">grammar</A>. In your
1061 grammar, tokens may be denoted by token names, by <A HREF="#VirtualProduction">virtual productions</A>, by
1062 immediate actions, by explicit character representations, or by expressions
1063 which yield <A HREF="#CharacterSets">character sets</A>.
1064 <P>
1065 A "token" should be thought of more or less as being something like a
1066 cloakroom token. Imagine you had taken a kindergarten class to the museum and
1067 all the kids checked their coats. Each one gets a token. What is the probability
1068 of losing one? You take all of the tokens from the children, put them in a paper
1069 bag, tie it up, and check it. You get one token in return. We would say that the
1070 kids' coats represent the actual input to the system, their individual tokens
1071 are the <A HREF="#Terminal">terminal tokens</A>, and your one token which enables you to redeem the
1072 paper bag is a <A HREF="#Nonterminal">nonterminal token</A>. Nonterminal tokens are just single token
1073 numbers to represent the result of compacting a sequence of more primitive
1074 tokens.
1075 </P><P>
1076 Actually, the word "token" is commonly used to refer to a number of
1077 apparently different things. The tokens you use in a grammar are abstract. The
1078 tokens identified by a parser are concrete instances of the abstract tokens in
1079 the grammar. Furthermore, we have to identify tokens in some manner, so we have
1080 token names and token numbers with which to refer to particular tokens. As a
1081 result the word "token", in any particular context, may refer directly
1082 to either an abstract token, or a concrete instance of an abstract token, or may
1083 refer only indirectly to the token by means of a token name or token number.
1084 </P><P>
1085 As an example, consider a C compiler which has a lexical scanner to identify
1086 input tokens. According to Kernighan and Ritchie, there are six classes of C
1087 tokens: identifiers, keywords, constants, string literals, operators, and other
1088 separators. Consider a particular token, a hexadecimal constant, 0XF00. When the
1089 lexical scanner hands this over to the parser, it has to provide an
1090 identification code which says what kind of token it is, a hexadecimal constant,
1091 as well as the
1092 <A HREF="#SemanticValue">"semantic value"</A> of the token,
1093 3840. The identification code is usually an integer value,
1094 determined by the designer of the lexical scanner, which by agreement
1095 with the designer of the parser uniquely represents a hexadecimal
1096 constant. This identification code can be thought of as an external
1097 token number.
1098
1099 </P><P>
1100 The grammar for the parser, on the other hand, has a token, "HEXconstant",
1101 let us say, to represent the abstract usage of hexadecimal constants in the C
1102 language. When AnaGram, or any other parser generator for that matter, analyzes
1103 the grammar, it assigns its own reference number to identify "HEXconstant".
1104 This reference number can be thought of as an internal token number to contrast
1105 it with the external token number provided by the lexical scanner. The parsers
1106 AnaGram builds contain a table, ag_ tcv, which provides a translation from the
1107 external to the internal token number. Both of these token numbers, of course,
1108 differ from the semantic value of the token, in this case, 3840.
1109 </P><P>
1110 Even when an AnaGram parser is using simple character input, the word "token"
1111 may represent several different things. Suppose a grammar has a token 'A-Z',
1112 which represents the set of upper case letters. There will be an internal token
1113 number, assigned by AnaGram, which identifies the (abstract) token in parser
1114 tables. The actual (concrete) input token to a parser will, however, be a single
1115 letter, say "Q". In this case, the ascii value of the character "Q",
1116 81, is used both as the external token number and as the semantic value of the
1117 token.
1118 </P><P></P>
1119 </DD>
1120
1121 <DT><B><A NAME="UnaryOperator">Unary Operator</A></B></DT>
1122 <DD>A "unary operator" is a <A HREF="#Token">token</A> which represents
1123 some sort of operation which operates on a single entity to produce a new
1124 resultant entity. It is normally represented by a symbol which is written either
1125 preceding the operand or following the operand. In the first case it is called a
1126 "prefix operator" and in the second case a "suffix operator".
1127 </P><P></P>
1128 </DD>
1129
1130 <DT><B><A NAME="SetUnion">Union</A></B></DT>
1131 <DD>
1132 The union of two sets is the set of all elements that are to be found in one or
1133 another of the two sets. In an AnaGram syntax file the union of two <A HREF="#CharacterSets">character
1134 sets</A> A and B is represented using the plus sign, as in A + B. The union operator
1135 has the same <A HREF="#Precedence">precedence</A> as the <A HREF="#SetDifference">difference</A> operator: lower than that of
1136 <A HREF="#SetIntersection">intersection</A> and <A HREF="#SetComplement">complement</A>. The union operator is <A HREF="#Associativity">left associative</A>.
1137 <P></P>
1138 </DD>
1139
1140 <DT><B><A NAME="VirtualProduction">Virtual Production</A></B></DT>
1141 <DD>
1142 <P>Virtual productions are a special short hand
1143 representation of <A HREF="#GrammarRule">grammar rules</A> which can be used to
1144 indicate a choice of inputs. They are an important
1145 convenience, especially useful when you are first
1146 building a <A HREF="#Grammar">grammar</A>.
1147
1148 <P>AnaGram rewrites virtual productions, so that when you
1149 look at the syntax tables in AnaGram, there will be
1150 actual <A HREF="#Production">productions</A> replacing the virtual productions.
1151
1152 <P>A virtual production appears as one of the <A HREF="#RuleElement">rule
1153 elements</A> in a <A HREF="#GrammarRule">grammar rule</A>, i.e. as one of the members
1154 of the list on the right side of a production.
1155
1156 <P>The simplest virtual production is the "optional"
1157 <A HREF="#Token">token</A>. If x is an arbitrary token, x? can be used to
1158 indicate an optional x.
1159
1160 <P>Related virtual productions are x... and x?... where
1161 the three dots indicate repetition. x... represents an
1162 arbitrary number of occurrences of x, but at least one.
1163 x?... represents zero or more occurrences of x.
1164
1165 <P>The remaining virtual productions use curly or square
1166 brackets to enclose a sequence of rules. The brackets
1167 may be followed variously by nothing, a string of three
1168 dots, or a slash, to indicate the choices to be made
1169 from the rules. Note that rules may be used, not merely
1170 tokens.
1171
1172 <P>If r1 through rn are a set of grammar rules, then
1173 <PRE> {r1 | r2 | ... | rn} </PRE>
1174 is a virtual production that allows a choice of exactly
1175 one of the rules. Similarly,
1176 <PRE> {r1 | r2 | ... | rn}... </PRE>
1177 is a virtual production that allows a choice of one or
1178 more of the rules. And, finally,
1179 <PRE> {r1 | r2 | ... | rn}/... </PRE>
1180 is a virtual production that allows a choice of one or
1181 more of the rules subject to the side condition that
1182 rules must alternate, that is, that no rule can follow
1183 itself immediately without the interposition of some
1184 other rule. This is a case that is not particularly
1185 easy to write by hand, but is quite useful in a number
1186 of contexts.
1187
1188 <P>If the above virtual productions are written with [ ]
1189 instead of { }, they all become optional. [ ] is an
1190 optional choice, [ ]... is zero or more choices, and
1191 [ ]/... is zero or more alternating choices.
1192
1193 <P><A HREF="#NullProduction">Null productions</A> are not permitted in virtual
1194 productions in those cases where they would cause an
1195 intrinsic ambiguity.
1196
1197 <P>You may use a definition statement to assign a name to
1198 a virtual production.
1199
1200 </DD>
1201
1202 </DL>
1203
1204
1205 <P>
1206 <BR>
1207
1208 <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------"
1209 WIDTH=1010 HEIGHT=2 >
1210 <P>
1211 <IMG ALIGN="right" SRC="images/pslrb6d.gif" ALT="Parsifal Software"
1212 WIDTH=181 HEIGHT=25>
1213 <BR CLEAR="right">
1214 Back to <A HREF="index.html">Index</A>
1215 <P>
1216 <ADDRESS><FONT SIZE="-1">
1217 AnaGram parser generator - documentation<BR>
1218 Glossary<BR>
1219 Copyright &copy; 1993-1999, Parsifal Software. <BR>
1220 All Rights Reserved.<BR>
1221 </FONT></ADDRESS>
1222
1223 </BODY>
1224 </HTML>
1225
1226