comparison doc/manual/gl.tex @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.
author David A. Holland
date Sat, 22 Dec 2007 17:52:45 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:13d2b8934445
1 \chapter{Glossary}
2
3 % XXX Add:
4 % parameter assignment (?)
5 % parser (state) stack
6 % parser value stack
7 % (may be defined in some other version of the glossary)
8 % (and appear in the help)
9 % (there are apparently dangling links to these in some version)
10
11 \index{Action}
12 \index{Parser action}
13 \index{Parsing engine}
14 \paragraph{Action.}
15 A ``parser action'' is one of the basic executable elements of a
16 parsing engine. In a traditional parsing engine there are four
17 actions: the shift action, the reduce action, the accept action, and
18 the error action. The parsing engines which AnaGram generates use a
19 substantially greater number of actions, in order to provide certain
20 short cuts in the interest of greater speed and greater compactness of
21 the tables which control the action of the parsing engine.
22
23 \paragraph{Accept Action.}
24 The accept action is one of the four actions of a traditional parsing
25 engine. The accept action is performed when the parser has succeeded
26 in identifying the goal token for the grammar. When the parser
27 executes the accept action, it returns to the calling program.
28 % , returning the semantic value of the goal token as its value.
29 The accept action is thus the last action of the parsing engine and
30 occurs only once for each successful execution of the parser.
31
32 \paragraph{Associativity.}
33 This is a term used to describe binary operators (q.v.). It describes
34 how you interpret a sequence of operations which all involve the same
35 operator. Consider $a - b - c$, for instance. Here the convention is
36 that we subtract $b$ from $a$, and then $c$ from the difference. We
37 say that subtraction is left associative, since if there is a sequence
38 of subtraction operations, the leftmost one is to be performed
39 first. As a second example, consider exponentiation. FORTRAN uses
40 \agcode{**} as an operator to indicate exponentiation. In order to
41 conform as closely as possible to ordinary mathematical notation, the
42 expression \agcode{a**b**c} is interpreted to mean that first
43 \agcode{b} is raised to the power \agcode{c}, and then the resultant
44 value is used as the power to which \agcode{a} is raised. We say that
45 exponentiation is right associative since the rightmost of a sequence
46 of exponentiation operations is to be performed first. If a
47 programming language does not allow for unparenthesized sequences of a
48 particular operator, the operator is said to be non-associative.
49
50 Another way to view left and right associativity is as implied
51 parenthesization. Left associative operators are parenthesized from
52 the left. Right associative operators are parenthesized from the
53 right. Thus $a - b - c = ((a-b) - c)$ and \agcode{a**b**c = (a**(b**c))}.
54
55 AnaGram offers two ways to specify associativity of binary
56 operators. The first way is to use conventional Backus-Naur Form
57 syntax descriptions. You would use recursion to describe both
58 subtraction and exponentiation. Subtraction, since it is left
59 associative, uses left recursion, and exponentiation, being right
60 associative, uses right recursion. Thus
61 \begin{indentingcode}{0.4in}
62 difference -> value | difference, '-', value
63 exponential -> value | value, "**", exponential
64 \end{indentingcode}
65 could be used to describe differences and exponents.
66
67 The second way to specify associativity is to use an ambiguous grammar
68 and precedence declarations. (See Chapter 9.)
69
70 \paragraph{Backtracking.}
71 In order to make parse tables more compact and parsers faster, it is
72 common to use default reductions. In case of error, it is necessary to
73 undo default reductions before diagnostics can be properly
74 determined. In AnaGram, this undo operation is called backtracking.
75
76 \index{Backus-Naur form}
77 \index{BNF}
78 \paragraph{Backus-Naur Form.}
79 Backus-Naur Form, or BNF, is a conventional notation for describing
80 context free grammars. AnaGram uses an extended notation for grammars,
81 which, except for semantically determined productions, can be shown to
82 be equivalent to BNF. The term ``BNF'' is often used colloquially to
83 refer to a grammar specification.
84 % XXX: shouldn't this say ``any grammar specification''?
85
86 AnaGram's syntax specification language differs from BNF in the
87 following respects:
88
89 \begin{enumerate}
90
91 \item
92 In conventional BNF, a symbol not enclosed in angle brackets ($< >$)
93 is taken to represent itself literally. In AnaGram, literal character
94 representations must be enclosed in single quotes and literal strings
95 within double quotes.
96
97 \item
98 In BNF, the right side of a production is simply a list of symbols
99 without any delimiters or punctuation. AnaGram requires that the rule
100 elements which make up a grammar rule, or right side, be joined by
101 commas.
102
103 \item
104 BNF makes no provision for identifying reduction procedures or their
105 arguments. AnaGram provides both reduction procedures, introduced by
106 an ``\agcode{=}'' at the end of the production, and named arguments,
107 introduced by a ``\agcode{:}'' at the end of any token on the right
108 side of the production.
109
110 \item
111 AnaGram allows virtual productions to be used freely.
112
113 \item
114 BNF is ``pure'' in that if you wish to define a nonterminal token
115 called ``digit'' to represent the digits from zero to nine, you must
116 provide ten explicit productions to define it. AnaGram treats the
117 concept of ``terminal token'' as used in language theory as an
118 abstraction, and interposes character set functionality between actual
119 character input and the terminal tokens of BNF. You can define digit
120 to be the character range \agcode{'0-9'}, for instance, and AnaGram
121 will determine and define precisely the minimum number of productions
122 necessary, taking into account any other usage of the characters
123 \agcode{'0'} through \agcode{'9'} in your grammar. This makes your
124 grammar more compact, more manageable, and easier to modify.
125
126 \item
127 AnaGram allows for semantically determined productions, which provide
128 a significant increment in the ease with which semantic analysis may
129 be joined to syntactic analysis.
130
131 \end{enumerate}
132
133 \paragraph{Binary operator.}
134 A binary operator is an operator that works on two operands to create
135 a result. It is often written between its operands and is sometimes
136 called an infix operator. In ordinary programming languages
137 ``\agcode{+}'' and ``\agcode{-}'' are binary operators.
138
139 \paragraph{Binding.}
140 Most programming languages, rather than executing arithmetic operators
141 in simple left to right order, conform to the traditional conventions
142 of ordinary algebra, which dictate that, except where parenthesis
143 indicate otherwise, exponentiation is done before multiplication,
144 multiplication before addition, and addition before comparison. One
145 says that exponentiation is ``more tightly binding'' than
146 multiplication, multiplication is more tightly binding than addition,
147 and so on. The sense of the word here is that the operator binds
148 together its operands into a single entity. An operand which falls
149 between two different operators is ``bound'' by the more tightly
150 binding operator. An operator that is more tightly binding than
151 another is also said to have higher precedence.
152
153 \index{Character sets}
154 \paragraph{Character sets.}
155 One of the traditional problems with syntax directed programming is
156 that caused by the fact that most formal language specifications have
157 productions of the form
158 \begin{indentingcode}{0.4in}
159 letter -> a | b | c | d ... | z
160 \end{indentingcode}
161
162 Since the letters are not often distinguished in the grammar, this
163 large number of essentially identical productions causes a
164 correspondingly large number of states in the parser. This problem is
165 often attacked by using a ``lexical scanner'' which simply specifies a
166 ``letter token'' whose value indicates precisely which letter is
167 intended. This works fine as long as there is nothing in the grammar
168 which distinguishes any letter at all. If there is, and there is
169 usually some special use of \agcode{h}, \agcode{x}, \agcode{q},
170 \agcode{o}, \agcode{e}, or even \agcode{a-f}, the situation gets more
171 complicated. Often the result is to abandon the use of syntax directed
172 programming for elemental units of the language and use a more complex
173 lexical scanner which identifies names, numbers, key words and
174 operators. Such lexical scanners are often built using conventional
175 programming languages or simple pattern recognizing languages such as
176 \agfile{lex}.
177
178 AnaGram avoids this problem by incorporation the notion of character
179 set into its input specifications. Briefly, the way it works is the
180 following: You specify the set of characters which make up any
181 ostensibly terminal token. AnaGram then analyzes the overlaps among
182 these definitions and creates a minimal set of tokens which match your
183 specifications exactly. For instance, suppose you have the following
184 definitions:
185
186 \begin{indentingcode}{0.4in}
187 letter = 'a-z' + 'A-Z'
188 hex digit = '0-9' + 'a-f' + 'A-F'
189 \end{indentingcode}
190
191 AnaGram will define a token for letter which has two productions:
192
193 \begin{indentingcode}{0.4in}
194 letter -> 'a-f' + 'A-F' | 'g-z' + 'G-Z'
195 \end{indentingcode}
196
197 With this technique, you have the freedom to specify your grammar in
198 the easiest possible manner, without the penalty of an absurdly large,
199 slow parser.
200
201 Character sets may be specified in terms of ranges of characters, as
202 in the above example, as unions, denoted by ``\agcode{+}'', as
203 intersections, denoted by ``\agcode{\&}'', as differences, denoted by
204 ``\agcode{-}'', and as complements, using ``\agcode{\~{}}''. You may
205 also specify a set consisting of a single character either with a
206 literal character or with a numeric specification. If you specify a
207 character numerically, you may use either decimal, octal or
208 hexadecimal notation, as in C.
209
210 \index{Character universe}
211 \index{Universe}
212 \paragraph{Character Universe.}
213 In ordinary set theory, the ``universe'' is the set of all entities
214 which may be elements of a set. It must be defined so that the
215 complement of a set can mean something unambiguous. In AnaGram, the
216 character universe normally comprises the ASCII characters and the IBM
217 extended character set, that is, all characters in the range from 0
218 through 255. If, however, you have defined input characters outside
219 this range, the character universe will be extended to the least
220 character you have used and to the greatest character you have used in
221 your grammar.
222 % XXX this should mention character sets, and it should mention that
223 % it's limited to character sets of size 65536.
224
225 \index{Characteristic rules}
226 \index{Rule}
227 \paragraph{Characteristic Rule.}
228 Each parser state is characterized by a particular set of grammar
229 rules, and for each such rule, a marked token which is the next token
230 expected. These rules are called the characteristic rules for the
231 state. In the course of doing grammar analysis, AnaGram determines the
232 characteristic rules for each parser state. After analyzing your
233 grammar, you may inspect the \agwindow{State Definition Table} to see
234 the characteristic rules for any state in your parser.
235
236 \index{Characteristic token}
237 \index{Token}
238 \paragraph{Characteristic Token.}
239 Every state in a parser, except state zero, can be characterized by
240 the one, unique token which causes a jump to that state. That token is
241 called the characteristic token of the state, because to get to that
242 parser state you must have just seen precisely that token in the
243 input. Note that several states could have the same characteristic
244 token.
245 % XXX s/one,/one/
246
247 When you have a list of states, such as is given by the parser state
248 stack, it is equivalent to a list of characteristic tokens. This list
249 of tokens is the list of tokens that have been recognized so far by
250 the parser. Some of these tokens, of course, may be nonterminal tokens
251 and may thus represent the result of reducing a sequence of previously
252 recognized tokens.
253
254 \index{Complement}
255 \paragraph{Complement of a set.}
256 In set theory, the complement of a set, $S$, is the collection of all
257 elements of the character universe which are not elements of $S$. Using
258 AnaGram's notation, the complement of \agcode{S} is written
259 \agcode{\~{}S}. The complement operator has higher precedence than the
260 difference, intersection, or union operators.
261
262 \index{Rule}
263 \index{Completed rule}
264 \paragraph{Completed Rule.}
265 A ``completed rule'' is a characteristic rule whose index is pointing
266 beyond the last rule element. In other words, it has been completely
267 matched and will be reduced by the next input.
268
269 If a state has more than one completed rule, the decision as to which
270 to reduce is made based on the next input token. If there is only one
271 completed rule in a state, it will be reduced by default unless the
272 default reductions switch has been reset, i.e., turned off.
273
274 \index{Conflicts}
275 \paragraph{Conflict.}
276 Conflicts arise during a grammar analysis when AnaGram cannot
277 determine how to treat a given input token. There are two sorts of
278 conflicts: \agterm{shift-reduce} conflicts and \agterm{reduce-reduce}
279 conflicts. Conflicts may arise either because the grammar is
280 inherently ambiguous, or simply because the grammar analyzer cannot
281 look far enough ahead to resolve the conflict. In the latter case, it
282 is often possible to rewrite the grammar in such a way as to eliminate
283 the conflict. Null productions are a common source of conflict.
284
285 There are a number of ways to deal with conflicts. If you understand
286 the conflict well, you may simply choose to ignore it. When AnaGram
287 encounters a shift-reduce conflict while building parse tables it
288 resolves it by choosing the shift action. When AnaGram encounters a
289 reduce-reduce conflict while building parse tables, it resolves it by
290 selecting the grammar rule which occurred first in the grammar.
291
292 A second way to deal with conflicts is to set operator precedence
293 parameters. If you set these parameters, AnaGram will use them
294 preferentially to resolve conflicts. Any conflicts so resolved will be
295 listed in the Resolved Conflicts window.
296
297 A third way to resolve a conflict is to declare some tokens as
298 \agparam{sticky} or as \agparam{subgrammar}s. This is particularly
299 useful for productions whose sole purpose is to skip over
300 uninteresting input.
301
302 The fourth way to deal with conflicts is to rewrite the grammar to
303 eliminate them. Many people prefer this approach since it yields the
304 highest level of confidence in the resulting program.
305
306 \index{Context free grammar}
307 \paragraph{Context Free Grammars.}
308 Context free grammars are grammars wherein the definition of a grammar
309 unit, or nonterminal token, does not depend on the context in which
310 the nonterminal is used. That means that if you define ``widget'' in a
311 certain way, that definition is valid in all contexts in which
312 ``widget'' might occur. Context free grammars can be represented in
313 Backus-Naur Form. AnaGram's syntax specification language has no
314 facility for representing grammars which are not context free.
315
316 \index{Default reductions}
317 \index{Reduction}
318 \paragraph{Default reductions.}
319 A ``default reduction'' is a parser action which may be used in your
320 parser in any state which has precisely one completed rule.
321
322 If a given parser state has, among its characteristic rules, exactly
323 one completed rule, it is usually faster to reduce it on any input
324 than to check specifically for correct input before reducing it. The
325 only time this default reduction causes trouble is in the event of a
326 syntax error. In this situation you may get an erroneous reduction.
327 Normally when you are parsing a file, this is inconsequential because
328 you are not going to continue semantic action in the presence of
329 error. But, if you are using your parser to handle real-time
330 interactive input, you have to be able to continue semantic processing
331 after notifying your user that he has entered erroneous input. In this
332 case you would want \agparam{default reductions} to have been turned
333 off so that productions are reduced only when there is correct input.
334
335 \index{Difference}
336 % XXX: \index{Set difference} ?
337 \paragraph{Difference of two sets.}
338 In set theory, the difference of two sets, $A$ and $B$, is defined to
339 be the set of all elements of $A$ that are not elements of $B$. In an
340 AnaGram syntax file, you represent the difference of two character
341 sets by using the ``\agcode{-}'' operator. Thus the difference of
342 \agcode{A} and \agcode{B} is \agcode{A - B}. The difference operator
343 is left associative.
344
345 \paragraph{Error Action.}
346 The error action is one of the four actions of a traditional parsing
347 engine. The error action is performed when the parser has encountered
348 an input token which is not admissible in the current state. The
349 further behavior of a traditional parser is undefined.
350
351 \index{Expansion rule}
352 \index{Rule}
353 \paragraph{Expansion Rule.}
354 In analyzing a grammar, we are often interested in the full range of
355 input that can be expected at a certain point. The expansion of a
356 token or state shows us all the expected input. An expansion yields a
357 set of marked rules. Looking at the marked token shows us what input
358 to expect.
359
360 The set of expansion rules of a (nonterminal) token shows all the
361 expected input that can occur whenever the token appears in the
362 grammar. The set consists of all the grammar rules produced by the
363 token, plus all the rules produced by the first token of any rule in
364 the set. The marked tokens for the expansion rules of a token are
365 always the first element in the rule.
366
367 The expansion of a state consists of its characteristic rules plus the
368 expansion rules of the marked token in each characteristic rule.
369
370 \index{Grammar}
371 \paragraph{Grammar.}
372 Traditionally, a grammar is a set of productions which taken together
373 specify precisely a set of acceptable input sequences in terms of an
374 abstract set of terminal tokens. The set of acceptable input sequences
375 is often called the ``language'' defined by the grammar.
376
377 In AnaGram, the term grammar also includes configuration sections,
378 definitions of character sets, virtual productions, etc. that augment
379 the collection of productions. A grammar is often called a
380 ``syntax'', and the rules of the grammar are often called syntactic
381 rules.
382
383 % XXX you can't say that step 1 (of 4) of analysis is analysis.
384 \paragraph{Grammar Analysis.}
385 The major function of AnaGram is the analysis of context free grammars
386 written in a particular variant of Backus-Naur Form.
387
388 The analysis of a grammar proceeds in four stages. In the first, the
389 input grammar is analyzed and a number of tables are built which
390 describe all of the productions and components of the grammar.
391
392 In the second stage, AnaGram analyzes all of the character sets
393 defined in the grammar, and where necessary, defines auxiliary tokens
394 and productions.
395
396 In the third stage, AnaGram identifies all of the states of the parser
397 and builds the go-to table for the parser.
398
399 In the fourth stage, Anagram identifies reduction tokens for each
400 completed grammar rule in each state and checks for conflicts.
401
402 \index{Grammar rule}
403 \paragraph{Grammar Rule.}
404 A ``grammar rule'' is the right side of a production. It consists of a
405 sequence of rule elements. Each rule element identifies some token,
406 which can be either a terminal token or nonterminal token. It is
407 ``matched'' by a corresponding sequence of tokens in the input stream
408 to the parser. The left side of the production identifies one or more
409 nonterminal tokens, or reduction tokens, to which the rule reduces
410 when matched. If there is more than one reduction token, there should
411 be a reduction procedure to make the choice.
412
413 \index{Grammar token}
414 \index{Token}
415 \paragraph{Grammar Token.}
416 The grammar token is the token which represents the ``top level'' in
417 your grammar. Some people refer to it as the ``goal'' or ``goal
418 token'' and others as the ``start token''. Whatever it is called, it
419 is the single token which describes the complete input to your parser.
420
421 \index{Intersection}
422 \paragraph{Intersection.}
423 In set theory, the intersection of two sets, $A$ and $B$, is defined
424 to be the set of all elements of $A$ which are also elements of $B$.
425 In an AnaGram syntax file, the intersection of two character sets is
426 represented with the ``\agcode{\&}'' operator. The intersection
427 operator has lower precedence than the complement operator, but higher
428 precedence than the union and difference operators. The intersection
429 operator is left associative.
430
431 \index{LALR parsing}
432 \paragraph{LALR parsing.}
433 LALR, or ``lookahead LR'' parsing, is a somewhat less powerful variant
434 of LR parsing which produces much smaller parsing tables. Thus an LALR
435 parser has very significant advantages in size over its LR
436 counterpart. It is possible to construct grammars which can be parsed
437 by an LR parser but cannot be parsed by an LALR parser. That is, the
438 LALR parser will find these grammars to be ambiguous and the LR parser
439 will not. In practice, however, such grammars seem to be rare and the
440 ambiguities can often be eliminated by minor revisions.
441
442 AnaGram generates LALR parsers and, in addition, uses LALR parsing to
443 interpret syntax files.
444
445 \index{Lexical scanner}
446 \paragraph{Lexical Scanner.}
447 A lexical scanner is a program or procedure used as a preprocessor for
448 a parser. It scans input characters and lumps them together into
449 tokens. Many systems for generating parsers, most notably
450 \agfile{yacc}, can scarcely be used without a lexical scanner.
451
452 AnaGram, although it can support the use of a lexical scanner, does
453 not need a lexical scanner for most practical problems. AnaGram avoids
454 the need for a lexical scanner by using character sets and disregard
455 and lexeme statements.
456
457 If your problem does in fact need a lexical scanner, you can use
458 AnaGram itself to write it, so you don't need to know different
459 languages for the scanner and for the parser.
460
461 \index{Lookahead token}
462 \index{Token}
463 \paragraph{Lookahead Token.}
464 The current input token to a parser is often called the ``lookahead''
465 token. In many situations, it is not shifted into the input buffer
466 immediately, but acts like a catalyst to cause a number of rules to be
467 reduced before it is eventually shifted in.
468
469 \index{LR parsing}
470 \paragraph{LR Parsing.}
471 An LR($k$) parser is a type of deterministic bottom-up shift-reduce
472 parser using at most $k$ lookahead symbols to determine its next
473 action at any point in the parse. If ($k$) is omitted, then $k$ is
474 assumed to be 1. Discussion of the technical details of LR($k$)
475 parsing is beyond the scope of this manual. The interested reader may
476 consult any of a variety of works covering the subject\footnote{ See,
477 for example, Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.
478 \textit{Compilers: Principles, Techniques, and Tools}. (Reading,
479 Massachusetts: Addison-Wesley, 1988).}.
480
481 AnaGram produces parsers which employ a variant of LR parsing known as
482 LALR parsing. It is not necessary to understand the details of LR and
483 LALR parsing theory in order to use AnaGram.
484
485 \index{Marked rule}
486 \index{Rule}
487 \paragraph{Marked Rule.}
488 A ``marked rule'' is a grammar rule together with a ``marked token''
489 that indicates how much of the rule has already been matched and what
490 input should be expected if the remainder of the rule is to be
491 matched. Each parser state is defined by a small number of
492 characteristic rules which indicate what matches have already been
493 made and what input can be expected. Note that when a state has more
494 than one characteristic rule, any two characteristic rules with the
495 same number of tokens match identically up to the marked token. Even
496 if one has fewer tokens to the left than the other, its tokens match
497 the other exactly up to the marked token.
498
499 When marked rules are displayed in AnaGram windows, the marked token
500 is displayed in a distinctive font which the developer can select.
501
502 \paragraph{Non-associative.}
503 A binary operator, say \agcode{\#}, is said to be non-associative if
504 the sequence \agcode{x \# y \# z} is not permissible. If this sequence
505 is to be interpreted as \agcode{(x \# y) \# z}, the operator is said
506 to be left associative. If the sequence is to be interpreted as
507 \agcode{x \# (y \# z)}, the operator is said to be right
508 associative. (See \agterm{associativity}.)
509
510 \index{Nonterminal token}
511 \index{Token}
512 \paragraph{Nonterminal Token.}
513 A ``nonterminal token'' is a token which results from reducing the
514 sequence of tokens which match a grammar rule and replacing them with
515 the appropriate reduction token. Nonterminal tokens are to be
516 distinguished from terminal tokens or input tokens.
517
518 \index{Null productions}
519 \index{Production}
520 \paragraph{Null Production.}
521 A ``null production'' is one that has no tokens on the right side
522 whatsoever. Null productions essentially are identified by the first
523 following input token. Null productions are extremely convenient
524 syntactic elements when you wish to make some input optional. For
525 example, suppose that you wish to allow an optional semicolon at some
526 point in your grammar. You could write the following pair of
527 productions:
528
529 \begin{indentingcode}{0.4in}
530 optional semicolon -> | ';'
531 \end{indentingcode}
532
533 Note that a null production can never follow a ``\agcode{|}''.
534
535 The above could also be written on multiple lines thus:
536
537 \begin{indentingcode}{0.4in}
538 optional semicolon
539 ->
540 -> ';'
541 \end{indentingcode}
542
543 You can always rewrite your grammar to eliminate null productions if
544 you wish, but you usually pay a price in conciseness and
545 clarity. Sometimes, however, it is necessary to do such a rewrite in
546 order to avoid conflicts, to which null productions are especially
547 prone.
548
549 If you have a null production with no reduction procedure specified,
550 your parser will automatically assign the value zero to the reduction
551 token.
552
553 Null productions can be generated by virtual productions.
554
555 \index{Parser}
556 \paragraph{Parser.}
557 A parser is a program, or more likely a procedure within a program,
558 which scans a sequence of input characters or input tokens and
559 accumulates them in an input buffer or stack as determined by a set of
560 productions which constitute a grammar. When the parser discovers a
561 sequence of tokens as defined by a grammar rule, or right side of a
562 production, it ``reduces'' the sequence to a single reduction token as
563 defined by the left side of the grammar rule. This nonterminal token
564 now replaces the tokens which matched the grammar rule and the search
565 for matches continues. If an input token is encountered which will not
566 yield a match for any rule, it is considered a syntax error and some
567 kind of error recovery may be required to continue. If a match, or
568 reduction action, yields the grammar token, sometimes called the goal
569 token or start token, the parser deems its work complete and returns
570 to whatever procedure may have called it.
571
572 Tokens may have semantic values. If the input values configuration
573 switch is on, your parser will expect semantic values to be provided
574 by the input process along with the token identification code. If the
575 \agparam{input values} switch is off, your parser will take the ASCII
576 value of the input character, that is, the actual input code, as the
577 value of the character. When the parser reduces a production, it can
578 call a reduction procedure or semantic action to analyze the values of
579 the constituent tokens. This reduction procedure can then return a
580 value which characterizes the reduced token.
581
582 \paragraph{Parser Generator.}
583 A parser generator, such as AnaGram, is a program that converts a
584 grammar, a rule-based description of the input to a program, into a
585 conventional, procedural module called a parser. The parsers AnaGram
586 generates are simple C modules which can be compiled on almost any
587 platform. AnaGram parsers are also compatible with C++.
588
589 \index{Parsing engine}
590 \paragraph{Parsing Engine.}
591 A parser consists of three basic components: A set of syntax tables, a
592 set of reduction procedures and a parsing engine. The parsing engine
593 is the body of code that interprets the parsing table, invokes input
594 functions, and calls the reduction procedures. The \agmenu{Build
595 Parser} command configures a parsing engine according to the implicit
596 requirements of the syntax specification and according to the explicit
597 values of the configuration parameters.
598
599 The parsing engine itself is a simple automaton, characterized by a
600 set of states and a set of inputs. The inputs are the tokens of your
601 grammar. Each state is represented by a list of tokens which are
602 admissible in that state and for each token an action to perform and a
603 parameter which further defines the action. In a traditional LR
604 parser, there are only four actions: the shift action, the reduce
605 action, the accept action and the error action. AnaGram, in doing its
606 grammar analysis, identifies a number of special cases, and creates a
607 number of extra actions which make for faster processing, but which
608 can be represented as combinations of these primitive actions.
609
610 When a shift action is performed, the current state number is pushed
611 onto the parser state stack and the new state number is determined by
612 the current state number and the current lookahead token. Different
613 tokens cause different new states.
614
615 When a reduce action is performed, the length of the rule being
616 reduced is subtracted from the parser stack index and the new state
617 number is read from the top of the parser state stack. The reduction
618 token for the rule being reduced is then used as an input token.
619
620 Each state in the grammar, with the exception of state zero, has a
621 characteristic token which must have been recognized in order to jump
622 to that state. Therefore, the parser state stack, which is essentially
623 a list of state numbers, can also be thought of as a list of token
624 numbers. This is the list of tokens that have been seen so far in the
625 parse of your input stream. Some of these tokens, of course, may be
626 nonterminal tokens which have resulted from reducing other sequences
627 of tokens.
628
629 \index{Partition}
630 \paragraph{Partition.}
631 If you use character sets in your grammar, AnaGram will compute a
632 ``partition'' of the character universe. This partition is a
633 collection of non-overlapping character sets such that every one of
634 the sets you have defined can be written as a union of partition
635 sets. Each partition set is identified by a unique reference number
636 called the partition set number.
637
638 Each partition set is assigned a unique token. If one of your
639 character sets requires more than one partition set to represent it,
640 AnaGram will create appropriate productions and add them to your
641 grammar so your parser can make the necessary distinctions.
642
643 \index{Precedence}
644 \paragraph{Precedence.}
645 ``Precedence'' is an attribute of binary operators and unary operators
646 which can be used to resolve conflicts. Operator precedence is also
647 referred to as \agterm{binding}. Suppose that \agcode{\#} and
648 \agcode{@} are two binary operators. The question is how to interpret
649 \agcode{x \# y @ z}. If \agcode{\#} has higher precedence than
650 \agcode{@}, it is interpreted as \agcode{(x \# y) @ z}. On the other
651 hand if \agcode{\#} has lower precedence than \agcode{@}, it would be
652 \agcode{x \# (y @ z)}. If \agcode{\#} and \agcode{@} have the same
653 precedence then the question must be resolved by associativity.
654
655 Note that all operators at the same precedence level must have the
656 same associativity.
657
658 % XXX ``neither ... and ...?'' maybe you mean ``neither ... nor ...''
659 The situation is somewhat simpler for unary operators. If
660 \agcode{\#} and \agcode{@} were both
661 prefix operators, or both suffix operators, there would be no
662 ambiguity in interpretation, since neither \agcode{\# @ x} and
663 \agcode{x \# @} offer more than one possible interpretation. The only
664 difficulty arises when both a prefix and a suffix operator are applied
665 to the same operand.
666
667 % XXX also ``if ... has ... would'' is wrong; it should be
668 % either ``if ... had ... would'' or ``if ... has ... will''.
669 Suppose that \agcode{\#} is a prefix operator and \agcode{@} is a
670 suffix operator. The question then is how to interpret \agcode{\# x @}.
671 If \agcode{\#} has higher precedence than \agcode{@}, it would be
672 interpreted as \agcode{(\# x) @}. On the other hand, if \agcode{\#}
673 has lower precedence than \agcode{@}, it would be interpreted as
674 \agcode{\# (x @)}. If \agcode{\#} and \agcode{@} have the same
675 precedence then the question must be resolved by associativity.
676
677 \index{Production}
678 \paragraph{Production.}
679 Productions are the mechanism used to describe how complex input
680 structures are built up out of simpler ones. Each production has a
681 left side and a right side. The right side, or grammar rule, is a
682 sequence of rule elements, which may represent either terminal tokens
683 or nonterminal tokens. The left side is a list of reduction tokens. In
684 most cases there would be only a single reduction token. Productions
685 with more than one token on the left side are called semantically
686 determined productions. The ``\agcode{->}'' symbol is used to separate
687 the left side from the right side.
688
689 \paragraph{Reduce Action.}
690 The reduce action is one of the four actions of a traditional parsing
691 engine. The reduce action is performed when the parser has succeeded
692 in matching all the elements of a grammar rule and the next input
693 token is not erroneous. Reducing the grammar rule amounts to
694 subtracting the length of the rule from the parser stack index,
695 identifying the reduction token, stacking its semantic value and then
696 doing a shift action with the reduction token as though it had been
697 input directly.
698
699 \index{Conflicts}
700 \index{Reduce-reduce conflicts}
701 \paragraph{Reduce-Reduce Conflict.}
702 A grammar has a ``reduce-reduce'' conflict at some state if a single
703 token turns out to be a reducing token for more than one completed
704 rule.
705 % XXX it would be nice to expand this. And given an example.
706
707 \index{Reducing token}
708 \index{Token}
709 \paragraph{Reducing Token.}
710 In a state with more than one completed rule, your parser must be able
711 to determine which one was actually found. AnaGram deals with this
712 problem by looking at all the states the parser will branch to once
713 each rule is reduced. The acceptable input tokens for those states are
714 the ``reducing tokens'' for the completed rules in the state under
715 investigation. If there is a single token which is a reducing token
716 for more than one rule, then the grammar is said to have a
717 reduce-reduce conflict at that state. If in a particular state there
718 is both a shift action and a reduce action for the same token the
719 grammar is said to have a shift-reduce conflict in that state.
720
721 A reducing token is not the same as a reduction token.
722
723 \index{Reduction procedure}
724 \paragraph{Reduction Procedure.}
725 A ``reduction procedure'' is a function you write which your parser
726 executes when it has identified the grammar rule to which the
727 reduction procedure is attached in your grammar. There are two formats
728 for reduction procedures, depending on the size and complexity of the
729 procedure.
730
731 \index{Reduction token}
732 \index{Token}
733 \paragraph{Reduction Token.}
734 A token which appears on the left side of a production is called a
735 reduction token. It is so called because when the grammar rule on the
736 right side of the production is matched in the input stream, your
737 parser will ``reduce'' the sequence of tokens which matches the rule
738 by replacing the sequence of tokens with the reduction token. Note
739 that, if more than one reduction token is specified, your reduction
740 procedure should choose the exact one. If it does not, your parser
741 will use the leftmost syntactically correct one in the list as the
742 default.
743
744 A reduction token is not the same as a reducing token.
745
746 \index{Resynchronization}
747 \paragraph{Resynchronization.}
748 ``Resynchronization'' is the process of getting your parser back in
749 step with its input after encountering a syntax error. As such, it is
750 one method of error recovery. Of course, you would resynchronize only
751 if it is necessary to continue after the error. There are several
752 options available when using AnaGram. You could use the \agparam{auto
753 resynch} option, which causes AnaGram to incorporate an automatic
754 resynchronizing procedure into your parser, or you could use the
755 \agparam{error resynch} option, which is similar to the technique used
756 by \agfile{yacc} programmers.
757 % XXX That should be ``error token'', not ``error resynch''.
758
759 \index{Rule elements}
760 \paragraph{Rule Element.}
761 A grammar rule is a list of ``rule elements'', separated by commas.
762 Rule elements may be token names, character sets, keywords, immediate
763 actions, or virtual productions. When AnaGram encounters a rule
764 element for which no token presently exists, it creates one.
765
766 \index{Semantic action}
767 \index{Action}
768 \paragraph{Semantic Action.}
769 A semantic action is a piece of C or C++ code that is executed when a
770 grammar rule has been identified by a parser. Semantic actions are
771 also called reduction procedures, since they are executed when a
772 grammar rule is reduced to the token on the left side of the
773 production.
774
775 \index{Semantic value}
776 \index{Value}
777 \paragraph{Semantic Value.}
778 Tokens, whether terminal or nonterminal, may have a semantic value. In
779 the case of terminal tokens, this may be a value assigned by the
780 lexical scanner or, if the parser is using direct character input, it
781 will be the ASCII value of the character itself. The values of
782 nonterminal tokens are created by reduction procedures. As a parse
783 progresses, token values are shifted onto the stack, so that in a
784 reduction procedure, the values of the tokens that comprise the
785 grammar rule that is being reduced are available for inspection.
786
787 \index{Semantically determined production}
788 \index{Production}
789 \paragraph{Semantically Determined Production.}
790 A ``semantically determined production'' is one which has more than
791 one reduction token specified on the left side of the production. You
792 would write such a production when the reduction tokens are
793 syntactically indistinguishable. The reduction procedure may then
794 specify which of the listed reduction tokens the grammar rule is to
795 reduce to based on semantic considerations. If there is no reduction
796 procedure, or the reduction procedure does not specify a reduction
797 token, the parser will use the leftmost syntactically correct
798 reduction token.
799
800 \index{Set expressions}
801 \paragraph{Set Expression.}
802 A set expression is an algebraic expression used to define a character
803 set in terms of individual characters, ranges of characters, and/or
804 other sets of characters as constructed using complements, unions,
805 intersections, and differences.
806
807 \paragraph{Shift Action.}
808 The shift action is one of the four actions of a traditional parsing
809 engine. The shift action is performed when the input token matches one
810 of the acceptable input tokens for the current parser state. The
811 semantic value of the token and the current state number are stacked,
812 the parser stack index is incremented and the state number is set to a
813 value determined by the previous state and the input token.
814
815 \index{Shift-reduce conflict}
816 \index{Conflicts}
817 \paragraph{Shift-Reduce Conflict.}
818 A ``shift-reduce'' conflict occurs if in some state there is a single
819 token that should be shifted, because it is legitimate input for one
820 of the rules of the state, but should also be used to reduce some
821 other rule because it is a reducing token for that rule.
822 % XXX again, should expand this and give an example.
823
824 \index{Parsing}
825 \index{Syntax directed parsing}
826 \paragraph{Syntax Directed Parsing.}
827 Syntax directed parsing, or formal parsing, is an approach to building
828 parsers based on formal language theory. Given a suitable description
829 of a language, called a grammar, there are algorithms which can be
830 used to create parsers for the language automatically. In this
831 context, the set of all possible inputs to a program may be considered
832 to constitute a language, and the rules for formulating the input to
833 the program constitute the grammar for the language.
834
835 The parsers built from a grammar have the advantage that they can
836 recognize any input that conforms to the rules, and can reject as
837 erroneous any input that fails to conform.
838
839 Since the program logic necessary to parse input is often extremely
840 intricate, programs which use formal parsing are usually much more
841 reliable than those built by hand. They are also much easier to
842 maintain, since it is much easier to modify a grammar specification
843 than it is to modify complex program logic.
844
845 \index{Terminal token}
846 \index{Token}
847 \paragraph{Terminal Token.}
848 A ``terminal token'' is a token which does not appear on the left side
849 of a production. It represents, therefore, a basic unit of input to
850 your parser. If the input to your parser consists of ASCII
851 characters, you may define terminal tokens explicitly as ASCII
852 characters or as sets of ASCII characters. If you have an input
853 procedure which produces numeric codes, you may define the terminal
854 tokens directly in terms of these numeric codes.
855
856 \index{Token}
857 \paragraph{Token.}
858 Tokens are the units with which your parser works. There are two kinds
859 of tokens: terminal tokens and nonterminal tokens. These latter are
860 identified by the parser as sequences of tokens. The grouping of
861 tokens into more complex tokens is governed by the grammar rules, or
862 productions in your grammar. In your grammar, tokens may be denoted by
863 token names, by virtual productions, by immediate actions, by explicit
864 character representations, or by expressions which yield character
865 sets.
866
867 A ``token'' should be thought of more or less as being something like
868 a cloakroom token. Imagine you had taken a kindergarten class to the
869 museum and all the kids checked their coats. Each one gets a token.
870 What is the probability of losing one? You take all of the tokens
871 from the children, put them in a paper bag, tie it up, and check
872 it. You get one token in return. We would say that the kids' coats
873 represent the actual input to the system, their individual tokens are
874 the terminal tokens, and your one token which enables you to redeem
875 the paper bag is a nonterminal token. Nonterminal tokens are just
876 single token numbers to represent the result of compacting a sequence
877 of more primitive tokens.
878 %Thus we sometimes refer to the input character as the input even though
879 %there is a translation process, token conversion, required to turn the
880 %input character, the winter coat, into a token number, or numbered brass
881 %slug.
882
883 Actually, the word ``token'' is commonly used to refer to a number of
884 apparently different things. The tokens you use in a grammar are
885 abstract. The tokens identified by a parser are concrete instances of
886 the abstract tokens in the grammar. Furthermore, we have to identify
887 tokens in some manner, so we have token names and token numbers with
888 which to refer to particular tokens. As a result the word ``token'',
889 in any particular context, may refer directly to either an abstract
890 token, or a concrete instance of an abstract token, or may refer only
891 indirectly to the token by means of a token name or token number.
892 %
893 % concrete whether concrete or abstract, can be confused
894 %
895 %we can easily confuse the identification with the token itself, whether
896 %concrete or abstract.
897 %
898 %
899 %the token, whether concrete or abstract, can be confused with its
900 %identification.
901 %
902 %"token" is sometimes used to refer to the token identification rather than
903 %the token itself. The correct meaning is usually clear from the context, at
904 %least to an experienced reader.
905 %
906
907 As an example, consider a C compiler which has a lexical scanner to
908 identify input tokens. According to Kernighan and
909 Ritchie\footnote{Brian W. Kernighan and Dennis M. Ritchie. \textit{The
910 C Programming Language}, 2nd ed. (Englewood Cliffs, New Jersey:
911 Prentice-Hall, 1988), p. 191.}, there are six classes of C tokens:
912 identifiers, keywords, constants, string literals, operators, and
913 other separators. Consider a particular token, a hexadecimal constant,
914 \agcode{0XF00}. When the lexical scanner hands this over to the
915 parser, it has to provide an identification code which says what kind
916 of token it is, a hexadecimal constant, as well as the ``semantic
917 value'' of the token, 3840. The identification code is usually an
918 integer value, determined by the designer of the lexical scanner,
919 which by agreement with the designer of the parser uniquely represents
920 a hexadecimal constant. This identification code can be thought of as
921 an external token number.
922
923 The grammar for the parser, on the other hand, has a token,
924 \agcode{HEXconstant}, let us say, to represent the abstract usage of
925 hexadecimal constants in the C language. When AnaGram, or any other
926 parser generator for that matter, analyzes the grammar, it assigns its
927 own reference number to identify \agcode{HEXconstant}. This reference
928 number can be thought of as an internal token number to contrast it
929 with the external token number provided by the lexical scanner. The
930 parsers AnaGram builds contain a table, \agcode{ag{\us}tcv}, which
931 provides a translation from the external to the internal token
932 number. Both of these token numbers, of course, differ from the
933 semantic value of the token, in this case, 3840.
934
935 Even when an AnaGram parser is using simple character input, the word
936 ``token'' may represent several different things. Suppose a grammar
937 has a token \agcode{'A-Z'}, which represents the set of upper case
938 letters. There will be an internal token number, assigned by AnaGram,
939 which identifies the (abstract) token in parser tables. The actual
940 (concrete) input token to a parser will, however, be a single letter,
941 say ``Q''. In this case, the ASCII value of the character ``Q'', 81,
942 is used both as the external token number and as the semantic value of
943 the token.
944
945 \paragraph{Unary Operator.}
946 A ``unary operator'' is a token which represents some sort of
947 operation which operates on a single entity to produce a new resultant
948 entity. It is normally represented by a symbol which is written either
949 preceding the operand or following the operand. In the first case it
950 is called a ``prefix operator'' and in the second case a ``suffix
951 operator''.
952
953 \index{Union}
954 \paragraph{Union.}
955 The union of two sets is the set of all elements that are to be found
956 in one or another of the two sets. In an AnaGram syntax file the union
957 of two character sets \agcode{A} and \agcode{B} is represented using
958 the plus sign, as in \agcode{A + B}. The union operator has the same
959 precedence as the difference operator: lower than that of intersection
960 and complement. The union operator is left associative.