Mercurial > ~dholland > hg > ag > index.cgi
comparison doc/manual/gl.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:13d2b8934445 |
---|---|
1 \chapter{Glossary} | |
2 | |
3 % XXX Add: | |
4 % parameter assignment (?) | |
5 % parser (state) stack | |
6 % parser value stack | |
7 % (may be defined in some other version of the glossary) | |
8 % (and appear in the help) | |
9 % (there are apparently dangling links to these in some version) | |
10 | |
11 \index{Action} | |
12 \index{Parser action} | |
13 \index{Parsing engine} | |
14 \paragraph{Action.} | |
15 A ``parser action'' is one of the basic executable elements of a | |
16 parsing engine. In a traditional parsing engine there are four | |
17 actions: the shift action, the reduce action, the accept action, and | |
18 the error action. The parsing engines which AnaGram generates use a | |
19 substantially greater number of actions, in order to provide certain | |
20 short cuts in the interest of greater speed and greater compactness of | |
21 the tables which control the action of the parsing engine. | |
22 | |
23 \paragraph{Accept Action.} | |
24 The accept action is one of the four actions of a traditional parsing | |
25 engine. The accept action is performed when the parser has succeeded | |
26 in identifying the goal token for the grammar. When the parser | |
27 executes the accept action, it returns to the calling program. | |
28 % , returning the semantic value of the goal token as its value. | |
29 The accept action is thus the last action of the parsing engine and | |
30 occurs only once for each successful execution of the parser. | |
31 | |
32 \paragraph{Associativity.} | |
33 This is a term used to describe binary operators (q.v.). It describes | |
34 how you interpret a sequence of operations which all involve the same | |
35 operator. Consider $a - b - c$, for instance. Here the convention is | |
36 that we subtract $b$ from $a$, and then $c$ from the difference. We | |
37 say that subtraction is left associative, since if there is a sequence | |
38 of subtraction operations, the leftmost one is to be performed | |
39 first. As a second example, consider exponentiation. FORTRAN uses | |
40 \agcode{**} as an operator to indicate exponentiation. In order to | |
41 conform as closely as possible to ordinary mathematical notation, the | |
42 expression \agcode{a**b**c} is interpreted to mean that first | |
43 \agcode{b} is raised to the power \agcode{c}, and then the resultant | |
44 value is used as the power to which \agcode{a} is raised. We say that | |
45 exponentiation is right associative since the rightmost of a sequence | |
46 of exponentiation operations is to be performed first. If a | |
47 programming language does not allow for unparenthesized sequences of a | |
48 particular operator, the operator is said to be non-associative. | |
49 | |
50 Another way to view left and right associativity is as implied | |
51 parenthesization. Left associative operators are parenthesized from | |
52 the left. Right associative operators are parenthesized from the | |
53 right. Thus $a - b - c = ((a-b) - c)$ and \agcode{a**b**c = (a**(b**c))}. | |
54 | |
55 AnaGram offers two ways to specify associativity of binary | |
56 operators. The first way is to use conventional Backus-Naur Form | |
57 syntax descriptions. You would use recursion to describe both | |
58 subtraction and exponentiation. Subtraction, since it is left | |
59 associative, uses left recursion, and exponentiation, being right | |
60 associative, uses right recursion. Thus | |
61 \begin{indentingcode}{0.4in} | |
62 difference -> value | difference, '-', value | |
63 exponential -> value | value, "**", exponential | |
64 \end{indentingcode} | |
65 could be used to describe differences and exponents. | |
66 | |
67 The second way to specify associativity is to use an ambiguous grammar | |
68 and precedence declarations. (See Chapter 9.) | |
69 | |
70 \paragraph{Backtracking.} | |
71 In order to make parse tables more compact and parsers faster, it is | |
72 common to use default reductions. In case of error, it is necessary to | |
73 undo default reductions before diagnostics can be properly | |
74 determined. In AnaGram, this undo operation is called backtracking. | |
75 | |
76 \index{Backus-Naur form} | |
77 \index{BNF} | |
78 \paragraph{Backus-Naur Form.} | |
79 Backus-Naur Form, or BNF, is a conventional notation for describing | |
80 context free grammars. AnaGram uses an extended notation for grammars, | |
81 which, except for semantically determined productions, can be shown to | |
82 be equivalent to BNF. The term ``BNF'' is often used colloquially to | |
83 refer to a grammar specification. | |
84 % XXX: shouldn't this say ``any grammar specification''? | |
85 | |
86 AnaGram's syntax specification language differs from BNF in the | |
87 following respects: | |
88 | |
89 \begin{enumerate} | |
90 | |
91 \item | |
92 In conventional BNF, a symbol not enclosed in angle brackets ($< >$) | |
93 is taken to represent itself literally. In AnaGram, literal character | |
94 representations must be enclosed in single quotes and literal strings | |
95 within double quotes. | |
96 | |
97 \item | |
98 In BNF, the right side of a production is simply a list of symbols | |
99 without any delimiters or punctuation. AnaGram requires that the rule | |
100 elements which make up a grammar rule, or right side, be joined by | |
101 commas. | |
102 | |
103 \item | |
104 BNF makes no provision for identifying reduction procedures or their | |
105 arguments. AnaGram provides both reduction procedures, introduced by | |
106 an ``\agcode{=}'' at the end of the production, and named arguments, | |
107 introduced by a ``\agcode{:}'' at the end of any token on the right | |
108 side of the production. | |
109 | |
110 \item | |
111 AnaGram allows virtual productions to be used freely. | |
112 | |
113 \item | |
114 BNF is ``pure'' in that if you wish to define a nonterminal token | |
115 called ``digit'' to represent the digits from zero to nine, you must | |
116 provide ten explicit productions to define it. AnaGram treats the | |
117 concept of ``terminal token'' as used in language theory as an | |
118 abstraction, and interposes character set functionality between actual | |
119 character input and the terminal tokens of BNF. You can define digit | |
120 to be the character range \agcode{'0-9'}, for instance, and AnaGram | |
121 will determine and define precisely the minimum number of productions | |
122 necessary, taking into account any other usage of the characters | |
123 \agcode{'0'} through \agcode{'9'} in your grammar. This makes your | |
124 grammar more compact, more manageable, and easier to modify. | |
125 | |
126 \item | |
127 AnaGram allows for semantically determined productions, which provide | |
128 a significant increment in the ease with which semantic analysis may | |
129 be joined to syntactic analysis. | |
130 | |
131 \end{enumerate} | |
132 | |
133 \paragraph{Binary operator.} | |
134 A binary operator is an operator that works on two operands to create | |
135 a result. It is often written between its operands and is sometimes | |
136 called an infix operator. In ordinary programming languages | |
137 ``\agcode{+}'' and ``\agcode{-}'' are binary operators. | |
138 | |
139 \paragraph{Binding.} | |
140 Most programming languages, rather than executing arithmetic operators | |
141 in simple left to right order, conform to the traditional conventions | |
142 of ordinary algebra, which dictate that, except where parenthesis | |
143 indicate otherwise, exponentiation is done before multiplication, | |
144 multiplication before addition, and addition before comparison. One | |
145 says that exponentiation is ``more tightly binding'' than | |
146 multiplication, multiplication is more tightly binding than addition, | |
147 and so on. The sense of the word here is that the operator binds | |
148 together its operands into a single entity. An operand which falls | |
149 between two different operators is ``bound'' by the more tightly | |
150 binding operator. An operator that is more tightly binding than | |
151 another is also said to have higher precedence. | |
152 | |
153 \index{Character sets} | |
154 \paragraph{Character sets.} | |
155 One of the traditional problems with syntax directed programming is | |
156 that caused by the fact that most formal language specifications have | |
157 productions of the form | |
158 \begin{indentingcode}{0.4in} | |
159 letter -> a | b | c | d ... | z | |
160 \end{indentingcode} | |
161 | |
162 Since the letters are not often distinguished in the grammar, this | |
163 large number of essentially identical productions causes a | |
164 correspondingly large number of states in the parser. This problem is | |
165 often attacked by using a ``lexical scanner'' which simply specifies a | |
166 ``letter token'' whose value indicates precisely which letter is | |
167 intended. This works fine as long as there is nothing in the grammar | |
168 which distinguishes any letter at all. If there is, and there is | |
169 usually some special use of \agcode{h}, \agcode{x}, \agcode{q}, | |
170 \agcode{o}, \agcode{e}, or even \agcode{a-f}, the situation gets more | |
171 complicated. Often the result is to abandon the use of syntax directed | |
172 programming for elemental units of the language and use a more complex | |
173 lexical scanner which identifies names, numbers, key words and | |
174 operators. Such lexical scanners are often built using conventional | |
175 programming languages or simple pattern recognizing languages such as | |
176 \agfile{lex}. | |
177 | |
178 AnaGram avoids this problem by incorporation the notion of character | |
179 set into its input specifications. Briefly, the way it works is the | |
180 following: You specify the set of characters which make up any | |
181 ostensibly terminal token. AnaGram then analyzes the overlaps among | |
182 these definitions and creates a minimal set of tokens which match your | |
183 specifications exactly. For instance, suppose you have the following | |
184 definitions: | |
185 | |
186 \begin{indentingcode}{0.4in} | |
187 letter = 'a-z' + 'A-Z' | |
188 hex digit = '0-9' + 'a-f' + 'A-F' | |
189 \end{indentingcode} | |
190 | |
191 AnaGram will define a token for letter which has two productions: | |
192 | |
193 \begin{indentingcode}{0.4in} | |
194 letter -> 'a-f' + 'A-F' | 'g-z' + 'G-Z' | |
195 \end{indentingcode} | |
196 | |
197 With this technique, you have the freedom to specify your grammar in | |
198 the easiest possible manner, without the penalty of an absurdly large, | |
199 slow parser. | |
200 | |
201 Character sets may be specified in terms of ranges of characters, as | |
202 in the above example, as unions, denoted by ``\agcode{+}'', as | |
203 intersections, denoted by ``\agcode{\&}'', as differences, denoted by | |
204 ``\agcode{-}'', and as complements, using ``\agcode{\~{}}''. You may | |
205 also specify a set consisting of a single character either with a | |
206 literal character or with a numeric specification. If you specify a | |
207 character numerically, you may use either decimal, octal or | |
208 hexadecimal notation, as in C. | |
209 | |
210 \index{Character universe} | |
211 \index{Universe} | |
212 \paragraph{Character Universe.} | |
213 In ordinary set theory, the ``universe'' is the set of all entities | |
214 which may be elements of a set. It must be defined so that the | |
215 complement of a set can mean something unambiguous. In AnaGram, the | |
216 character universe normally comprises the ASCII characters and the IBM | |
217 extended character set, that is, all characters in the range from 0 | |
218 through 255. If, however, you have defined input characters outside | |
219 this range, the character universe will be extended to the least | |
220 character you have used and to the greatest character you have used in | |
221 your grammar. | |
222 % XXX this should mention character sets, and it should mention that | |
223 % it's limited to character sets of size 65536. | |
224 | |
225 \index{Characteristic rules} | |
226 \index{Rule} | |
227 \paragraph{Characteristic Rule.} | |
228 Each parser state is characterized by a particular set of grammar | |
229 rules, and for each such rule, a marked token which is the next token | |
230 expected. These rules are called the characteristic rules for the | |
231 state. In the course of doing grammar analysis, AnaGram determines the | |
232 characteristic rules for each parser state. After analyzing your | |
233 grammar, you may inspect the \agwindow{State Definition Table} to see | |
234 the characteristic rules for any state in your parser. | |
235 | |
236 \index{Characteristic token} | |
237 \index{Token} | |
238 \paragraph{Characteristic Token.} | |
239 Every state in a parser, except state zero, can be characterized by | |
240 the one, unique token which causes a jump to that state. That token is | |
241 called the characteristic token of the state, because to get to that | |
242 parser state you must have just seen precisely that token in the | |
243 input. Note that several states could have the same characteristic | |
244 token. | |
245 % XXX s/one,/one/ | |
246 | |
247 When you have a list of states, such as is given by the parser state | |
248 stack, it is equivalent to a list of characteristic tokens. This list | |
249 of tokens is the list of tokens that have been recognized so far by | |
250 the parser. Some of these tokens, of course, may be nonterminal tokens | |
251 and may thus represent the result of reducing a sequence of previously | |
252 recognized tokens. | |
253 | |
254 \index{Complement} | |
255 \paragraph{Complement of a set.} | |
256 In set theory, the complement of a set, $S$, is the collection of all | |
257 elements of the character universe which are not elements of $S$. Using | |
258 AnaGram's notation, the complement of \agcode{S} is written | |
259 \agcode{\~{}S}. The complement operator has higher precedence than the | |
260 difference, intersection, or union operators. | |
261 | |
262 \index{Rule} | |
263 \index{Completed rule} | |
264 \paragraph{Completed Rule.} | |
265 A ``completed rule'' is a characteristic rule whose index is pointing | |
266 beyond the last rule element. In other words, it has been completely | |
267 matched and will be reduced by the next input. | |
268 | |
269 If a state has more than one completed rule, the decision as to which | |
270 to reduce is made based on the next input token. If there is only one | |
271 completed rule in a state, it will be reduced by default unless the | |
272 default reductions switch has been reset, i.e., turned off. | |
273 | |
274 \index{Conflicts} | |
275 \paragraph{Conflict.} | |
276 Conflicts arise during a grammar analysis when AnaGram cannot | |
277 determine how to treat a given input token. There are two sorts of | |
278 conflicts: \agterm{shift-reduce} conflicts and \agterm{reduce-reduce} | |
279 conflicts. Conflicts may arise either because the grammar is | |
280 inherently ambiguous, or simply because the grammar analyzer cannot | |
281 look far enough ahead to resolve the conflict. In the latter case, it | |
282 is often possible to rewrite the grammar in such a way as to eliminate | |
283 the conflict. Null productions are a common source of conflict. | |
284 | |
285 There are a number of ways to deal with conflicts. If you understand | |
286 the conflict well, you may simply choose to ignore it. When AnaGram | |
287 encounters a shift-reduce conflict while building parse tables it | |
288 resolves it by choosing the shift action. When AnaGram encounters a | |
289 reduce-reduce conflict while building parse tables, it resolves it by | |
290 selecting the grammar rule which occurred first in the grammar. | |
291 | |
292 A second way to deal with conflicts is to set operator precedence | |
293 parameters. If you set these parameters, AnaGram will use them | |
294 preferentially to resolve conflicts. Any conflicts so resolved will be | |
295 listed in the Resolved Conflicts window. | |
296 | |
297 A third way to resolve a conflict is to declare some tokens as | |
298 \agparam{sticky} or as \agparam{subgrammar}s. This is particularly | |
299 useful for productions whose sole purpose is to skip over | |
300 uninteresting input. | |
301 | |
302 The fourth way to deal with conflicts is to rewrite the grammar to | |
303 eliminate them. Many people prefer this approach since it yields the | |
304 highest level of confidence in the resulting program. | |
305 | |
306 \index{Context free grammar} | |
307 \paragraph{Context Free Grammars.} | |
308 Context free grammars are grammars wherein the definition of a grammar | |
309 unit, or nonterminal token, does not depend on the context in which | |
310 the nonterminal is used. That means that if you define ``widget'' in a | |
311 certain way, that definition is valid in all contexts in which | |
312 ``widget'' might occur. Context free grammars can be represented in | |
313 Backus-Naur Form. AnaGram's syntax specification language has no | |
314 facility for representing grammars which are not context free. | |
315 | |
316 \index{Default reductions} | |
317 \index{Reduction} | |
318 \paragraph{Default reductions.} | |
319 A ``default reduction'' is a parser action which may be used in your | |
320 parser in any state which has precisely one completed rule. | |
321 | |
322 If a given parser state has, among its characteristic rules, exactly | |
323 one completed rule, it is usually faster to reduce it on any input | |
324 than to check specifically for correct input before reducing it. The | |
325 only time this default reduction causes trouble is in the event of a | |
326 syntax error. In this situation you may get an erroneous reduction. | |
327 Normally when you are parsing a file, this is inconsequential because | |
328 you are not going to continue semantic action in the presence of | |
329 error. But, if you are using your parser to handle real-time | |
330 interactive input, you have to be able to continue semantic processing | |
331 after notifying your user that he has entered erroneous input. In this | |
332 case you would want \agparam{default reductions} to have been turned | |
333 off so that productions are reduced only when there is correct input. | |
334 | |
335 \index{Difference} | |
336 % XXX: \index{Set difference} ? | |
337 \paragraph{Difference of two sets.} | |
338 In set theory, the difference of two sets, $A$ and $B$, is defined to | |
339 be the set of all elements of $A$ that are not elements of $B$. In an | |
340 AnaGram syntax file, you represent the difference of two character | |
341 sets by using the ``\agcode{-}'' operator. Thus the difference of | |
342 \agcode{A} and \agcode{B} is \agcode{A - B}. The difference operator | |
343 is left associative. | |
344 | |
345 \paragraph{Error Action.} | |
346 The error action is one of the four actions of a traditional parsing | |
347 engine. The error action is performed when the parser has encountered | |
348 an input token which is not admissible in the current state. The | |
349 further behavior of a traditional parser is undefined. | |
350 | |
351 \index{Expansion rule} | |
352 \index{Rule} | |
353 \paragraph{Expansion Rule.} | |
354 In analyzing a grammar, we are often interested in the full range of | |
355 input that can be expected at a certain point. The expansion of a | |
356 token or state shows us all the expected input. An expansion yields a | |
357 set of marked rules. Looking at the marked token shows us what input | |
358 to expect. | |
359 | |
360 The set of expansion rules of a (nonterminal) token shows all the | |
361 expected input that can occur whenever the token appears in the | |
362 grammar. The set consists of all the grammar rules produced by the | |
363 token, plus all the rules produced by the first token of any rule in | |
364 the set. The marked tokens for the expansion rules of a token are | |
365 always the first element in the rule. | |
366 | |
367 The expansion of a state consists of its characteristic rules plus the | |
368 expansion rules of the marked token in each characteristic rule. | |
369 | |
370 \index{Grammar} | |
371 \paragraph{Grammar.} | |
372 Traditionally, a grammar is a set of productions which taken together | |
373 specify precisely a set of acceptable input sequences in terms of an | |
374 abstract set of terminal tokens. The set of acceptable input sequences | |
375 is often called the ``language'' defined by the grammar. | |
376 | |
377 In AnaGram, the term grammar also includes configuration sections, | |
378 definitions of character sets, virtual productions, etc. that augment | |
379 the collection of productions. A grammar is often called a | |
380 ``syntax'', and the rules of the grammar are often called syntactic | |
381 rules. | |
382 | |
383 % XXX you can't say that step 1 (of 4) of analysis is analysis. | |
384 \paragraph{Grammar Analysis.} | |
385 The major function of AnaGram is the analysis of context free grammars | |
386 written in a particular variant of Backus-Naur Form. | |
387 | |
388 The analysis of a grammar proceeds in four stages. In the first, the | |
389 input grammar is analyzed and a number of tables are built which | |
390 describe all of the productions and components of the grammar. | |
391 | |
392 In the second stage, AnaGram analyzes all of the character sets | |
393 defined in the grammar, and where necessary, defines auxiliary tokens | |
394 and productions. | |
395 | |
396 In the third stage, AnaGram identifies all of the states of the parser | |
397 and builds the go-to table for the parser. | |
398 | |
399 In the fourth stage, Anagram identifies reduction tokens for each | |
400 completed grammar rule in each state and checks for conflicts. | |
401 | |
402 \index{Grammar rule} | |
403 \paragraph{Grammar Rule.} | |
404 A ``grammar rule'' is the right side of a production. It consists of a | |
405 sequence of rule elements. Each rule element identifies some token, | |
406 which can be either a terminal token or nonterminal token. It is | |
407 ``matched'' by a corresponding sequence of tokens in the input stream | |
408 to the parser. The left side of the production identifies one or more | |
409 nonterminal tokens, or reduction tokens, to which the rule reduces | |
410 when matched. If there is more than one reduction token, there should | |
411 be a reduction procedure to make the choice. | |
412 | |
413 \index{Grammar token} | |
414 \index{Token} | |
415 \paragraph{Grammar Token.} | |
416 The grammar token is the token which represents the ``top level'' in | |
417 your grammar. Some people refer to it as the ``goal'' or ``goal | |
418 token'' and others as the ``start token''. Whatever it is called, it | |
419 is the single token which describes the complete input to your parser. | |
420 | |
421 \index{Intersection} | |
422 \paragraph{Intersection.} | |
423 In set theory, the intersection of two sets, $A$ and $B$, is defined | |
424 to be the set of all elements of $A$ which are also elements of $B$. | |
425 In an AnaGram syntax file, the intersection of two character sets is | |
426 represented with the ``\agcode{\&}'' operator. The intersection | |
427 operator has lower precedence than the complement operator, but higher | |
428 precedence than the union and difference operators. The intersection | |
429 operator is left associative. | |
430 | |
431 \index{LALR parsing} | |
432 \paragraph{LALR parsing.} | |
433 LALR, or ``lookahead LR'' parsing, is a somewhat less powerful variant | |
434 of LR parsing which produces much smaller parsing tables. Thus an LALR | |
435 parser has very significant advantages in size over its LR | |
436 counterpart. It is possible to construct grammars which can be parsed | |
437 by an LR parser but cannot be parsed by an LALR parser. That is, the | |
438 LALR parser will find these grammars to be ambiguous and the LR parser | |
439 will not. In practice, however, such grammars seem to be rare and the | |
440 ambiguities can often be eliminated by minor revisions. | |
441 | |
442 AnaGram generates LALR parsers and, in addition, uses LALR parsing to | |
443 interpret syntax files. | |
444 | |
445 \index{Lexical scanner} | |
446 \paragraph{Lexical Scanner.} | |
447 A lexical scanner is a program or procedure used as a preprocessor for | |
448 a parser. It scans input characters and lumps them together into | |
449 tokens. Many systems for generating parsers, most notably | |
450 \agfile{yacc}, can scarcely be used without a lexical scanner. | |
451 | |
452 AnaGram, although it can support the use of a lexical scanner, does | |
453 not need a lexical scanner for most practical problems. AnaGram avoids | |
454 the need for a lexical scanner by using character sets and disregard | |
455 and lexeme statements. | |
456 | |
457 If your problem does in fact need a lexical scanner, you can use | |
458 AnaGram itself to write it, so you don't need to know different | |
459 languages for the scanner and for the parser. | |
460 | |
461 \index{Lookahead token} | |
462 \index{Token} | |
463 \paragraph{Lookahead Token.} | |
464 The current input token to a parser is often called the ``lookahead'' | |
465 token. In many situations, it is not shifted into the input buffer | |
466 immediately, but acts like a catalyst to cause a number of rules to be | |
467 reduced before it is eventually shifted in. | |
468 | |
469 \index{LR parsing} | |
470 \paragraph{LR Parsing.} | |
471 An LR($k$) parser is a type of deterministic bottom-up shift-reduce | |
472 parser using at most $k$ lookahead symbols to determine its next | |
473 action at any point in the parse. If ($k$) is omitted, then $k$ is | |
474 assumed to be 1. Discussion of the technical details of LR($k$) | |
475 parsing is beyond the scope of this manual. The interested reader may | |
476 consult any of a variety of works covering the subject\footnote{ See, | |
477 for example, Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. | |
478 \textit{Compilers: Principles, Techniques, and Tools}. (Reading, | |
479 Massachusetts: Addison-Wesley, 1988).}. | |
480 | |
481 AnaGram produces parsers which employ a variant of LR parsing known as | |
482 LALR parsing. It is not necessary to understand the details of LR and | |
483 LALR parsing theory in order to use AnaGram. | |
484 | |
485 \index{Marked rule} | |
486 \index{Rule} | |
487 \paragraph{Marked Rule.} | |
488 A ``marked rule'' is a grammar rule together with a ``marked token'' | |
489 that indicates how much of the rule has already been matched and what | |
490 input should be expected if the remainder of the rule is to be | |
491 matched. Each parser state is defined by a small number of | |
492 characteristic rules which indicate what matches have already been | |
493 made and what input can be expected. Note that when a state has more | |
494 than one characteristic rule, any two characteristic rules with the | |
495 same number of tokens match identically up to the marked token. Even | |
496 if one has fewer tokens to the left than the other, its tokens match | |
497 the other exactly up to the marked token. | |
498 | |
499 When marked rules are displayed in AnaGram windows, the marked token | |
500 is displayed in a distinctive font which the developer can select. | |
501 | |
502 \paragraph{Non-associative.} | |
503 A binary operator, say \agcode{\#}, is said to be non-associative if | |
504 the sequence \agcode{x \# y \# z} is not permissible. If this sequence | |
505 is to be interpreted as \agcode{(x \# y) \# z}, the operator is said | |
506 to be left associative. If the sequence is to be interpreted as | |
507 \agcode{x \# (y \# z)}, the operator is said to be right | |
508 associative. (See \agterm{associativity}.) | |
509 | |
510 \index{Nonterminal token} | |
511 \index{Token} | |
512 \paragraph{Nonterminal Token.} | |
513 A ``nonterminal token'' is a token which results from reducing the | |
514 sequence of tokens which match a grammar rule and replacing them with | |
515 the appropriate reduction token. Nonterminal tokens are to be | |
516 distinguished from terminal tokens or input tokens. | |
517 | |
518 \index{Null productions} | |
519 \index{Production} | |
520 \paragraph{Null Production.} | |
521 A ``null production'' is one that has no tokens on the right side | |
522 whatsoever. Null productions essentially are identified by the first | |
523 following input token. Null productions are extremely convenient | |
524 syntactic elements when you wish to make some input optional. For | |
525 example, suppose that you wish to allow an optional semicolon at some | |
526 point in your grammar. You could write the following pair of | |
527 productions: | |
528 | |
529 \begin{indentingcode}{0.4in} | |
530 optional semicolon -> | ';' | |
531 \end{indentingcode} | |
532 | |
533 Note that a null production can never follow a ``\agcode{|}''. | |
534 | |
535 The above could also be written on multiple lines thus: | |
536 | |
537 \begin{indentingcode}{0.4in} | |
538 optional semicolon | |
539 -> | |
540 -> ';' | |
541 \end{indentingcode} | |
542 | |
543 You can always rewrite your grammar to eliminate null productions if | |
544 you wish, but you usually pay a price in conciseness and | |
545 clarity. Sometimes, however, it is necessary to do such a rewrite in | |
546 order to avoid conflicts, to which null productions are especially | |
547 prone. | |
548 | |
549 If you have a null production with no reduction procedure specified, | |
550 your parser will automatically assign the value zero to the reduction | |
551 token. | |
552 | |
553 Null productions can be generated by virtual productions. | |
554 | |
555 \index{Parser} | |
556 \paragraph{Parser.} | |
557 A parser is a program, or more likely a procedure within a program, | |
558 which scans a sequence of input characters or input tokens and | |
559 accumulates them in an input buffer or stack as determined by a set of | |
560 productions which constitute a grammar. When the parser discovers a | |
561 sequence of tokens as defined by a grammar rule, or right side of a | |
562 production, it ``reduces'' the sequence to a single reduction token as | |
563 defined by the left side of the grammar rule. This nonterminal token | |
564 now replaces the tokens which matched the grammar rule and the search | |
565 for matches continues. If an input token is encountered which will not | |
566 yield a match for any rule, it is considered a syntax error and some | |
567 kind of error recovery may be required to continue. If a match, or | |
568 reduction action, yields the grammar token, sometimes called the goal | |
569 token or start token, the parser deems its work complete and returns | |
570 to whatever procedure may have called it. | |
571 | |
572 Tokens may have semantic values. If the input values configuration | |
573 switch is on, your parser will expect semantic values to be provided | |
574 by the input process along with the token identification code. If the | |
575 \agparam{input values} switch is off, your parser will take the ASCII | |
576 value of the input character, that is, the actual input code, as the | |
577 value of the character. When the parser reduces a production, it can | |
578 call a reduction procedure or semantic action to analyze the values of | |
579 the constituent tokens. This reduction procedure can then return a | |
580 value which characterizes the reduced token. | |
581 | |
582 \paragraph{Parser Generator.} | |
583 A parser generator, such as AnaGram, is a program that converts a | |
584 grammar, a rule-based description of the input to a program, into a | |
585 conventional, procedural module called a parser. The parsers AnaGram | |
586 generates are simple C modules which can be compiled on almost any | |
587 platform. AnaGram parsers are also compatible with C++. | |
588 | |
589 \index{Parsing engine} | |
590 \paragraph{Parsing Engine.} | |
591 A parser consists of three basic components: A set of syntax tables, a | |
592 set of reduction procedures and a parsing engine. The parsing engine | |
593 is the body of code that interprets the parsing table, invokes input | |
594 functions, and calls the reduction procedures. The \agmenu{Build | |
595 Parser} command configures a parsing engine according to the implicit | |
596 requirements of the syntax specification and according to the explicit | |
597 values of the configuration parameters. | |
598 | |
599 The parsing engine itself is a simple automaton, characterized by a | |
600 set of states and a set of inputs. The inputs are the tokens of your | |
601 grammar. Each state is represented by a list of tokens which are | |
602 admissible in that state and for each token an action to perform and a | |
603 parameter which further defines the action. In a traditional LR | |
604 parser, there are only four actions: the shift action, the reduce | |
605 action, the accept action and the error action. AnaGram, in doing its | |
606 grammar analysis, identifies a number of special cases, and creates a | |
607 number of extra actions which make for faster processing, but which | |
608 can be represented as combinations of these primitive actions. | |
609 | |
610 When a shift action is performed, the current state number is pushed | |
611 onto the parser state stack and the new state number is determined by | |
612 the current state number and the current lookahead token. Different | |
613 tokens cause different new states. | |
614 | |
615 When a reduce action is performed, the length of the rule being | |
616 reduced is subtracted from the parser stack index and the new state | |
617 number is read from the top of the parser state stack. The reduction | |
618 token for the rule being reduced is then used as an input token. | |
619 | |
620 Each state in the grammar, with the exception of state zero, has a | |
621 characteristic token which must have been recognized in order to jump | |
622 to that state. Therefore, the parser state stack, which is essentially | |
623 a list of state numbers, can also be thought of as a list of token | |
624 numbers. This is the list of tokens that have been seen so far in the | |
625 parse of your input stream. Some of these tokens, of course, may be | |
626 nonterminal tokens which have resulted from reducing other sequences | |
627 of tokens. | |
628 | |
629 \index{Partition} | |
630 \paragraph{Partition.} | |
631 If you use character sets in your grammar, AnaGram will compute a | |
632 ``partition'' of the character universe. This partition is a | |
633 collection of non-overlapping character sets such that every one of | |
634 the sets you have defined can be written as a union of partition | |
635 sets. Each partition set is identified by a unique reference number | |
636 called the partition set number. | |
637 | |
638 Each partition set is assigned a unique token. If one of your | |
639 character sets requires more than one partition set to represent it, | |
640 AnaGram will create appropriate productions and add them to your | |
641 grammar so your parser can make the necessary distinctions. | |
642 | |
643 \index{Precedence} | |
644 \paragraph{Precedence.} | |
645 ``Precedence'' is an attribute of binary operators and unary operators | |
646 which can be used to resolve conflicts. Operator precedence is also | |
647 referred to as \agterm{binding}. Suppose that \agcode{\#} and | |
648 \agcode{@} are two binary operators. The question is how to interpret | |
649 \agcode{x \# y @ z}. If \agcode{\#} has higher precedence than | |
650 \agcode{@}, it is interpreted as \agcode{(x \# y) @ z}. On the other | |
651 hand if \agcode{\#} has lower precedence than \agcode{@}, it would be | |
652 \agcode{x \# (y @ z)}. If \agcode{\#} and \agcode{@} have the same | |
653 precedence then the question must be resolved by associativity. | |
654 | |
655 Note that all operators at the same precedence level must have the | |
656 same associativity. | |
657 | |
658 % XXX ``neither ... and ...?'' maybe you mean ``neither ... nor ...'' | |
659 The situation is somewhat simpler for unary operators. If | |
660 \agcode{\#} and \agcode{@} were both | |
661 prefix operators, or both suffix operators, there would be no | |
662 ambiguity in interpretation, since neither \agcode{\# @ x} and | |
663 \agcode{x \# @} offer more than one possible interpretation. The only | |
664 difficulty arises when both a prefix and a suffix operator are applied | |
665 to the same operand. | |
666 | |
667 % XXX also ``if ... has ... would'' is wrong; it should be | |
668 % either ``if ... had ... would'' or ``if ... has ... will''. | |
669 Suppose that \agcode{\#} is a prefix operator and \agcode{@} is a | |
670 suffix operator. The question then is how to interpret \agcode{\# x @}. | |
671 If \agcode{\#} has higher precedence than \agcode{@}, it would be | |
672 interpreted as \agcode{(\# x) @}. On the other hand, if \agcode{\#} | |
673 has lower precedence than \agcode{@}, it would be interpreted as | |
674 \agcode{\# (x @)}. If \agcode{\#} and \agcode{@} have the same | |
675 precedence then the question must be resolved by associativity. | |
676 | |
677 \index{Production} | |
678 \paragraph{Production.} | |
679 Productions are the mechanism used to describe how complex input | |
680 structures are built up out of simpler ones. Each production has a | |
681 left side and a right side. The right side, or grammar rule, is a | |
682 sequence of rule elements, which may represent either terminal tokens | |
683 or nonterminal tokens. The left side is a list of reduction tokens. In | |
684 most cases there would be only a single reduction token. Productions | |
685 with more than one token on the left side are called semantically | |
686 determined productions. The ``\agcode{->}'' symbol is used to separate | |
687 the left side from the right side. | |
688 | |
689 \paragraph{Reduce Action.} | |
690 The reduce action is one of the four actions of a traditional parsing | |
691 engine. The reduce action is performed when the parser has succeeded | |
692 in matching all the elements of a grammar rule and the next input | |
693 token is not erroneous. Reducing the grammar rule amounts to | |
694 subtracting the length of the rule from the parser stack index, | |
695 identifying the reduction token, stacking its semantic value and then | |
696 doing a shift action with the reduction token as though it had been | |
697 input directly. | |
698 | |
699 \index{Conflicts} | |
700 \index{Reduce-reduce conflicts} | |
701 \paragraph{Reduce-Reduce Conflict.} | |
702 A grammar has a ``reduce-reduce'' conflict at some state if a single | |
703 token turns out to be a reducing token for more than one completed | |
704 rule. | |
705 % XXX it would be nice to expand this. And given an example. | |
706 | |
707 \index{Reducing token} | |
708 \index{Token} | |
709 \paragraph{Reducing Token.} | |
710 In a state with more than one completed rule, your parser must be able | |
711 to determine which one was actually found. AnaGram deals with this | |
712 problem by looking at all the states the parser will branch to once | |
713 each rule is reduced. The acceptable input tokens for those states are | |
714 the ``reducing tokens'' for the completed rules in the state under | |
715 investigation. If there is a single token which is a reducing token | |
716 for more than one rule, then the grammar is said to have a | |
717 reduce-reduce conflict at that state. If in a particular state there | |
718 is both a shift action and a reduce action for the same token the | |
719 grammar is said to have a shift-reduce conflict in that state. | |
720 | |
721 A reducing token is not the same as a reduction token. | |
722 | |
723 \index{Reduction procedure} | |
724 \paragraph{Reduction Procedure.} | |
725 A ``reduction procedure'' is a function you write which your parser | |
726 executes when it has identified the grammar rule to which the | |
727 reduction procedure is attached in your grammar. There are two formats | |
728 for reduction procedures, depending on the size and complexity of the | |
729 procedure. | |
730 | |
731 \index{Reduction token} | |
732 \index{Token} | |
733 \paragraph{Reduction Token.} | |
734 A token which appears on the left side of a production is called a | |
735 reduction token. It is so called because when the grammar rule on the | |
736 right side of the production is matched in the input stream, your | |
737 parser will ``reduce'' the sequence of tokens which matches the rule | |
738 by replacing the sequence of tokens with the reduction token. Note | |
739 that, if more than one reduction token is specified, your reduction | |
740 procedure should choose the exact one. If it does not, your parser | |
741 will use the leftmost syntactically correct one in the list as the | |
742 default. | |
743 | |
744 A reduction token is not the same as a reducing token. | |
745 | |
746 \index{Resynchronization} | |
747 \paragraph{Resynchronization.} | |
748 ``Resynchronization'' is the process of getting your parser back in | |
749 step with its input after encountering a syntax error. As such, it is | |
750 one method of error recovery. Of course, you would resynchronize only | |
751 if it is necessary to continue after the error. There are several | |
752 options available when using AnaGram. You could use the \agparam{auto | |
753 resynch} option, which causes AnaGram to incorporate an automatic | |
754 resynchronizing procedure into your parser, or you could use the | |
755 \agparam{error resynch} option, which is similar to the technique used | |
756 by \agfile{yacc} programmers. | |
757 % XXX That should be ``error token'', not ``error resynch''. | |
758 | |
759 \index{Rule elements} | |
760 \paragraph{Rule Element.} | |
761 A grammar rule is a list of ``rule elements'', separated by commas. | |
762 Rule elements may be token names, character sets, keywords, immediate | |
763 actions, or virtual productions. When AnaGram encounters a rule | |
764 element for which no token presently exists, it creates one. | |
765 | |
766 \index{Semantic action} | |
767 \index{Action} | |
768 \paragraph{Semantic Action.} | |
769 A semantic action is a piece of C or C++ code that is executed when a | |
770 grammar rule has been identified by a parser. Semantic actions are | |
771 also called reduction procedures, since they are executed when a | |
772 grammar rule is reduced to the token on the left side of the | |
773 production. | |
774 | |
775 \index{Semantic value} | |
776 \index{Value} | |
777 \paragraph{Semantic Value.} | |
778 Tokens, whether terminal or nonterminal, may have a semantic value. In | |
779 the case of terminal tokens, this may be a value assigned by the | |
780 lexical scanner or, if the parser is using direct character input, it | |
781 will be the ASCII value of the character itself. The values of | |
782 nonterminal tokens are created by reduction procedures. As a parse | |
783 progresses, token values are shifted onto the stack, so that in a | |
784 reduction procedure, the values of the tokens that comprise the | |
785 grammar rule that is being reduced are available for inspection. | |
786 | |
787 \index{Semantically determined production} | |
788 \index{Production} | |
789 \paragraph{Semantically Determined Production.} | |
790 A ``semantically determined production'' is one which has more than | |
791 one reduction token specified on the left side of the production. You | |
792 would write such a production when the reduction tokens are | |
793 syntactically indistinguishable. The reduction procedure may then | |
794 specify which of the listed reduction tokens the grammar rule is to | |
795 reduce to based on semantic considerations. If there is no reduction | |
796 procedure, or the reduction procedure does not specify a reduction | |
797 token, the parser will use the leftmost syntactically correct | |
798 reduction token. | |
799 | |
800 \index{Set expressions} | |
801 \paragraph{Set Expression.} | |
802 A set expression is an algebraic expression used to define a character | |
803 set in terms of individual characters, ranges of characters, and/or | |
804 other sets of characters as constructed using complements, unions, | |
805 intersections, and differences. | |
806 | |
807 \paragraph{Shift Action.} | |
808 The shift action is one of the four actions of a traditional parsing | |
809 engine. The shift action is performed when the input token matches one | |
810 of the acceptable input tokens for the current parser state. The | |
811 semantic value of the token and the current state number are stacked, | |
812 the parser stack index is incremented and the state number is set to a | |
813 value determined by the previous state and the input token. | |
814 | |
815 \index{Shift-reduce conflict} | |
816 \index{Conflicts} | |
817 \paragraph{Shift-Reduce Conflict.} | |
818 A ``shift-reduce'' conflict occurs if in some state there is a single | |
819 token that should be shifted, because it is legitimate input for one | |
820 of the rules of the state, but should also be used to reduce some | |
821 other rule because it is a reducing token for that rule. | |
822 % XXX again, should expand this and give an example. | |
823 | |
824 \index{Parsing} | |
825 \index{Syntax directed parsing} | |
826 \paragraph{Syntax Directed Parsing.} | |
827 Syntax directed parsing, or formal parsing, is an approach to building | |
828 parsers based on formal language theory. Given a suitable description | |
829 of a language, called a grammar, there are algorithms which can be | |
830 used to create parsers for the language automatically. In this | |
831 context, the set of all possible inputs to a program may be considered | |
832 to constitute a language, and the rules for formulating the input to | |
833 the program constitute the grammar for the language. | |
834 | |
835 The parsers built from a grammar have the advantage that they can | |
836 recognize any input that conforms to the rules, and can reject as | |
837 erroneous any input that fails to conform. | |
838 | |
839 Since the program logic necessary to parse input is often extremely | |
840 intricate, programs which use formal parsing are usually much more | |
841 reliable than those built by hand. They are also much easier to | |
842 maintain, since it is much easier to modify a grammar specification | |
843 than it is to modify complex program logic. | |
844 | |
845 \index{Terminal token} | |
846 \index{Token} | |
847 \paragraph{Terminal Token.} | |
848 A ``terminal token'' is a token which does not appear on the left side | |
849 of a production. It represents, therefore, a basic unit of input to | |
850 your parser. If the input to your parser consists of ASCII | |
851 characters, you may define terminal tokens explicitly as ASCII | |
852 characters or as sets of ASCII characters. If you have an input | |
853 procedure which produces numeric codes, you may define the terminal | |
854 tokens directly in terms of these numeric codes. | |
855 | |
856 \index{Token} | |
857 \paragraph{Token.} | |
858 Tokens are the units with which your parser works. There are two kinds | |
859 of tokens: terminal tokens and nonterminal tokens. These latter are | |
860 identified by the parser as sequences of tokens. The grouping of | |
861 tokens into more complex tokens is governed by the grammar rules, or | |
862 productions in your grammar. In your grammar, tokens may be denoted by | |
863 token names, by virtual productions, by immediate actions, by explicit | |
864 character representations, or by expressions which yield character | |
865 sets. | |
866 | |
867 A ``token'' should be thought of more or less as being something like | |
868 a cloakroom token. Imagine you had taken a kindergarten class to the | |
869 museum and all the kids checked their coats. Each one gets a token. | |
870 What is the probability of losing one? You take all of the tokens | |
871 from the children, put them in a paper bag, tie it up, and check | |
872 it. You get one token in return. We would say that the kids' coats | |
873 represent the actual input to the system, their individual tokens are | |
874 the terminal tokens, and your one token which enables you to redeem | |
875 the paper bag is a nonterminal token. Nonterminal tokens are just | |
876 single token numbers to represent the result of compacting a sequence | |
877 of more primitive tokens. | |
878 %Thus we sometimes refer to the input character as the input even though | |
879 %there is a translation process, token conversion, required to turn the | |
880 %input character, the winter coat, into a token number, or numbered brass | |
881 %slug. | |
882 | |
883 Actually, the word ``token'' is commonly used to refer to a number of | |
884 apparently different things. The tokens you use in a grammar are | |
885 abstract. The tokens identified by a parser are concrete instances of | |
886 the abstract tokens in the grammar. Furthermore, we have to identify | |
887 tokens in some manner, so we have token names and token numbers with | |
888 which to refer to particular tokens. As a result the word ``token'', | |
889 in any particular context, may refer directly to either an abstract | |
890 token, or a concrete instance of an abstract token, or may refer only | |
891 indirectly to the token by means of a token name or token number. | |
892 % | |
893 % concrete whether concrete or abstract, can be confused | |
894 % | |
895 %we can easily confuse the identification with the token itself, whether | |
896 %concrete or abstract. | |
897 % | |
898 % | |
899 %the token, whether concrete or abstract, can be confused with its | |
900 %identification. | |
901 % | |
902 %"token" is sometimes used to refer to the token identification rather than | |
903 %the token itself. The correct meaning is usually clear from the context, at | |
904 %least to an experienced reader. | |
905 % | |
906 | |
907 As an example, consider a C compiler which has a lexical scanner to | |
908 identify input tokens. According to Kernighan and | |
909 Ritchie\footnote{Brian W. Kernighan and Dennis M. Ritchie. \textit{The | |
910 C Programming Language}, 2nd ed. (Englewood Cliffs, New Jersey: | |
911 Prentice-Hall, 1988), p. 191.}, there are six classes of C tokens: | |
912 identifiers, keywords, constants, string literals, operators, and | |
913 other separators. Consider a particular token, a hexadecimal constant, | |
914 \agcode{0XF00}. When the lexical scanner hands this over to the | |
915 parser, it has to provide an identification code which says what kind | |
916 of token it is, a hexadecimal constant, as well as the ``semantic | |
917 value'' of the token, 3840. The identification code is usually an | |
918 integer value, determined by the designer of the lexical scanner, | |
919 which by agreement with the designer of the parser uniquely represents | |
920 a hexadecimal constant. This identification code can be thought of as | |
921 an external token number. | |
922 | |
923 The grammar for the parser, on the other hand, has a token, | |
924 \agcode{HEXconstant}, let us say, to represent the abstract usage of | |
925 hexadecimal constants in the C language. When AnaGram, or any other | |
926 parser generator for that matter, analyzes the grammar, it assigns its | |
927 own reference number to identify \agcode{HEXconstant}. This reference | |
928 number can be thought of as an internal token number to contrast it | |
929 with the external token number provided by the lexical scanner. The | |
930 parsers AnaGram builds contain a table, \agcode{ag{\us}tcv}, which | |
931 provides a translation from the external to the internal token | |
932 number. Both of these token numbers, of course, differ from the | |
933 semantic value of the token, in this case, 3840. | |
934 | |
935 Even when an AnaGram parser is using simple character input, the word | |
936 ``token'' may represent several different things. Suppose a grammar | |
937 has a token \agcode{'A-Z'}, which represents the set of upper case | |
938 letters. There will be an internal token number, assigned by AnaGram, | |
939 which identifies the (abstract) token in parser tables. The actual | |
940 (concrete) input token to a parser will, however, be a single letter, | |
941 say ``Q''. In this case, the ASCII value of the character ``Q'', 81, | |
942 is used both as the external token number and as the semantic value of | |
943 the token. | |
944 | |
945 \paragraph{Unary Operator.} | |
946 A ``unary operator'' is a token which represents some sort of | |
947 operation which operates on a single entity to produce a new resultant | |
948 entity. It is normally represented by a symbol which is written either | |
949 preceding the operand or following the operand. In the first case it | |
950 is called a ``prefix operator'' and in the second case a ``suffix | |
951 operator''. | |
952 | |
953 \index{Union} | |
954 \paragraph{Union.} | |
955 The union of two sets is the set of all elements that are to be found | |
956 in one or another of the two sets. In an AnaGram syntax file the union | |
957 of two character sets \agcode{A} and \agcode{B} is represented using | |
958 the plus sign, as in \agcode{A + B}. The union operator has the same | |
959 precedence as the difference operator: lower than that of intersection | |
960 and complement. The union operator is left associative. |