Mercurial > ~dholland > hg > ag > index.cgi
comparison doc/misc/html/gloss.html @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:13d2b8934445 |
---|---|
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> | |
2 <HTML> | |
3 <HEAD> | |
4 <TITLE>Glossary of Parsing Terms</TITLE> | |
5 | |
6 </HEAD> | |
7 | |
8 <BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif" | |
9 TEXT="#000000" LINK="#0033CC" | |
10 VLINK="#CC0033" ALINK="#CC0099"> | |
11 | |
12 <P> | |
13 <IMG ALIGN="right" SRC="images/agrsl6c.gif" ALT="AnaGram" | |
14 WIDTH=124 HEIGHT=30> | |
15 <BR CLEAR="RIGHT"> | |
16 </P> | |
17 | |
18 Back to <A HREF="index.html">Index</A> | |
19 <P> | |
20 <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" | |
21 WIDTH=1010 HEIGHT=2 > | |
22 | |
23 <P> | |
24 <H2>Glossary of Parsing Terms</H2> | |
25 <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" | |
26 WIDTH=1010 HEIGHT=2 > | |
27 </P> | |
28 | |
29 <P> | |
30 <TABLE WIDTH="100%" ALIGN="center" CELLPADDING=15 CELLSPACING=15> | |
31 <TR ALIGN="left"> | |
32 | |
33 <TD VALIGN=TOP NOWRAP WIDTH="50%"> | |
34 | |
35 | |
36 | |
37 <A HREF="#ParserAction">Action</A><BR> | |
38 <A HREF="#AcceptAction">Accept Action</A><BR> | |
39 <A HREF="#Associativity">Associativity</A><BR> | |
40 <A HREF="#Backtracking">Backtracking</A><BR> | |
41 <A HREF="#BNF">Backus-Naur Form</A><BR> | |
42 <A HREF="#BinaryOperator">Binary operator</A><BR> | |
43 <A HREF="#Binding">Binding</A><BR> | |
44 <A HREF="#CharacterSets">Character sets</A><BR> | |
45 <A HREF="#Universe">Character Universe</A><BR> | |
46 <A HREF="#CharacteristicRule">Characteristic Rule</A><BR> | |
47 <A HREF="#CharacteristicToken">Characteristic Token</A><BR> | |
48 <A HREF="#SetComplement">Complement of a set</A><BR> | |
49 <A HREF="#CompletedRule">Completed Rule</A><BR> | |
50 <A HREF="#Conflict">Conflict</A><BR> | |
51 <A HREF="#ContextFreeGrammar">Context Free Grammars</A><BR> | |
52 <A HREF="#DefaultReduction">Default reductions</A><BR> | |
53 <A HREF="#SetDifference">Difference of two sets</A><BR> | |
54 <A HREF="#ErrorAction">Error Action</A><BR> | |
55 <A HREF="#EventDriven">Event Driven Parser</A><BR> | |
56 <A HREF="#ExpansionRule">Expansion Rule</A><BR> | |
57 <A HREF="#Grammar">Grammar</A><BR> | |
58 <A HREF="#GrammarAnalysis">Grammar Analysis</A><BR> | |
59 <A HREF="#GrammarRule">Grammar Rule</A><BR> | |
60 <A HREF="#GrammarToken">Grammar Token</A><BR> | |
61 <A HREF="#SetIntersection">Intersection</A><BR> | |
62 <A HREF="#LALRParsing">LALR parsing</A><BR> | |
63 <A HREF="#LexicalScanner">Lexical Scanner</A><BR> | |
64 <A HREF="#Lookahead">Lookahead Token</A><BR> | |
65 <A HREF="#LRParsing">LR Parsing</A><BR> | |
66 <A HREF="#MarkedRule">Marked Rule</A><BR> | |
67 <A HREF="#NonAssociative">Non-associative</A><BR> | |
68 </TD> | |
69 | |
70 <TD VALIGN=TOP NOWRAP WIDTH="50%"> | |
71 <A HREF="#Nonterminal">Nonterminal Token</A><BR> | |
72 <A HREF="#NullProduction">Null Production</A><BR> | |
73 <A HREF="#ParameterAssignment">Parameter Assignment</A><BR> | |
74 <A HREF="#Parser">Parser</A><BR> | |
75 <A HREF="#ParserGenerator">Parser Generator</A><BR> | |
76 <A HREF="#ParserStateStack">Parser State Stack</A><BR> | |
77 <A HREF="#ParserValueStack">Parser Value Stack</A><BR> | |
78 <A HREF="#ParsingEngine">Parsing Engine</A><BR> | |
79 <A HREF="#SetPartition">Partition</A><BR> | |
80 <A HREF="#Precedence">Precedence</A><BR> | |
81 <A HREF="#Production">Production</A><BR> | |
82 <A HREF="#ReduceAction">Reduce Action</A><BR> | |
83 <A HREF="#ReduceReduce">Reduce-Reduce Conflict</A><BR> | |
84 <A HREF="#ReducingToken">Reducing Token</A><BR> | |
85 <A HREF="#ReductionProcedure">Reduction Procedure</A><BR> | |
86 <A HREF="#ReductionToken">Reduction Token</A><BR> | |
87 <A HREF="#Resynchronization">Resynchronization</A><BR> | |
88 <A HREF="#RuleElement">Rule Element</A><BR> | |
89 <A HREF="#SemanticAction">Semantic Action</A><BR> | |
90 <A HREF="#SemanticValue">Semantic Value</A><BR> | |
91 <A HREF="#SemanticallyDetermined">Semantically Determined Production</A><BR> | |
92 <A HREF="#SetExpression">Set Expression</A><BR> | |
93 <A HREF="#ShiftAction">Shift Action</A><BR> | |
94 <A HREF="#ShiftReduce">Shift-Reduce Conflict</A><BR> | |
95 <A HREF="#SyntaxDirectedParsing">Syntax Directed Parsing</A><BR> | |
96 <A HREF="#Terminal">Terminal Token</A><BR> | |
97 <A HREF="#Token">Token</A><BR> | |
98 <A HREF="#UnaryOperator">Unary Operator</A><BR> | |
99 <A HREF="#SetUnion">Union</A><BR> | |
100 <A HREF="#VirtualProduction">Virtual Production</A><BR> | |
101 | |
102 </TD> | |
103 | |
104 </TABLE> | |
105 | |
106 <HR> | |
107 | |
108 | |
109 | |
110 <DL> | |
111 <DT><B><A NAME="ParserAction">Action</A></B></DT> | |
112 <DD>A "parser action" is one of the basic executable elements of a | |
113 <A HREF="#ParsingEngine">parsing engine</A>. In a traditional parsing engine | |
114 there are four actions: the | |
115 <A HREF="#ShiftAction">shift action</A>, the <A HREF="#ReduceAction">reduce | |
116 action</A>, the <A HREF="#AcceptAction">accept action</A>, and the | |
117 <A HREF="#ErrorAction">error action</A>. The parsing engines which AnaGram | |
118 generates use a substantially greater number of actions, in order to provide | |
119 certain short cuts in the interest of greater speed and greater compactness of | |
120 the tables which control the action of the parsing engine. | |
121 <P></P> | |
122 </DD> | |
123 | |
124 <DT><B><A NAME="AcceptAction">Accept Action</A></B></DT> | |
125 <DD> The accept <A HREF="#ParserAction">action</A> is one of the four actions | |
126 of a traditional <A HREF="#ParsingEngine">parsing engine</A>. The accept action | |
127 is performed when the <A HREF="#Parser">parser</A> has succeeded in identifying | |
128 the <A HREF="#GrammarToken">goal token</A> for the <A HREF="#Grammar">grammar</A>. | |
129 When the parser executes the accept action, it returns to the calling program. | |
130 The accept action is thus the last action of | |
131 the parsing engine and occurs only once for each successful execution | |
132 of the parser. | |
133 | |
134 <P></P> | |
135 </DD> | |
136 | |
137 <DT><B><A NAME="Associativity">Associativity</A></B></DT> | |
138 <DD> This is a term used to describe <A HREF="#BinaryOperator">binary | |
139 operators </A>. It describes how you interpret a sequence of operations which | |
140 all involve the same operator. Consider a - b - c, for instance. Here the | |
141 convention is that we subtract b from a, and then c from the difference. We say | |
142 that subtraction is <B>left associative</B>, since if there is a sequence of | |
143 subtraction operations, the leftmost one is to be performed first. As a second | |
144 example, consider exponentiation. FORTRAN uses ** as an operator to indicate | |
145 exponentiation. In order to conform as closely as possible to ordinary | |
146 mathematical notation, the expression a**b**c is interpreted to mean that first | |
147 b is raised to the power c, and then the resultant value is used as the power to | |
148 which a is raised. We say that exponentiation is <B>right associative</B> since the | |
149 rightmost of a sequence of exponentiation operations is to be performed first. | |
150 If a programming language does not allow for unparenthesized sequences of a | |
151 particular operator, the operator is said to be non-associative.<P> | |
152 Another way to view left and right associativity is as implied parenthesization. | |
153 Left associative operators are parenthesized from the left. Right associative | |
154 operators are parenthesized from the right. Thus a - b - c = ((a-b) - c) and | |
155 a**b**c = (a**(b**c))</P><P> | |
156 AnaGram offers two ways to specify associativity of | |
157 <A HREF="#BinaryOperator">binary operators</A>. The first | |
158 way is to use conventional <A HREF="#BNF">Backus-Naur Form</A> syntax | |
159 descriptions. You would userecursion to describe both subtraction and | |
160 exponentiation. Subtraction, since it is left associative, uses left recursion, | |
161 and exponentiation, being right associative, uses right recursion. Thus</P><P> | |
162 </P><PRE> | |
163 difference -> value | difference, '-', value | |
164 exponential -> value | value, "**", exponential | |
165 </PRE> could be used to describe differences and exponents.<P> | |
166 The second way to specify associativity is to use an ambiguous <A HREF="#Grammar">grammar</A> and | |
167 precedence declarations. (See Chapter 9, AnaGram User's Guide.)</P><P> | |
168 </P></DD> | |
169 | |
170 <DT><B><A NAME="Backtracking">Backtracking</A></B></DT> | |
171 <DD> In order to make parse tables more compact and <A HREF="#Parser">parsers</A> faster, it is | |
172 common to use <A HREF="#DefaultReduction">default reductions</A>. In case of | |
173 error, it is necessary to undo default reductions before diagnostics can be | |
174 properly determined. In AnaGram, this undo operation is called backtracking.<P> | |
175 | |
176 </P></DD> | |
177 | |
178 <DT><B><A NAME="BNF">Backus-Naur Form</A></B></DT> | |
179 <DD> Backus-Naur Form, or BNF, is a conventional notation for describing | |
180 <A HREF="#ContextFreeGrammar">context free grammars</A>. AnaGram uses an | |
181 extended notation for <A HREF="#Grammar">grammars</A>, which, except for | |
182 <A HREF="#SemanticallyDetermined">semantically determined | |
183 productions</A>, can be shown to be equivalent to BNF. The term "BNF" is | |
184 often used colloquially to refer to a grammar specification.<P> | |
185 AnaGram's syntax specification language differs from BNF in the following | |
186 respects: | |
187 </P><OL> | |
188 <LI>In conventional BNF, a symbol not enclosed in angle brackets (< >) is | |
189 taken to represent itself literally. In AnaGram, literal character | |
190 representations must be enclosed in single quotes and literal strings within | |
191 double quotes. | |
192 </LI> | |
193 <LI>In BNF, the right side of a <A HREF="#Production">production</A> is simply a list of symbols without | |
194 any delimiters or punctuation. AnaGram requires that the <A HREF="#RuleElement">rule elements</A> which | |
195 make up a <A HREF="#GrammarRule">grammar rule</A>, or right side, be joined by commas. | |
196 </LI> | |
197 <LI>BNF makes no provision for identifying <A HREF="#ReductionProcedure">reduction procedures</A> or their | |
198 arguments. AnaGram provides both reduction procedures, introduced by an "=" | |
199 at the end of the production, and named arguments, introduced by a ":" | |
200 at the end of any <A HREF="#Token">token</A> on the right side of the production. | |
201 </LI> | |
202 <LI>AnaGram allows <A HREF="#VirtualProduction">virtual productions</A> to be used freely. | |
203 </LI> | |
204 <LI>BNF is "pure" in that if you wish to define a <A HREF="#Nonterminal">nonterminal token</A> | |
205 called "digit" to represent the digits from zero to nine, you must | |
206 provide ten explicit productions to define it. AnaGram treats the concept of <A HREF="#Terminal">"terminal | |
207 token"</A> as used in language theory as an abstraction, and interposes | |
208 <A HREF="#CharacterSets">character set</A> functionality between actual character input and the terminal | |
209 tokens of BNF. You can define digit to be the character range '0-9', for | |
210 instance, and AnaGram will determine and define precisely the minimum number of | |
211 productions necessary, taking into account any other usage of the characters | |
212 '0-9' in your grammar. This makes your grammar more compact, more manageable, | |
213 and easier to modify. | |
214 </LI> | |
215 <LI>AnaGram allows for <A HREF="#SemanticallyDetermined"> | |
216 semantically determined productions</A>, which provide a | |
217 significant mechanism for melding semantic analysis with syntactic | |
218 analysis. | |
219 | |
220 </LI> | |
221 </OL> | |
222 </DD> | |
223 | |
224 <DT><B><A NAME="BinaryOperator">Binary operator</A></B></DT> | |
225 <DD> A binary operator is an operator that works on two operands to create a | |
226 result. It is often written between its operands and is sometimes called an | |
227 infix operator. In ordinary programming languages "+" and "-" | |
228 are binary operators. | |
229 <P></P> | |
230 </DD> | |
231 | |
232 <DT><B><A NAME="Binding">Binding</A></B></DT> | |
233 <DD> Most programming languages, rather than executing arithmetic operators in | |
234 simple left to right order, conform to the traditional conventions of ordinary | |
235 algebra, which dictate that, except where parenthesis indicate otherwise, | |
236 exponentiation is done before multiplication, multiplication before addition, | |
237 and addition before comparison. One says that exponentiation is "more | |
238 tightly binding" than multiplication, multiplication is more tightly | |
239 binding than addition, and so on. The sense of the word here is that the | |
240 operator binds together its operands into a single entity. An operand which | |
241 falls between two different operators is "bound" by the more tightly | |
242 binding operator. An operator that is more tightly binding than another is also | |
243 said to have higher <A HREF="#Precedence">precedence</A>. | |
244 <P></P> | |
245 </DD> | |
246 | |
247 <DT><B><A NAME="CharacterSets">Character sets</A></B></DT> | |
248 <DD> One of the traditional problems with syntax directed programming is that | |
249 caused by the fact that most formal language specifications have productions of | |
250 the form: | |
251 <PRE> | |
252 letter -> a | b | c | d ... | z | |
253 </PRE> Since the letters are not often distinguished in the <A HREF = "#Grammar">grammar</A>, this large | |
254 number of essentially identical <A HREF="#Production">productions</A> causes a correspondingly large | |
255 number of states in the <A HREF="#Parser">parser</A>. This problem is often attacked by using a<A | |
256 HREF="#LexicalScanner"> "lexical scanner"</A> which simply specifies | |
257 a "letter token" whose value indicates precisely which letter is | |
258 intended. This works fine as long as there is nothing in the grammar which | |
259 distinguishes any letter at all. If there is, and there is usually some special | |
260 use of h, x, q, o, e, or even a-f, the situation gets more complicated. Often the | |
261 result is to abandon the use of syntax directed programming for elemental units | |
262 of the language and use a more complex lexical scanner which identifies names, | |
263 numbers, key words and operators. Such lexical scanners are often built using | |
264 conventional programming languages or simple pattern recognizing languages such | |
265 as LEX.<P> | |
266 AnaGram avoids this problem by incorporation the notion of character set into | |
267 its input specifications. Briefly, the way it works is the following: You | |
268 specify the set of characters which make up any ostensibly <A HREF="#Terminal">terminal token</A>. | |
269 AnaGram then analyzes the overlaps among these definitions and creates a minimal | |
270 set of <A HREF="#Token">tokens</A> which match your specifications exactly. For instance, suppose you | |
271 have the following definitions: | |
272 </P><PRE> | |
273 letter = 'a-z' + 'A-Z' | |
274 hex digit = '0-9' + 'a-f' + 'A-F' | |
275 </PRE> AnaGram will define a token for letter which has two productions: | |
276 <PRE> | |
277 letter -> 'a-f' + 'A-F' | 'g-z' + 'G-Z' | |
278 </PRE> With this technique, you have the freedom to specify your grammar in the | |
279 easiest possible manner, without the penalty of an absurdly large, slow parser.<P> | |
280 Character sets may be specified in terms of ranges of characters, as in the | |
281 above example, as <A HREF="#SetUnion">unions</A>, denoted by "+", as | |
282 <A HREF="#SetIntersection">intersections</A>, denoted by "&", as | |
283 <A HREF="#SetDifference">differences,</A> denoted by "-", and as | |
284 <A HREF="#SetComplement">complements</A>, using "~". You may also | |
285 specify a set consisting of a single character either with a literal character | |
286 or with a numeric specification. If you specify a character numerically, you may | |
287 use either decimal, octal or hexadecimal notation, as in C. | |
288 </P><P></P> | |
289 </DD> | |
290 | |
291 <DT><B><A NAME="Universe">Character Universe</A></B></DT> | |
292 <DD> In ordinary set theory, the "universe" is the set of all | |
293 entities which may be elements of a set. It must be defined so that the | |
294 complement of a set can mean something unambiguous. In AnaGram, the character | |
295 universe normally comprises the ascii characters and the IBM extended character | |
296 set, that is, all characters in the range from 0 through 255. If, however, you | |
297 have defined input characters outside this range, the character universe will be | |
298 extended to the least character you have used and to the greatest character you | |
299 have used in your <A HREF = "#Grammar">grammar</A>. | |
300 <P></P> | |
301 </DD> | |
302 | |
303 <DT><B><A NAME="CharacteristicRule">Characteristic Rule</A></B></DT> | |
304 <DD> Each parser state is characterized by a particular set of <A HREF="#GrammarRule">grammar rules</A>, | |
305 and for each such rule, a marked token which is the next <A HREF="#Token">token</A> expected. These | |
306 rules are called the characteristic rules for the state. In the course of doing | |
307 <A HREF="#GrammarAnalysis">grammar analysis</A>, AnaGram determines the characteristic rules for each parser | |
308 state. After analyzing your grammar, you may inspect the State Definition Table | |
309 to see the characteristic rules for any state in your <A HREF="#Parser">parser</A>. | |
310 <P></P> | |
311 </DD> | |
312 | |
313 <DT><B><A NAME="CharacteristicToken">Characteristic Token</A></B></DT> | |
314 <DD> Every state in a <A HREF="#Parser">parser</A>, except state zero, can be characterized by the | |
315 one, unique <A HREF="#Token">token</A> which causes a jump to that state. That token is called the | |
316 characteristic token of the state, because to get to that parser state you must | |
317 have just seen precisely that token in the input. Note that several states could | |
318 have the same characteristic token.<P> | |
319 When you have a list of states, such as is given by the parser state stack, it | |
320 is equivalent to a list of characteristic tokens. This list of tokens is the | |
321 list of tokens that have been recognized so far by the parser. Some of these | |
322 tokens, of course, may be <A HREF="#Nonterminal">nonterminal tokens</A> and may thus represent the result | |
323 of reducing a sequence of previously recognized tokens. | |
324 </P> | |
325 <P></P> | |
326 </DD> | |
327 | |
328 <DT><B><A NAME="SetComplement">Complement of a set</A></B></DT> | |
329 <DD> In set theory, the complement of a set, S, is the collection of all | |
330 elements of the <A HREF="#Universe">character universe</A> which are not elements of S. Using AnaGram's | |
331 notation, the complement of S is written ~S. The complement operator has higher | |
332 <A HREF="#Precedence">precedence</A> than the | |
333 <A HREF="#SetDifference">difference</A>, <A HREF="#SetIntersection">intersection</A>, or <A HREF="#SetUnion">union</A> operators. | |
334 <P></P> | |
335 </DD> | |
336 | |
337 <DT><B><A NAME="CompletedRule">Completed Rule</A></B></DT> | |
338 <DD> A "completed rule" is a <A HREF="#CharacteristicRule">characteristic rule</A> whose index is | |
339 pointing beyond the last <A HREF="#RuleElement">rule element</A>. In other words, it has been completely | |
340 matched and will be <A HREF="#ReduceAction">reduced</A> by the next input. | |
341 <P> | |
342 If a state has more than one completed rule, the decision as to which to reduce | |
343 is made based on the next input <A HREF="#Token">token</A>. If there is only one completed rule in a | |
344 state, it will be reduced by default unless the default reductions switch has | |
345 been reset, i.e., turned off. | |
346 </P></DD> | |
347 <DT><B><A NAME="Conflict">Conflict</A></B></DT> | |
348 <DD> Conflicts arise during a <A HREF="#GrammarAnalysis">grammar analysis</A> when AnaGram cannot determine | |
349 how to treat a given input <A HREF="#Token">token</A>. There are two sorts of conflicts: <A HREF="#ShiftReduce">shift-reduce | |
350 conflicts</A> and <A HREF="#ReduceReduce">reduce-reduce conflicts</A>. Conflicts may arise either because the | |
351 <A HREF = "#Grammar">grammar</A> is inherently ambiguous, or simply because the grammar analyzer cannot | |
352 look far enough ahead to resolve the conflict. In the latter case, it is often | |
353 possible to rewrite the grammar in such a way as to eliminate the conflict. <A HREF="#NullProduction">Null | |
354 productions</A> are a common source of conflict. | |
355 <P> | |
356 There are a number of ways to deal with conflicts. If you understand the | |
357 conflict well, you may simply choose to ignore it. When AnaGram encounters a | |
358 shift-reduce conflict while building parse tables it resolves it by choosing the | |
359 <A HREF="#ShiftAction">shift action</A>. When AnaGram encounters a reduce-reduce conflict while building | |
360 parse tables, it resolves it by selecting the <A HREF="#GrammarRule">grammar rule</A> which occurred first | |
361 in the grammar. | |
362 </P><P> | |
363 A second way to deal with conflicts is to set operator <A HREF="#Precedence">precedence</A> parameters. If | |
364 you set these parameters, AnaGram will use them preferentially to resolve | |
365 conflicts. Any conflicts so resolved will be listed in the Resolved Conflicts | |
366 window. | |
367 </P><P> | |
368 A third way to resolve a conflict is to declare some tokens as sticky or as | |
369 subgrammars. This is particularly useful for <A HREF="#Production">productions</A> whose sole purpose is | |
370 to skip over uninteresting input.</P><P> | |
371 </P><P> | |
372 The fourth way to deal with conflicts is to rewrite the grammar to eliminate | |
373 them. Many people prefer this approach since it yields the highest level of | |
374 confidence in the resulting program. | |
375 </P></DD> | |
376 <DT><B><A NAME="ContextFreeGrammar">Context Free Grammars</A></B></DT> | |
377 <DD> Context free grammars are <A HREF = "#Grammar">grammars</A> wherein the definition of a grammar | |
378 unit, or <A HREF="#Nonterminal">nonterminal token</A>, does not depend on the context in which the | |
379 nonterminal is used. That means that if you define "widget" in a | |
380 certain way, that definition is valid in all contexts in which "widget" | |
381 might occur. Context free grammars can be represented in <A HREF="#BNF">Backus-Naur Form</A>. | |
382 AnaGram's syntax specification language has no facility for representing | |
383 grammars which are not context free. | |
384 <P></P> | |
385 </DD> | |
386 <DT><B><A NAME="DefaultReduction">Default reductions</A></B></DT> | |
387 <DD> A "default reduction" is a parser action which may be used in | |
388 your <A HREF="#Parser">parser</A> in any state which has precisely one <A HREF="#CompletedRule">completed rule</A>. | |
389 <P> | |
390 If a given parser state has, among its <A HREF="#CharacteristicRule">characteristic rules</A>, exactly one | |
391 completed rule, it is usually faster to reduce it on any input than to check | |
392 specifically for correct input before reducing it. The only time this default | |
393 reduction causes trouble is in the event of a syntax error. In this situation | |
394 you may get an erroneous reduction. Normally when you are parsing a file, this | |
395 is inconsequential because you are not going to continue performing | |
396 <A HREF="#SemanticAction">semantic actions</A> in the presence of | |
397 errors. But, if you are using your parser to handle real-time | |
398 interactive input, you have to be able to continue semantic | |
399 processing after notifying your user that he has entered erroneous | |
400 input. In this case you would want default reductions to have been | |
401 turned off so that <A HREF="#Production">production</A>s are reduced | |
402 only when there is correct input. | |
403 | |
404 </P></DD> | |
405 | |
406 <DT><B><A NAME="SetDifference">Difference of two sets</A></B></DT> | |
407 <DD> In set theory, the difference of two sets, A and B, is defined to be the | |
408 set of all elements of A that are not elements of B. In an AnaGram syntax file, | |
409 you represent the difference of two <A HREF="#CharacterSets">character sets</A> by using the "-" | |
410 operator. Thus the difference of A and B is A - B. The difference operator is | |
411 <A HREF="#Associativity">left associative</A>. | |
412 <P></P> | |
413 </DD> | |
414 | |
415 <DT><B><A NAME="ErrorAction">Error Action</A></B></DT> | |
416 <DD> The error action is one of the four <A HREF="#ParserAction">actions</A> of a traditional <A HREF="#ParsingEngine">parsing | |
417 engine</A>. The error action is performed when the <A HREF="#Parser">parser</A> has encountered an input | |
418 <A HREF="#Token">token</A> which is not admissible in the current state. The further behavior of a | |
419 traditional parser is undefined. | |
420 <P></P> | |
421 </DD> | |
422 | |
423 <DT><B><A NAME="EventDriven">Event Driven Parser</A></B></DT> | |
424 <DD> | |
425 An <I>event driven</I> <A HREF="#Parser">parser</A> is one in which | |
426 the relationship between the host program and the parser is turned | |
427 inside out. In a conventional parser, the host program calls the | |
428 parser, the parser analyzes the complete input text and returns to | |
429 the host program only when it has finished with the entire input. | |
430 <P>In an event driven parser, the parser does not read its input | |
431 directly from a file or from memory. Instead, the host program, after | |
432 initializing the parser, calls it once for each input token. Each | |
433 time the parser is called, it updates its state | |
434 appropriately, calls any <A HREF="#ReductionProcedure">reduction | |
435 procedures</A> that need to be called and finally when it needs more | |
436 input, returns to the host program. The effect is that parsing can | |
437 occur in parallel with other processing performed by the host | |
438 program. This technique is especially useful in situations where the | |
439 token stream to be parsed is developed on the fly, as when using <A | |
440 HREF="#LexicalScanner">lexical scanners</A>, for instance. | |
441 <P></P> | |
442 </DD> | |
443 | |
444 | |
445 <DT><B><A NAME="ExpansionRule">Expansion Rule</A></B></DT> | |
446 <DD> In analyzing a <A HREF = "#Grammar">grammar</A>, we are often interested in the full range of input | |
447 that can be expected at a certain point. The expansion of a <A HREF="#Token">token</A> or state shows | |
448 us all the expected input. An expansion yields a set of | |
449 <A HREF="#MarkedRule">marked rules</A>. Looking | |
450 at the marked token shows us what input to expect. | |
451 <P> | |
452 The set of expansion rules of a <A HREF="#Nonterminal">nonterminal token</A> shows all the expected input | |
453 that can occur whenever the token appears in the grammar. The set consists of | |
454 all the <A HREF="#GrammarRule">grammar rules</A> produced by the token, plus all the rules produced by the | |
455 first token of any rule in the set. The marked tokens for the expansion rules of a | |
456 token are always the first element in the rule. | |
457 </P><P> | |
458 The expansion of a state consists of its | |
459 <A HREF="#CharacteristicRule">characteristic rules</A> plus the expansion | |
460 rules of the marked token in each characteristic rule. | |
461 </P> | |
462 </DD> | |
463 | |
464 | |
465 <DT><B><A NAME="Grammar">Grammar</A></B></DT> | |
466 <DD> Traditionally, a grammar is a set of <A HREF="#Production">productions</A> which taken together | |
467 specify precisely a set of acceptable input sequences in terms of an abstract | |
468 set of <A HREF="#Terminal">terminal tokens</A>. The set of acceptable input sequences is often called | |
469 the "language" defined by the grammar. | |
470 <P> | |
471 In AnaGram, the term grammar also includes configuration segments, definitions | |
472 of <A HREF="#CharacterSets">character sets</A>, <A HREF="#VirtualProduction">virtual productions</A>, etc. that augment the collection of | |
473 productions. A grammar is often called a "syntax", and the <A HREF="#GrammarRule">rules of | |
474 the grammar</A> are often called syntactic rules. | |
475 </P> | |
476 <P></P> | |
477 </DD> | |
478 | |
479 <DT><B><A NAME="GrammarAnalysis">Grammar Analysis</A></B></DT> | |
480 <DD> The major function of AnaGram is the analysis of <A HREF = "#ContextFreeGrammar">context free grammars</A> | |
481 written in a particular variant of <A HREF="#BNF">Backus-Naur Form</A>. | |
482 <P> | |
483 The analysis of a <A HREF="#Grammar">grammar</A> proceeds in four stages. In the first, the input | |
484 grammar is analyzed and a number of tables are built which describe all of the | |
485 <A HREF="#Production">productions</A> and components of the grammar. | |
486 </P><P> | |
487 In the second stage, AnaGram analyzes all of the <A HREF="#CharacterSets">character sets</A> defined in the | |
488 grammar, and where necessary, defines auxiliary <A HREF="#Token">tokens</A> and productions. | |
489 </P><P> | |
490 In the third stage, AnaGram identifies all of the states of the <A HREF="#Parser">parser</A> and | |
491 builds the go-to table for the parser. | |
492 </P><P> | |
493 In the fourth stage, Anagram identifies <A HREF="#ReductionToken">reduction tokens</A> for each completed | |
494 <A HREF="#GrammarRule">grammar rule</A> in each state and checks for <A HREF="#Conflict">conflicts</A>. | |
495 </P> | |
496 <P></P> | |
497 </DD> | |
498 | |
499 <DT><B><A NAME="GrammarRule">Grammar Rule</A></B></DT> | |
500 <DD> A "grammar rule" is the right side of a <A HREF="#Production">production</A>. It consists | |
501 of a sequence of <A HREF="#RuleElement">rule elements</A>. Each rule element identifies some <A HREF="#Token">token</A>, which | |
502 can be either a <A HREF="#Terminal">terminal token</A> or <A HREF="#Nonterminal">nonterminal token</A>. It is "matched" | |
503 by a corresponding sequence of tokens in the input stream to the <A HREF="#Parser">parser</A>. The | |
504 left side of the production identifies one or more nonterminal tokens, or | |
505 <A HREF="#ReductionToken">reduction tokens</A>, to which the rule reduces when matched. If there is more than | |
506 one reduction token, there should be a <A HREF="#ReductionProcedure">reduction procedure</A> to make the choice. | |
507 <P></P> | |
508 </DD> | |
509 | |
510 <DT><B><A NAME="GrammarToken">Grammar Token</A></B></DT> | |
511 <DD> The grammar <A HREF="#Token">token</A> is the token which represents the "top level" | |
512 in your grammar. Some people refer to it as the "goal" or "goal | |
513 token" and others as the "start token". Whatever it is called, it | |
514 is the single token which describes the complete input to your <A HREF="#Parser">parser</A>. | |
515 <P></P> | |
516 </DD> | |
517 | |
518 <DT><B><A NAME="SetIntersection">Intersection</A></B></DT> | |
519 <DD> In set theory, the intersection of two sets, A and B, is defined to be the | |
520 set of all elements of A which are also elements of B. In an AnaGram syntax | |
521 file, the intersection of two <A HREF="#CharacterSets">character sets</A> is represented with the "&" | |
522 operator. The intersection operator has lower <A HREF="#Precedence">precedence</A> than the complement | |
523 operator, but higher precedence than the <A HREF="#SetUnion">union</A> and <A HREF="#SetDifference">difference</A> operators. The | |
524 intersection operator is <A HREF="#Associativity">left associative</A>. | |
525 <P></P> | |
526 </DD> | |
527 | |
528 <DT><B><A NAME="LALRParsing">LALR parsing</A></B></DT> | |
529 <DD> LALR, or "lookahead LR" parsing, a slightly less powerful | |
530 variant of <A HREF="#LRParsing">LR parsing</A>, produces much smaller parsing tables. Thus an LALR | |
531 <A HREF="#Parser">parser</A> has very significant advantages in size over its LR counterpart. It is | |
532 possible to construct <A HREF = "#Grammar">grammar</A>s which can be parsed by an LR parser but cannot be | |
533 parsed by an LALR parser. That is, the LALR parser will find these grammars to | |
534 be ambiguous and the LR parser will not. In practice, however, such grammars | |
535 seem to be rare and the ambiguities can usually be eliminated by minor revisions. | |
536 <P> | |
537 AnaGram generates LALR parsers and, in addition, uses LALR parsing to interpret | |
538 syntax files. | |
539 </P><P></P> | |
540 </DD> | |
541 | |
542 <DT><B><A NAME="LexicalScanner">Lexical Scanner</A></B></DT> | |
543 <DD> A lexical scanner is a program or procedure used as a preprocessor for a | |
544 <A HREF="#Parser">parser</A>. It scans input characters and lumps them together into <A HREF="#Token">tokens</A>. Many | |
545 systems for generating parsers, most notably YACC, can scarcely be used without | |
546 a lexical scanner. | |
547 <P> | |
548 AnaGram, although it can support the use of a lexical scanner, does not need a | |
549 lexical scanner for most practical problems. AnaGram avoids the need for a | |
550 lexical scanner by using <A HREF="#CharacterSets">character sets</A> and disregard and lexeme statements. | |
551 </P><P> | |
552 If your problem does in fact need a lexical scanner, you can use AnaGram itself | |
553 to write it, so you don't need to know different languages for the scanner and | |
554 for the parser. | |
555 </P> | |
556 <P></P> | |
557 </DD> | |
558 | |
559 <DT><B><A NAME="Lookahead">Lookahead Token</A></B></DT> | |
560 <DD> The current input <A HREF="#Token">token</A> to a <A HREF="#Parser">parser</A> is often called the "lookahead" | |
561 token. In many situations, it is not shifted into the input buffer immediately, | |
562 but acts like a catalyst to cause a number of rules to be reduced before it is | |
563 eventually shifted in. | |
564 <P></P> | |
565 </DD> | |
566 | |
567 <DT><B><A NAME="LRParsing">LR Parsing</A></B></DT> | |
568 <DD> An LR(k) <A HREF="#Parser">parser</A> is a type of deterministic bottom-up shift-reduce parser | |
569 using at most k lookahead symbols to determine its next <A HREF="#ParserAction">action</A> at any point in | |
570 the parse. If (k) is omitted, then k is assumed to be 1. Discussion of the | |
571 technical details of LR(k) parsing is beyond the scope of this manual. The | |
572 interested reader may consult any of a variety of works covering the subject. | |
573 <P> | |
574 AnaGram produces parsers which employ a variant of LR parsing known as LALR | |
575 parsing. It is not necessary to understand the details of LR and LALR parsing | |
576 theory in order to use AnaGram. | |
577 </P> | |
578 <P></P> | |
579 </DD> | |
580 | |
581 <DT><B><A NAME="MarkedRule">Marked Rule</A></B></DT> | |
582 <DD> A "marked rule" is a <A HREF="#GrammarRule">grammar rule</A> | |
583 together with a "marked token" | |
584 that indicates how much of the rule has already been matched and what input | |
585 should be expected, starting with the marked token, if the remainder of the rule is to be matched. Each <A HREF="#Parser">parser</A> | |
586 state is defined by a small number of <A HREF="#CharacteristicRule">characteristic rules</A> which indicate what | |
587 matches have already been made and what input can be expected. Note that when a | |
588 state has more than one characteristic rule, any two characteristic rules with | |
589 the same number of <A HREF="#Token">tokens</A> to the left of the marked token | |
590 match identically up to the | |
591 marked token. If one rule has fewer tokens to the left than the other, all of its tokens | |
592 preceding the marked token match the corresponding tokens of the longer rule | |
593 exactly. | |
594 <P> | |
595 When marked rules are displayed in AnaGram windows, the marked token is displayed in | |
596 a distinctive font which the developer can select. The default is italic. | |
597 </P> | |
598 <P></P> | |
599 </DD> | |
600 | |
601 <DT><B><A NAME="NonAssociative">Non-associative</A></B></DT> | |
602 <DD> A <A HREF="#BinaryOperator">binary operator</A>, say #, is said to be non-associative if the sequence x | |
603 # y # z is not permissible. If this sequence is to be interpreted as (x#y)#z, | |
604 the operator is said to be <B>left associative</B>. If the sequence is to be | |
605 interpreted as x#(y#z), the operator is said to be <B>right associative</B>. (See | |
606 <A HREF="#Associativity">associativity</A>.) | |
607 <P></P> | |
608 </DD> | |
609 | |
610 <DT><B><A NAME="Nonterminal">Nonterminal Token</A></B></DT> | |
611 <DD> A "nonterminal token" is a <A HREF="#Token">token</A> which results from reducing the | |
612 sequence of tokens which match a <A HREF="#GrammarRule">grammar rule</A> and replacing them with the | |
613 appropriate <A HREF="#ReductionToken">reduction token</A>. Nonterminal tokens are to be distinguished from | |
614 <A HREF="#Terminal">terminal tokens</A> or input tokens. | |
615 <P></P> | |
616 </DD> | |
617 | |
618 <DT><B><A NAME="NullProduction">Null Production</A></B></DT> | |
619 <DD> A "null production" is one that has no <A HREF="#Token">tokens</A> on the right side | |
620 whatsoever. Null <A HREF="#Production">production</A>s essentially are identified by the first following | |
621 input token. Null productions are extremely convenient syntactic elements when | |
622 you wish to make some input optional. For example, suppose that you wish to | |
623 allow an optional semicolon at some point in your <A HREF = "#Grammar">grammar</A>. You could write the | |
624 following pair of productions: | |
625 <PRE> | |
626 optional semicolon -> | ';' | |
627 </PRE> Note that a null production can never follow a '|'. The above could also | |
628 be written on multiple lines thus: | |
629 <PRE> | |
630 optional semicolon | |
631 -> | |
632 -> ';' | |
633 </PRE> You can always rewrite your grammar to eliminate null productions if you | |
634 wish, but you usually pay a price in conciseness and clarity. Sometimes, | |
635 however, it is necessary to do such a rewrite in order to avoid <A HREF="#Conflict">conflicts</A>, to | |
636 which null productions are especially prone. | |
637 <P> | |
638 If you have a null production with no <A HREF="#ReductionProcedure">reduction procedure</A> specified, your <A HREF="#Parser">parser</A> | |
639 will automatically assign the value zero to the <A HREF="#ReductionToken">reduction token</A>. | |
640 </P><P> | |
641 Null productions can be generated by <A HREF="#VirtualProduction">virtual productions</A>. | |
642 </P> | |
643 <P></P> | |
644 </DD> | |
645 | |
646 <DT><B><A NAME="ParameterAssignment">Parameter Assignment</A></B></DT> | |
647 | |
648 <DD>In any <A HREF="#GrammarRule"> grammar rule</A>, the | |
649 <A HREF="#SemanticValue">semantic value</A> of any | |
650 <A HREF="#RuleElement">rule element</A> may be passed to a | |
651 <A HREF="#ReductionProcedure">reduction procedure</A> | |
652 by means of a parameter assignment. In AnaGram this is particularly | |
653 convenient since you can use an ordinary variable name for the | |
654 parameter. Simply follow the | |
655 rule element with a colon and a C variable name. The C | |
656 variable name can then be used in the reduction | |
657 procedure to reference the semantic value of the token | |
658 it is attached to. AnaGram will automatically provide | |
659 necessary declarations. | |
660 | |
661 Here are some examples of rule elements with parameter | |
662 assignments: | |
663 <PRE> | |
664 '0-9':d | |
665 integer:n | |
666 expression:x | |
667 declaration : declaration_descriptor | |
668 </PRE> | |
669 </DD> | |
670 | |
671 <DT><B><A NAME="Parser">Parser</A></B></DT> | |
672 <DD> A parser is a program, or more likely a procedure within a program, which | |
673 scans a sequence of input characters or input <A HREF="#Token">tokens</A> and accumulates them in an | |
674 input buffer or stack as determined by a set of <A HREF="#Production">productions</A> which constitute a | |
675 <A HREF = "#Grammar">grammar</A>. When the parser discovers a sequence of tokens as defined by a <A HREF="#GrammarRule">grammar | |
676 rule</A>, or right side of a production, it "reduces" the sequence to a | |
677 single <A HREF="#ReductionToken">reduction token</A> as defined by the left side of the <A HREF="#GrammarRule">grammar rule</A>. This | |
678 <A HREF="#Nonterminal">nonterminal token</A> now replaces the tokens which matched the grammar rule and the | |
679 search for matches continues. If an input token is encountered which will not | |
680 yield a match for any rule, it is considered a syntax error and some kind of | |
681 error recovery may be required to continue. If a match, or reduction action, | |
682 yields the grammar token, sometimes called the goal token or start token, the | |
683 parser deems its work complete and returns to whatever procedure may have called | |
684 it. | |
685 <P> | |
686 Tokens may have <A HREF="#SemanticValue">semantic values</A>. If the | |
687 input values configuration switch is on, your parser will expect | |
688 semantic values to be provided by the input process along with the | |
689 token identification code. If the input values switch is off, your | |
690 parser will take the ascii value of the input character, that is, the | |
691 actual input code, as the value of the character. When the parser <A | |
692 HREF="#ReduceAction">reduces</A> a <A | |
693 HREF="#Production">production</A>, it can call a <A | |
694 HREF="#ReductionProcedure">reduction procedure</A> or <A | |
695 HREF="#SemanticAction">semantic action</A> to analyze the values of | |
696 the constituent tokens. This reduction procedure can then return a | |
697 value which characterizes the reduced token. | |
698 </P> | |
699 <P></P> | |
700 </DD> | |
701 | |
702 <DT><B><A NAME="ParserGenerator">Parser Generator</A></B></DT> | |
703 <DD> | |
704 A parser generator, such as AnaGram, is a program that | |
705 converts a <A HREF="#Grammar">grammar</A>, a rule-based description of the | |
706 input to a program, into a conventional, procedural | |
707 module called a <A HREF="#Parser">parser</A>. The parsers AnaGram generates | |
708 are simple C modules which can be compiled on almost | |
709 any platform. AnaGram parsers are also compatible with | |
710 C++. | |
711 <P></P> | |
712 </DD> | |
713 | |
714 <DT><B><A NAME="ParserStateStack">Parser State Stack</A></B></DT> | |
715 <DD> | |
716 The parser state stack is a stack maintained by an AnaGram | |
717 parser and which is an integral part of the parsing | |
718 process. At any point in the parse of your input | |
719 stream, the parser state stack provides a list of the | |
720 states which were traversed in order to arrive at the | |
721 current state. | |
722 | |
723 <P></P> | |
724 </DD> | |
725 | |
726 | |
727 <DT><B><A NAME="ParserValueStack">Parser Value Stack</A></B></DT> | |
728 <DD>In parallel with the <A HREF="#ParserStateStack">parser state stack</A>, | |
729 an AnaGram <A HREF="#Parser"> parser</A> | |
730 maintains a <I>value stack</I> each entry of | |
731 which corresponds to the <A HREF="#SemanticValue">semantic value</A> of the token | |
732 identified at that state. Since the semantic values of | |
733 different tokens might well have different data types, | |
734 AnaGram gives you the opportunity, in your syntax | |
735 file, to define the data type for any token. AnaGram | |
736 then builds a typedef statement creating a data type | |
737 which is a union of the all the types you have defined. | |
738 AnaGram creates the name for this data type by | |
739 appending "_vs_type" to the name of the parser. AnaGram uses | |
740 this data type to define the value stack. | |
741 <P></P> | |
742 </DD> | |
743 | |
744 | |
745 <DT><B><A NAME="ParsingEngine">Parsing Engine</A></B></DT> | |
746 <DD> A <A HREF="#Parser">parser</A> consists of three basic components: A set of syntax tables, a set | |
747 of <A HREF="#ReductionProcedure">reduction procedures</A> and a parsing engine. The parsing engine is the body of | |
748 code that interprets the parsing table, invokes input functions, and calls the | |
749 reduction procedures. The Build Parser command configures a parsing engine | |
750 according to the implicit requirements of the syntax specification and according | |
751 to the explicit values of the configuration parameters. | |
752 <P> | |
753 The parsing engine itself is a simple automaton, characterized by a set of | |
754 states and a set of inputs. The inputs are the <A HREF="#Token">tokens</A> of your <A HREF = "#Grammar">grammar</A>. Each | |
755 state is represented by a list of tokens which are admissible in that state and | |
756 for each token an <A HREF="#ParserAction">action</A> to perform and a parameter which further defines the | |
757 action. In a traditional <A HREF="#LALRParsing">LALR parser</A>, there are only four <A HREF="#ParserAction">actions</A>: the <A HREF="#ShiftAction">shift | |
758 action</A>, the <A HREF="#ReduceAction">reduce action</A>, the <A HREF="#AcceptAction">accept action</A> and the <A HREF="#ErrorAction">error action</A>. AnaGram, in | |
759 doing its <A HREF="#GrammarAnalysis">grammar analysis</A>, identifies a number of special cases, and creates a | |
760 number of extra actions which make for faster processing, but which can be | |
761 represented as combinations of these primitive actions. | |
762 </P><P> | |
763 When a shift action is performed, the current state number is pushed onto the | |
764 parser state stack and the new state number is determined by the current state | |
765 number and the current lookahead token. Different tokens cause different new | |
766 states. | |
767 </P><P> | |
768 When a reduce action is performed, the length of the rule being reduced is | |
769 subtracted from the parser stack index and the new state number is read from the | |
770 top of the parser state stack. The <A HREF="#ReductionToken">reduction token</A> for the rule being reduced is | |
771 then used as an input token. | |
772 </P><P> | |
773 Each state in the grammar, with the exception of state zero, has a | |
774 <A HREF="#CharacteristicToken">characteristic token</A> which must have been recognized in order to jump to that | |
775 state. Therefore, the parser state stack, which is essentially a list of state | |
776 numbers, can also be thought of as a list of token numbers. This is the list of | |
777 tokens that have been seen so far in the parse of your input stream. Some of | |
778 these tokens, of course, may be <A HREF="#Nonterminal">nonterminal tokens</A> which have resulted from | |
779 reducing other sequences of tokens. | |
780 </P> | |
781 <P></P> | |
782 </DD> | |
783 | |
784 <DT><B><A NAME="SetPartition">Partition</A></B></DT> | |
785 <DD> If you use <A HREF="#CharacterSets">character sets</A> in your <A HREF = "#Grammar">grammar</A>, AnaGram will compute a "partition" | |
786 of the <A HREF="#Universe">character universe</A>. This partition is a collection of non-overlapping | |
787 character sets such that every one of the sets you have defined can be written | |
788 as a <A HREF="#SetUnion">union</A> of partition sets. Each partition set is identified by a unique | |
789 reference number called the partition set number. | |
790 <P> | |
791 Each partition set is assigned a unique <A HREF="#Token">token</A>. If one of your character sets | |
792 requires more than one partition set to represent it, AnaGram will create | |
793 appropriate <A HREF="#Production">productions</A> and add them to your grammar so your <A HREF="#Parser">parser</A> can make the | |
794 necessary distinctions. | |
795 </P> | |
796 <P></P> | |
797 </DD> | |
798 | |
799 <DT><B><A NAME="Precedence">Precedence</A></B></DT> | |
800 <DD> "Precedence" is an attribute of <A HREF="#BinaryOperator">binary operators</A> and | |
801 <A HREF = "#UnaryOperator">unary | |
802 operators</A> which can be used to resolve <A HREF="#Conflict">conflicts</A>. Operator precedence is also | |
803 referred to as <A HREF = "#Binding">binding</A>. Suppose that # and @ are two binary operators. The | |
804 question is how to interpret x # y @ z. If # has higher precedence than @, it is | |
805 interpreted as (x#y)@z. On the other hand if # has lower precedence than @, it | |
806 would be x#(y@z). If # and @ have the same precedence then the question must be | |
807 resolved by <A HREF="#Associativity">associativity</A>. | |
808 <P> | |
809 Note that all operators at the same precedence level must have the same | |
810 associativity. | |
811 </P><P> | |
812 The situation is somewhat simpler for unary operators. If # and @ were both | |
813 prefix operators, or both suffix operators, there would be no ambiguity in | |
814 interpretation, since neither # @ x and x # @ offer more than one possible | |
815 interpretation. The only difficulty arises when both a prefix and a suffix | |
816 operator are applied to the same operand. | |
817 </P><P> | |
818 Suppose that # is a prefix operator and @ is a suffix operator. The question | |
819 then is how to interpret # x @. If # has higher precedence than @, it would be | |
820 interpreted as (# x) @. On the other hand, if # has lower precedence than @, it | |
821 would be interpreted as # (x @). If # and @ have the same precedence then the | |
822 question must be resolved by associativity. | |
823 </P> | |
824 <P></P> | |
825 </DD> | |
826 | |
827 <DT><B><A NAME="Production">Production</A></B></DT> | |
828 <DD> Productions are the mechanism used to describe how complex input | |
829 structures are built up out of simpler ones. Each production has a left side and | |
830 a right side. The right side, or <A HREF="#GrammarRule">grammar rule</A>, | |
831 is a sequence of <A HREF="#RuleElement">rule elements</A>, | |
832 which may represent either <A HREF="#Terminal">terminal tokens</A> or | |
833 <A HREF="#Nonterminal">nonterminal tokens</A>. The left side | |
834 is a list of <A HREF="#ReductionToken">reduction tokens</A>. | |
835 In most cases there would be only a single | |
836 reduction token. Productions with more than one <A HREF="#Token">token</A> | |
837 on the left side are | |
838 called <A HREF = "#SemanticallyDetermined">semantically determined productions</A>. | |
839 <p>The | |
840 "<CODE>-></CODE>" symbol is used | |
841 to separate the left side from the right side. | |
842 | |
843 If you have several | |
844 productions with the same left hand side, you can avoid | |
845 rewriting the left hand side either by using '|' or by | |
846 using another "->". | |
847 <p> | |
848 | |
849 A <A HREF="#NullProduction"> null production</A>, or empty right hand side, cannot | |
850 follow a '|'. | |
851 <P> | |
852 Productions may be written thus: | |
853 <pre> | |
854 name | |
855 -> letter | |
856 -> name, digit | |
857 </pre> | |
858 This could also be written | |
859 <pre> | |
860 name -> letter | name, digit | |
861 </pre> | |
862 In order to accommodate semantic analysis of the data, | |
863 you may attach to any grammar rule a <A HREF="#ReductionProcedure">reduction | |
864 procedure</A> which will be executed when the rule is | |
865 identified. Each token may have a <A HREF="#SemanticValue"> semantic value</A>. By | |
866 using <A HREF="#ParameterAssignment"> parameter assignments</A>, you may provide the | |
867 reduction procedure with access to the semantic values of | |
868 tokens that comprise the grammar rule. When it finishes, the | |
869 reduction procedure may return a value which will be | |
870 saved on the <A HREF="#ParserValueStack"> parser value stack</A> as the semantic value of the | |
871 <A HREF="#ReductionToken">reduction token</A>. | |
872 | |
873 <P></P> | |
874 </DD> | |
875 | |
876 <DT><B><A NAME="ReduceAction">Reduce Action</A></B></DT> | |
877 <DD> The reduce action is one of the four <A HREF="#ParserAction">actions</A> of a traditional parsing | |
878 engine. The reduce action is performed when the <A HREF="#Parser">parser</A> has succeeded in matching | |
879 all the elements of a <A HREF="#GrammarRule">grammar rule</A> and the next input <A HREF="#Token">token</A> is not erroneous. | |
880 Reducing the grammar rule amounts to subtracting the length of the rule from the | |
881 parser stack index, identifying the <A HREF="#ReductionToken">reduction token</A>, stacking its | |
882 <A HREF="#SemanticValue">semantic value</A> | |
883 and then doing a <A HREF="#ShiftAction">shift action</A> with the reduction token as though it had been | |
884 input directly. | |
885 <P></P> | |
886 </DD> | |
887 | |
888 <DT><B><A NAME="ReduceReduce">Reduce-Reduce Conflict</A></B></DT> | |
889 <DD> A <A HREF = "#Grammar">grammar</A> has a "reduce-reduce" conflict at some state if a | |
890 single <A HREF="#Token">token</A> turns out to be a <A HREF="#ReducingToken">reducing token</A> for more than one <A HREF="#CompletedRule">completed rule</A>. | |
891 <P></P> | |
892 </DD> | |
893 | |
894 <DT><B><A NAME="ReducingToken">Reducing Token</A></B></DT> | |
895 <DD> In a state with more than one <A HREF="#CompletedRule">completed rule</A>, your <A HREF="#Parser">parser</A> must be able to | |
896 determine which one was actually found. AnaGram deals with this problem by | |
897 looking at all the states the parser will branch to once each rule is reduced. | |
898 The acceptable input <A HREF="#Token">tokens</A> for those states are the "reducing tokens" | |
899 for the completed rules in the state under investigation. If there is a single | |
900 token which is a reducing token for more than one rule, then the <A HREF = "#Grammar">grammar</A> is said | |
901 to have a <A HREF="#ReduceReduce">reduce-reduce conflict</A> at that state. If in a particular state there | |
902 is both a <A HREF="#ShiftAction">shift action</A> and a <A HREF="#ReduceAction">reduce action</A> for the same token the grammar is | |
903 said to have a <A HREF="#ShiftReduce">shift-reduce conflict</A> in that state. | |
904 <P> | |
905 A reducing token is not the same as a <A HREF="#ReductionToken">reduction token</A>. | |
906 </P> | |
907 <P></P> | |
908 </DD> | |
909 | |
910 <DT><B><A NAME="ReductionProcedure">Reduction Procedure</A></B></DT> | |
911 <DD> A "reduction procedure" is a function you write which your | |
912 <A HREF="#Parser">parser</A> executes when it has identified the <A HREF="#GrammarRule">grammar rule</A> to which the reduction | |
913 procedure is attached in your <A HREF = "#Grammar">grammar</A>. There are two formats for reduction | |
914 procedures, depending on the size and complexity of the procedure. | |
915 <P></P> | |
916 </DD> | |
917 | |
918 <DT><B><A NAME="ReductionToken">Reduction Token</A></B></DT> | |
919 <DD> A <A HREF="#Token">token</A> which appears on the left side of a <A HREF="#Production">production</A> is called a | |
920 reduction token. It is so called because when the <A HREF="#GrammarRule">grammar rule</A> on the right side | |
921 of the production is matched in the input stream, your <A HREF="#Parser">parser</A> will <A HREF="#ReduceAction">"reduce"</A> | |
922 the sequence of tokens which matches the rule by replacing the sequence of | |
923 tokens with the reduction token. Note that, if more than one reduction token is | |
924 specified, your <A HREF="#ReductionProcedure">reduction procedure</A> should choose the exact one. If it does not, | |
925 your parser will use the leftmost syntactically correct one in the list as the default. | |
926 <P> | |
927 A reduction token is not the same as a <A HREF="#ReducingToken">reducing token</A>. | |
928 </P> | |
929 <P></P> | |
930 </DD> | |
931 | |
932 <DT><B><A NAME="Resynchronization">Resynchronization</A></B></DT> | |
933 <DD> "Resynchronization" is the process of getting your <A HREF="#Parser">parser</A> back | |
934 in step with its input after encountering a syntax error. As such, it is one | |
935 method of error recovery. Of course, you would resynchronize only if it | |
936 is necessary to continue after the error. There are several options available | |
937 when using AnaGram. You could use the auto resynch option, which causes AnaGram | |
938 to incorporate an automatic resynchronizing procedure into your parser, or you | |
939 could use the error resynch option, which is similar to the technique used by | |
940 YACC programmers. | |
941 </DD> | |
942 | |
943 <P></P> | |
944 <DT><B><A NAME="RuleElement">Rule Element</A></B></DT> | |
945 <DD> A <A HREF="#GrammarRule">grammar rule</A> is a list of "rule elements", separated by | |
946 commas. Rule elements may be <A HREF="#Token">token</A> names, <A HREF="#CharacterSets">character sets</A>, keywords, immediate | |
947 actions, or <A HREF="#VirtualProduction">virtual productions</A>. When AnaGram encounters a rule element for | |
948 which no token presently exists, it creates one. | |
949 <P></P> | |
950 </DD> | |
951 | |
952 <DT><B><A NAME="SemanticAction">Semantic Action</A></B></DT> | |
953 <DD> | |
954 A semantic action is a piece of C or C++ code that is executed when | |
955 a <A HREF="#GrammarRule">grammar rule</A> has been identified by a | |
956 <A HREF="#Parser">parser</A>. Semantic actions are also called | |
957 <A HREF="#ReductionProcedure">reduction procedures</A>, since they | |
958 are executed when a grammar rule is <A | |
959 HREF="#ReduceAction">reduced</A> to the token on the left side of the | |
960 <A HREF="#Production">production.</A> | |
961 <P></P> | |
962 </DD> | |
963 | |
964 <DT><B><A NAME="SemanticValue">Semantic Value</A></B></DT> | |
965 <DD> | |
966 <A HREF="#Token">Tokens</A>, whether <A HREF="#Terminal">terminal</A> | |
967 or <A HREF="#Nonterminal">nonterminal</A>, may have a semantic value. | |
968 In the case of terminal tokens, this may be a value assigned by the | |
969 <A HREF="#LexicalScanner">lexical scanner</A> or, if the parser is | |
970 using direct character input, it will be the ascii value of the | |
971 character itself. The values of nonterminal tokens are created by | |
972 <A HREF="#ReductionProcedure">reduction procedures</A>. As a parse | |
973 progresses, token values are <A HREF="#ShiftAction">shifted</A> | |
974 onto the stack, so that in a reduction procedure, the values of the | |
975 tokens that comprise the <A HREF="#GrammarRule">grammar rule</A> that | |
976 is being <A HREF="#ReduceAction">reduced</A> are available | |
977 for inspection. | |
978 <P></P> | |
979 </DD> | |
980 | |
981 <DT><B><A NAME="SemanticallyDetermined">Semantically Determined Production</A></B></DT> | |
982 <DD> A "semantically determined production" is one which has more | |
983 than one <A HREF="#ReductionToken">reduction token</A> specified on the left side of the production. You would | |
984 write such a <A HREF="#Production">production</A> when the reduction tokens are syntactically | |
985 indistinguishable. The <A HREF="#ReductionProcedure">reduction procedure</A> may then specify which of the listed | |
986 reduction tokens the <A HREF="#GrammarRule">grammar rule</A> is to reduce to based on semantic | |
987 considerations. If there is no reduction procedure, or the reduction procedure | |
988 does not specify a reduction token, the <A HREF="#Parser">parser</A> will use the leftmost syntactically correct token. | |
989 | |
990 <P></P> | |
991 </DD> | |
992 | |
993 <DT><B><A NAME="SetExpression">Set Expression</A></B></DT> | |
994 <DD> A set expression is an algebraic expression used to define a <A HREF="#CharacterSets">character set</A> | |
995 in terms of individual characters, ranges of characters, and/or other sets of | |
996 characters as constructed using <A HREF="#SetComplement">complements</A>, <A HREF="#SetUnion">unions</A>, <A HREF="#SetIntersection">intersections</A>, and | |
997 <A HREF="#SetDifference">differences</A>. | |
998 <P></P> | |
999 </DD> | |
1000 | |
1001 <DT><B><A NAME="ShiftAction">Shift Action</A></B></DT> | |
1002 <DD> The shift action is one of the four <A HREF="#ParserAction">actions</A> of a traditional parsing | |
1003 engine. The shift action is performed when the input <A HREF="#Token">token</A> matches one of the | |
1004 acceptable input tokens for the current parser state. The | |
1005 <A HREF="#SemanticValue">semantic value</A> of the | |
1006 token and the current state number are stacked, the parser stack index is | |
1007 incremented and the state number is set to a value determined by the previous | |
1008 state and the input token. | |
1009 <P></P> | |
1010 </DD> | |
1011 | |
1012 <DT><B><A NAME="ShiftReduce">Shift-Reduce Conflict</A></B></DT> | |
1013 <DD> A "shift-reduce" conflict occurs if in some state there is a | |
1014 single <A HREF="#Token">token</A> that should be shifted, because it is legitimate input for one of | |
1015 the rules of the state, but should also be used to reduce some other rule | |
1016 because it is a <A HREF="#ReducingToken">reducing token</A> for that rule. | |
1017 <P></P> | |
1018 </DD> | |
1019 <DT><B><A NAME="SyntaxDirectedParsing">Syntax Directed Parsing</A></B></DT> | |
1020 <DD> | |
1021 <P>Syntax directed parsing, or formal parsing, is an | |
1022 approach to building <A HREF="#Parser">parsers</A> based on formal language | |
1023 theory. Given a suitable description of a language, | |
1024 called a <A HREF="#Grammar">grammar</A>, there are algorithms which can be | |
1025 used to create parsers for the language automatically. | |
1026 In this context, the set of all possible inputs to a | |
1027 program may be considered to constitute a language, and | |
1028 the rules for formulating the input to the program | |
1029 constitute the grammar for the language. | |
1030 | |
1031 <P>The parsers built from a grammar have the advantage | |
1032 that they can recognize any input that conforms to the | |
1033 rules, and can reject as erroneous any input that fails | |
1034 to conform. | |
1035 | |
1036 <P>Since the program logic necessary to parse input is | |
1037 often extremely intricate, programs which use formal | |
1038 parsing are usually much more reliable than those built | |
1039 by hand. They are also much easier to maintain, since | |
1040 it is much easier to modify a grammar specification | |
1041 than it is to modify complex program logic. | |
1042 | |
1043 <P></P> | |
1044 </DD> | |
1045 | |
1046 <DT><B><A NAME="Terminal">Terminal Token</A></B></DT> | |
1047 <DD> A "terminal token" is a <A HREF="#Token">token</A> which does not appear on the left | |
1048 side of a <A HREF="#Production">production</A>. It represents, therefore, a basic unit of input to your | |
1049 <A HREF="#Parser">parser</A>. If the input to your parser consists of ascii characters, you may define | |
1050 terminal tokens explicitly as ascii characters or as <A HREF="#CharacterSets">sets of ascii characters</A>. | |
1051 If you have an input procedure which produces numeric codes, you may define the | |
1052 terminal tokens directly in terms of these numeric codes. | |
1053 <P></P> | |
1054 </DD> | |
1055 | |
1056 <DT><B><A NAME="Token">Token</A></B></DT> | |
1057 <DD> Tokens are the units with which your <A HREF="#Parser">parser</A> works. There are two kinds of | |
1058 tokens: <A HREF="#Terminal">terminal tokens</A> and <A HREF="#Nonterminal">nonterminal tokens</A>. These latter are identified by | |
1059 the parser as sequences of tokens. The grouping of tokens into more complex | |
1060 tokens is governed by the <A HREF="#GrammarRule">grammar rules</A>, or <A HREF="#Production">productions</A> in your <A HREF = "#Grammar">grammar</A>. In your | |
1061 grammar, tokens may be denoted by token names, by <A HREF="#VirtualProduction">virtual productions</A>, by | |
1062 immediate actions, by explicit character representations, or by expressions | |
1063 which yield <A HREF="#CharacterSets">character sets</A>. | |
1064 <P> | |
1065 A "token" should be thought of more or less as being something like a | |
1066 cloakroom token. Imagine you had taken a kindergarten class to the museum and | |
1067 all the kids checked their coats. Each one gets a token. What is the probability | |
1068 of losing one? You take all of the tokens from the children, put them in a paper | |
1069 bag, tie it up, and check it. You get one token in return. We would say that the | |
1070 kids' coats represent the actual input to the system, their individual tokens | |
1071 are the <A HREF="#Terminal">terminal tokens</A>, and your one token which enables you to redeem the | |
1072 paper bag is a <A HREF="#Nonterminal">nonterminal token</A>. Nonterminal tokens are just single token | |
1073 numbers to represent the result of compacting a sequence of more primitive | |
1074 tokens. | |
1075 </P><P> | |
1076 Actually, the word "token" is commonly used to refer to a number of | |
1077 apparently different things. The tokens you use in a grammar are abstract. The | |
1078 tokens identified by a parser are concrete instances of the abstract tokens in | |
1079 the grammar. Furthermore, we have to identify tokens in some manner, so we have | |
1080 token names and token numbers with which to refer to particular tokens. As a | |
1081 result the word "token", in any particular context, may refer directly | |
1082 to either an abstract token, or a concrete instance of an abstract token, or may | |
1083 refer only indirectly to the token by means of a token name or token number. | |
1084 </P><P> | |
1085 As an example, consider a C compiler which has a lexical scanner to identify | |
1086 input tokens. According to Kernighan and Ritchie, there are six classes of C | |
1087 tokens: identifiers, keywords, constants, string literals, operators, and other | |
1088 separators. Consider a particular token, a hexadecimal constant, 0XF00. When the | |
1089 lexical scanner hands this over to the parser, it has to provide an | |
1090 identification code which says what kind of token it is, a hexadecimal constant, | |
1091 as well as the | |
1092 <A HREF="#SemanticValue">"semantic value"</A> of the token, | |
1093 3840. The identification code is usually an integer value, | |
1094 determined by the designer of the lexical scanner, which by agreement | |
1095 with the designer of the parser uniquely represents a hexadecimal | |
1096 constant. This identification code can be thought of as an external | |
1097 token number. | |
1098 | |
1099 </P><P> | |
1100 The grammar for the parser, on the other hand, has a token, "HEXconstant", | |
1101 let us say, to represent the abstract usage of hexadecimal constants in the C | |
1102 language. When AnaGram, or any other parser generator for that matter, analyzes | |
1103 the grammar, it assigns its own reference number to identify "HEXconstant". | |
1104 This reference number can be thought of as an internal token number to contrast | |
1105 it with the external token number provided by the lexical scanner. The parsers | |
1106 AnaGram builds contain a table, ag_ tcv, which provides a translation from the | |
1107 external to the internal token number. Both of these token numbers, of course, | |
1108 differ from the semantic value of the token, in this case, 3840. | |
1109 </P><P> | |
1110 Even when an AnaGram parser is using simple character input, the word "token" | |
1111 may represent several different things. Suppose a grammar has a token 'A-Z', | |
1112 which represents the set of upper case letters. There will be an internal token | |
1113 number, assigned by AnaGram, which identifies the (abstract) token in parser | |
1114 tables. The actual (concrete) input token to a parser will, however, be a single | |
1115 letter, say "Q". In this case, the ascii value of the character "Q", | |
1116 81, is used both as the external token number and as the semantic value of the | |
1117 token. | |
1118 </P><P></P> | |
1119 </DD> | |
1120 | |
1121 <DT><B><A NAME="UnaryOperator">Unary Operator</A></B></DT> | |
1122 <DD>A "unary operator" is a <A HREF="#Token">token</A> which represents | |
1123 some sort of operation which operates on a single entity to produce a new | |
1124 resultant entity. It is normally represented by a symbol which is written either | |
1125 preceding the operand or following the operand. In the first case it is called a | |
1126 "prefix operator" and in the second case a "suffix operator". | |
1127 </P><P></P> | |
1128 </DD> | |
1129 | |
1130 <DT><B><A NAME="SetUnion">Union</A></B></DT> | |
1131 <DD> | |
1132 The union of two sets is the set of all elements that are to be found in one or | |
1133 another of the two sets. In an AnaGram syntax file the union of two <A HREF="#CharacterSets">character | |
1134 sets</A> A and B is represented using the plus sign, as in A + B. The union operator | |
1135 has the same <A HREF="#Precedence">precedence</A> as the <A HREF="#SetDifference">difference</A> operator: lower than that of | |
1136 <A HREF="#SetIntersection">intersection</A> and <A HREF="#SetComplement">complement</A>. The union operator is <A HREF="#Associativity">left associative</A>. | |
1137 <P></P> | |
1138 </DD> | |
1139 | |
1140 <DT><B><A NAME="VirtualProduction">Virtual Production</A></B></DT> | |
1141 <DD> | |
1142 <P>Virtual productions are a special short hand | |
1143 representation of <A HREF="#GrammarRule">grammar rules</A> which can be used to | |
1144 indicate a choice of inputs. They are an important | |
1145 convenience, especially useful when you are first | |
1146 building a <A HREF="#Grammar">grammar</A>. | |
1147 | |
1148 <P>AnaGram rewrites virtual productions, so that when you | |
1149 look at the syntax tables in AnaGram, there will be | |
1150 actual <A HREF="#Production">productions</A> replacing the virtual productions. | |
1151 | |
1152 <P>A virtual production appears as one of the <A HREF="#RuleElement">rule | |
1153 elements</A> in a <A HREF="#GrammarRule">grammar rule</A>, i.e. as one of the members | |
1154 of the list on the right side of a production. | |
1155 | |
1156 <P>The simplest virtual production is the "optional" | |
1157 <A HREF="#Token">token</A>. If x is an arbitrary token, x? can be used to | |
1158 indicate an optional x. | |
1159 | |
1160 <P>Related virtual productions are x... and x?... where | |
1161 the three dots indicate repetition. x... represents an | |
1162 arbitrary number of occurrences of x, but at least one. | |
1163 x?... represents zero or more occurrences of x. | |
1164 | |
1165 <P>The remaining virtual productions use curly or square | |
1166 brackets to enclose a sequence of rules. The brackets | |
1167 may be followed variously by nothing, a string of three | |
1168 dots, or a slash, to indicate the choices to be made | |
1169 from the rules. Note that rules may be used, not merely | |
1170 tokens. | |
1171 | |
1172 <P>If r1 through rn are a set of grammar rules, then | |
1173 <PRE> {r1 | r2 | ... | rn} </PRE> | |
1174 is a virtual production that allows a choice of exactly | |
1175 one of the rules. Similarly, | |
1176 <PRE> {r1 | r2 | ... | rn}... </PRE> | |
1177 is a virtual production that allows a choice of one or | |
1178 more of the rules. And, finally, | |
1179 <PRE> {r1 | r2 | ... | rn}/... </PRE> | |
1180 is a virtual production that allows a choice of one or | |
1181 more of the rules subject to the side condition that | |
1182 rules must alternate, that is, that no rule can follow | |
1183 itself immediately without the interposition of some | |
1184 other rule. This is a case that is not particularly | |
1185 easy to write by hand, but is quite useful in a number | |
1186 of contexts. | |
1187 | |
1188 <P>If the above virtual productions are written with [ ] | |
1189 instead of { }, they all become optional. [ ] is an | |
1190 optional choice, [ ]... is zero or more choices, and | |
1191 [ ]/... is zero or more alternating choices. | |
1192 | |
1193 <P><A HREF="#NullProduction">Null productions</A> are not permitted in virtual | |
1194 productions in those cases where they would cause an | |
1195 intrinsic ambiguity. | |
1196 | |
1197 <P>You may use a definition statement to assign a name to | |
1198 a virtual production. | |
1199 | |
1200 </DD> | |
1201 | |
1202 </DL> | |
1203 | |
1204 | |
1205 <P> | |
1206 <BR> | |
1207 | |
1208 <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" | |
1209 WIDTH=1010 HEIGHT=2 > | |
1210 <P> | |
1211 <IMG ALIGN="right" SRC="images/pslrb6d.gif" ALT="Parsifal Software" | |
1212 WIDTH=181 HEIGHT=25> | |
1213 <BR CLEAR="right"> | |
1214 Back to <A HREF="index.html">Index</A> | |
1215 <P> | |
1216 <ADDRESS><FONT SIZE="-1"> | |
1217 AnaGram parser generator - documentation<BR> | |
1218 Glossary<BR> | |
1219 Copyright © 1993-1999, Parsifal Software. <BR> | |
1220 All Rights Reserved.<BR> | |
1221 </FONT></ADDRESS> | |
1222 | |
1223 </BODY> | |
1224 </HTML> | |
1225 | |
1226 |