Mercurial > ~dholland > hg > ag > index.cgi
comparison doc/manual/xg-iii.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:13d2b8934445 |
---|---|
1 \chapter{Exploring Your Grammar III: Diagnostic Windows} | |
2 | |
3 If AnaGram finds problems with your grammar, it creates a number of | |
4 windows to help you identify and correct the problems. These windows | |
5 are the \agwindow{Warnings} window, the \agwindow{Conflicts} window, | |
6 the \agwindow{Resolved Conflicts} window, and the \agwindow{Keyword | |
7 Anomalies} window. To expand upon these windows, AnaGram also | |
8 provides, using the \agmenu{Auxiliary Windows} menu, the following | |
9 supporting windows: \agwindow{Conflict Trace}, \agwindow{Keyword | |
10 Anomaly Trace}, \agwindow{Reduction Trace}, \agwindow{Rule | |
11 Derivation}, \agwindow{Token Derivation} and \agwindow{Problem | |
12 States}. This chapter describes these windows and their use. | |
13 | |
14 | |
15 \section{The Warnings Window} | |
16 \index{Warnings}\index{Window} | |
17 | |
18 If while analyzing your syntax file, AnaGram finds something | |
19 suspicious, it issues a warning, regardless of the severity of the | |
20 problem. The \agwindow{Warnings} window will pop up automatically | |
21 when the analysis has been completed. If the warning is for a syntax | |
22 error in your input file, you will have to fix it, because AnaGram | |
23 cannot successfully interpret your syntax file. Otherwise, no matter | |
24 how serious the warnings may be, AnaGram will create a parser if it is | |
25 instructed to do so. Thus, if you decide the warnings are of no | |
26 consequence, you may choose to ignore them. Note that AnaGram will | |
27 resolve shift-reduce conflicts by choosing the shift action. It will | |
28 resolve reduce-reduce conflicts by choosing the rule that occurred | |
29 first in the grammar. | |
30 | |
31 If you have \index{Syntax error}\index{Errors}syntax errors, AnaGram | |
32 will synchronize the cursor in the syntax file window with the cursor | |
33 in the \agwindow{Warnings} window so that as you move through the | |
34 warning messages, the cursor in the syntax file window will move to | |
35 the point of the error. | |
36 | |
37 AnaGram's warning messages are listed, in alphabetical order, in | |
38 Appendix B, together with explanations of the messages and suggestions | |
39 as to what to do about them. | |
40 | |
41 | |
42 \section{The Conflicts Window} | |
43 \index{Conflicts}\index{Window} | |
44 | |
45 The \agwindow{Conflicts} window lists the conflicts which AnaGram | |
46 found in your grammar. The table identifies the parser states in | |
47 which it found conflicts. For each conflict state it identifies the | |
48 reducing token for which it had more than one option, and the marked | |
49 rules which characterize each such option. If AnaGram was able to | |
50 resolve a conflict by using \agparam{precedence} or \agparam{sticky} | |
51 declarations, the conflict will be listed in the \index{Resolved | |
52 Conflicts}\index{Window}\agwindow{Resolved Conflicts} window instead. | |
53 | |
54 The \agwindow{Conflicts} window and the \agwindow{Resolved Conflicts} | |
55 window each require several lines to display a conflict. For each | |
56 conflict, the first, or header, line identifies the conflict state and | |
57 the conflict token. The conflict token is a terminal token for which | |
58 there is more than one parser action consistent with the rules of the | |
59 grammar. If both shift and reduce actions are consistent with the | |
60 grammar, the conflict is a \agterm{shift-reduce} conflict. If more | |
61 than one reduce action is consistent with the grammar, the conflict is | |
62 said to be a \agterm{reduce-reduce} conflict. | |
63 | |
64 Following the header line, AnaGram lists the grammar rules which the | |
65 conflict token could satisfy. The rules which correspond to shift | |
66 actions are listed first, followed by those which correspond to reduce | |
67 actions. Rules which correspond to shift actions have not yet been | |
68 completed and have a marked token, shown in a distinctive type face, | |
69 which represents the next token expected. For such rules, the | |
70 conflict token is an instance of the marked token and can be shifted | |
71 into the input buffer. If the rule has been completely matched there | |
72 is no marked token and the conflict token can reduce the rule. In the | |
73 following discussion, such a rule is referred to as a \agterm{conflict | |
74 rule}. If there is a more than one conflict rule, the conflict is | |
75 called a \agterm{reduce-reduce} conflict; otherwise it is a | |
76 \agterm{shift-reduce} conflict. Although it is theoretically possible | |
77 for a conflict to be both a shift-reduce conflict and a reduce-reduce | |
78 conflict, it is not a common event. | |
79 | |
80 The \agmenu{Auxiliary Windows} menu for the \agwindow{Conflicts} window | |
81 provides eleven auxiliary windows to help you understand the | |
82 circumstances of a conflict. Five of these windows address the | |
83 conflict directly. These windows are the \agwindow{Conflict Trace}, the | |
84 \agwindow{Reduction Trace}, the \agwindow{Rule Derivation}, the | |
85 \agwindow{Token Derivation}, and the \agwindow{Problem States} | |
86 windows. These windows are keyed to the conflict state, the conflict | |
87 token, and the conflict rule. If the cursor bar is not highlighting a | |
88 conflict rule, AnaGram scans down the \agwindow{Conflicts} window for | |
89 the first conflict rule for the selected state and token. The use of | |
90 these windows in analyzing conflicts is described below. | |
91 | |
92 The other windows in the \agmenu{Auxiliary Windows} menu have been | |
93 described in Chapter 6. These are the \agwindow{Expansion Chain} | |
94 window, the \agwindow{Rule Context} window, the \agwindow{Reduction | |
95 States} window, the \agwindow{State Definition} window, the | |
96 \agwindow{State Expansion} window and the \agwindow{Token Usage} | |
97 window. The \agwindow{Rule Context} window is keyed to the particular | |
98 rule highlighted by the cursor. The | |
99 \index{Expansion Chain}\index{Window}\agwindow{Expansion Chain} window | |
100 is keyed to the highlighted rule. If the highlighted rule is a | |
101 characteristic rule no expansion chain is possible, so the option will | |
102 be greyed out. The \agwindow{Reduction States} window is keyed to the | |
103 highlighted rule if it is a conflict rule, otherwise to the first | |
104 conflict rule following the highlighted line. The \agwindow{State | |
105 Definition} and \agwindow{State Expansion} windows are keyed to the | |
106 conflict state. The \agwindow{Token Usage} window is keyed to the | |
107 conflict token. | |
108 | |
109 | |
110 \section{The Resolved Conflicts Window} | |
111 \index{Resolved Conflicts}\index{Window} | |
112 | |
113 AnaGram creates the \agwindow{Resolved Conflicts} window only when the | |
114 grammar it is analyzing has conflicts and when some of those conflicts | |
115 have been resolved by precedence rules (see Chapter 9) or by use of | |
116 the \agparam{sticky} directive. In this case, it will list, in the | |
117 \agwindow{Resolved Conflicts} window, the conflicts that were resolved | |
118 and the way in which they were resolved. The layout of the | |
119 \agwindow{Resolved Conflicts} window is identical to that of the | |
120 \agwindow{Conflicts} window with one exception. The rules which your | |
121 parser will observe, selected in accordance with your \agparam{sticky} | |
122 declarations or precedence rules, are marked with asterisks. Those | |
123 which your parser will ignore will not be so marked. The | |
124 \agwindow{Auxiliary Windows} available from the \agwindow{Resolved | |
125 Conflicts} window are the same as those available from the | |
126 \agwindow{Conflicts} window. | |
127 | |
128 | |
129 \section{Conflicts} | |
130 \index{Conflicts} | |
131 | |
132 Chapter 4 presented a brief introduction to the problem of conflicts | |
133 in a grammar. There, conflicts were simply classified as shift-reduce | |
134 conflicts or as reduce-reduce conflicts. This section examines | |
135 conflicts in somewhat greater detail. It looks at the causes of | |
136 conflicts, how to identify the cause of a conflict and how to deal | |
137 with it. | |
138 | |
139 Conflicts occur when, in a particular state, a given token, the | |
140 \index{Conflict token}\index{Token}\agterm{conflict token}, reduces | |
141 more than one rule, or reduces a rule when the token could also be | |
142 shifted into the input buffer in accordance with another rule. | |
143 | |
144 Note that there are no shift-shift conflicts, that is, a particular | |
145 state could have several | |
146 \index{Characteristic rules}\index{Rules}characteristic rules | |
147 accepting the same token as a shift token. A conflict is always | |
148 associated with reduction. This is because LALR parsers keep track of | |
149 multiple possibilities in the input stream as long as the input tokens | |
150 only cause shifts. At the point where a reduction occurs the parser | |
151 must be able to identify exactly one of the possibilities as | |
152 consistent with the input. | |
153 | |
154 \subsection{Shift-Reduce Conflicts} | |
155 \index{Shift-reduce conflict}\index{Conflicts} | |
156 | |
157 Unless a conflict is deliberate, the fact that the conflict token | |
158 reduces a particular rule is usually a surprise. Although some | |
159 conflicts arise because an LALR-1 parser generator cannot look far | |
160 enough ahead, most shift-reduce conflicts result from true | |
161 \index{Ambiguities}ambiguities in your grammar. For example, an | |
162 erroneous Pascal grammar contained the following productions, which | |
163 illustrate one of the more common classes of shift-reduce conflicts: | |
164 | |
165 \begin{indentingcode}{0.4in} | |
166 parameter list | |
167 -> expression | |
168 -> parameter list, expression | |
169 \end{indentingcode} | |
170 | |
171 Pascal actually requires that the expressions in a parameter list be | |
172 separated by commas. In other words, the second rule should read: | |
173 | |
174 \begin{indentingcode}{0.4in} | |
175 -> parameter list, comma, expression | |
176 \end{indentingcode} | |
177 | |
178 Lacking a requirement for a comma, the erroneous rules specify that a | |
179 parameter list is simply a sequence of expressions with nothing | |
180 separating them. For example, using this definition, | |
181 | |
182 \begin{indentingcode}{0.4in} | |
183 (a + b) - (c * d) | |
184 \end{indentingcode} | |
185 | |
186 can be interpreted either as one expression, if the minus sign is | |
187 interpreted as a subtraction operator, or as a sequence of two | |
188 expressions if it is interpreted as a unary minus sign. | |
189 | |
190 When AnaGram analyzed the grammar with the erroneous rule it found a | |
191 conflict in the state where the minus sign was expected. If the minus | |
192 sign is the subtraction operator, it should be shifted onto the input | |
193 stack. On the other hand, if the minus sign is a unary minus sign, | |
194 the sum \agcode{(a+b)} should be reduced to \agcode{expression}, and | |
195 then to \agcode{parameter list}. Thus, this ambiguity gives rise to a | |
196 shift-reduce conflict, where it is impossible to determine a unique | |
197 parser action. | |
198 | |
199 \subsection{Identifying Ambiguities} | |
200 \index{Ambiguities} | |
201 | |
202 Unfortunately, shift-reduce conflicts usually show up in parser states | |
203 which are far removed from the true source of the problem. When your | |
204 grammar has a shift-reduce conflict of the sort illustrated above, | |
205 there may be no readily apparent link to the erroneous | |
206 production. AnaGram provides several options to help you establish the | |
207 link, all available using the \agmenu{Auxiliary Windows} menu from the | |
208 \agwindow{Conflicts} window. These options comprise two pre-built | |
209 grammar traces, the \agwindow{Conflict Trace} and the | |
210 \agwindow{Reduction Trace}; two special windows called \agwindow{Rule | |
211 Derivation} and \agwindow{Token Derivation}; the \agwindow{Expansion | |
212 Chain} window and a special version of the \agwindow{Reduction States} | |
213 window called the \agwindow{Problem States} window. The four trace | |
214 and derivation windows are coordinated, so that AnaGram actually | |
215 computes a mutually consistent set of four whenever you ask for one of | |
216 them. | |
217 | |
218 The \agwindow{Conflict} and \agwindow{Reduction Traces} are simply | |
219 pre-initialized \agwindow{Grammar Traces} which you may use to become | |
220 more familiar with the circumstances of the conflict. Bear in mind, | |
221 however, that unless you have set the | |
222 \index{Traditional engine}\index{Configuration switches} | |
223 \agparam{traditional engine} configuration switch, | |
224 the traces will use compound actions, which in some cases will use the | |
225 default resolution of conflicts. Therefore, unless you set | |
226 \agparam{traditional engine} it is not always possible to work through | |
227 an alternate resolution of a conflict. | |
228 | |
229 The | |
230 \index{Conflict Trace}\index{Trace}\index{Window}\agwindow{Conflict Trace} | |
231 shows you one possible sequence of input which will lead to the | |
232 conflict state. Unless a conflict results from a typo or other | |
233 clerical error, it is usually a surprise and not something you | |
234 anticipated. The \agwindow{Conflict Trace} helps you to understand | |
235 the context of the problem. It is often helpful to look at the | |
236 \agwindow{Rule Stack} for the \agwindow{Conflict Trace} to see what | |
237 productions are involved in the conflict. You can inspect the | |
238 \agwindow{State Definition} window for the conflict state to see why | |
239 the conflict token can be shifted. | |
240 | |
241 \index{Reduction Trace}\index{Trace}\index{Window} | |
242 The \agwindow{Reduction Trace}, for a particular conflict rule, | |
243 carries forward a \agwindow{Conflict Trace} by selecting the reduce | |
244 action. In other words, for a parser in the conflict state, the | |
245 \agwindow{Reduction Trace} shows the result of using the conflict | |
246 token as a reducing token. The trace shows the result of taking all | |
247 possible reductions until a state is reached where the lookahead token | |
248 will shift into the input buffer. If you compare the | |
249 \agwindow{Reduction Trace} to the \agwindow{Conflict Trace} you will | |
250 notice that they start off with identical tokens. At some point they | |
251 diverge. All of the tokens in the \agwindow{Conflict Trace} following | |
252 this point have reduced to the token in the \agwindow{Reduction | |
253 Trace}. Usually there are no further tokens in the | |
254 \agwindow{Reduction Trace}, but if there are, you will notice that | |
255 they are all zero-length tokens. That is, they all have a null | |
256 production, so that they come for free, as it were. If you press | |
257 Enter, to display the \agwindow{Select Input} window for the | |
258 \agwindow{Reduction Trace}, you will see that the conflict token is | |
259 acceptable input. | |
260 | |
261 The \agwindow{Reduction Trace} illustrates the ambiguity in your | |
262 grammar that caused the conflict: One of the acceptable input tokens | |
263 in the final state of this trace cannot be distinguished from its | |
264 predecessor. | |
265 | |
266 The \index{Expansion Chain}\index{Window}\agwindow{Expansion Chain} | |
267 window is extremely useful for indicating why a particular grammar | |
268 rule is an expansion rule in a particular state. Sometimes the chain | |
269 of substitutions to produce a given rule is quite long, and changes to | |
270 a grammar can have unexpected effects. The \agwindow{Expansion Chain} | |
271 window enables you to see how a rule in a conflict state got to be | |
272 there. | |
273 | |
274 \index{Problem States}\index{Window}The \agwindow{Problem States} | |
275 window for a conflict rule is essentially a trimmed | |
276 \agwindow{Reduction States} window which shows only those reduction | |
277 states in which the conflict token is acceptable input. Often it is | |
278 sufficient to look at the \agwindow{Problem States} window to see the | |
279 source of the problem. The last state in the \agwindow{Reduction | |
280 Trace} window, will be, of course, one of the states in the | |
281 \agwindow{Problem States} window. | |
282 | |
283 \subsection{Rule and Token Derivation Windows} | |
284 | |
285 If after inspecting the \agwindow{Problem States}, \agwindow{Conflict | |
286 Trace} and the \agwindow{Reduction Trace}, the conflict is still a | |
287 mystery, you may pinpoint the problem by looking at the | |
288 \agwindow{Rule} and \agwindow{Token Derivation} windows for the | |
289 conflict. Both windows contain a sequence of rules, and both begin | |
290 with the same rule, the rule which is the root cause of the conflict. | |
291 This rule is one of the characteristic rules of the last state in the | |
292 \agwindow{Reduction Trace} and is the rule which produces the ambiguity. | |
293 | |
294 \index{Rule Derivation}\index{Window}In both windows, each rule, | |
295 except the last line of the \agwindow{Rule Derivation}, contains a | |
296 marked token in a distinctive typeface. The essence of the ambiguity | |
297 is that it is impossible to tell where in the input stream the marked | |
298 token in the top line of the \agwindow{Rule Derivation} ends and the | |
299 marked token in the top line of the \agwindow{Token Derivation} | |
300 begins. | |
301 | |
302 In both derivation windows, each line after the first consists of a | |
303 rule produced by the marked token on the previous line. In the | |
304 \agwindow{Rule Derivation}, except on the first line, the marked token | |
305 is either the last token in the rule, or all subsequent tokens have | |
306 null productions. The last rule in the derivation window is the | |
307 conflict rule you selected in the \agwindow{Conflicts} window. Thus | |
308 the \agwindow{Rule Derivation} window shows you how the rule involved | |
309 in the conflict derives from the root. | |
310 | |
311 \index{Window}\index{Token Derivation}In the \agwindow{Token | |
312 Derivation} window, except on the first line, the marked token is | |
313 either the first token in the rule, or all preceding tokens have null | |
314 productions. The marked token in the last line of the list is the | |
315 conflict token. | |
316 | |
317 The \agwindow{Rule Derivation} and \agwindow{Token Derivation} windows | |
318 each have five auxiliary windows. The \agwindow{Rule Context} window | |
319 is keyed simply to the highlighted rule. The other four windows, the | |
320 \agwindow{Expansion Rules} window, the \agwindow{Productions} window, | |
321 the \agwindow{Set Elements} window, and the \agwindow{Token Usage} | |
322 window are keyed to the marked token. Remember that there is no marked | |
323 token on the last line of the \agwindow{Rule Derivation} window. | |
324 | |
325 \subsection{Resolving Ambiguities} | |
326 \index{Ambiguities} | |
327 | |
328 The shift-reduce conflict discussed earlier stemmed from an oversight | |
329 in writing a grammar: the grammar specified a list of items, in this | |
330 case expressions, which could not be distinguished without a separator | |
331 of some sort. This conflict can be resolved simply by rewriting the | |
332 grammar to require a comma separating the expressions. | |
333 | |
334 A similar problem comes from specifying a list of lists without any | |
335 separator. Sometimes such a specification is made inadvertently. The | |
336 erroneous Pascal grammar cited above had such a conflict. Here are | |
337 the productions that caused trouble: | |
338 | |
339 \begin{indentingcode}{0.4in} | |
340 declaration part | |
341 -> declaration section | |
342 -> declaration part, declaration section | |
343 | |
344 declaration section | |
345 -> proc and func declaration part | |
346 | |
347 proc and func declaration part | |
348 -> proc or func declaration | |
349 -> proc and func declaration part, proc or func declaration | |
350 | |
351 proc or func declaration | |
352 -> proc declaration | |
353 -> func declaration | |
354 \end{indentingcode} | |
355 | |
356 Of course, there are other productions for \agcode{declaration | |
357 section} so that this grammar looks perfectly reasonable on the | |
358 surface. A \agcode{declaration part} is a sequence of | |
359 \agcode{declaration sections}. One type of \agcode{declaration | |
360 section}, the \agcode{proc and func declaration part}, is a sequence | |
361 of procedure or function definitions. Unfortunately, if you have two | |
362 procedure definitions in a row, it could be interpreted as one | |
363 \agcode{proc and func declaration part} consisting of two declarations | |
364 or two \agcode{proc and func declaration parts} each consisting of one | |
365 declaration. The problem here is that there is a little too much | |
366 recursion. You can fix this problem by replacing | |
367 | |
368 \begin{indentingcode}{0.4in} | |
369 declaration section | |
370 -> proc and func declaration part | |
371 | |
372 proc and func declaration part | |
373 -> proc or func declaration | |
374 -> proc and func declaration part, proc or func declaration | |
375 \end{indentingcode} | |
376 | |
377 with | |
378 | |
379 \begin{indentingcode}{0.4in} | |
380 declaration section | |
381 -> proc or func declaration | |
382 \end{indentingcode} | |
383 | |
384 After a little reflection you will see that this simplification, which | |
385 eliminates one source of recursion, entails no loss of generality. | |
386 | |
387 In the above examples, the ambiguities were the results of errors in | |
388 the specification of the grammar, and could be easily fixed by | |
389 correcting the grammar. Sometimes, however, ambiguities arise which | |
390 are not so easily corrected. Consider, for instance, a simple token | |
391 scanner: | |
392 | |
393 \begin{indentingcode}{0.4in} | |
394 tokens | |
395 -> [token, white space?...]... | |
396 | |
397 token | |
398 -> name | |
399 -> number | |
400 -> string | |
401 -> character constant | |
402 -> punctuation character | |
403 \end{indentingcode} | |
404 | |
405 where \agcode{name}, \agcode{number}, etc., are defined in the | |
406 conventional manner. | |
407 | |
408 Such token scanners are normally written using conventional | |
409 programming techniques, or using programs such as \agfile{lex} which | |
410 are based on the analysis of regular expressions. Such programs | |
411 simply work on a character by character basis and do not see that | |
412 ``maryjane'' could be interpreted as one name or as two, or as even | |
413 more, since the separating white space is declared to be optional. | |
414 AnaGram, however, since it takes a broader view, sees that the | |
415 specifications are basically ambiguous and complains. Note, however, | |
416 that the default resolution of conflicts, in this case choosing the | |
417 shift action, will provide a parser that does precisely what you would | |
418 normally intend. Thus, the major concern is to make sure that the | |
419 conflicts caused by this situation do not mask other more serious | |
420 conditions. | |
421 | |
422 While it is possible to rewrite this grammar so that it is not | |
423 ambiguous, such a grammar would lack clarity and conciseness. AnaGram | |
424 provides two alternatives for dealing with this problem: the | |
425 \index{Declaration}\index{Subgrammar declaration}\agparam{subgrammar} | |
426 declaration and the | |
427 \index{Sticky declaration}\index{Declaration}\agparam{sticky} | |
428 declaration, which accomplish the desired end using very different | |
429 techniques. | |
430 | |
431 In the definition above, a grammar for token, taken all by itself, | |
432 would not normally show any ambiguities. We could write a | |
433 \index{Lexical scanner}lexical scanner that would identify a token, | |
434 pass it on to the parser, and then go on to the next. We would never | |
435 worry about any ambiguities. In this situation, it is clear that what | |
436 we mean by \agcode{token} is defined entirely by the grammar for | |
437 \agcode{token}, not by its usage within the rest of the grammar. When | |
438 we say that we have a sequence of tokens, \agcode{token...}, we | |
439 usually mean that, of course, two tokens, such as ``mary'' and | |
440 ``jane'' have to be separated by punctuation or space to be | |
441 distinguished as separate tokens. The | |
442 \index{Declaration}\index{Subgrammar declaration}\agparam{subgrammar} | |
443 declaration allows you to specify that a particular \index{Nonterminal | |
444 token}\index{Token}nonterminal token in your grammar is to be treated | |
445 as though it were defined by its own complete grammar, without regard | |
446 to its usage in the grander scheme of things. In the example above, | |
447 add the following: | |
448 | |
449 \begin{indentingcode}{0.4in} | |
450 [ subgrammar \bra token \ket ] | |
451 \end{indentingcode} | |
452 | |
453 Now, when AnaGram conducts its search for reducing tokens, it will | |
454 limit its search to those it would find if your grammar consisted only | |
455 of the grammar rules necessary to define token. When it does this, it | |
456 never even finds the reducing tokens that appear to cause conflicts. | |
457 When you fix a conflict with the \agparam{subgrammar} declaration, the | |
458 conflict vanishes entirely. It is not recorded in the | |
459 \agwindow{Resolved Conflicts} window. For this reason, the | |
460 \agparam{subgrammar} declaration should be used only when you are | |
461 absolutely sure of the nature of the problem you are trying to solve. | |
462 | |
463 A somewhat more precise mechanism for dealing with the conflicts | |
464 described above is to declare name and number as \index{Sticky | |
465 declaration}\index{Declaration}\agparam{sticky}: | |
466 | |
467 \begin{indentingcode}{0.4in} | |
468 [ sticky \bra name, number \ket ] | |
469 \end{indentingcode} | |
470 | |
471 This means that whenever the last token in your parser's input buffer | |
472 is \agcode{name} or \agcode{number}, it will shift in any acceptable | |
473 input character, and will disregard conflicting reduce options. | |
474 AnaGram will continue to identify conflicts in your grammar due to | |
475 this cause, but will record them in the \agwindow{Resolved Conflicts} | |
476 window. Any other conflicts you may have will still show up in the | |
477 \agwindow{Conflicts} window. | |
478 | |
479 Sometimes it may be desirable to introduce ambiguities deliberately | |
480 into a grammar. For example, although it is easy enough to write | |
481 normal grammar rules for the conventional expression logic used in | |
482 programming languages, it is possible to produce a somewhat more | |
483 compact, faster running parser by writing an ambiguous grammar and | |
484 using AnaGram's precedence declarations to resolve the ambiguities. | |
485 An example of this technique is given in Chapter 9. | |
486 | |
487 \subsection{Null Productions} | |
488 | |
489 \index{Null productions}\index{Production}Null productions can be a | |
490 fertile source of conflicts. Fortunately, many of them can be easily | |
491 resolved by simply rewriting a few rules. Consider, for example, the | |
492 following: | |
493 | |
494 \begin{indentingcode}{0.4in} | |
495 name | |
496 -> capital letter, letter..., white space | |
497 | |
498 initial | |
499 -> capital letter, period, white space | |
500 | |
501 full name | |
502 -> name, initial?, name | |
503 \end{indentingcode} | |
504 | |
505 In this example, \agcode{initial?} is a virtual production which | |
506 AnaGram replaces with | |
507 % not ``with''... | |
508 | |
509 \begin{indentingcode}{0.4in} | |
510 initial? | |
511 -> | |
512 -> initial | |
513 \end{indentingcode} | |
514 | |
515 so \agcode{initial?} involves a null production. | |
516 | |
517 Recall that LALR-1 parsers cannot look ahead more than one character. | |
518 The problem with this grammar, then, is that given the input ``John | |
519 Q'', there is no way to tell whether the ``Q'' is an initial or the | |
520 first letter of the last name. Is this ``John Q. Public'' or ``John | |
521 Quincy''? Adding another production for ``full name'' so that | |
522 ``initial'' need not be marked as optional allows the parser to defer | |
523 its decision: | |
524 | |
525 \begin{indentingcode}{0.4in} | |
526 full name | |
527 -> name, name | |
528 -> name, initial, name | |
529 \end{indentingcode} | |
530 | |
531 Notice that the technique here is to \emph{avoid a reduction} until | |
532 the parser has enough information to proceed. | |
533 | |
534 \subsection{Reduce-Reduce Conflicts} | |
535 | |
536 \index{Reduce-reduce conflicts}\index{Conflicts}Reduce-reduce | |
537 conflicts are usually caused by the limited lookahead of LR-1 parsing. | |
538 Very occasionally, you might find a grammar with a reduce-reduce | |
539 conflict that is caused by the LALR-1 approximation to LR-1 parsing | |
540 that AnaGram supports. Such conflicts are quite easy to resolve, | |
541 however, by almost trivial rewriting. | |
542 | |
543 Reduce-reduce conflicts commonly involve productions that are | |
544 intrinsically indistinguishable, for instance: | |
545 | |
546 \begin{indentingcode}{0.4in} | |
547 tweedledee | |
548 -> tiddly, wink | |
549 | |
550 tweedledum | |
551 -> tiddly, wink | |
552 \end{indentingcode} | |
553 | |
554 However, it may be possible to distinguish \agcode{tweedledee} from | |
555 \agcode{tweedledum} by context. Suppose | |
556 | |
557 \begin{indentingcode}{0.4in} | |
558 gibber | |
559 -> tweedledee, apples | |
560 -> tweedledum, oranges | |
561 \end{indentingcode} | |
562 | |
563 Now if you can distinguish \agcode{apples} from \agcode{oranges} on | |
564 the basis of the very first character, your parser will be able to | |
565 tell \agcode{tweedledee} from \agcode{tweedledum}. If, on the other | |
566 hand, it takes more than one input character to tell whether you have | |
567 \agcode{apples} or \agcode{oranges}, you have a reduce-reduce conflict. | |
568 | |
569 This example, of course, is rather far-fetched, since the potential | |
570 confusion between \agcode{tweedledee} and \agcode{tweedledum} is | |
571 obvious. More commonly, the definitions and usage of | |
572 \agcode{tweedledee} and \agcode{tweedledum} are physically far apart | |
573 in the grammar. At one place in a grammar: | |
574 | |
575 \begin{indentingcode}{0.4in} | |
576 balderdash | |
577 -> tweedledee, apples | |
578 \end{indentingcode} | |
579 | |
580 Elsewhere, seemingly unrelated: | |
581 | |
582 \begin{indentingcode}{0.4in} | |
583 buncombe | |
584 -> tweedledum, oranges | |
585 \end{indentingcode} | |
586 | |
587 In a dark corner: | |
588 | |
589 \begin{indentingcode}{0.4in} | |
590 gossip | |
591 -> buncombe, slander | |
592 \end{indentingcode} | |
593 | |
594 And in yet another dark corner: | |
595 | |
596 \begin{indentingcode}{0.4in} | |
597 small talk | |
598 -> gossip | |
599 -> rumors | |
600 -> balderdash | |
601 \end{indentingcode} | |
602 | |
603 Once again, if \agcode{apples} and \agcode{oranges} cannot be | |
604 distinguished by the very first character, you have a reduce-reduce | |
605 conflict that is somewhat harder to find than the one discussed above, | |
606 simply because the trail is somewhat more complicated to follow. | |
607 | |
608 \subsection{Identifying Reduce-Reduce Conflicts} | |
609 \index{Reduce-reduce conflicts}\index{Conflicts} | |
610 | |
611 Although the machinery described above for identifying shift-reduce | |
612 conflicts can be used to investigate reduce-reduce conflicts, many | |
613 yield to comparison of the \agwindow{Reduction States} windows for the | |
614 two rules. For the two rules to be distinguishable, no token which is | |
615 acceptable input for any of the reduction states belonging to one rule | |
616 can be acceptable input for any reduction state belonging to the | |
617 other. | |
618 | |
619 Therefore, when you have a reduce-reduce conflict, there is an overlap | |
620 among the tokens in the reduction states for one rule and the tokens | |
621 in the reduction states for the other rule. Often, however, there are | |
622 many reduction states which are irrelevant as far as a particular | |
623 conflict is concerned. So AnaGram provides a special window, the | |
624 \agwindow{Problem States} window, in the \agmenu{Auxiliary Windows} | |
625 menu for the \agwindow{Conflicts} and \agwindow{Resolved Conflicts} | |
626 windows. The \agwindow{Problem States} window for a particular rule | |
627 lists those, and only those, reduction states for the rule for which | |
628 the conflict token is acceptable input. This usually reduces the size | |
629 of the table and often makes the source of the problem immediately | |
630 obvious. | |
631 | |
632 \subsection{Resolving Reduce-Reduce Conflicts} | |
633 | |
634 If you build a parser for a grammar that includes a reduce-reduce | |
635 conflict, AnaGram will arbitrarily reduce the first rule it | |
636 encountered while reading your grammar, that is, the rule with the | |
637 smaller rule number. In some circumstances this default may be | |
638 acceptable. In other circumstances more may be required. | |
639 | |
640 Often, it is possible to get rid of reduce-reduce conflicts by | |
641 rewriting your grammar so that tokens that were once reducing tokens | |
642 for a rule become part of the rule itself. Thus, in the previous | |
643 example, you could write: | |
644 | |
645 \begin{indentingcode}{0.4in} | |
646 tweedledee | |
647 -> tiddly, wink, apples | |
648 | |
649 tweedledum | |
650 -> tiddly, wink, oranges | |
651 \end{indentingcode} | |
652 | |
653 In other words, if \agcode{apples} always follows \agcode{tweedledee}, | |
654 you can move \agcode{apples} ``down'' in the grammar by making it part | |
655 of \agcode{tweedledee}. Moving tokens ``down'' in this manner always | |
656 reduces the likelihood of conflicts. But what if \agcode{apples} | |
657 doesn't always follow \agcode{tweedledee}? Then it is sometimes | |
658 worthwhile to make two versions of \agcode{tweedledee}: | |
659 | |
660 \begin{indentingcode}{0.4in} | |
661 simple tweedledee | |
662 -> tiddly, wink | |
663 | |
664 complex tweedledee | |
665 -> tiddly, wink, apples | |
666 \end{indentingcode} | |
667 | |
668 and in each circumstance use whichever is appropriate. As with the | |
669 earlier shift-reduce example, the technique is to avoid any reduction | |
670 until the tokens can be distinguished. | |
671 | |
672 Sometimes, however, reduce-reduce conflicts reflect a language design | |
673 that really demands greater lookahead than can be accommodated with an | |
674 LR-1 parser. In these cases you have two options: you can consider a | |
675 multi-stage parser or you can implement an ad hoc lookahead procedure. | |
676 In the latter case, you can use semantically determined productions to | |
677 control your parser depending on the results of your lookahead | |
678 procedure. | |
679 | |
680 \subsection{Conflicts due to LALR-1 Simplifications} | |
681 \index{LALR simplifications}\index{Conflicts} | |
682 | |
683 LALR-1 \index{Parser generator}parser generators occasionally report a | |
684 reduce-reduce conflict where an LR-1 parser would successfully resolve | |
685 the conflict. How significant is this problem? An early prototype | |
686 version of AnaGram actually had full LR-1 capability. Unfortunately | |
687 LR-1 parsers are generally much bigger than LALR-1 parsers. Creating | |
688 them takes longer and more resources so that for given computer | |
689 resources an LALR-1 parser generator can handle larger, more complex | |
690 grammars than an LR-1 parser generator. To provide a compromise, the | |
691 prototype was modified to do an LALR-1 analysis first. Then, only if | |
692 there were reduce-reduce conflicts did it do an LR-1 analysis. Over a | |
693 substantial period of testing, the only reduce-reduce conflicts that | |
694 were resolved by the LR-1 analysis were those that were created | |
695 deliberately as tests. Eventually the LR-1 capabilities were removed | |
696 from AnaGram since they added complexity without concomitant benefits. | |
697 To understand why this is so, it is worthwhile to look at an example. | |
698 Let us recall the definitions of \agcode{tweedledee} and | |
699 \agcode{tweedledum} made above: | |
700 | |
701 \begin{indentingcode}{0.4in} | |
702 tweedledee | |
703 -> tiddly, wink | |
704 tweedledum | |
705 -> tiddly, wink | |
706 \end{indentingcode} | |
707 | |
708 Remember that we defined gibber thus: | |
709 | |
710 \begin{indentingcode}{0.4in} | |
711 gibber | |
712 -> tweedledee, apples | |
713 -> tweedledum, oranges | |
714 \end{indentingcode} | |
715 | |
716 Now suppose that we also define jabber in a contrary way: | |
717 | |
718 \begin{indentingcode}{0.4in} | |
719 jabber | |
720 -> tweedledee, oranges | |
721 -> tweedledum, apples | |
722 \end{indentingcode} | |
723 | |
724 Now suppose we use \agcode{gibber} in one part of our grammar and | |
725 \agcode{jabber} in another, completely unrelated, part. An LR-1 | |
726 parser generator would succeed in keeping \agcode{tweedledee} and | |
727 \agcode{tweedledum} straight in spite of our efforts to confuse it. | |
728 An LALR-1 parser generator, however, succumbs to our efforts and | |
729 reports a conflict. A trivial rewrite of the grammar, however, | |
730 removes the conflict: | |
731 | |
732 \begin{indentingcode}{0.4in} | |
733 gibber tweedledee | |
734 -> tiddly, wink | |
735 gibber tweedledum | |
736 -> tiddly, wink | |
737 | |
738 jabber tweedledee | |
739 -> tiddly, wink | |
740 jabber tweedledum | |
741 -> tiddly, wink | |
742 | |
743 gibber | |
744 -> gibber tweedledee, apples | |
745 -> gibber tweedledum, oranges | |
746 | |
747 jabber | |
748 -> jabber tweedledee, oranges | |
749 -> jabber tweedledum, apples | |
750 \end{indentingcode} | |
751 | |
752 The effect of this rewrite is to accomplish the same separation of | |
753 states that an LR-1 parser generator would make. | |
754 | |
755 \section{Keyword Anomalies} | |
756 \index{Keyword Anomalies}\index{Anomaly} | |
757 | |
758 In syntax directed parsing, it is assumed that input tokens can be | |
759 uniquely identified. In the case of keywords, however, there is the | |
760 possibility that the individual characters making up the keyword, as | |
761 well as the keyword taken as a whole, could constitute legitimate | |
762 input under some circumstances. Thus keywords, though a powerful and | |
763 useful tool, are not completely consistent with the assumptions that | |
764 underlie syntax directed parsing. This can occasionally give rise to a | |
765 type of conflict, diagnosed by AnaGram, called a \agterm{keyword | |
766 anomaly}. AnaGram is quite conservative in its diagnoses, so that many | |
767 keyword anomalies it reports are actually innocuous and can be safely | |
768 ignored. Nevertheless, anomalies should be investigated carefully. | |
769 | |
770 AnaGram only recognizes a keyword in those states where the keyword is | |
771 allowable input. If AnaGram recognizes a keyword, the keyword takes | |
772 precedence over the sequence of characters that constitute the | |
773 keyword. If there are several allowable keywords, the longest takes | |
774 precedence. | |
775 | |
776 Under certain circumstances it is possible for a grammar to have a | |
777 state where a particular keyword may or may not be allowable input | |
778 depending on how the parser gets to that state. That is, for some | |
779 sequences of tokens which lead to the state, the keyword is allowable, | |
780 in others it isn't. In such a state, the keyword must always cause a | |
781 reduction, since if the keyword were to shift, it would always be | |
782 allowable input. When AnaGram finds such a state in your parser, it | |
783 checks to see what the parser action would be if the parser ignored | |
784 the keyword and interpreted it in terms of its constituent | |
785 characters. If the parser action would be anything other than reducing | |
786 the same rule as the keyword or a syntax error, AnaGram will diagnose | |
787 a keyword anomaly in this state. | |
788 | |
789 To summarize, the nature of a keyword anomaly is this: It is possible | |
790 for your parser to get to the anomalous state by a route for which the | |
791 keyword is not allowable. In the anomalous state, at least the first | |
792 constituent character of the keyword is allowable in its own right and | |
793 causes a parser action different from that caused by the | |
794 keyword. Nonetheless, if the keyword should appear in the input, your | |
795 parser will recognize it, perform at least one reduction action, get | |
796 to a state where the keyword is not recognized, and finally, in the | |
797 wrong state, will try to interpret the keyword in terms of its | |
798 constituent characters. | |
799 | |
800 Now, what do you do about a keyword anomaly? If in the anomalous state | |
801 the characters which constitute the keyword would never be | |
802 syntactically admissible as individual characters, the keyword anomaly | |
803 is innocuous in that the only problem is that the recognition of the | |
804 syntax error is somewhat delayed. If however, the characters which | |
805 constitute the keyword are syntactically admissible, you have a | |
806 problem that needs to be fixed. | |
807 | |
808 In many programming languages, many keywords are treated as | |
809 \agterm{reserved words}, that is, as words that may not be otherwise | |
810 used in a conforming program. Version 1.5 of AnaGram has added a | |
811 \index{Reserve keywords}\agparam{reserve keywords} statement that you | |
812 can use to specify that the text of a keyword is \emph{never} | |
813 admissible as individual characters. Thus you will never get a | |
814 keyword anomaly diagnostic for a keyword that has been listed in a | |
815 \index{Reserve keywords}\agparam{reserve keywords} statement. | |
816 The \agparam{reserve keywords} statement is described in greater | |
817 detail in \S 8.2.28. | |
818 % XXX this is the only place in the manual (so far?) that uses the | |
819 % \S section-symbol. (Also, this should be a tex crossreference.) | |
820 | |
821 \subsection{The Keyword Anomalies Window} | |
822 \index{Keyword Anomalies}\index{Window} | |
823 | |
824 If your grammar has a keyword anomaly, AnaGram will provide a warning | |
825 message and will create a \agwindow{Keyword Anomalies} window to | |
826 provide details about the anomalies in your grammar. | |
827 | |
828 Each entry in the \agwindow{Keyword Anomalies} window consists of two | |
829 lines. The first line identifies the state at which the anomaly | |
830 occurs and the offending keyword. The second line identifies the rule | |
831 which the keyword may erroneously reduce. The \agmenu{Auxiliary | |
832 Windows} menu provides three auxiliary windows keyed directly to the | |
833 anomaly to help you determine the nature of the problem: the | |
834 \agwindow{Keyword Anomaly Trace} window, the \agwindow{Reduction | |
835 Trace} window, and the \agwindow{Rule Derivation} window. In | |
836 addition, three other windows provide supporting information: the | |
837 \agwindow{Reduction States} window, the \agwindow{Rule Context} window | |
838 and the \agwindow{State Definition} window. These windows are | |
839 discussed in Chapter 6. | |
840 | |
841 The \agwindow{Keyword Anomaly Trace} window, a pre-built | |
842 \agwindow{Grammar Trace}, shows a sequence of tokens that illustrates | |
843 one way to encounter the anomaly. The \agwindow{Reduction Trace} | |
844 window shows the result of performing the reduction action. Just as | |
845 for conflicts, the true source of the problem is often at some remove | |
846 from the actual state in which the problem manifests itself. One of | |
847 the characteristic rules for the last state in the \agwindow{Reduction | |
848 Trace} window is, in fact, the true source of the problem. The | |
849 \agwindow{Rule Derivation} window can show you which of the | |
850 characteristic rules is the offending party and how the rule in the | |
851 anomalous state derives from it. | |
852 | |
853 \subsection{Dealing with Keyword Anomalies} | |
854 | |
855 If your grammar has a keyword anomaly, you should study it carefully. | |
856 It may be that the anomaly is completely innocuous, that is, the | |
857 characters which constitute the keyword are not otherwise legitimate | |
858 input. In this case you can simply ignore the message. | |
859 | |
860 The basic approach to dealing with keyword anomalies is to revise your | |
861 grammar to make a better distinction between the contexts in which the | |
862 keyword is acceptable and those in which it may be confused with | |
863 otherwise valid input. Several other techniques may be useful: | |
864 | |
865 For an anomaly to occur, there must be, \emph{in the anomaly state}, | |
866 an alternative interpretation of some of the initial characters of the | |
867 keyword. If the anomaly occurs because you have several keywords that | |
868 begin in a similar manner, you may use the | |
869 \index{Distinguish keywords}\index{Keywords}\agparam{distinguish keywords} | |
870 statement (see Chapter 8) to keep them properly separated. Note that | |
871 having several keywords that begin with the same characters does | |
872 \emph{not} necessarily cause an anomaly. | |
873 | |
874 If the keyword is an alphabetic keyword that can follow conventional | |
875 variable name tokens, you may wish to use a \agparam{sticky} | |
876 declaration or the | |
877 \index{Distinguish lexemes}\index{Configuration switch} | |
878 \agparam{distinguish lexemes} switch | |
879 to inhibit recognition of the keyword when it is preceded immediately | |
880 by a name. | |
881 | |
882 If the keyword can be treated as a reserved word, the \index{Reserve | |
883 keywords}\agparam{reserve keywords} statement will eliminate the | |
884 anomaly. |