comparison doc/manual/xg-iii.tex @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.
author David A. Holland
date Sat, 22 Dec 2007 17:52:45 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:13d2b8934445
1 \chapter{Exploring Your Grammar III: Diagnostic Windows}
2
3 If AnaGram finds problems with your grammar, it creates a number of
4 windows to help you identify and correct the problems. These windows
5 are the \agwindow{Warnings} window, the \agwindow{Conflicts} window,
6 the \agwindow{Resolved Conflicts} window, and the \agwindow{Keyword
7 Anomalies} window. To expand upon these windows, AnaGram also
8 provides, using the \agmenu{Auxiliary Windows} menu, the following
9 supporting windows: \agwindow{Conflict Trace}, \agwindow{Keyword
10 Anomaly Trace}, \agwindow{Reduction Trace}, \agwindow{Rule
11 Derivation}, \agwindow{Token Derivation} and \agwindow{Problem
12 States}. This chapter describes these windows and their use.
13
14
15 \section{The Warnings Window}
16 \index{Warnings}\index{Window}
17
18 If while analyzing your syntax file, AnaGram finds something
19 suspicious, it issues a warning, regardless of the severity of the
20 problem. The \agwindow{Warnings} window will pop up automatically
21 when the analysis has been completed. If the warning is for a syntax
22 error in your input file, you will have to fix it, because AnaGram
23 cannot successfully interpret your syntax file. Otherwise, no matter
24 how serious the warnings may be, AnaGram will create a parser if it is
25 instructed to do so. Thus, if you decide the warnings are of no
26 consequence, you may choose to ignore them. Note that AnaGram will
27 resolve shift-reduce conflicts by choosing the shift action. It will
28 resolve reduce-reduce conflicts by choosing the rule that occurred
29 first in the grammar.
30
31 If you have \index{Syntax error}\index{Errors}syntax errors, AnaGram
32 will synchronize the cursor in the syntax file window with the cursor
33 in the \agwindow{Warnings} window so that as you move through the
34 warning messages, the cursor in the syntax file window will move to
35 the point of the error.
36
37 AnaGram's warning messages are listed, in alphabetical order, in
38 Appendix B, together with explanations of the messages and suggestions
39 as to what to do about them.
40
41
42 \section{The Conflicts Window}
43 \index{Conflicts}\index{Window}
44
45 The \agwindow{Conflicts} window lists the conflicts which AnaGram
46 found in your grammar. The table identifies the parser states in
47 which it found conflicts. For each conflict state it identifies the
48 reducing token for which it had more than one option, and the marked
49 rules which characterize each such option. If AnaGram was able to
50 resolve a conflict by using \agparam{precedence} or \agparam{sticky}
51 declarations, the conflict will be listed in the \index{Resolved
52 Conflicts}\index{Window}\agwindow{Resolved Conflicts} window instead.
53
54 The \agwindow{Conflicts} window and the \agwindow{Resolved Conflicts}
55 window each require several lines to display a conflict. For each
56 conflict, the first, or header, line identifies the conflict state and
57 the conflict token. The conflict token is a terminal token for which
58 there is more than one parser action consistent with the rules of the
59 grammar. If both shift and reduce actions are consistent with the
60 grammar, the conflict is a \agterm{shift-reduce} conflict. If more
61 than one reduce action is consistent with the grammar, the conflict is
62 said to be a \agterm{reduce-reduce} conflict.
63
64 Following the header line, AnaGram lists the grammar rules which the
65 conflict token could satisfy. The rules which correspond to shift
66 actions are listed first, followed by those which correspond to reduce
67 actions. Rules which correspond to shift actions have not yet been
68 completed and have a marked token, shown in a distinctive type face,
69 which represents the next token expected. For such rules, the
70 conflict token is an instance of the marked token and can be shifted
71 into the input buffer. If the rule has been completely matched there
72 is no marked token and the conflict token can reduce the rule. In the
73 following discussion, such a rule is referred to as a \agterm{conflict
74 rule}. If there is a more than one conflict rule, the conflict is
75 called a \agterm{reduce-reduce} conflict; otherwise it is a
76 \agterm{shift-reduce} conflict. Although it is theoretically possible
77 for a conflict to be both a shift-reduce conflict and a reduce-reduce
78 conflict, it is not a common event.
79
80 The \agmenu{Auxiliary Windows} menu for the \agwindow{Conflicts} window
81 provides eleven auxiliary windows to help you understand the
82 circumstances of a conflict. Five of these windows address the
83 conflict directly. These windows are the \agwindow{Conflict Trace}, the
84 \agwindow{Reduction Trace}, the \agwindow{Rule Derivation}, the
85 \agwindow{Token Derivation}, and the \agwindow{Problem States}
86 windows. These windows are keyed to the conflict state, the conflict
87 token, and the conflict rule. If the cursor bar is not highlighting a
88 conflict rule, AnaGram scans down the \agwindow{Conflicts} window for
89 the first conflict rule for the selected state and token. The use of
90 these windows in analyzing conflicts is described below.
91
92 The other windows in the \agmenu{Auxiliary Windows} menu have been
93 described in Chapter 6. These are the \agwindow{Expansion Chain}
94 window, the \agwindow{Rule Context} window, the \agwindow{Reduction
95 States} window, the \agwindow{State Definition} window, the
96 \agwindow{State Expansion} window and the \agwindow{Token Usage}
97 window. The \agwindow{Rule Context} window is keyed to the particular
98 rule highlighted by the cursor. The
99 \index{Expansion Chain}\index{Window}\agwindow{Expansion Chain} window
100 is keyed to the highlighted rule. If the highlighted rule is a
101 characteristic rule no expansion chain is possible, so the option will
102 be greyed out. The \agwindow{Reduction States} window is keyed to the
103 highlighted rule if it is a conflict rule, otherwise to the first
104 conflict rule following the highlighted line. The \agwindow{State
105 Definition} and \agwindow{State Expansion} windows are keyed to the
106 conflict state. The \agwindow{Token Usage} window is keyed to the
107 conflict token.
108
109
110 \section{The Resolved Conflicts Window}
111 \index{Resolved Conflicts}\index{Window}
112
113 AnaGram creates the \agwindow{Resolved Conflicts} window only when the
114 grammar it is analyzing has conflicts and when some of those conflicts
115 have been resolved by precedence rules (see Chapter 9) or by use of
116 the \agparam{sticky} directive. In this case, it will list, in the
117 \agwindow{Resolved Conflicts} window, the conflicts that were resolved
118 and the way in which they were resolved. The layout of the
119 \agwindow{Resolved Conflicts} window is identical to that of the
120 \agwindow{Conflicts} window with one exception. The rules which your
121 parser will observe, selected in accordance with your \agparam{sticky}
122 declarations or precedence rules, are marked with asterisks. Those
123 which your parser will ignore will not be so marked. The
124 \agwindow{Auxiliary Windows} available from the \agwindow{Resolved
125 Conflicts} window are the same as those available from the
126 \agwindow{Conflicts} window.
127
128
129 \section{Conflicts}
130 \index{Conflicts}
131
132 Chapter 4 presented a brief introduction to the problem of conflicts
133 in a grammar. There, conflicts were simply classified as shift-reduce
134 conflicts or as reduce-reduce conflicts. This section examines
135 conflicts in somewhat greater detail. It looks at the causes of
136 conflicts, how to identify the cause of a conflict and how to deal
137 with it.
138
139 Conflicts occur when, in a particular state, a given token, the
140 \index{Conflict token}\index{Token}\agterm{conflict token}, reduces
141 more than one rule, or reduces a rule when the token could also be
142 shifted into the input buffer in accordance with another rule.
143
144 Note that there are no shift-shift conflicts, that is, a particular
145 state could have several
146 \index{Characteristic rules}\index{Rules}characteristic rules
147 accepting the same token as a shift token. A conflict is always
148 associated with reduction. This is because LALR parsers keep track of
149 multiple possibilities in the input stream as long as the input tokens
150 only cause shifts. At the point where a reduction occurs the parser
151 must be able to identify exactly one of the possibilities as
152 consistent with the input.
153
154 \subsection{Shift-Reduce Conflicts}
155 \index{Shift-reduce conflict}\index{Conflicts}
156
157 Unless a conflict is deliberate, the fact that the conflict token
158 reduces a particular rule is usually a surprise. Although some
159 conflicts arise because an LALR-1 parser generator cannot look far
160 enough ahead, most shift-reduce conflicts result from true
161 \index{Ambiguities}ambiguities in your grammar. For example, an
162 erroneous Pascal grammar contained the following productions, which
163 illustrate one of the more common classes of shift-reduce conflicts:
164
165 \begin{indentingcode}{0.4in}
166 parameter list
167 -> expression
168 -> parameter list, expression
169 \end{indentingcode}
170
171 Pascal actually requires that the expressions in a parameter list be
172 separated by commas. In other words, the second rule should read:
173
174 \begin{indentingcode}{0.4in}
175 -> parameter list, comma, expression
176 \end{indentingcode}
177
178 Lacking a requirement for a comma, the erroneous rules specify that a
179 parameter list is simply a sequence of expressions with nothing
180 separating them. For example, using this definition,
181
182 \begin{indentingcode}{0.4in}
183 (a + b) - (c * d)
184 \end{indentingcode}
185
186 can be interpreted either as one expression, if the minus sign is
187 interpreted as a subtraction operator, or as a sequence of two
188 expressions if it is interpreted as a unary minus sign.
189
190 When AnaGram analyzed the grammar with the erroneous rule it found a
191 conflict in the state where the minus sign was expected. If the minus
192 sign is the subtraction operator, it should be shifted onto the input
193 stack. On the other hand, if the minus sign is a unary minus sign,
194 the sum \agcode{(a+b)} should be reduced to \agcode{expression}, and
195 then to \agcode{parameter list}. Thus, this ambiguity gives rise to a
196 shift-reduce conflict, where it is impossible to determine a unique
197 parser action.
198
199 \subsection{Identifying Ambiguities}
200 \index{Ambiguities}
201
202 Unfortunately, shift-reduce conflicts usually show up in parser states
203 which are far removed from the true source of the problem. When your
204 grammar has a shift-reduce conflict of the sort illustrated above,
205 there may be no readily apparent link to the erroneous
206 production. AnaGram provides several options to help you establish the
207 link, all available using the \agmenu{Auxiliary Windows} menu from the
208 \agwindow{Conflicts} window. These options comprise two pre-built
209 grammar traces, the \agwindow{Conflict Trace} and the
210 \agwindow{Reduction Trace}; two special windows called \agwindow{Rule
211 Derivation} and \agwindow{Token Derivation}; the \agwindow{Expansion
212 Chain} window and a special version of the \agwindow{Reduction States}
213 window called the \agwindow{Problem States} window. The four trace
214 and derivation windows are coordinated, so that AnaGram actually
215 computes a mutually consistent set of four whenever you ask for one of
216 them.
217
218 The \agwindow{Conflict} and \agwindow{Reduction Traces} are simply
219 pre-initialized \agwindow{Grammar Traces} which you may use to become
220 more familiar with the circumstances of the conflict. Bear in mind,
221 however, that unless you have set the
222 \index{Traditional engine}\index{Configuration switches}
223 \agparam{traditional engine} configuration switch,
224 the traces will use compound actions, which in some cases will use the
225 default resolution of conflicts. Therefore, unless you set
226 \agparam{traditional engine} it is not always possible to work through
227 an alternate resolution of a conflict.
228
229 The
230 \index{Conflict Trace}\index{Trace}\index{Window}\agwindow{Conflict Trace}
231 shows you one possible sequence of input which will lead to the
232 conflict state. Unless a conflict results from a typo or other
233 clerical error, it is usually a surprise and not something you
234 anticipated. The \agwindow{Conflict Trace} helps you to understand
235 the context of the problem. It is often helpful to look at the
236 \agwindow{Rule Stack} for the \agwindow{Conflict Trace} to see what
237 productions are involved in the conflict. You can inspect the
238 \agwindow{State Definition} window for the conflict state to see why
239 the conflict token can be shifted.
240
241 \index{Reduction Trace}\index{Trace}\index{Window}
242 The \agwindow{Reduction Trace}, for a particular conflict rule,
243 carries forward a \agwindow{Conflict Trace} by selecting the reduce
244 action. In other words, for a parser in the conflict state, the
245 \agwindow{Reduction Trace} shows the result of using the conflict
246 token as a reducing token. The trace shows the result of taking all
247 possible reductions until a state is reached where the lookahead token
248 will shift into the input buffer. If you compare the
249 \agwindow{Reduction Trace} to the \agwindow{Conflict Trace} you will
250 notice that they start off with identical tokens. At some point they
251 diverge. All of the tokens in the \agwindow{Conflict Trace} following
252 this point have reduced to the token in the \agwindow{Reduction
253 Trace}. Usually there are no further tokens in the
254 \agwindow{Reduction Trace}, but if there are, you will notice that
255 they are all zero-length tokens. That is, they all have a null
256 production, so that they come for free, as it were. If you press
257 Enter, to display the \agwindow{Select Input} window for the
258 \agwindow{Reduction Trace}, you will see that the conflict token is
259 acceptable input.
260
261 The \agwindow{Reduction Trace} illustrates the ambiguity in your
262 grammar that caused the conflict: One of the acceptable input tokens
263 in the final state of this trace cannot be distinguished from its
264 predecessor.
265
266 The \index{Expansion Chain}\index{Window}\agwindow{Expansion Chain}
267 window is extremely useful for indicating why a particular grammar
268 rule is an expansion rule in a particular state. Sometimes the chain
269 of substitutions to produce a given rule is quite long, and changes to
270 a grammar can have unexpected effects. The \agwindow{Expansion Chain}
271 window enables you to see how a rule in a conflict state got to be
272 there.
273
274 \index{Problem States}\index{Window}The \agwindow{Problem States}
275 window for a conflict rule is essentially a trimmed
276 \agwindow{Reduction States} window which shows only those reduction
277 states in which the conflict token is acceptable input. Often it is
278 sufficient to look at the \agwindow{Problem States} window to see the
279 source of the problem. The last state in the \agwindow{Reduction
280 Trace} window, will be, of course, one of the states in the
281 \agwindow{Problem States} window.
282
283 \subsection{Rule and Token Derivation Windows}
284
285 If after inspecting the \agwindow{Problem States}, \agwindow{Conflict
286 Trace} and the \agwindow{Reduction Trace}, the conflict is still a
287 mystery, you may pinpoint the problem by looking at the
288 \agwindow{Rule} and \agwindow{Token Derivation} windows for the
289 conflict. Both windows contain a sequence of rules, and both begin
290 with the same rule, the rule which is the root cause of the conflict.
291 This rule is one of the characteristic rules of the last state in the
292 \agwindow{Reduction Trace} and is the rule which produces the ambiguity.
293
294 \index{Rule Derivation}\index{Window}In both windows, each rule,
295 except the last line of the \agwindow{Rule Derivation}, contains a
296 marked token in a distinctive typeface. The essence of the ambiguity
297 is that it is impossible to tell where in the input stream the marked
298 token in the top line of the \agwindow{Rule Derivation} ends and the
299 marked token in the top line of the \agwindow{Token Derivation}
300 begins.
301
302 In both derivation windows, each line after the first consists of a
303 rule produced by the marked token on the previous line. In the
304 \agwindow{Rule Derivation}, except on the first line, the marked token
305 is either the last token in the rule, or all subsequent tokens have
306 null productions. The last rule in the derivation window is the
307 conflict rule you selected in the \agwindow{Conflicts} window. Thus
308 the \agwindow{Rule Derivation} window shows you how the rule involved
309 in the conflict derives from the root.
310
311 \index{Window}\index{Token Derivation}In the \agwindow{Token
312 Derivation} window, except on the first line, the marked token is
313 either the first token in the rule, or all preceding tokens have null
314 productions. The marked token in the last line of the list is the
315 conflict token.
316
317 The \agwindow{Rule Derivation} and \agwindow{Token Derivation} windows
318 each have five auxiliary windows. The \agwindow{Rule Context} window
319 is keyed simply to the highlighted rule. The other four windows, the
320 \agwindow{Expansion Rules} window, the \agwindow{Productions} window,
321 the \agwindow{Set Elements} window, and the \agwindow{Token Usage}
322 window are keyed to the marked token. Remember that there is no marked
323 token on the last line of the \agwindow{Rule Derivation} window.
324
325 \subsection{Resolving Ambiguities}
326 \index{Ambiguities}
327
328 The shift-reduce conflict discussed earlier stemmed from an oversight
329 in writing a grammar: the grammar specified a list of items, in this
330 case expressions, which could not be distinguished without a separator
331 of some sort. This conflict can be resolved simply by rewriting the
332 grammar to require a comma separating the expressions.
333
334 A similar problem comes from specifying a list of lists without any
335 separator. Sometimes such a specification is made inadvertently. The
336 erroneous Pascal grammar cited above had such a conflict. Here are
337 the productions that caused trouble:
338
339 \begin{indentingcode}{0.4in}
340 declaration part
341 -> declaration section
342 -> declaration part, declaration section
343
344 declaration section
345 -> proc and func declaration part
346
347 proc and func declaration part
348 -> proc or func declaration
349 -> proc and func declaration part, proc or func declaration
350
351 proc or func declaration
352 -> proc declaration
353 -> func declaration
354 \end{indentingcode}
355
356 Of course, there are other productions for \agcode{declaration
357 section} so that this grammar looks perfectly reasonable on the
358 surface. A \agcode{declaration part} is a sequence of
359 \agcode{declaration sections}. One type of \agcode{declaration
360 section}, the \agcode{proc and func declaration part}, is a sequence
361 of procedure or function definitions. Unfortunately, if you have two
362 procedure definitions in a row, it could be interpreted as one
363 \agcode{proc and func declaration part} consisting of two declarations
364 or two \agcode{proc and func declaration parts} each consisting of one
365 declaration. The problem here is that there is a little too much
366 recursion. You can fix this problem by replacing
367
368 \begin{indentingcode}{0.4in}
369 declaration section
370 -> proc and func declaration part
371
372 proc and func declaration part
373 -> proc or func declaration
374 -> proc and func declaration part, proc or func declaration
375 \end{indentingcode}
376
377 with
378
379 \begin{indentingcode}{0.4in}
380 declaration section
381 -> proc or func declaration
382 \end{indentingcode}
383
384 After a little reflection you will see that this simplification, which
385 eliminates one source of recursion, entails no loss of generality.
386
387 In the above examples, the ambiguities were the results of errors in
388 the specification of the grammar, and could be easily fixed by
389 correcting the grammar. Sometimes, however, ambiguities arise which
390 are not so easily corrected. Consider, for instance, a simple token
391 scanner:
392
393 \begin{indentingcode}{0.4in}
394 tokens
395 -> [token, white space?...]...
396
397 token
398 -> name
399 -> number
400 -> string
401 -> character constant
402 -> punctuation character
403 \end{indentingcode}
404
405 where \agcode{name}, \agcode{number}, etc., are defined in the
406 conventional manner.
407
408 Such token scanners are normally written using conventional
409 programming techniques, or using programs such as \agfile{lex} which
410 are based on the analysis of regular expressions. Such programs
411 simply work on a character by character basis and do not see that
412 ``maryjane'' could be interpreted as one name or as two, or as even
413 more, since the separating white space is declared to be optional.
414 AnaGram, however, since it takes a broader view, sees that the
415 specifications are basically ambiguous and complains. Note, however,
416 that the default resolution of conflicts, in this case choosing the
417 shift action, will provide a parser that does precisely what you would
418 normally intend. Thus, the major concern is to make sure that the
419 conflicts caused by this situation do not mask other more serious
420 conditions.
421
422 While it is possible to rewrite this grammar so that it is not
423 ambiguous, such a grammar would lack clarity and conciseness. AnaGram
424 provides two alternatives for dealing with this problem: the
425 \index{Declaration}\index{Subgrammar declaration}\agparam{subgrammar}
426 declaration and the
427 \index{Sticky declaration}\index{Declaration}\agparam{sticky}
428 declaration, which accomplish the desired end using very different
429 techniques.
430
431 In the definition above, a grammar for token, taken all by itself,
432 would not normally show any ambiguities. We could write a
433 \index{Lexical scanner}lexical scanner that would identify a token,
434 pass it on to the parser, and then go on to the next. We would never
435 worry about any ambiguities. In this situation, it is clear that what
436 we mean by \agcode{token} is defined entirely by the grammar for
437 \agcode{token}, not by its usage within the rest of the grammar. When
438 we say that we have a sequence of tokens, \agcode{token...}, we
439 usually mean that, of course, two tokens, such as ``mary'' and
440 ``jane'' have to be separated by punctuation or space to be
441 distinguished as separate tokens. The
442 \index{Declaration}\index{Subgrammar declaration}\agparam{subgrammar}
443 declaration allows you to specify that a particular \index{Nonterminal
444 token}\index{Token}nonterminal token in your grammar is to be treated
445 as though it were defined by its own complete grammar, without regard
446 to its usage in the grander scheme of things. In the example above,
447 add the following:
448
449 \begin{indentingcode}{0.4in}
450 [ subgrammar \bra token \ket ]
451 \end{indentingcode}
452
453 Now, when AnaGram conducts its search for reducing tokens, it will
454 limit its search to those it would find if your grammar consisted only
455 of the grammar rules necessary to define token. When it does this, it
456 never even finds the reducing tokens that appear to cause conflicts.
457 When you fix a conflict with the \agparam{subgrammar} declaration, the
458 conflict vanishes entirely. It is not recorded in the
459 \agwindow{Resolved Conflicts} window. For this reason, the
460 \agparam{subgrammar} declaration should be used only when you are
461 absolutely sure of the nature of the problem you are trying to solve.
462
463 A somewhat more precise mechanism for dealing with the conflicts
464 described above is to declare name and number as \index{Sticky
465 declaration}\index{Declaration}\agparam{sticky}:
466
467 \begin{indentingcode}{0.4in}
468 [ sticky \bra name, number \ket ]
469 \end{indentingcode}
470
471 This means that whenever the last token in your parser's input buffer
472 is \agcode{name} or \agcode{number}, it will shift in any acceptable
473 input character, and will disregard conflicting reduce options.
474 AnaGram will continue to identify conflicts in your grammar due to
475 this cause, but will record them in the \agwindow{Resolved Conflicts}
476 window. Any other conflicts you may have will still show up in the
477 \agwindow{Conflicts} window.
478
479 Sometimes it may be desirable to introduce ambiguities deliberately
480 into a grammar. For example, although it is easy enough to write
481 normal grammar rules for the conventional expression logic used in
482 programming languages, it is possible to produce a somewhat more
483 compact, faster running parser by writing an ambiguous grammar and
484 using AnaGram's precedence declarations to resolve the ambiguities.
485 An example of this technique is given in Chapter 9.
486
487 \subsection{Null Productions}
488
489 \index{Null productions}\index{Production}Null productions can be a
490 fertile source of conflicts. Fortunately, many of them can be easily
491 resolved by simply rewriting a few rules. Consider, for example, the
492 following:
493
494 \begin{indentingcode}{0.4in}
495 name
496 -> capital letter, letter..., white space
497
498 initial
499 -> capital letter, period, white space
500
501 full name
502 -> name, initial?, name
503 \end{indentingcode}
504
505 In this example, \agcode{initial?} is a virtual production which
506 AnaGram replaces with
507 % not ``with''...
508
509 \begin{indentingcode}{0.4in}
510 initial?
511 ->
512 -> initial
513 \end{indentingcode}
514
515 so \agcode{initial?} involves a null production.
516
517 Recall that LALR-1 parsers cannot look ahead more than one character.
518 The problem with this grammar, then, is that given the input ``John
519 Q'', there is no way to tell whether the ``Q'' is an initial or the
520 first letter of the last name. Is this ``John Q. Public'' or ``John
521 Quincy''? Adding another production for ``full name'' so that
522 ``initial'' need not be marked as optional allows the parser to defer
523 its decision:
524
525 \begin{indentingcode}{0.4in}
526 full name
527 -> name, name
528 -> name, initial, name
529 \end{indentingcode}
530
531 Notice that the technique here is to \emph{avoid a reduction} until
532 the parser has enough information to proceed.
533
534 \subsection{Reduce-Reduce Conflicts}
535
536 \index{Reduce-reduce conflicts}\index{Conflicts}Reduce-reduce
537 conflicts are usually caused by the limited lookahead of LR-1 parsing.
538 Very occasionally, you might find a grammar with a reduce-reduce
539 conflict that is caused by the LALR-1 approximation to LR-1 parsing
540 that AnaGram supports. Such conflicts are quite easy to resolve,
541 however, by almost trivial rewriting.
542
543 Reduce-reduce conflicts commonly involve productions that are
544 intrinsically indistinguishable, for instance:
545
546 \begin{indentingcode}{0.4in}
547 tweedledee
548 -> tiddly, wink
549
550 tweedledum
551 -> tiddly, wink
552 \end{indentingcode}
553
554 However, it may be possible to distinguish \agcode{tweedledee} from
555 \agcode{tweedledum} by context. Suppose
556
557 \begin{indentingcode}{0.4in}
558 gibber
559 -> tweedledee, apples
560 -> tweedledum, oranges
561 \end{indentingcode}
562
563 Now if you can distinguish \agcode{apples} from \agcode{oranges} on
564 the basis of the very first character, your parser will be able to
565 tell \agcode{tweedledee} from \agcode{tweedledum}. If, on the other
566 hand, it takes more than one input character to tell whether you have
567 \agcode{apples} or \agcode{oranges}, you have a reduce-reduce conflict.
568
569 This example, of course, is rather far-fetched, since the potential
570 confusion between \agcode{tweedledee} and \agcode{tweedledum} is
571 obvious. More commonly, the definitions and usage of
572 \agcode{tweedledee} and \agcode{tweedledum} are physically far apart
573 in the grammar. At one place in a grammar:
574
575 \begin{indentingcode}{0.4in}
576 balderdash
577 -> tweedledee, apples
578 \end{indentingcode}
579
580 Elsewhere, seemingly unrelated:
581
582 \begin{indentingcode}{0.4in}
583 buncombe
584 -> tweedledum, oranges
585 \end{indentingcode}
586
587 In a dark corner:
588
589 \begin{indentingcode}{0.4in}
590 gossip
591 -> buncombe, slander
592 \end{indentingcode}
593
594 And in yet another dark corner:
595
596 \begin{indentingcode}{0.4in}
597 small talk
598 -> gossip
599 -> rumors
600 -> balderdash
601 \end{indentingcode}
602
603 Once again, if \agcode{apples} and \agcode{oranges} cannot be
604 distinguished by the very first character, you have a reduce-reduce
605 conflict that is somewhat harder to find than the one discussed above,
606 simply because the trail is somewhat more complicated to follow.
607
608 \subsection{Identifying Reduce-Reduce Conflicts}
609 \index{Reduce-reduce conflicts}\index{Conflicts}
610
611 Although the machinery described above for identifying shift-reduce
612 conflicts can be used to investigate reduce-reduce conflicts, many
613 yield to comparison of the \agwindow{Reduction States} windows for the
614 two rules. For the two rules to be distinguishable, no token which is
615 acceptable input for any of the reduction states belonging to one rule
616 can be acceptable input for any reduction state belonging to the
617 other.
618
619 Therefore, when you have a reduce-reduce conflict, there is an overlap
620 among the tokens in the reduction states for one rule and the tokens
621 in the reduction states for the other rule. Often, however, there are
622 many reduction states which are irrelevant as far as a particular
623 conflict is concerned. So AnaGram provides a special window, the
624 \agwindow{Problem States} window, in the \agmenu{Auxiliary Windows}
625 menu for the \agwindow{Conflicts} and \agwindow{Resolved Conflicts}
626 windows. The \agwindow{Problem States} window for a particular rule
627 lists those, and only those, reduction states for the rule for which
628 the conflict token is acceptable input. This usually reduces the size
629 of the table and often makes the source of the problem immediately
630 obvious.
631
632 \subsection{Resolving Reduce-Reduce Conflicts}
633
634 If you build a parser for a grammar that includes a reduce-reduce
635 conflict, AnaGram will arbitrarily reduce the first rule it
636 encountered while reading your grammar, that is, the rule with the
637 smaller rule number. In some circumstances this default may be
638 acceptable. In other circumstances more may be required.
639
640 Often, it is possible to get rid of reduce-reduce conflicts by
641 rewriting your grammar so that tokens that were once reducing tokens
642 for a rule become part of the rule itself. Thus, in the previous
643 example, you could write:
644
645 \begin{indentingcode}{0.4in}
646 tweedledee
647 -> tiddly, wink, apples
648
649 tweedledum
650 -> tiddly, wink, oranges
651 \end{indentingcode}
652
653 In other words, if \agcode{apples} always follows \agcode{tweedledee},
654 you can move \agcode{apples} ``down'' in the grammar by making it part
655 of \agcode{tweedledee}. Moving tokens ``down'' in this manner always
656 reduces the likelihood of conflicts. But what if \agcode{apples}
657 doesn't always follow \agcode{tweedledee}? Then it is sometimes
658 worthwhile to make two versions of \agcode{tweedledee}:
659
660 \begin{indentingcode}{0.4in}
661 simple tweedledee
662 -> tiddly, wink
663
664 complex tweedledee
665 -> tiddly, wink, apples
666 \end{indentingcode}
667
668 and in each circumstance use whichever is appropriate. As with the
669 earlier shift-reduce example, the technique is to avoid any reduction
670 until the tokens can be distinguished.
671
672 Sometimes, however, reduce-reduce conflicts reflect a language design
673 that really demands greater lookahead than can be accommodated with an
674 LR-1 parser. In these cases you have two options: you can consider a
675 multi-stage parser or you can implement an ad hoc lookahead procedure.
676 In the latter case, you can use semantically determined productions to
677 control your parser depending on the results of your lookahead
678 procedure.
679
680 \subsection{Conflicts due to LALR-1 Simplifications}
681 \index{LALR simplifications}\index{Conflicts}
682
683 LALR-1 \index{Parser generator}parser generators occasionally report a
684 reduce-reduce conflict where an LR-1 parser would successfully resolve
685 the conflict. How significant is this problem? An early prototype
686 version of AnaGram actually had full LR-1 capability. Unfortunately
687 LR-1 parsers are generally much bigger than LALR-1 parsers. Creating
688 them takes longer and more resources so that for given computer
689 resources an LALR-1 parser generator can handle larger, more complex
690 grammars than an LR-1 parser generator. To provide a compromise, the
691 prototype was modified to do an LALR-1 analysis first. Then, only if
692 there were reduce-reduce conflicts did it do an LR-1 analysis. Over a
693 substantial period of testing, the only reduce-reduce conflicts that
694 were resolved by the LR-1 analysis were those that were created
695 deliberately as tests. Eventually the LR-1 capabilities were removed
696 from AnaGram since they added complexity without concomitant benefits.
697 To understand why this is so, it is worthwhile to look at an example.
698 Let us recall the definitions of \agcode{tweedledee} and
699 \agcode{tweedledum} made above:
700
701 \begin{indentingcode}{0.4in}
702 tweedledee
703 -> tiddly, wink
704 tweedledum
705 -> tiddly, wink
706 \end{indentingcode}
707
708 Remember that we defined gibber thus:
709
710 \begin{indentingcode}{0.4in}
711 gibber
712 -> tweedledee, apples
713 -> tweedledum, oranges
714 \end{indentingcode}
715
716 Now suppose that we also define jabber in a contrary way:
717
718 \begin{indentingcode}{0.4in}
719 jabber
720 -> tweedledee, oranges
721 -> tweedledum, apples
722 \end{indentingcode}
723
724 Now suppose we use \agcode{gibber} in one part of our grammar and
725 \agcode{jabber} in another, completely unrelated, part. An LR-1
726 parser generator would succeed in keeping \agcode{tweedledee} and
727 \agcode{tweedledum} straight in spite of our efforts to confuse it.
728 An LALR-1 parser generator, however, succumbs to our efforts and
729 reports a conflict. A trivial rewrite of the grammar, however,
730 removes the conflict:
731
732 \begin{indentingcode}{0.4in}
733 gibber tweedledee
734 -> tiddly, wink
735 gibber tweedledum
736 -> tiddly, wink
737
738 jabber tweedledee
739 -> tiddly, wink
740 jabber tweedledum
741 -> tiddly, wink
742
743 gibber
744 -> gibber tweedledee, apples
745 -> gibber tweedledum, oranges
746
747 jabber
748 -> jabber tweedledee, oranges
749 -> jabber tweedledum, apples
750 \end{indentingcode}
751
752 The effect of this rewrite is to accomplish the same separation of
753 states that an LR-1 parser generator would make.
754
755 \section{Keyword Anomalies}
756 \index{Keyword Anomalies}\index{Anomaly}
757
758 In syntax directed parsing, it is assumed that input tokens can be
759 uniquely identified. In the case of keywords, however, there is the
760 possibility that the individual characters making up the keyword, as
761 well as the keyword taken as a whole, could constitute legitimate
762 input under some circumstances. Thus keywords, though a powerful and
763 useful tool, are not completely consistent with the assumptions that
764 underlie syntax directed parsing. This can occasionally give rise to a
765 type of conflict, diagnosed by AnaGram, called a \agterm{keyword
766 anomaly}. AnaGram is quite conservative in its diagnoses, so that many
767 keyword anomalies it reports are actually innocuous and can be safely
768 ignored. Nevertheless, anomalies should be investigated carefully.
769
770 AnaGram only recognizes a keyword in those states where the keyword is
771 allowable input. If AnaGram recognizes a keyword, the keyword takes
772 precedence over the sequence of characters that constitute the
773 keyword. If there are several allowable keywords, the longest takes
774 precedence.
775
776 Under certain circumstances it is possible for a grammar to have a
777 state where a particular keyword may or may not be allowable input
778 depending on how the parser gets to that state. That is, for some
779 sequences of tokens which lead to the state, the keyword is allowable,
780 in others it isn't. In such a state, the keyword must always cause a
781 reduction, since if the keyword were to shift, it would always be
782 allowable input. When AnaGram finds such a state in your parser, it
783 checks to see what the parser action would be if the parser ignored
784 the keyword and interpreted it in terms of its constituent
785 characters. If the parser action would be anything other than reducing
786 the same rule as the keyword or a syntax error, AnaGram will diagnose
787 a keyword anomaly in this state.
788
789 To summarize, the nature of a keyword anomaly is this: It is possible
790 for your parser to get to the anomalous state by a route for which the
791 keyword is not allowable. In the anomalous state, at least the first
792 constituent character of the keyword is allowable in its own right and
793 causes a parser action different from that caused by the
794 keyword. Nonetheless, if the keyword should appear in the input, your
795 parser will recognize it, perform at least one reduction action, get
796 to a state where the keyword is not recognized, and finally, in the
797 wrong state, will try to interpret the keyword in terms of its
798 constituent characters.
799
800 Now, what do you do about a keyword anomaly? If in the anomalous state
801 the characters which constitute the keyword would never be
802 syntactically admissible as individual characters, the keyword anomaly
803 is innocuous in that the only problem is that the recognition of the
804 syntax error is somewhat delayed. If however, the characters which
805 constitute the keyword are syntactically admissible, you have a
806 problem that needs to be fixed.
807
808 In many programming languages, many keywords are treated as
809 \agterm{reserved words}, that is, as words that may not be otherwise
810 used in a conforming program. Version 1.5 of AnaGram has added a
811 \index{Reserve keywords}\agparam{reserve keywords} statement that you
812 can use to specify that the text of a keyword is \emph{never}
813 admissible as individual characters. Thus you will never get a
814 keyword anomaly diagnostic for a keyword that has been listed in a
815 \index{Reserve keywords}\agparam{reserve keywords} statement.
816 The \agparam{reserve keywords} statement is described in greater
817 detail in \S 8.2.28.
818 % XXX this is the only place in the manual (so far?) that uses the
819 % \S section-symbol. (Also, this should be a tex crossreference.)
820
821 \subsection{The Keyword Anomalies Window}
822 \index{Keyword Anomalies}\index{Window}
823
824 If your grammar has a keyword anomaly, AnaGram will provide a warning
825 message and will create a \agwindow{Keyword Anomalies} window to
826 provide details about the anomalies in your grammar.
827
828 Each entry in the \agwindow{Keyword Anomalies} window consists of two
829 lines. The first line identifies the state at which the anomaly
830 occurs and the offending keyword. The second line identifies the rule
831 which the keyword may erroneously reduce. The \agmenu{Auxiliary
832 Windows} menu provides three auxiliary windows keyed directly to the
833 anomaly to help you determine the nature of the problem: the
834 \agwindow{Keyword Anomaly Trace} window, the \agwindow{Reduction
835 Trace} window, and the \agwindow{Rule Derivation} window. In
836 addition, three other windows provide supporting information: the
837 \agwindow{Reduction States} window, the \agwindow{Rule Context} window
838 and the \agwindow{State Definition} window. These windows are
839 discussed in Chapter 6.
840
841 The \agwindow{Keyword Anomaly Trace} window, a pre-built
842 \agwindow{Grammar Trace}, shows a sequence of tokens that illustrates
843 one way to encounter the anomaly. The \agwindow{Reduction Trace}
844 window shows the result of performing the reduction action. Just as
845 for conflicts, the true source of the problem is often at some remove
846 from the actual state in which the problem manifests itself. One of
847 the characteristic rules for the last state in the \agwindow{Reduction
848 Trace} window is, in fact, the true source of the problem. The
849 \agwindow{Rule Derivation} window can show you which of the
850 characteristic rules is the offending party and how the rule in the
851 anomalous state derives from it.
852
853 \subsection{Dealing with Keyword Anomalies}
854
855 If your grammar has a keyword anomaly, you should study it carefully.
856 It may be that the anomaly is completely innocuous, that is, the
857 characters which constitute the keyword are not otherwise legitimate
858 input. In this case you can simply ignore the message.
859
860 The basic approach to dealing with keyword anomalies is to revise your
861 grammar to make a better distinction between the contexts in which the
862 keyword is acceptable and those in which it may be confused with
863 otherwise valid input. Several other techniques may be useful:
864
865 For an anomaly to occur, there must be, \emph{in the anomaly state},
866 an alternative interpretation of some of the initial characters of the
867 keyword. If the anomaly occurs because you have several keywords that
868 begin in a similar manner, you may use the
869 \index{Distinguish keywords}\index{Keywords}\agparam{distinguish keywords}
870 statement (see Chapter 8) to keep them properly separated. Note that
871 having several keywords that begin with the same characters does
872 \emph{not} necessarily cause an anomaly.
873
874 If the keyword is an alphabetic keyword that can follow conventional
875 variable name tokens, you may wish to use a \agparam{sticky}
876 declaration or the
877 \index{Distinguish lexemes}\index{Configuration switch}
878 \agparam{distinguish lexemes} switch
879 to inhibit recognition of the keyword when it is preceded immediately
880 by a name.
881
882 If the keyword can be treated as a reserved word, the \index{Reserve
883 keywords}\agparam{reserve keywords} statement will eliminate the
884 anomaly.