Mercurial > ~dholland > hg > ag > index.cgi
view doc/manual/xg-iii.tex @ 24:a4899cdfc2d6 default tip
Obfuscate the regexps to strip off the IBM compiler's copyright banners.
I don't want bots scanning github to think they're real copyright
notices because that could cause real problems.
author | David A. Holland |
---|---|
date | Mon, 13 Jun 2022 00:40:23 -0400 |
parents | 13d2b8934445 |
children |
line wrap: on
line source
\chapter{Exploring Your Grammar III: Diagnostic Windows} If AnaGram finds problems with your grammar, it creates a number of windows to help you identify and correct the problems. These windows are the \agwindow{Warnings} window, the \agwindow{Conflicts} window, the \agwindow{Resolved Conflicts} window, and the \agwindow{Keyword Anomalies} window. To expand upon these windows, AnaGram also provides, using the \agmenu{Auxiliary Windows} menu, the following supporting windows: \agwindow{Conflict Trace}, \agwindow{Keyword Anomaly Trace}, \agwindow{Reduction Trace}, \agwindow{Rule Derivation}, \agwindow{Token Derivation} and \agwindow{Problem States}. This chapter describes these windows and their use. \section{The Warnings Window} \index{Warnings}\index{Window} If while analyzing your syntax file, AnaGram finds something suspicious, it issues a warning, regardless of the severity of the problem. The \agwindow{Warnings} window will pop up automatically when the analysis has been completed. If the warning is for a syntax error in your input file, you will have to fix it, because AnaGram cannot successfully interpret your syntax file. Otherwise, no matter how serious the warnings may be, AnaGram will create a parser if it is instructed to do so. Thus, if you decide the warnings are of no consequence, you may choose to ignore them. Note that AnaGram will resolve shift-reduce conflicts by choosing the shift action. It will resolve reduce-reduce conflicts by choosing the rule that occurred first in the grammar. If you have \index{Syntax error}\index{Errors}syntax errors, AnaGram will synchronize the cursor in the syntax file window with the cursor in the \agwindow{Warnings} window so that as you move through the warning messages, the cursor in the syntax file window will move to the point of the error. AnaGram's warning messages are listed, in alphabetical order, in Appendix B, together with explanations of the messages and suggestions as to what to do about them. \section{The Conflicts Window} \index{Conflicts}\index{Window} The \agwindow{Conflicts} window lists the conflicts which AnaGram found in your grammar. The table identifies the parser states in which it found conflicts. For each conflict state it identifies the reducing token for which it had more than one option, and the marked rules which characterize each such option. If AnaGram was able to resolve a conflict by using \agparam{precedence} or \agparam{sticky} declarations, the conflict will be listed in the \index{Resolved Conflicts}\index{Window}\agwindow{Resolved Conflicts} window instead. The \agwindow{Conflicts} window and the \agwindow{Resolved Conflicts} window each require several lines to display a conflict. For each conflict, the first, or header, line identifies the conflict state and the conflict token. The conflict token is a terminal token for which there is more than one parser action consistent with the rules of the grammar. If both shift and reduce actions are consistent with the grammar, the conflict is a \agterm{shift-reduce} conflict. If more than one reduce action is consistent with the grammar, the conflict is said to be a \agterm{reduce-reduce} conflict. Following the header line, AnaGram lists the grammar rules which the conflict token could satisfy. The rules which correspond to shift actions are listed first, followed by those which correspond to reduce actions. Rules which correspond to shift actions have not yet been completed and have a marked token, shown in a distinctive type face, which represents the next token expected. For such rules, the conflict token is an instance of the marked token and can be shifted into the input buffer. If the rule has been completely matched there is no marked token and the conflict token can reduce the rule. In the following discussion, such a rule is referred to as a \agterm{conflict rule}. If there is a more than one conflict rule, the conflict is called a \agterm{reduce-reduce} conflict; otherwise it is a \agterm{shift-reduce} conflict. Although it is theoretically possible for a conflict to be both a shift-reduce conflict and a reduce-reduce conflict, it is not a common event. The \agmenu{Auxiliary Windows} menu for the \agwindow{Conflicts} window provides eleven auxiliary windows to help you understand the circumstances of a conflict. Five of these windows address the conflict directly. These windows are the \agwindow{Conflict Trace}, the \agwindow{Reduction Trace}, the \agwindow{Rule Derivation}, the \agwindow{Token Derivation}, and the \agwindow{Problem States} windows. These windows are keyed to the conflict state, the conflict token, and the conflict rule. If the cursor bar is not highlighting a conflict rule, AnaGram scans down the \agwindow{Conflicts} window for the first conflict rule for the selected state and token. The use of these windows in analyzing conflicts is described below. The other windows in the \agmenu{Auxiliary Windows} menu have been described in Chapter 6. These are the \agwindow{Expansion Chain} window, the \agwindow{Rule Context} window, the \agwindow{Reduction States} window, the \agwindow{State Definition} window, the \agwindow{State Expansion} window and the \agwindow{Token Usage} window. The \agwindow{Rule Context} window is keyed to the particular rule highlighted by the cursor. The \index{Expansion Chain}\index{Window}\agwindow{Expansion Chain} window is keyed to the highlighted rule. If the highlighted rule is a characteristic rule no expansion chain is possible, so the option will be greyed out. The \agwindow{Reduction States} window is keyed to the highlighted rule if it is a conflict rule, otherwise to the first conflict rule following the highlighted line. The \agwindow{State Definition} and \agwindow{State Expansion} windows are keyed to the conflict state. The \agwindow{Token Usage} window is keyed to the conflict token. \section{The Resolved Conflicts Window} \index{Resolved Conflicts}\index{Window} AnaGram creates the \agwindow{Resolved Conflicts} window only when the grammar it is analyzing has conflicts and when some of those conflicts have been resolved by precedence rules (see Chapter 9) or by use of the \agparam{sticky} directive. In this case, it will list, in the \agwindow{Resolved Conflicts} window, the conflicts that were resolved and the way in which they were resolved. The layout of the \agwindow{Resolved Conflicts} window is identical to that of the \agwindow{Conflicts} window with one exception. The rules which your parser will observe, selected in accordance with your \agparam{sticky} declarations or precedence rules, are marked with asterisks. Those which your parser will ignore will not be so marked. The \agwindow{Auxiliary Windows} available from the \agwindow{Resolved Conflicts} window are the same as those available from the \agwindow{Conflicts} window. \section{Conflicts} \index{Conflicts} Chapter 4 presented a brief introduction to the problem of conflicts in a grammar. There, conflicts were simply classified as shift-reduce conflicts or as reduce-reduce conflicts. This section examines conflicts in somewhat greater detail. It looks at the causes of conflicts, how to identify the cause of a conflict and how to deal with it. Conflicts occur when, in a particular state, a given token, the \index{Conflict token}\index{Token}\agterm{conflict token}, reduces more than one rule, or reduces a rule when the token could also be shifted into the input buffer in accordance with another rule. Note that there are no shift-shift conflicts, that is, a particular state could have several \index{Characteristic rules}\index{Rules}characteristic rules accepting the same token as a shift token. A conflict is always associated with reduction. This is because LALR parsers keep track of multiple possibilities in the input stream as long as the input tokens only cause shifts. At the point where a reduction occurs the parser must be able to identify exactly one of the possibilities as consistent with the input. \subsection{Shift-Reduce Conflicts} \index{Shift-reduce conflict}\index{Conflicts} Unless a conflict is deliberate, the fact that the conflict token reduces a particular rule is usually a surprise. Although some conflicts arise because an LALR-1 parser generator cannot look far enough ahead, most shift-reduce conflicts result from true \index{Ambiguities}ambiguities in your grammar. For example, an erroneous Pascal grammar contained the following productions, which illustrate one of the more common classes of shift-reduce conflicts: \begin{indentingcode}{0.4in} parameter list -> expression -> parameter list, expression \end{indentingcode} Pascal actually requires that the expressions in a parameter list be separated by commas. In other words, the second rule should read: \begin{indentingcode}{0.4in} -> parameter list, comma, expression \end{indentingcode} Lacking a requirement for a comma, the erroneous rules specify that a parameter list is simply a sequence of expressions with nothing separating them. For example, using this definition, \begin{indentingcode}{0.4in} (a + b) - (c * d) \end{indentingcode} can be interpreted either as one expression, if the minus sign is interpreted as a subtraction operator, or as a sequence of two expressions if it is interpreted as a unary minus sign. When AnaGram analyzed the grammar with the erroneous rule it found a conflict in the state where the minus sign was expected. If the minus sign is the subtraction operator, it should be shifted onto the input stack. On the other hand, if the minus sign is a unary minus sign, the sum \agcode{(a+b)} should be reduced to \agcode{expression}, and then to \agcode{parameter list}. Thus, this ambiguity gives rise to a shift-reduce conflict, where it is impossible to determine a unique parser action. \subsection{Identifying Ambiguities} \index{Ambiguities} Unfortunately, shift-reduce conflicts usually show up in parser states which are far removed from the true source of the problem. When your grammar has a shift-reduce conflict of the sort illustrated above, there may be no readily apparent link to the erroneous production. AnaGram provides several options to help you establish the link, all available using the \agmenu{Auxiliary Windows} menu from the \agwindow{Conflicts} window. These options comprise two pre-built grammar traces, the \agwindow{Conflict Trace} and the \agwindow{Reduction Trace}; two special windows called \agwindow{Rule Derivation} and \agwindow{Token Derivation}; the \agwindow{Expansion Chain} window and a special version of the \agwindow{Reduction States} window called the \agwindow{Problem States} window. The four trace and derivation windows are coordinated, so that AnaGram actually computes a mutually consistent set of four whenever you ask for one of them. The \agwindow{Conflict} and \agwindow{Reduction Traces} are simply pre-initialized \agwindow{Grammar Traces} which you may use to become more familiar with the circumstances of the conflict. Bear in mind, however, that unless you have set the \index{Traditional engine}\index{Configuration switches} \agparam{traditional engine} configuration switch, the traces will use compound actions, which in some cases will use the default resolution of conflicts. Therefore, unless you set \agparam{traditional engine} it is not always possible to work through an alternate resolution of a conflict. The \index{Conflict Trace}\index{Trace}\index{Window}\agwindow{Conflict Trace} shows you one possible sequence of input which will lead to the conflict state. Unless a conflict results from a typo or other clerical error, it is usually a surprise and not something you anticipated. The \agwindow{Conflict Trace} helps you to understand the context of the problem. It is often helpful to look at the \agwindow{Rule Stack} for the \agwindow{Conflict Trace} to see what productions are involved in the conflict. You can inspect the \agwindow{State Definition} window for the conflict state to see why the conflict token can be shifted. \index{Reduction Trace}\index{Trace}\index{Window} The \agwindow{Reduction Trace}, for a particular conflict rule, carries forward a \agwindow{Conflict Trace} by selecting the reduce action. In other words, for a parser in the conflict state, the \agwindow{Reduction Trace} shows the result of using the conflict token as a reducing token. The trace shows the result of taking all possible reductions until a state is reached where the lookahead token will shift into the input buffer. If you compare the \agwindow{Reduction Trace} to the \agwindow{Conflict Trace} you will notice that they start off with identical tokens. At some point they diverge. All of the tokens in the \agwindow{Conflict Trace} following this point have reduced to the token in the \agwindow{Reduction Trace}. Usually there are no further tokens in the \agwindow{Reduction Trace}, but if there are, you will notice that they are all zero-length tokens. That is, they all have a null production, so that they come for free, as it were. If you press Enter, to display the \agwindow{Select Input} window for the \agwindow{Reduction Trace}, you will see that the conflict token is acceptable input. The \agwindow{Reduction Trace} illustrates the ambiguity in your grammar that caused the conflict: One of the acceptable input tokens in the final state of this trace cannot be distinguished from its predecessor. The \index{Expansion Chain}\index{Window}\agwindow{Expansion Chain} window is extremely useful for indicating why a particular grammar rule is an expansion rule in a particular state. Sometimes the chain of substitutions to produce a given rule is quite long, and changes to a grammar can have unexpected effects. The \agwindow{Expansion Chain} window enables you to see how a rule in a conflict state got to be there. \index{Problem States}\index{Window}The \agwindow{Problem States} window for a conflict rule is essentially a trimmed \agwindow{Reduction States} window which shows only those reduction states in which the conflict token is acceptable input. Often it is sufficient to look at the \agwindow{Problem States} window to see the source of the problem. The last state in the \agwindow{Reduction Trace} window, will be, of course, one of the states in the \agwindow{Problem States} window. \subsection{Rule and Token Derivation Windows} If after inspecting the \agwindow{Problem States}, \agwindow{Conflict Trace} and the \agwindow{Reduction Trace}, the conflict is still a mystery, you may pinpoint the problem by looking at the \agwindow{Rule} and \agwindow{Token Derivation} windows for the conflict. Both windows contain a sequence of rules, and both begin with the same rule, the rule which is the root cause of the conflict. This rule is one of the characteristic rules of the last state in the \agwindow{Reduction Trace} and is the rule which produces the ambiguity. \index{Rule Derivation}\index{Window}In both windows, each rule, except the last line of the \agwindow{Rule Derivation}, contains a marked token in a distinctive typeface. The essence of the ambiguity is that it is impossible to tell where in the input stream the marked token in the top line of the \agwindow{Rule Derivation} ends and the marked token in the top line of the \agwindow{Token Derivation} begins. In both derivation windows, each line after the first consists of a rule produced by the marked token on the previous line. In the \agwindow{Rule Derivation}, except on the first line, the marked token is either the last token in the rule, or all subsequent tokens have null productions. The last rule in the derivation window is the conflict rule you selected in the \agwindow{Conflicts} window. Thus the \agwindow{Rule Derivation} window shows you how the rule involved in the conflict derives from the root. \index{Window}\index{Token Derivation}In the \agwindow{Token Derivation} window, except on the first line, the marked token is either the first token in the rule, or all preceding tokens have null productions. The marked token in the last line of the list is the conflict token. The \agwindow{Rule Derivation} and \agwindow{Token Derivation} windows each have five auxiliary windows. The \agwindow{Rule Context} window is keyed simply to the highlighted rule. The other four windows, the \agwindow{Expansion Rules} window, the \agwindow{Productions} window, the \agwindow{Set Elements} window, and the \agwindow{Token Usage} window are keyed to the marked token. Remember that there is no marked token on the last line of the \agwindow{Rule Derivation} window. \subsection{Resolving Ambiguities} \index{Ambiguities} The shift-reduce conflict discussed earlier stemmed from an oversight in writing a grammar: the grammar specified a list of items, in this case expressions, which could not be distinguished without a separator of some sort. This conflict can be resolved simply by rewriting the grammar to require a comma separating the expressions. A similar problem comes from specifying a list of lists without any separator. Sometimes such a specification is made inadvertently. The erroneous Pascal grammar cited above had such a conflict. Here are the productions that caused trouble: \begin{indentingcode}{0.4in} declaration part -> declaration section -> declaration part, declaration section declaration section -> proc and func declaration part proc and func declaration part -> proc or func declaration -> proc and func declaration part, proc or func declaration proc or func declaration -> proc declaration -> func declaration \end{indentingcode} Of course, there are other productions for \agcode{declaration section} so that this grammar looks perfectly reasonable on the surface. A \agcode{declaration part} is a sequence of \agcode{declaration sections}. One type of \agcode{declaration section}, the \agcode{proc and func declaration part}, is a sequence of procedure or function definitions. Unfortunately, if you have two procedure definitions in a row, it could be interpreted as one \agcode{proc and func declaration part} consisting of two declarations or two \agcode{proc and func declaration parts} each consisting of one declaration. The problem here is that there is a little too much recursion. You can fix this problem by replacing \begin{indentingcode}{0.4in} declaration section -> proc and func declaration part proc and func declaration part -> proc or func declaration -> proc and func declaration part, proc or func declaration \end{indentingcode} with \begin{indentingcode}{0.4in} declaration section -> proc or func declaration \end{indentingcode} After a little reflection you will see that this simplification, which eliminates one source of recursion, entails no loss of generality. In the above examples, the ambiguities were the results of errors in the specification of the grammar, and could be easily fixed by correcting the grammar. Sometimes, however, ambiguities arise which are not so easily corrected. Consider, for instance, a simple token scanner: \begin{indentingcode}{0.4in} tokens -> [token, white space?...]... token -> name -> number -> string -> character constant -> punctuation character \end{indentingcode} where \agcode{name}, \agcode{number}, etc., are defined in the conventional manner. Such token scanners are normally written using conventional programming techniques, or using programs such as \agfile{lex} which are based on the analysis of regular expressions. Such programs simply work on a character by character basis and do not see that ``maryjane'' could be interpreted as one name or as two, or as even more, since the separating white space is declared to be optional. AnaGram, however, since it takes a broader view, sees that the specifications are basically ambiguous and complains. Note, however, that the default resolution of conflicts, in this case choosing the shift action, will provide a parser that does precisely what you would normally intend. Thus, the major concern is to make sure that the conflicts caused by this situation do not mask other more serious conditions. While it is possible to rewrite this grammar so that it is not ambiguous, such a grammar would lack clarity and conciseness. AnaGram provides two alternatives for dealing with this problem: the \index{Declaration}\index{Subgrammar declaration}\agparam{subgrammar} declaration and the \index{Sticky declaration}\index{Declaration}\agparam{sticky} declaration, which accomplish the desired end using very different techniques. In the definition above, a grammar for token, taken all by itself, would not normally show any ambiguities. We could write a \index{Lexical scanner}lexical scanner that would identify a token, pass it on to the parser, and then go on to the next. We would never worry about any ambiguities. In this situation, it is clear that what we mean by \agcode{token} is defined entirely by the grammar for \agcode{token}, not by its usage within the rest of the grammar. When we say that we have a sequence of tokens, \agcode{token...}, we usually mean that, of course, two tokens, such as ``mary'' and ``jane'' have to be separated by punctuation or space to be distinguished as separate tokens. The \index{Declaration}\index{Subgrammar declaration}\agparam{subgrammar} declaration allows you to specify that a particular \index{Nonterminal token}\index{Token}nonterminal token in your grammar is to be treated as though it were defined by its own complete grammar, without regard to its usage in the grander scheme of things. In the example above, add the following: \begin{indentingcode}{0.4in} [ subgrammar \bra token \ket ] \end{indentingcode} Now, when AnaGram conducts its search for reducing tokens, it will limit its search to those it would find if your grammar consisted only of the grammar rules necessary to define token. When it does this, it never even finds the reducing tokens that appear to cause conflicts. When you fix a conflict with the \agparam{subgrammar} declaration, the conflict vanishes entirely. It is not recorded in the \agwindow{Resolved Conflicts} window. For this reason, the \agparam{subgrammar} declaration should be used only when you are absolutely sure of the nature of the problem you are trying to solve. A somewhat more precise mechanism for dealing with the conflicts described above is to declare name and number as \index{Sticky declaration}\index{Declaration}\agparam{sticky}: \begin{indentingcode}{0.4in} [ sticky \bra name, number \ket ] \end{indentingcode} This means that whenever the last token in your parser's input buffer is \agcode{name} or \agcode{number}, it will shift in any acceptable input character, and will disregard conflicting reduce options. AnaGram will continue to identify conflicts in your grammar due to this cause, but will record them in the \agwindow{Resolved Conflicts} window. Any other conflicts you may have will still show up in the \agwindow{Conflicts} window. Sometimes it may be desirable to introduce ambiguities deliberately into a grammar. For example, although it is easy enough to write normal grammar rules for the conventional expression logic used in programming languages, it is possible to produce a somewhat more compact, faster running parser by writing an ambiguous grammar and using AnaGram's precedence declarations to resolve the ambiguities. An example of this technique is given in Chapter 9. \subsection{Null Productions} \index{Null productions}\index{Production}Null productions can be a fertile source of conflicts. Fortunately, many of them can be easily resolved by simply rewriting a few rules. Consider, for example, the following: \begin{indentingcode}{0.4in} name -> capital letter, letter..., white space initial -> capital letter, period, white space full name -> name, initial?, name \end{indentingcode} In this example, \agcode{initial?} is a virtual production which AnaGram replaces with % not ``with''... \begin{indentingcode}{0.4in} initial? -> -> initial \end{indentingcode} so \agcode{initial?} involves a null production. Recall that LALR-1 parsers cannot look ahead more than one character. The problem with this grammar, then, is that given the input ``John Q'', there is no way to tell whether the ``Q'' is an initial or the first letter of the last name. Is this ``John Q. Public'' or ``John Quincy''? Adding another production for ``full name'' so that ``initial'' need not be marked as optional allows the parser to defer its decision: \begin{indentingcode}{0.4in} full name -> name, name -> name, initial, name \end{indentingcode} Notice that the technique here is to \emph{avoid a reduction} until the parser has enough information to proceed. \subsection{Reduce-Reduce Conflicts} \index{Reduce-reduce conflicts}\index{Conflicts}Reduce-reduce conflicts are usually caused by the limited lookahead of LR-1 parsing. Very occasionally, you might find a grammar with a reduce-reduce conflict that is caused by the LALR-1 approximation to LR-1 parsing that AnaGram supports. Such conflicts are quite easy to resolve, however, by almost trivial rewriting. Reduce-reduce conflicts commonly involve productions that are intrinsically indistinguishable, for instance: \begin{indentingcode}{0.4in} tweedledee -> tiddly, wink tweedledum -> tiddly, wink \end{indentingcode} However, it may be possible to distinguish \agcode{tweedledee} from \agcode{tweedledum} by context. Suppose \begin{indentingcode}{0.4in} gibber -> tweedledee, apples -> tweedledum, oranges \end{indentingcode} Now if you can distinguish \agcode{apples} from \agcode{oranges} on the basis of the very first character, your parser will be able to tell \agcode{tweedledee} from \agcode{tweedledum}. If, on the other hand, it takes more than one input character to tell whether you have \agcode{apples} or \agcode{oranges}, you have a reduce-reduce conflict. This example, of course, is rather far-fetched, since the potential confusion between \agcode{tweedledee} and \agcode{tweedledum} is obvious. More commonly, the definitions and usage of \agcode{tweedledee} and \agcode{tweedledum} are physically far apart in the grammar. At one place in a grammar: \begin{indentingcode}{0.4in} balderdash -> tweedledee, apples \end{indentingcode} Elsewhere, seemingly unrelated: \begin{indentingcode}{0.4in} buncombe -> tweedledum, oranges \end{indentingcode} In a dark corner: \begin{indentingcode}{0.4in} gossip -> buncombe, slander \end{indentingcode} And in yet another dark corner: \begin{indentingcode}{0.4in} small talk -> gossip -> rumors -> balderdash \end{indentingcode} Once again, if \agcode{apples} and \agcode{oranges} cannot be distinguished by the very first character, you have a reduce-reduce conflict that is somewhat harder to find than the one discussed above, simply because the trail is somewhat more complicated to follow. \subsection{Identifying Reduce-Reduce Conflicts} \index{Reduce-reduce conflicts}\index{Conflicts} Although the machinery described above for identifying shift-reduce conflicts can be used to investigate reduce-reduce conflicts, many yield to comparison of the \agwindow{Reduction States} windows for the two rules. For the two rules to be distinguishable, no token which is acceptable input for any of the reduction states belonging to one rule can be acceptable input for any reduction state belonging to the other. Therefore, when you have a reduce-reduce conflict, there is an overlap among the tokens in the reduction states for one rule and the tokens in the reduction states for the other rule. Often, however, there are many reduction states which are irrelevant as far as a particular conflict is concerned. So AnaGram provides a special window, the \agwindow{Problem States} window, in the \agmenu{Auxiliary Windows} menu for the \agwindow{Conflicts} and \agwindow{Resolved Conflicts} windows. The \agwindow{Problem States} window for a particular rule lists those, and only those, reduction states for the rule for which the conflict token is acceptable input. This usually reduces the size of the table and often makes the source of the problem immediately obvious. \subsection{Resolving Reduce-Reduce Conflicts} If you build a parser for a grammar that includes a reduce-reduce conflict, AnaGram will arbitrarily reduce the first rule it encountered while reading your grammar, that is, the rule with the smaller rule number. In some circumstances this default may be acceptable. In other circumstances more may be required. Often, it is possible to get rid of reduce-reduce conflicts by rewriting your grammar so that tokens that were once reducing tokens for a rule become part of the rule itself. Thus, in the previous example, you could write: \begin{indentingcode}{0.4in} tweedledee -> tiddly, wink, apples tweedledum -> tiddly, wink, oranges \end{indentingcode} In other words, if \agcode{apples} always follows \agcode{tweedledee}, you can move \agcode{apples} ``down'' in the grammar by making it part of \agcode{tweedledee}. Moving tokens ``down'' in this manner always reduces the likelihood of conflicts. But what if \agcode{apples} doesn't always follow \agcode{tweedledee}? Then it is sometimes worthwhile to make two versions of \agcode{tweedledee}: \begin{indentingcode}{0.4in} simple tweedledee -> tiddly, wink complex tweedledee -> tiddly, wink, apples \end{indentingcode} and in each circumstance use whichever is appropriate. As with the earlier shift-reduce example, the technique is to avoid any reduction until the tokens can be distinguished. Sometimes, however, reduce-reduce conflicts reflect a language design that really demands greater lookahead than can be accommodated with an LR-1 parser. In these cases you have two options: you can consider a multi-stage parser or you can implement an ad hoc lookahead procedure. In the latter case, you can use semantically determined productions to control your parser depending on the results of your lookahead procedure. \subsection{Conflicts due to LALR-1 Simplifications} \index{LALR simplifications}\index{Conflicts} LALR-1 \index{Parser generator}parser generators occasionally report a reduce-reduce conflict where an LR-1 parser would successfully resolve the conflict. How significant is this problem? An early prototype version of AnaGram actually had full LR-1 capability. Unfortunately LR-1 parsers are generally much bigger than LALR-1 parsers. Creating them takes longer and more resources so that for given computer resources an LALR-1 parser generator can handle larger, more complex grammars than an LR-1 parser generator. To provide a compromise, the prototype was modified to do an LALR-1 analysis first. Then, only if there were reduce-reduce conflicts did it do an LR-1 analysis. Over a substantial period of testing, the only reduce-reduce conflicts that were resolved by the LR-1 analysis were those that were created deliberately as tests. Eventually the LR-1 capabilities were removed from AnaGram since they added complexity without concomitant benefits. To understand why this is so, it is worthwhile to look at an example. Let us recall the definitions of \agcode{tweedledee} and \agcode{tweedledum} made above: \begin{indentingcode}{0.4in} tweedledee -> tiddly, wink tweedledum -> tiddly, wink \end{indentingcode} Remember that we defined gibber thus: \begin{indentingcode}{0.4in} gibber -> tweedledee, apples -> tweedledum, oranges \end{indentingcode} Now suppose that we also define jabber in a contrary way: \begin{indentingcode}{0.4in} jabber -> tweedledee, oranges -> tweedledum, apples \end{indentingcode} Now suppose we use \agcode{gibber} in one part of our grammar and \agcode{jabber} in another, completely unrelated, part. An LR-1 parser generator would succeed in keeping \agcode{tweedledee} and \agcode{tweedledum} straight in spite of our efforts to confuse it. An LALR-1 parser generator, however, succumbs to our efforts and reports a conflict. A trivial rewrite of the grammar, however, removes the conflict: \begin{indentingcode}{0.4in} gibber tweedledee -> tiddly, wink gibber tweedledum -> tiddly, wink jabber tweedledee -> tiddly, wink jabber tweedledum -> tiddly, wink gibber -> gibber tweedledee, apples -> gibber tweedledum, oranges jabber -> jabber tweedledee, oranges -> jabber tweedledum, apples \end{indentingcode} The effect of this rewrite is to accomplish the same separation of states that an LR-1 parser generator would make. \section{Keyword Anomalies} \index{Keyword Anomalies}\index{Anomaly} In syntax directed parsing, it is assumed that input tokens can be uniquely identified. In the case of keywords, however, there is the possibility that the individual characters making up the keyword, as well as the keyword taken as a whole, could constitute legitimate input under some circumstances. Thus keywords, though a powerful and useful tool, are not completely consistent with the assumptions that underlie syntax directed parsing. This can occasionally give rise to a type of conflict, diagnosed by AnaGram, called a \agterm{keyword anomaly}. AnaGram is quite conservative in its diagnoses, so that many keyword anomalies it reports are actually innocuous and can be safely ignored. Nevertheless, anomalies should be investigated carefully. AnaGram only recognizes a keyword in those states where the keyword is allowable input. If AnaGram recognizes a keyword, the keyword takes precedence over the sequence of characters that constitute the keyword. If there are several allowable keywords, the longest takes precedence. Under certain circumstances it is possible for a grammar to have a state where a particular keyword may or may not be allowable input depending on how the parser gets to that state. That is, for some sequences of tokens which lead to the state, the keyword is allowable, in others it isn't. In such a state, the keyword must always cause a reduction, since if the keyword were to shift, it would always be allowable input. When AnaGram finds such a state in your parser, it checks to see what the parser action would be if the parser ignored the keyword and interpreted it in terms of its constituent characters. If the parser action would be anything other than reducing the same rule as the keyword or a syntax error, AnaGram will diagnose a keyword anomaly in this state. To summarize, the nature of a keyword anomaly is this: It is possible for your parser to get to the anomalous state by a route for which the keyword is not allowable. In the anomalous state, at least the first constituent character of the keyword is allowable in its own right and causes a parser action different from that caused by the keyword. Nonetheless, if the keyword should appear in the input, your parser will recognize it, perform at least one reduction action, get to a state where the keyword is not recognized, and finally, in the wrong state, will try to interpret the keyword in terms of its constituent characters. Now, what do you do about a keyword anomaly? If in the anomalous state the characters which constitute the keyword would never be syntactically admissible as individual characters, the keyword anomaly is innocuous in that the only problem is that the recognition of the syntax error is somewhat delayed. If however, the characters which constitute the keyword are syntactically admissible, you have a problem that needs to be fixed. In many programming languages, many keywords are treated as \agterm{reserved words}, that is, as words that may not be otherwise used in a conforming program. Version 1.5 of AnaGram has added a \index{Reserve keywords}\agparam{reserve keywords} statement that you can use to specify that the text of a keyword is \emph{never} admissible as individual characters. Thus you will never get a keyword anomaly diagnostic for a keyword that has been listed in a \index{Reserve keywords}\agparam{reserve keywords} statement. The \agparam{reserve keywords} statement is described in greater detail in \S 8.2.28. % XXX this is the only place in the manual (so far?) that uses the % \S section-symbol. (Also, this should be a tex crossreference.) \subsection{The Keyword Anomalies Window} \index{Keyword Anomalies}\index{Window} If your grammar has a keyword anomaly, AnaGram will provide a warning message and will create a \agwindow{Keyword Anomalies} window to provide details about the anomalies in your grammar. Each entry in the \agwindow{Keyword Anomalies} window consists of two lines. The first line identifies the state at which the anomaly occurs and the offending keyword. The second line identifies the rule which the keyword may erroneously reduce. The \agmenu{Auxiliary Windows} menu provides three auxiliary windows keyed directly to the anomaly to help you determine the nature of the problem: the \agwindow{Keyword Anomaly Trace} window, the \agwindow{Reduction Trace} window, and the \agwindow{Rule Derivation} window. In addition, three other windows provide supporting information: the \agwindow{Reduction States} window, the \agwindow{Rule Context} window and the \agwindow{State Definition} window. These windows are discussed in Chapter 6. The \agwindow{Keyword Anomaly Trace} window, a pre-built \agwindow{Grammar Trace}, shows a sequence of tokens that illustrates one way to encounter the anomaly. The \agwindow{Reduction Trace} window shows the result of performing the reduction action. Just as for conflicts, the true source of the problem is often at some remove from the actual state in which the problem manifests itself. One of the characteristic rules for the last state in the \agwindow{Reduction Trace} window is, in fact, the true source of the problem. The \agwindow{Rule Derivation} window can show you which of the characteristic rules is the offending party and how the rule in the anomalous state derives from it. \subsection{Dealing with Keyword Anomalies} If your grammar has a keyword anomaly, you should study it carefully. It may be that the anomaly is completely innocuous, that is, the characters which constitute the keyword are not otherwise legitimate input. In this case you can simply ignore the message. The basic approach to dealing with keyword anomalies is to revise your grammar to make a better distinction between the contexts in which the keyword is acceptable and those in which it may be confused with otherwise valid input. Several other techniques may be useful: For an anomaly to occur, there must be, \emph{in the anomaly state}, an alternative interpretation of some of the initial characters of the keyword. If the anomaly occurs because you have several keywords that begin in a similar manner, you may use the \index{Distinguish keywords}\index{Keywords}\agparam{distinguish keywords} statement (see Chapter 8) to keep them properly separated. Note that having several keywords that begin with the same characters does \emph{not} necessarily cause an anomaly. If the keyword is an alphabetic keyword that can follow conventional variable name tokens, you may wish to use a \agparam{sticky} declaration or the \index{Distinguish lexemes}\index{Configuration switch} \agparam{distinguish lexemes} switch to inhibit recognition of the keyword when it is preceded immediately by a name. If the keyword can be treated as a reserved word, the \index{Reserve keywords}\agparam{reserve keywords} statement will eliminate the anomaly.