Mercurial > ~dholland > hg > ag > index.cgi
comparison doc/manual/isdp.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:13d2b8934445 |
---|---|
1 \chapter{Introduction to Syntax Directed Parsing} | |
2 | |
3 Every programmer has to deal with input data. The units of data may | |
4 consist of single characters from a keyboard or file, unit records | |
5 from a file, mouse events in a windowing system, or even more complex | |
6 data constructs. When the processing of a unit of data depends only on | |
7 the value of that data, it is usually straightforward. Usually, | |
8 however, the processing depends also on what has preceded, and often | |
9 even on what follows, the input under consideration. Keeping track of | |
10 these dependencies in order to know how to process the data is called | |
11 \index{Parsing}\agterm{parsing}. | |
12 | |
13 It is often relatively easy to keep track of simple dependencies when | |
14 first writing a program. But as the program develops, as new features | |
15 are added, as bugs are fixed, the dependencies often stop being | |
16 simple. Then input processing becomes a headache, since it is hard to | |
17 keep track of or even identify all the particular cases. Changes to | |
18 the program cause unexpected problems. Program maintenance threatens | |
19 to get out of control. | |
20 | |
21 Syntax directed parsing is a technique for dealing with input data | |
22 streams which keeps you in control of the problem, gives you a high | |
23 degree of confidence in your program, makes bugs less likely and makes | |
24 program maintenance and the addition of new features an easy task. | |
25 | |
26 To use syntax directed parsing you create a high level description of | |
27 the structure of your input data, called a | |
28 \index{Grammar}\agterm{grammar}. | |
29 A file which contains a grammar is called a | |
30 \index{Syntax file}\index{File}\agterm{syntax file}. | |
31 A | |
32 \index{Parser generator}\agterm{parser generator}, | |
33 such as AnaGram, | |
34 can create, from a syntax file, a function (or program) called a | |
35 \index{Parser}\agterm{parser}, | |
36 written in C or C++. The parser keeps track of all the dependencies | |
37 in your input, and calls certain functions, | |
38 \index{Reduction procedure}\agterm{reduction procedures}, | |
39 to deal with specific units or sequences of data as they are | |
40 encountered. | |
41 | |
42 Reduction procedures are functions you write to process your data. | |
43 They are linked, in an obvious way in the grammar, to the structures | |
44 in your input, so that your parser will call them at precisely the | |
45 right times with precisely the right data. When you write reduction | |
46 procedures, you can concentrate entirely on what you have to do with | |
47 the data. You don't have to encumber your code with switches and | |
48 tests to determine the structure of your input. | |
49 | |
50 This chapter describes in a general way how you write a grammar and | |
51 how a parser works. In particular, it defines a number of terms which | |
52 are useful in talking about grammars and syntax directed parsing. It | |
53 then introduces a number of special features of AnaGram. More | |
54 detailed treatment of writing grammars is given in Chapter 8. | |
55 Techniques for building a complete functioning parser are discussed in | |
56 Chapter 9. | |
57 | |
58 | |
59 \section{Describing an Input Sequence} | |
60 | |
61 Writing a grammar consists of describing the acceptable input | |
62 sequences for your program. The vehicle for describing an input | |
63 sequence is called a | |
64 \index{Production}\agterm{production}. | |
65 Productions show how a logical component of the input can be made up | |
66 of a sequence of more elementary components. A production that | |
67 describes a date might be written: | |
68 | |
69 \begin{indentingcode}{0.4in} | |
70 date | |
71 -> name of month, day, comma, year | |
72 \end{indentingcode} | |
73 | |
74 The components of the input are called | |
75 \index{Token}\agterm{tokens}. | |
76 The sequence of tokens on the \index{Right side}right side of the | |
77 production is called a | |
78 \index{Grammar rule}\agterm{grammar rule}, | |
79 or \index{Rule}\agterm{rule} for short. | |
80 The individual tokens on the right side of the rule are also called | |
81 \index{Rule elements}\index{Rule}\agterm{rule elements}. | |
82 The \agterm{rule length} is the number of elements in the rule. | |
83 | |
84 The token on the \index{Left side}left side of the production is | |
85 called the | |
86 \index{Reduction token}\index{Token}\agterm{reduction token} | |
87 for the rule. \index{Token}Tokens may have | |
88 \index{Value}\index{Semantic value}\agterm{semantic values}, | |
89 as distinguished from \index{Value}\agterm{syntactic values}, which | |
90 you can use in your reduction procedures. For instance, the value of | |
91 \agcode{name of month} could be an integer in the range zero to | |
92 eleven, or it could be a pointer to an ASCII string. The value of day | |
93 could be an integer in the range one to thirty-one. | |
94 | |
95 A grammar consists of a number of such productions, each of which | |
96 describes some component of the input in terms of other components. | |
97 It does not take very many productions to describe quite complex input | |
98 streams. A grammar for the C language, for instance, requires about | |
99 two hundred productions. | |
100 | |
101 Many people find the term \agterm{production} quite confusing. It | |
102 comes from theoretical linguistics where it is used to describe how | |
103 one may produce sequences which correspond to a set of grammatical | |
104 rules. Ironically, the major usage of the idea has been in parsing | |
105 where the interest is not so much in creating sequences which satisfy | |
106 the grammatical rules as in decoding and analyzing such sequences. | |
107 Nonetheless, it is convenient, in the above example, to say that the | |
108 token \agcode{date} \agterm{produces} a sequence of tokens consisting | |
109 of \agcode{name of month}, \agcode{day}, \agcode{comma}, and | |
110 \agcode{year}. We also say that the sequence \agterm{reduces} to | |
111 \agcode{date}. | |
112 | |
113 There may be more than one production to describe a given component, | |
114 if there is more than one way it may be represented. For instance, | |
115 | |
116 \begin{indentingcode}{0.4in} | |
117 date | |
118 -> day, name of month, year | |
119 \end{indentingcode} | |
120 | |
121 describes another common way of writing a date. In other words, a | |
122 reduction token may produce a number of different grammar rules. | |
123 | |
124 Tokens which appear on the left side of one or more productions are called | |
125 \index{Token}\index{Nonterminal token}\agterm{nonterminal tokens}. | |
126 Those which appear \emph{only} on the right sides of productions are | |
127 called | |
128 \index{Token}\index{Terminal token}\agterm{terminal tokens}. | |
129 Terminal tokens are the units which actually appear physically in the | |
130 input. Nonterminal tokens are identified when a sequence of tokens | |
131 that matches the right side of a production is seen in the input. | |
132 When AnaGram analyzes a grammar, it assigns a unique | |
133 \index{Token number}\index{Number}\index{Token}\agterm{token number} | |
134 to each token it finds in the grammar. | |
135 | |
136 Nonterminal tokens, such as \agcode{date} in the example above, may | |
137 appear in any grammar rule just as though they were input tokens. The | |
138 token on the left side of a production can even appear on the right | |
139 side as well. Such a production is called a | |
140 \index{Production}\index{Recursive productions}\agterm{recursive} | |
141 production. When a nonterminal token appears on the right side of a | |
142 production, it may be represented in this context by \emph{any} of | |
143 the grammar rules it produces. Grammars described in this manner are | |
144 called | |
145 \index{Context free grammar}\index{Grammar}\agterm{context free grammars} | |
146 since there is no contextual constraint on which of the rules that a | |
147 token produces can appear in any given context. | |
148 | |
149 Recursive productions may be either left recursive or right recursive. | |
150 \agterm{Left recursive} productions are those where the recursively | |
151 defined nonterminal appears as the first element in the recursive | |
152 rule. \agterm{Right recursive} productions are those where it is the | |
153 last element. If it appears anywhere between, the production is said | |
154 to be \agterm{center recursive}. | |
155 | |
156 Any nonterminal token which has a recursive production must also have | |
157 at least one simple, non-recursive production. Otherwise, it is not | |
158 possible to create a finite sequence of terminal tokens from the | |
159 nonterminal token. | |
160 | |
161 Recursion may also occur implicitly in a grammar when one of the | |
162 tokens on the right side of a production itself has a production | |
163 involving the token on the left. Such implicit recursion sometimes | |
164 may involve numerous levels of productions. Implicit recursion occurs | |
165 most commonly in describing constructs such as arithmetic expressions | |
166 or the block structure of programming languages. | |
167 | |
168 Clearly, grammars can accommodate multiple levels of structure in the | |
169 input sequences they describe. There must, at the top, be a single | |
170 token which encompasses the entire input. This special token is | |
171 variously called the | |
172 \index{Grammar token}\index{Configuration parameters}\index{Token} | |
173 \agterm{grammar token}, | |
174 the | |
175 \index{Goal token}\agterm{goal token} | |
176 or the | |
177 \index{Start token}\agterm{start token}. | |
178 | |
179 In AnaGram grammars, terminal tokens need not be declared or otherwise | |
180 defined. Any token name that appears only on the right side of a | |
181 production is taken to be a terminal token, and AnaGram presumes that | |
182 the input procedure that provides tokens to the parser will be able to | |
183 identify them to the parser appropriately. The details on how to do | |
184 this are explained in Chapter 9. | |
185 | |
186 On the other hand, AnaGram allows you to specify terminal tokens | |
187 explicitly as ASCII characters, or even as sets of ASCII characters, | |
188 right in the grammar. Thus, you may write \agcode{'0-9'} to represent | |
189 the set of ASCII digits, or \agcode{'A-Z'} to represent the set of | |
190 upper case letters. The | |
191 \index{Semantic value}\index{Value}\index{Token}semantic value | |
192 of such a token is the ASCII \index{Character codes}character code | |
193 that actually appears in the input stream. The rules for representing | |
194 \index{Character sets}character sets are given in Chapter 8. If the | |
195 various sets you use in your grammar overlap, they may not properly | |
196 represent terminal tokens. In this case, AnaGram automatically | |
197 extends your grammar appropriately. This ``set partition'' logic is | |
198 described in Chapter 6. | |
199 | |
200 | |
201 \section{How a Parser Works} | |
202 | |
203 The aim of a \index{Parser}parser is to match its input with the full | |
204 syntactic structure specified by the productions which make up the | |
205 grammar. The primary component of a parser is an input buffer, | |
206 sometimes thought of as a stack, into which tokens are | |
207 \agterm{shifted}, or stored sequentially, as they are encountered in | |
208 the input. At the same time that a token is shifted into the input | |
209 buffer, its | |
210 \index{Semantic value}\index{Token}\index{Value}semantic value is | |
211 pushed onto the | |
212 \index{Parser value stack}\index{Value stack}\index{Stack} | |
213 \agterm{value stack}. | |
214 A token is not shifted into the buffer unless it ``makes sense'', that | |
215 is, unless it is consistent with the rules of the grammar and with the | |
216 input that has preceded it. If a token does not make sense, the | |
217 parser signals a | |
218 \index{Syntax error}\index{Errors}\agterm{syntax error}. | |
219 In order to determine whether a token makes sense, the parser has a | |
220 sort of decision table which provides a list of acceptable tokens for | |
221 each of a number of | |
222 \index{Parser}\index{State}\agterm{states}. | |
223 The table also specifies what the parser is to do with each acceptable | |
224 token. When the table indicates that a token is to be shifted into | |
225 the buffer, it also specifies a new state. The parser stacks the | |
226 current state on a | |
227 \index{Parser state stack}\index{State stack}\index{Stack} | |
228 \agterm{state stack} | |
229 and jumps to the new state. Thus every time a token is shifted into | |
230 the input buffer, a state number is pushed onto the state stack. For | |
231 each state of the parser, excepting only the initial state, there is a | |
232 unique token which will cause a jump to that state. This token is | |
233 called the | |
234 \index{Characteristic token}\index{Token}\agterm{characteristic token} | |
235 for the state. | |
236 | |
237 When the rightmost, or most recent, tokens in the input buffer match | |
238 the right side of a production precisely, the parser \emph{may} | |
239 replace the tokens that match the rule with a single token, the token | |
240 on the left side of the production. This process of replacing a | |
241 sequence of tokens with a single token is called | |
242 \index{Reduction}\agterm{reduction}. | |
243 The token that replaces the sequence of tokens is called the | |
244 \index{Reduction token}\index{Token}\agterm{reduction token}. | |
245 | |
246 The actual mechanism of the reduction is quite important. At the same | |
247 time that input tokens are removed from the input buffer, state | |
248 numbers are popped from the state stack, so that when all input tokens | |
249 matching the rule have been removed, the parser | |
250 \index{Parser}\index{State}state has been restored to the value it had | |
251 at the time the first token in the rule was seen. As state numbers | |
252 are popped from the state stack, token values are popped from the | |
253 value stack. If the rule has a reduction procedure, temporary | |
254 variables are loaded with the values popped from the stack and the | |
255 reduction procedure is called. The reduction token is now shifted | |
256 into the input buffer just as though it were an input token. If the | |
257 reduction procedure returned a result it is shifted into the value | |
258 stack as the value of the reduction token. The parser stacks the | |
259 current state again and jumps to a new state as determined by the | |
260 parser tables. | |
261 | |
262 If there are no errors in your input, when the last token has been | |
263 read from the input, shifted into the input buffer, and reductions | |
264 performed, there will be precisely one token in the input buffer: the | |
265 \index{Grammar token}\index{Goal token}grammar, or goal, token which | |
266 describes your entire input. At this point your parser declares | |
267 itself finished. | |
268 | |
269 Reductions do not necessarily occur every time a rule matches the | |
270 tokens in the input buffer. If the reduction token does not make | |
271 sense, that is, if it is not consistent with the rules of the grammar | |
272 and with the input that has preceded it, the parser will not perform | |
273 the reduction. Suppose there are tokens in the input which match one | |
274 of the rules given above for \agcode{date}. A reduction will not | |
275 occur unless \agcode{date} is one of the tokens the parser is actually | |
276 looking for at that stage in the input. | |
277 | |
278 Even when the reduction token would make sense, there are still | |
279 situations where the reduction would not take place. Suppose a | |
280 grammar includes the two productions given above for \agcode{date} as | |
281 well as the following: | |
282 | |
283 \begin{indentingcode}{0.4in} | |
284 date | |
285 -> name of month, day | |
286 \end{indentingcode} | |
287 | |
288 This production is the same as the first, but with no year | |
289 specification. We often write dates without specifying the year | |
290 explicitly. The year is usually understood from context. When a | |
291 parser directed by this grammar has encountered \agcode{name of month} | |
292 and \agcode{day}, it can't tell without looking further whether it has | |
293 a short form or a long form date. In such a circumstance, the parser | |
294 looks at the next following token, which is called a | |
295 \index{Lookahead token}\index{Token}\agterm{lookahead token}. | |
296 If the lookahead token is a comma, then, in the absence of other | |
297 productions, the input is a long form date. If the lookahead token is | |
298 not a comma, then the input is certainly not a long form date, and it | |
299 is proper to reduce the short form production. | |
300 | |
301 Suppose the lookahead token were a comma and the grammar were also to | |
302 contain the following production: | |
303 | |
304 \begin{indentingcode}{0.4in} | |
305 appointment | |
306 -> date, comma, time | |
307 \end{indentingcode} | |
308 | |
309 Since a comma can follow date, according to this rule, and can also | |
310 follow day according to the first production, it is impossible to | |
311 determine, simply by looking at the lookahead token, whether the date | |
312 was being given in short form or long form. One would have to look | |
313 beyond the comma to see if what follows the comma matches the rules | |
314 for time or for year. Although it is possible to build parsers which | |
315 can do this, it is not generally feasible. This situation is called a | |
316 \index{Shift-reduce conflict}\index{Conflicts} | |
317 \agterm{shift-reduce conflict}, | |
318 because the parser cannot decide simply on the basis of the lookahead | |
319 token whether to reduce the short form rule or to ignore it and shift | |
320 the comma into the input buffer. | |
321 | |
322 AnaGram finds and warns you about the conflicts in your grammar when | |
323 it analyzes your grammar. To do this, it checks all states to find | |
324 all completed rules. A | |
325 \index{Completed rule}\index{Rule}\agterm{completed rule} | |
326 is a rule that is completely matched by the rightmost tokens in the | |
327 input buffer. For each such rule, AnaGram determines all of the | |
328 terminal tokens which could follow once that rule has been reduced. | |
329 These tokens are sometimes called the | |
330 \index{Token}\agterm{reducing tokens} | |
331 for the given rule in the given state. (Note that ``reducing tokens'' | |
332 are quite different from ``reduction tokens''.) AnaGram checks each | |
333 state for shift-reduce conflicts by comparing the set of tokens which | |
334 can be shifted against all sets of reducing tokens for that state. If | |
335 there is an overlap, there is a conflict. When AnaGram finds a | |
336 conflict, it prepares a detailed report on the conflict. This | |
337 information is then available in the \agwindow{Conflicts} window. A | |
338 shift-reduce conflict does not prevent you from building a parser. If | |
339 you build a parser for a grammar which has such a shift-reduce | |
340 conflict, the parser will resolve the issue by shifting the lookahead | |
341 token into the input buffer and ignoring the reduction. | |
342 | |
343 There is another type of conflict involving reductions. Suppose a | |
344 given state has more than one | |
345 \index{Completed rule}\index{Rule}completed rule. | |
346 AnaGram then checks the sets of reducing tokens for all the completed | |
347 rules. If there is an overlap, there is no way for AnaGram to | |
348 determine which rule should be reduced simply on the basis of the | |
349 \index{Token}\index{Lookahead token}lookahead token. | |
350 This situation is called a | |
351 \index{Reduce-reduce conflicts}\index{Conflicts} | |
352 \agterm{reduce-reduce conflict}. | |
353 AnaGram will also diagnose reduce-reduce conflicts in the Conflicts | |
354 window. If you decide to ignore a reduce-reduce conflict and ask | |
355 AnaGram to build a parser, the parser will choose the rule AnaGram | |
356 encountered first when it read your grammar. | |
357 | |
358 \index{Conflicts}Conflicts are not necessarily errors. Sometimes you | |
359 can build a more compact, faster parser by deliberately introducing | |
360 conflicts. You can rely on AnaGram's default resolution of the | |
361 conflicts, or you may use operator precedence to resolve the | |
362 conflicts. The use of operator precedence is described in Chapter 9. | |
363 | |
364 AnaGram has extensive facilities for identifying and correcting | |
365 unwanted conflicts. These facilities are described in Chapter 7. | |
366 | |
367 Traditionally, parsers are built using only shift and reduce actions | |
368 as described above. The parsers AnaGram builds normally use a number | |
369 of more complex actions, each of which is equivalent to a number of | |
370 shift and reduce actions. These complex actions permit AnaGram to | |
371 build faster, more compact parsers. If you wish, however, you may | |
372 restrict AnaGram to using only shift and reduce actions by setting the | |
373 \index{Traditional engine}\index{Configuration switches} | |
374 \agparam{traditional engine} configuration switch. See Appendix A, | |
375 Configuration Parameters. | |
376 | |
377 | |
378 \section{A Note on Notation} | |
379 | |
380 \index{Context free grammar}\index{Grammar}Context free grammars have | |
381 been traditionally represented in the literature using | |
382 \index{Backus-Naur Form}\agterm{Backus-Naur Form}, or | |
383 \index{BNF}\agterm{BNF}. | |
384 In Backus-Naur Form, certain characters, called metacharacters, are | |
385 used to punctuate productions and all other printable characters are | |
386 taken to represent themselves literally. Named tokens are denoted by | |
387 enclosing the name within angle brackets ($< >$). The left side of a | |
388 production is distinguished from the right side by the | |
389 % XXX s/characters/symbol/ | |
390 characters $::=$. | |
391 If several productions have the same left side, the vertical | |
392 bar $|$ is used to separate them. The elements of a grammar | |
393 rule are simply juxtaposed to indicate that one follows another. | |
394 Blanks are ignored. Thus, in BNF, the first production given for | |
395 \agcode{date}, above, would be: | |
396 | |
397 % ~ is a hard space | |
398 $$ | |
399 <date>~~::=~~<name~of~month>~~<day>~~,~~<year> | |
400 $$ | |
401 | |
402 AnaGram uses a notation more consonant with ordinary programming | |
403 usage. Thus token names need not be bracketed and literal characters | |
404 must be appropriately quoted. The elements of rules are joined by | |
405 commas. Using this approach, there is no need for metacharacters and | |
406 it becomes possible to make a number of useful extensions to the | |
407 notation. The detailed rules for writing grammars using AnaGram's | |
408 notation are given in Chapter 8. | |
409 | |
410 | |
411 \section{Reduction Procedures} | |
412 \index{Reduction procedure} | |
413 | |
414 Of course, the reason for parsing an input stream is to interpret the | |
415 data in the stream and to process it in some useful manner. The | |
416 primary tool for doing this is the \agterm{reduction procedure}. A | |
417 reduction procedure is a piece of C code that is executed when a | |
418 particular grammar rule is reduced. Often, a reduction procedure | |
419 calculates a value which becomes the | |
420 \index{Value}\index{Semantic value}semantic value | |
421 of the | |
422 \index{Token}\index{Reduction token}reduction token. | |
423 The input to the reduction procedure consists of the values of the | |
424 tokens that make up the grammar rule. | |
425 | |
426 AnaGram allows you to assign C variable names to the tokens in the | |
427 grammar rule, so that you can refer to them in the reduction | |
428 procedure. To assign a C variable name, simply follow the token in | |
429 the rule with a colon and the C variable name. Simple reduction | |
430 procedures can be written as C or C++ expressions. At the end of the | |
431 rule, write an equal sign and an expression, terminated by a | |
432 semicolon. For example: | |
433 | |
434 \begin{indentingcode}{0.4in} | |
435 (int) hex digit | |
436 -> '0-9':d = d-'0'; | |
437 -> 'a-f':d = d-'a'+10; | |
438 -> 'A-F':d = d-'A'+10; | |
439 \end{indentingcode} | |
440 | |
441 When any one of these rules is matched, the value of the token in the | |
442 rule is assigned to the temporary variable \agcode{d}. The expression | |
443 to the right of the equal sign is evaluated and the value of the | |
444 expression is stored as the value of \agcode{hex digit}, which has | |
445 been declared to be an \agcode{int}. These productions define | |
446 hexadecimal digits as ASCII characters, and calculate the binary value | |
447 of the digit in terms of the ASCII character code, \agcode{d}. The | |
448 binary value becomes the value of \agcode{hex digit}. Hexadecimal | |
449 digits can be combined to make hexadecimal numbers by writing the | |
450 following productions: | |
451 | |
452 \begin{indentingcode}{0.4in} | |
453 (int) hex number | |
454 -> hex digit | |
455 -> hex number:n, hex digit:d = 16*n+d; | |
456 \end{indentingcode} | |
457 | |
458 %In this example, note that if you do not specify a reduction procedure for | |
459 %a grammar rule, the value of the reduction token will be taken to be the | |
460 %value of the first token in the rule. Thus, it is not necessary to provide | |
461 %a reduction procedure in the first production for hex number. The second | |
462 %production recursively defines the value of hexadecimal numbers that have | |
463 %more than one digit. | |
464 | |
465 There are several important points to notice in this example. First, | |
466 reduction procedures are executed ``from the bottom up''. That is, | |
467 the reduction procedure for \agcode{hex digit} is executed before any | |
468 reduction procedure for \agcode{hex number}. Second, if there is no | |
469 reduction procedure for a production, the value of the first token in | |
470 the rule is assigned to the reduction token. Thus, it is not | |
471 necessary to provide a reduction procedure in the first production for | |
472 \agcode{hex number}. Third, the reduction procedures for recursive | |
473 productions are always executed \emph{after} the reduction procedure, | |
474 if any, for the non-recursive production which begins the recursion. | |
475 Finally, when an input sequence is described using left recursion, as | |
476 in this example, the elements of the sequence are processed left to | |
477 right. | |
478 | |
479 % | |
480 % XXX this is a bad example as you never want to read floats this way! | |
481 % Can we think of something else? | |
482 % | |
483 If you wish to process the elements of a sequence right to left, you | |
484 may use right recursion. For example, it is sometimes convenient to | |
485 define the fraction part of a decimal number thus: | |
486 | |
487 \begin{indentingcode}{0.4in} | |
488 (double) fraction part | |
489 -> '0-9':d = (d - '0')/10.0; | |
490 -> '0-9':d, fraction part:f = (d - '0' + f)/10.0; | |
491 \end{indentingcode} | |
492 | |
493 In this case the leading digits are stored temporarily in the parse | |
494 stack, and then the fraction part is evaluated right to left only when | |
495 the last digit has been found. | |
496 | |
497 Reduction procedures can be more complex than simple expressions. After the | |
498 equal sign you may include an arbitrary block of C or C++ code, enclosed in | |
499 braces, \agcode{\bra \ket}. To return a | |
500 \index{Semantic value}\index{Token}\index{Value}semantic value | |
501 for the reduction token simply use a return statement. Of course, | |
502 reduction procedures have the full resources of C or C++ at their | |
503 disposal. They may set and interrogate global variables and may call | |
504 functions freely. Reduction procedures are usually implemented as C | |
505 functions. Very simple procedures may be implemented as macros, | |
506 unless you have disabled this option by turning off the | |
507 \index{Macros}\index{Allow macros}\agparam{allow macros} configuration | |
508 switch. See Chapter 8 for the rules for writing reduction procedures. | |
509 When AnaGram builds a parser it copies all of the reduction procedures | |
510 you have defined to the parser file, and includes code to call them at | |
511 the right time. | |
512 | |
513 Since the reduction procedures you write will probably need some | |
514 support code, such as \agcode{\#include} statements and declarations, | |
515 you may incorporate C or C++ code into your syntax file at any point. | |
516 You need only enclose it in braces, \agcode{\bra \ket}. Such code | |
517 is called | |
518 \index{Embedded C}\agterm{embedded C}. | |
519 All embedded C code is also copied to the \index{File}parser file, and | |
520 \emph{precedes} all of your reduction procedures. | |
521 | |
522 | |
523 \section{Building a Parser} | |
524 \index{Building a Parser}\index{Parser} | |
525 | |
526 In order to round out a parser into a functioning program it needs | |
527 input procedures, as well as error diagnosis and recovery | |
528 capabilities. AnaGram has a number of options available which give | |
529 you a high degree of flexibility in configuring a parser to suit your | |
530 particular needs. All of the options are provided with reasonable | |
531 defaults, so that you can safely disregard any option until you need | |
532 the features it provides. The following paragraphs summarize the | |
533 options AnaGram provides. These topics are discussed in greater | |
534 detail in Chapter 9. | |
535 | |
536 \paragraph{Invoking a Parser.} | |
537 \index{Parser} | |
538 Normally, AnaGram configures parsers as functions which you can call | |
539 from elsewhere in your program. In this situation, you call the | |
540 parser, it processes its input, and returns either when it has | |
541 finished or cannot proceed because of errors. Alternatively, if you | |
542 set the | |
543 \index{Event driven}\agparam{event driven} | |
544 configuration switch, your parser will be configured so that you have | |
545 two procedures to call: an \index{Initializer}\agterm{initializer} and | |
546 a \index{Parser}\agterm{parser}. In the event driven configuration | |
547 you start the parse by calling the initializer and then you call the | |
548 parser once for each unit of input. Using the event driven mode makes | |
549 it quite easy to configure a parser as a filter and to chain several | |
550 parsers together so that the output from one parser is the input to | |
551 the next. Such | |
552 \index{Parsing}\index{Multi-stage parsing}\agterm{multi-stage parsing} | |
553 is a convenient way to deal with complex input that is not context free. | |
554 | |
555 \paragraph{Communicating with a Parser.} | |
556 The complete status of your parser is contained in a single data | |
557 structure called a | |
558 \index{Parser control block}\index{Control block} | |
559 \agterm{parser control block}. | |
560 All communications with a parser take place via the parser control | |
561 block. \index{Input procedures}Input procedures must place input data | |
562 in the appropriate field in the parser control block. When the parse | |
563 is complete or encounters an error, the results of the parse may be | |
564 found in the parser control block. When AnaGram builds a parser it | |
565 includes, in the \index{File}\index{Header file}header file it | |
566 generates, a \agcode{typedef} statement which defines the structure of | |
567 the parser control block. | |
568 | |
569 \paragraph{Parser Input.} | |
570 \index{Input}\index{Parser}\index{Input procedures} | |
571 The input to your parser may be either characters read directly from | |
572 an input stream, or tokens accumulated by a pre-processor or lexical | |
573 scanner. The way you provide input to your parser depends on how your | |
574 grammar defines input tokens and also on whether or not you have | |
575 requested an event driven parser. | |
576 | |
577 If your parser is event driven, you provide its input by storing the | |
578 input code and the input value, if any, into the parser control block | |
579 and calling the parser. | |
580 | |
581 If you have set the | |
582 \index{Pointer input}\agparam{pointer input} | |
583 \index{Configuration switches}configuration switch | |
584 in your syntax file, you simply initialize the pointer field in your | |
585 parser control block before you call your parser. Your parser will | |
586 then read its input directly from memory by simply incrementing the | |
587 pointer as necessary. | |
588 | |
589 Otherwise, your parser will invoke a macro called | |
590 \index{\agcode{GET{\us}INPUT}}\index{Macros}\agcode{GET{\us}INPUT} every | |
591 time it needs more input. You may define \agcode{GET{\us}INPUT} | |
592 according to your needs. You can define it so that it calls an input | |
593 function, or you can define it so that it executes in-line code each | |
594 time it is invoked. Your \agcode{GET{\us}INPUT} macro should store its | |
595 input code in the | |
596 \index{\agcode{input{\us}code}}\agcode{input{\us}code} | |
597 field of the | |
598 \index{PCB}parser control block. | |
599 If you do not write a \agcode{GET{\us}INPUT} macro, AnaGram will provide | |
600 one which will read characters from \agcode{stdin}. | |
601 | |
602 If your grammar does not define terminal tokens in terms of ASCII | |
603 characters or external token numbers, your \agcode{GET{\us}INPUT} will | |
604 have to determine the appropriate internal | |
605 \index{Token number}\index{Token}\index{Number}token number | |
606 for each input token. To assist you in determining these token | |
607 numbers AnaGram provides a \agcode{typedef enum} statement in the | |
608 \index{File}\index{Header file}header file. You can then use named | |
609 constants to specify the internal token numbers for the input tokens. | |
610 | |
611 \paragraph{Error Handling.} | |
612 \index{Error handling}\index{Error diagnosis}\index{Error recovery} | |
613 Your parser must be prepared to deal with erroneous input. There are | |
614 two aspects to error handling: diagnosing the error, and recovering | |
615 from the error. On encountering an error, your parser will invoke a | |
616 macro called | |
617 \index{\agcode{SYNTAX{\us}ERROR}}\index{Macros}\agcode{SYNTAX{\us}ERROR}. | |
618 If you do not provide a definition for \agcode{SYNTAX{\us}ERROR}, AnaGram | |
619 will provide a simple error diagnostic. AnaGram can also provide | |
620 automatic error diagnoses which pinpoint the location of the error. | |
621 | |
622 \index{Resynchronization} | |
623 AnaGram provides two options for error recovery: | |
624 \agterm{error token resynchronization} and | |
625 \agterm{automatic resynchronization}. These are | |
626 techniques for getting your parser back in synch with its input so it | |
627 can proceed after encountering an error. Normally, if you do not | |
628 select one of these recovery techniques, your parser will terminate | |
629 when it encounters an error; however, you may override this default if | |
630 you wish and provide your own recovery technique. | |
631 | |
632 | |
633 \section{Special Features of AnaGram} | |
634 | |
635 AnaGram provides a number of special features not generally available | |
636 in other parsing systems. These features extend the applicability of | |
637 syntax directed parsing, make it easier for you to write and debug | |
638 your grammars, and make it easier to interface your parser to other | |
639 parsers and to your main program. The following paragraphs provide a | |
640 brief introduction to these features. Their use is described more | |
641 fully in the appropriate chapters in this manual. | |
642 | |
643 \paragraph{Configuration Parameters.} | |
644 \index{Configuration parameters}\index{Parameters} | |
645 AnaGram has a number of configuration switches and parameters which | |
646 you can use to control the way AnaGram works in various circumstances. | |
647 Since programmers often have different requirements, AnaGram strives | |
648 to be as flexible as possible. | |
649 | |
650 Configuration parameters may be set in | |
651 \index{Configuration file}\index{File}configuration files, or in your | |
652 syntax file. The rules for writing configuration parameters are given | |
653 in Chapter 8. A summary of all configuration parameters is given in | |
654 Appendix A. The use of particular configuration parameters is | |
655 described as appropriate throughout this manual. | |
656 | |
657 \paragraph{Character Sets}. | |
658 \index{Character sets} | |
659 In your grammar, if you write terminal tokens explicitly as characters | |
660 or as sets of ASCII characters, AnaGram will set up your parser to | |
661 accept ASCII characters directly, so you will not need to build a | |
662 \index{Lexical scanner}lexical scanner and interface it to | |
663 your parser. Character sets may be specified in terms of ranges of | |
664 characters, such as \agcode{'a-z'}, as unions, intersections or | |
665 differences of sets, denoted by \agcode{+}, \agcode{\&}, or \agcode{-} | |
666 respectively, or as the complement of a set, denoted by \agcode{\~{}}. | |
667 You may also specify a set consisting of a single character either | |
668 with a literal character or with a numeric specification. The | |
669 detailed rules for character set expressions are given in Chapter 8. | |
670 Some useful character sets are: | |
671 | |
672 \begin{itemize} | |
673 \item Letters: \agcode{'a-z' + 'A-Z'} | |
674 \item Digits: \agcode{'0-9'} | |
675 \item DOS end of file: \agcode{-1 + \^{}Z} | |
676 \item Neither end of file nor end of line: | |
677 \agcode{\~{}(-1 + \^{}Z | |
678 + '{\bs}n')} | |
679 \end{itemize} | |
680 | |
681 \paragraph{Keywords.} | |
682 \index{Keywords} | |
683 AnaGram makes special provision for the use of keywords in the parsers | |
684 it builds. In your grammar, you may specify a keyword as a quoted | |
685 string. AnaGram will incorporate special string matching logic in | |
686 your parser to recognize keywords in the parser input. When a keyword | |
687 is recognized, it is treated as an individual token, completely | |
688 independent of the characters which make up the keyword. Some | |
689 examples of keywords: \agcode{"let"}, \agcode{".EQ."}, | |
690 \agcode{"<>"}, \agcode{"/*"}. The | |
691 rules for the use of keyword strings are given in Chapter 8. | |
692 | |
693 Keywords are a powerful and useful tool, but they are not completely | |
694 consistent with the assumptions underlying the use of context-free | |
695 grammars. Thus under certain circumstances it is possible for a | |
696 parser built from a grammar that uses keywords to fail | |
697 % XXX s/parse correctly/correctly parse/ | |
698 to parse correctly text that | |
699 appears to be consistent with the grammar. Such a situation is called a | |
700 \index{Keyword anomalies}\index{Anomaly}keyword anomaly. | |
701 AnaGram detects and diagnoses keyword anomalies in your grammar, and | |
702 provides tools to help you find and correct the problem. Many such | |
703 apparent anomalies, however, are completely innocuous. Keyword | |
704 anomalies are discussed in greater detail in Chapter 7. | |
705 | |
706 \paragraph{Virtual Productions.} | |
707 \index{Virtual productions}\index{Production} | |
708 Virtual productions are a powerful, shorthand notation for specifying | |
709 optional or repetitive input. Using a virtual production can often | |
710 spare you the trouble of writing out four or five regular productions | |
711 and naming them. | |
712 | |
713 Most of the notation for virtual productions should be familiar since | |
714 it has been used in programming manuals for years. The basic forms | |
715 are these: | |
716 | |
717 \index{\agcode{[]}} | |
718 \index{\_opb\_clb} % {}, apparently. XXX? | |
719 \index{\agcode{...}}\index{Ellipsis} | |
720 \index{\agcode{?}} | |
721 \index{\agcode{?...}} | |
722 \index{\agcode{/...}} | |
723 | |
724 % XXX figure out how to not have to encode the line breaks manually. | |
725 % I think we need one of the fancier extension forms of tabular. | |
726 | |
727 \begin{tabular}[t]{ll} | |
728 | |
729 \agcode{[]}&enclosing a list of one or more grammar rules separated | |
730 by \agcode{|} characters denotes an optional\\ | |
731 \phantom{}&choice of one of the rules.\\ | |
732 | |
733 \agcode{\bra \ket}&enclosing a list of one or more grammar rules separated by | |
734 \agcode{|} characters indicates a required\\ | |
735 \phantom{}&choice of one of the rules.\\ | |
736 | |
737 \agcode{...}&following a token, or following \agcode{[]} or \agcode{\bra \ket}, | |
738 indicates arbitrary repetition of the previous choice or\\ | |
739 \phantom{}&token.\\ | |
740 | |
741 \agcode{?}&following a token or set expression makes the token optional.\\ | |
742 | |
743 \agcode{?...}&following a token or set expression denotes zero or more | |
744 repetitions of the token.\\ | |
745 | |
746 \agcode{/...}&following \agcode{[]} or \agcode{\bra \ket} indicates arbitrary | |
747 repetition of the choice subject to the constraint that\\ | |
748 \phantom{}&choices must alternate.\\ | |
749 | |
750 \end{tabular} | |
751 | |
752 \index{Production}\index{Virtual productions}Virtual productions | |
753 are simply syntactic devices. When AnaGram encounters a virtual | |
754 production it translates the virtual production into conventional | |
755 productions which show up in your grammar tables. Virtual productions | |
756 are described in Chapter 8. | |
757 | |
758 \paragraph{Definition Statements.} | |
759 \index{Statement}\index{Definition statement} | |
760 AnaGram provides a mechanism for naming particular character sets, | |
761 keywords or virtual productions in your syntax file. The name you | |
762 define can be used freely in place of the thing defined. Definition | |
763 statements have the form: | |
764 | |
765 % XXX why the hell is the first line offset by a small amount? | |
766 \begin{indentingcode}{0.4in} | |
767 name = \codemeta{character set} | |
768 name = \codemeta{virtual production} | |
769 name = \codemeta{keyword} | |
770 name = \codemeta{immediate action} | |
771 name = \codemeta{token name} | |
772 \end{indentingcode} | |
773 | |
774 The name may be any name acceptable to AnaGram. The name can then be | |
775 used anywhere you might have used the expression on the right side. | |
776 For example: | |
777 | |
778 \begin{indentingcode}{0.4in} | |
779 upper case letter = 'A-Z' | |
780 lower case letter = 'a-z' | |
781 letter = upper case letter + lower case letter | |
782 statement list = statement?... | |
783 while = "WHILE" | |
784 dos eof = \^{}Z | |
785 \end{indentingcode} | |
786 | |
787 \paragraph{Disregard Statements.} | |
788 \index{Disregard statement}\index{Statement} | |
789 If you wish your parser generally to skip over certain characters or | |
790 constructs in its input, such as blanks, newlines, or comments, you | |
791 may use disregard statements to specify precisely what is to | |
792 be skipped and under what circumstances. You may use any number of | |
793 disregard statements and you may cause your parser to disregard any | |
794 number of tokens, of whatever complexity. Any \agparam{disregard} | |
795 statements must | |
796 be placed in a configuration section. They may appear anywhere in | |
797 your syntax file. | |
798 | |
799 The format of the \agparam{disregard} statement is as follows: | |
800 | |
801 \begin{indentingcode}{0.4in} | |
802 disregard \codemeta{token name} | |
803 \end{indentingcode} | |
804 | |
805 The rules for using disregard statements are given in Chapter 8. | |
806 | |
807 \paragraph{Lexeme Statements.} | |
808 \index{Lexeme statement}\index{Statement} | |
809 The lexeme statement is used to fine tune the disregard statement. | |
810 Any \agparam{lexeme} statements must be placed in a configuration | |
811 section. They may appear anywhere in your syntax file. | |
812 | |
813 The format of the \agparam{lexeme} statement is as follows: | |
814 | |
815 \begin{indentingcode}{0.4in} | |
816 lexeme \bra \codemeta{token list} \ket | |
817 \end{indentingcode} | |
818 | |
819 where \textit{token list} is a list of nonterminal tokens, separated | |
820 by commas. | |
821 | |
822 The lexeme statement allows you to specify that for the purposes of | |
823 the disregard statement, the listed tokens are to be treated as | |
824 % XXX make this: indivisible lexical units | |
825 lexical units, so that the disregard statement will be inoperative | |
826 within the tokens specified. Lexeme statements are discussed in | |
827 Chapter 8. | |
828 | |
829 \paragraph{Semantically Determined Productions.} | |
830 \index{Semantically determined production}\index{Production} | |
831 Traditionally, syntax directed parsing systems have enforced a rigid | |
832 distinction between syntactic and semantic information. In these | |
833 systems, it is not possible to use the knowledge that, for example, a | |
834 symbol is a function name and not a variable name in the syntax of a | |
835 programming language. AnaGram provides semantically determined | |
836 productions as a mechanism to allow semantic information to control | |
837 syntactic analysis. | |
838 | |
839 A semantically determined production has more than one | |
840 \index{Token}\index{Reduction token}reduction token | |
841 on the left side. The reduction procedure determines which reduction | |
842 token the parser should use. For example: | |
843 | |
844 \begin{indentingcode}{0.4in} | |
845 variable name, function name | |
846 -> symbol:s = identify{\us}symbol(s); | |
847 \end{indentingcode} | |
848 | |
849 In this situation, the parser will set the reduction token to the | |
850 leftmost token, \agcode{variable name}, before calling the reduction | |
851 procedure. If the reduction procedure finds that \agcode{s} in fact | |
852 identifies a function name, it can change the reduction token to | |
853 \agcode{function name}. Using this capability, you can write your grammar in | |
854 terms of \agcode{variable name} and \agcode{function name}, just as | |
855 though they were syntactically distinguishable. Further discussion of | |
856 semantically determined productions may be found in Chapter 9. | |
857 | |
858 \paragraph{Event Driven Parsers.} | |
859 \index{Parser}\index{Event driven parser} | |
860 Traditionally, parsers have been constructed as subroutines which are | |
861 called to analyze a stream of input and return only when they have | |
862 finished parsing the input or have encountered an unrecoverable error. | |
863 Such a parser itself calls a subroutine to obtain each unit of input. | |
864 There are a number of circumstances where this traditional | |
865 architecture is inconvenient. The most obvious is in situations where | |
866 programs must be event-driven, as in many popular windowing systems. | |
867 A somewhat less obvious situation is where one parser, often called a | |
868 \index{Lexical scanner}\agterm{lexical scanner}, provides input to a | |
869 second. In these situations it is convenient to invert the role of the | |
870 calling and the called programs. | |
871 | |
872 AnaGram provides the capability of specifying that a parser it builds | |
873 should be event-driven. In this case, the parser consists of an | |
874 initialization procedure and an event handler. The main program calls | |
875 the initialization procedure and then calls the event handler with | |
876 each input token in turn. The event handler returns every time the | |
877 parser needs more input. Details on event-driven parsers are given in | |
878 Chapter 9. | |
879 | |
880 \paragraph{Context tracking.} | |
881 \index{Context tracking} | |
882 When you are writing a reduction procedure for a particular grammar | |
883 rule, you often need to know the value one or another of your program | |
884 variables had at the time the first token in the rule was encountered. | |
885 Examples of such variables are: | |
886 | |
887 \begin{itemize} | |
888 \item Line or column number | |
889 \item Index in an input file | |
890 \item Index into an array | |
891 \item Counters, as of symbols defined, etc. | |
892 \end{itemize} | |
893 | |
894 Such variables can be thought of as representing the ``context'' of | |
895 the rule you are reducing. Sometimes it is possible to incorporate | |
896 the values of such variables into the values of reduction tokens, but | |
897 this can become quite cumbersome. AnaGram provides a feature known as | |
898 ``context tracking'' to deal with this problem. Details on context | |
899 tracking can be found in Chapter 9. | |
900 | |
901 \paragraph{Coverage Analysis.} | |
902 \index{Coverage analysis} | |
903 AnaGram has facilities to help you determine the adequacy of your test | |
904 suites. If you set appropriate configuration switches, AnaGram will | |
905 include code in your parser to count the number of times each rule in | |
906 your grammar is reduced or the number of times each reduction | |
907 procedure in your grammar is executed. You can use these counts to | |
908 determine the completeness of your testing. The use of these | |
909 facilities is discussed in Chapter 9. | |
910 | |
911 \paragraph{Immediate Actions.} | |
912 \index{Action}\index{Immediate action} | |
913 AnaGram provides for executing particular functions, called | |
914 ``immediate actions'', at any point in a grammar rule. Thus, | |
915 computation does not need to wait until the rule is complete. | |
916 Immediate actions are most useful when using AnaGram to control an | |
917 interactive process. Immediate actions are discussed in Chapter 8. | |
918 | |
919 \paragraph{Grammar Trace.} | |
920 \index{Grammar Trace}\index{Window}\index{Trace} | |
921 The \agwindow{Grammar Trace} facility of AnaGram allows you to examine | |
922 the workings of your parser in detail. You can set up a | |
923 representation of the | |
924 \index{Parser state stack}\index{State stack}\index{Stack}parser state stack | |
925 and parser state as they might appear in the course of execution of | |
926 your parser. You can then examine the possible inputs and see how the | |
927 state and the state stack change in response to any input you choose. | |
928 You can back up and try other options. You can have several grammar | |
929 traces active simultaneously so you can compare the results of | |
930 different sequences of input. For any configuration of the state | |
931 stack, you can see what grammar rules the parser is in the process of | |
932 matching. | |
933 | |
934 The use of the Grammar Trace is described in Chapter 5. | |
935 | |
936 \paragraph{File Trace.} | |
937 \index{Trace}\index{File Trace}\index{Window} | |
938 The \agwindow{File Trace} is a facility which allows you to see how | |
939 your parser will work on real data. File Trace is an interactive, | |
940 interpretive parser governed by the rules in your grammar. It lets | |
941 you see precisely how your grammar parses a test file. You can see | |
942 the contents of the parse stack at any point in the parse. You can | |
943 watch the parse progress at any level of detail you choose. If you | |
944 wish, you may back up and try again. When AnaGram runs the | |
945 \agwindow{File Trace} or \agwindow{Grammar Trace} it also builds a | |
946 \index{Coverage}\index{Trace Coverage}\index{Window} | |
947 \agwindow{Trace Coverage} table. | |
948 | |
949 Note that AnaGram normally uses a number of short-cut parsing actions. | |
950 To make the parse look like a textbook parse, set the | |
951 \index{Traditional engine}\agparam{traditional engine} | |
952 \index{Configuration parameters}configuration parameter | |
953 in your syntax file. | |
954 | |
955 \paragraph{Aids to Debugging.} | |
956 AnaGram provides a number of facilities to help you debug your parser. | |
957 It provides numerous tables that summarize your grammar and a | |
958 windowing environment to assist you in inspecting and comparing this | |
959 data. To assist in identifying a conflict in your grammar it can | |
960 provide a | |
961 \index{Trace}\index{Conflict Trace}\index{Window}\agwindow{Conflict Trace}, | |
962 a pre-built grammar trace showing a sequence of input which will | |
963 trigger the conflict. If your parser encounters a | |
964 \index{Syntax error}\index{Errors}syntax error, | |
965 you can arrange to have AnaGram build an \agwindow{Error Trace}, a | |
966 pre-built grammar trace showing the sequence of input tokens that led | |
967 to the syntax error. These facilities are described in Chapter 5. | |
968 | |
969 \paragraph{Conflict Resolution.} | |
970 \index{Conflict} | |
971 In order to help you deal with conflicts in your grammar, AnaGram | |
972 provides several methods for resolving conflicts. You may include | |
973 \agterm{precedence declarations} to assist in the construction of | |
974 simple precedence grammars, you may declare tokens to be | |
975 \agparam{sticky}, or you may declare tokens to be | |
976 \agterm{subgrammar} tokens. | |
977 These techniques are described in Chapter 7. |