comparison doc/manual/isdp.tex @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.
author David A. Holland
date Sat, 22 Dec 2007 17:52:45 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:13d2b8934445
1 \chapter{Introduction to Syntax Directed Parsing}
2
3 Every programmer has to deal with input data. The units of data may
4 consist of single characters from a keyboard or file, unit records
5 from a file, mouse events in a windowing system, or even more complex
6 data constructs. When the processing of a unit of data depends only on
7 the value of that data, it is usually straightforward. Usually,
8 however, the processing depends also on what has preceded, and often
9 even on what follows, the input under consideration. Keeping track of
10 these dependencies in order to know how to process the data is called
11 \index{Parsing}\agterm{parsing}.
12
13 It is often relatively easy to keep track of simple dependencies when
14 first writing a program. But as the program develops, as new features
15 are added, as bugs are fixed, the dependencies often stop being
16 simple. Then input processing becomes a headache, since it is hard to
17 keep track of or even identify all the particular cases. Changes to
18 the program cause unexpected problems. Program maintenance threatens
19 to get out of control.
20
21 Syntax directed parsing is a technique for dealing with input data
22 streams which keeps you in control of the problem, gives you a high
23 degree of confidence in your program, makes bugs less likely and makes
24 program maintenance and the addition of new features an easy task.
25
26 To use syntax directed parsing you create a high level description of
27 the structure of your input data, called a
28 \index{Grammar}\agterm{grammar}.
29 A file which contains a grammar is called a
30 \index{Syntax file}\index{File}\agterm{syntax file}.
31 A
32 \index{Parser generator}\agterm{parser generator},
33 such as AnaGram,
34 can create, from a syntax file, a function (or program) called a
35 \index{Parser}\agterm{parser},
36 written in C or C++. The parser keeps track of all the dependencies
37 in your input, and calls certain functions,
38 \index{Reduction procedure}\agterm{reduction procedures},
39 to deal with specific units or sequences of data as they are
40 encountered.
41
42 Reduction procedures are functions you write to process your data.
43 They are linked, in an obvious way in the grammar, to the structures
44 in your input, so that your parser will call them at precisely the
45 right times with precisely the right data. When you write reduction
46 procedures, you can concentrate entirely on what you have to do with
47 the data. You don't have to encumber your code with switches and
48 tests to determine the structure of your input.
49
50 This chapter describes in a general way how you write a grammar and
51 how a parser works. In particular, it defines a number of terms which
52 are useful in talking about grammars and syntax directed parsing. It
53 then introduces a number of special features of AnaGram. More
54 detailed treatment of writing grammars is given in Chapter 8.
55 Techniques for building a complete functioning parser are discussed in
56 Chapter 9.
57
58
59 \section{Describing an Input Sequence}
60
61 Writing a grammar consists of describing the acceptable input
62 sequences for your program. The vehicle for describing an input
63 sequence is called a
64 \index{Production}\agterm{production}.
65 Productions show how a logical component of the input can be made up
66 of a sequence of more elementary components. A production that
67 describes a date might be written:
68
69 \begin{indentingcode}{0.4in}
70 date
71 -> name of month, day, comma, year
72 \end{indentingcode}
73
74 The components of the input are called
75 \index{Token}\agterm{tokens}.
76 The sequence of tokens on the \index{Right side}right side of the
77 production is called a
78 \index{Grammar rule}\agterm{grammar rule},
79 or \index{Rule}\agterm{rule} for short.
80 The individual tokens on the right side of the rule are also called
81 \index{Rule elements}\index{Rule}\agterm{rule elements}.
82 The \agterm{rule length} is the number of elements in the rule.
83
84 The token on the \index{Left side}left side of the production is
85 called the
86 \index{Reduction token}\index{Token}\agterm{reduction token}
87 for the rule. \index{Token}Tokens may have
88 \index{Value}\index{Semantic value}\agterm{semantic values},
89 as distinguished from \index{Value}\agterm{syntactic values}, which
90 you can use in your reduction procedures. For instance, the value of
91 \agcode{name of month} could be an integer in the range zero to
92 eleven, or it could be a pointer to an ASCII string. The value of day
93 could be an integer in the range one to thirty-one.
94
95 A grammar consists of a number of such productions, each of which
96 describes some component of the input in terms of other components.
97 It does not take very many productions to describe quite complex input
98 streams. A grammar for the C language, for instance, requires about
99 two hundred productions.
100
101 Many people find the term \agterm{production} quite confusing. It
102 comes from theoretical linguistics where it is used to describe how
103 one may produce sequences which correspond to a set of grammatical
104 rules. Ironically, the major usage of the idea has been in parsing
105 where the interest is not so much in creating sequences which satisfy
106 the grammatical rules as in decoding and analyzing such sequences.
107 Nonetheless, it is convenient, in the above example, to say that the
108 token \agcode{date} \agterm{produces} a sequence of tokens consisting
109 of \agcode{name of month}, \agcode{day}, \agcode{comma}, and
110 \agcode{year}. We also say that the sequence \agterm{reduces} to
111 \agcode{date}.
112
113 There may be more than one production to describe a given component,
114 if there is more than one way it may be represented. For instance,
115
116 \begin{indentingcode}{0.4in}
117 date
118 -> day, name of month, year
119 \end{indentingcode}
120
121 describes another common way of writing a date. In other words, a
122 reduction token may produce a number of different grammar rules.
123
124 Tokens which appear on the left side of one or more productions are called
125 \index{Token}\index{Nonterminal token}\agterm{nonterminal tokens}.
126 Those which appear \emph{only} on the right sides of productions are
127 called
128 \index{Token}\index{Terminal token}\agterm{terminal tokens}.
129 Terminal tokens are the units which actually appear physically in the
130 input. Nonterminal tokens are identified when a sequence of tokens
131 that matches the right side of a production is seen in the input.
132 When AnaGram analyzes a grammar, it assigns a unique
133 \index{Token number}\index{Number}\index{Token}\agterm{token number}
134 to each token it finds in the grammar.
135
136 Nonterminal tokens, such as \agcode{date} in the example above, may
137 appear in any grammar rule just as though they were input tokens. The
138 token on the left side of a production can even appear on the right
139 side as well. Such a production is called a
140 \index{Production}\index{Recursive productions}\agterm{recursive}
141 production. When a nonterminal token appears on the right side of a
142 production, it may be represented in this context by \emph{any} of
143 the grammar rules it produces. Grammars described in this manner are
144 called
145 \index{Context free grammar}\index{Grammar}\agterm{context free grammars}
146 since there is no contextual constraint on which of the rules that a
147 token produces can appear in any given context.
148
149 Recursive productions may be either left recursive or right recursive.
150 \agterm{Left recursive} productions are those where the recursively
151 defined nonterminal appears as the first element in the recursive
152 rule. \agterm{Right recursive} productions are those where it is the
153 last element. If it appears anywhere between, the production is said
154 to be \agterm{center recursive}.
155
156 Any nonterminal token which has a recursive production must also have
157 at least one simple, non-recursive production. Otherwise, it is not
158 possible to create a finite sequence of terminal tokens from the
159 nonterminal token.
160
161 Recursion may also occur implicitly in a grammar when one of the
162 tokens on the right side of a production itself has a production
163 involving the token on the left. Such implicit recursion sometimes
164 may involve numerous levels of productions. Implicit recursion occurs
165 most commonly in describing constructs such as arithmetic expressions
166 or the block structure of programming languages.
167
168 Clearly, grammars can accommodate multiple levels of structure in the
169 input sequences they describe. There must, at the top, be a single
170 token which encompasses the entire input. This special token is
171 variously called the
172 \index{Grammar token}\index{Configuration parameters}\index{Token}
173 \agterm{grammar token},
174 the
175 \index{Goal token}\agterm{goal token}
176 or the
177 \index{Start token}\agterm{start token}.
178
179 In AnaGram grammars, terminal tokens need not be declared or otherwise
180 defined. Any token name that appears only on the right side of a
181 production is taken to be a terminal token, and AnaGram presumes that
182 the input procedure that provides tokens to the parser will be able to
183 identify them to the parser appropriately. The details on how to do
184 this are explained in Chapter 9.
185
186 On the other hand, AnaGram allows you to specify terminal tokens
187 explicitly as ASCII characters, or even as sets of ASCII characters,
188 right in the grammar. Thus, you may write \agcode{'0-9'} to represent
189 the set of ASCII digits, or \agcode{'A-Z'} to represent the set of
190 upper case letters. The
191 \index{Semantic value}\index{Value}\index{Token}semantic value
192 of such a token is the ASCII \index{Character codes}character code
193 that actually appears in the input stream. The rules for representing
194 \index{Character sets}character sets are given in Chapter 8. If the
195 various sets you use in your grammar overlap, they may not properly
196 represent terminal tokens. In this case, AnaGram automatically
197 extends your grammar appropriately. This ``set partition'' logic is
198 described in Chapter 6.
199
200
201 \section{How a Parser Works}
202
203 The aim of a \index{Parser}parser is to match its input with the full
204 syntactic structure specified by the productions which make up the
205 grammar. The primary component of a parser is an input buffer,
206 sometimes thought of as a stack, into which tokens are
207 \agterm{shifted}, or stored sequentially, as they are encountered in
208 the input. At the same time that a token is shifted into the input
209 buffer, its
210 \index{Semantic value}\index{Token}\index{Value}semantic value is
211 pushed onto the
212 \index{Parser value stack}\index{Value stack}\index{Stack}
213 \agterm{value stack}.
214 A token is not shifted into the buffer unless it ``makes sense'', that
215 is, unless it is consistent with the rules of the grammar and with the
216 input that has preceded it. If a token does not make sense, the
217 parser signals a
218 \index{Syntax error}\index{Errors}\agterm{syntax error}.
219 In order to determine whether a token makes sense, the parser has a
220 sort of decision table which provides a list of acceptable tokens for
221 each of a number of
222 \index{Parser}\index{State}\agterm{states}.
223 The table also specifies what the parser is to do with each acceptable
224 token. When the table indicates that a token is to be shifted into
225 the buffer, it also specifies a new state. The parser stacks the
226 current state on a
227 \index{Parser state stack}\index{State stack}\index{Stack}
228 \agterm{state stack}
229 and jumps to the new state. Thus every time a token is shifted into
230 the input buffer, a state number is pushed onto the state stack. For
231 each state of the parser, excepting only the initial state, there is a
232 unique token which will cause a jump to that state. This token is
233 called the
234 \index{Characteristic token}\index{Token}\agterm{characteristic token}
235 for the state.
236
237 When the rightmost, or most recent, tokens in the input buffer match
238 the right side of a production precisely, the parser \emph{may}
239 replace the tokens that match the rule with a single token, the token
240 on the left side of the production. This process of replacing a
241 sequence of tokens with a single token is called
242 \index{Reduction}\agterm{reduction}.
243 The token that replaces the sequence of tokens is called the
244 \index{Reduction token}\index{Token}\agterm{reduction token}.
245
246 The actual mechanism of the reduction is quite important. At the same
247 time that input tokens are removed from the input buffer, state
248 numbers are popped from the state stack, so that when all input tokens
249 matching the rule have been removed, the parser
250 \index{Parser}\index{State}state has been restored to the value it had
251 at the time the first token in the rule was seen. As state numbers
252 are popped from the state stack, token values are popped from the
253 value stack. If the rule has a reduction procedure, temporary
254 variables are loaded with the values popped from the stack and the
255 reduction procedure is called. The reduction token is now shifted
256 into the input buffer just as though it were an input token. If the
257 reduction procedure returned a result it is shifted into the value
258 stack as the value of the reduction token. The parser stacks the
259 current state again and jumps to a new state as determined by the
260 parser tables.
261
262 If there are no errors in your input, when the last token has been
263 read from the input, shifted into the input buffer, and reductions
264 performed, there will be precisely one token in the input buffer: the
265 \index{Grammar token}\index{Goal token}grammar, or goal, token which
266 describes your entire input. At this point your parser declares
267 itself finished.
268
269 Reductions do not necessarily occur every time a rule matches the
270 tokens in the input buffer. If the reduction token does not make
271 sense, that is, if it is not consistent with the rules of the grammar
272 and with the input that has preceded it, the parser will not perform
273 the reduction. Suppose there are tokens in the input which match one
274 of the rules given above for \agcode{date}. A reduction will not
275 occur unless \agcode{date} is one of the tokens the parser is actually
276 looking for at that stage in the input.
277
278 Even when the reduction token would make sense, there are still
279 situations where the reduction would not take place. Suppose a
280 grammar includes the two productions given above for \agcode{date} as
281 well as the following:
282
283 \begin{indentingcode}{0.4in}
284 date
285 -> name of month, day
286 \end{indentingcode}
287
288 This production is the same as the first, but with no year
289 specification. We often write dates without specifying the year
290 explicitly. The year is usually understood from context. When a
291 parser directed by this grammar has encountered \agcode{name of month}
292 and \agcode{day}, it can't tell without looking further whether it has
293 a short form or a long form date. In such a circumstance, the parser
294 looks at the next following token, which is called a
295 \index{Lookahead token}\index{Token}\agterm{lookahead token}.
296 If the lookahead token is a comma, then, in the absence of other
297 productions, the input is a long form date. If the lookahead token is
298 not a comma, then the input is certainly not a long form date, and it
299 is proper to reduce the short form production.
300
301 Suppose the lookahead token were a comma and the grammar were also to
302 contain the following production:
303
304 \begin{indentingcode}{0.4in}
305 appointment
306 -> date, comma, time
307 \end{indentingcode}
308
309 Since a comma can follow date, according to this rule, and can also
310 follow day according to the first production, it is impossible to
311 determine, simply by looking at the lookahead token, whether the date
312 was being given in short form or long form. One would have to look
313 beyond the comma to see if what follows the comma matches the rules
314 for time or for year. Although it is possible to build parsers which
315 can do this, it is not generally feasible. This situation is called a
316 \index{Shift-reduce conflict}\index{Conflicts}
317 \agterm{shift-reduce conflict},
318 because the parser cannot decide simply on the basis of the lookahead
319 token whether to reduce the short form rule or to ignore it and shift
320 the comma into the input buffer.
321
322 AnaGram finds and warns you about the conflicts in your grammar when
323 it analyzes your grammar. To do this, it checks all states to find
324 all completed rules. A
325 \index{Completed rule}\index{Rule}\agterm{completed rule}
326 is a rule that is completely matched by the rightmost tokens in the
327 input buffer. For each such rule, AnaGram determines all of the
328 terminal tokens which could follow once that rule has been reduced.
329 These tokens are sometimes called the
330 \index{Token}\agterm{reducing tokens}
331 for the given rule in the given state. (Note that ``reducing tokens''
332 are quite different from ``reduction tokens''.) AnaGram checks each
333 state for shift-reduce conflicts by comparing the set of tokens which
334 can be shifted against all sets of reducing tokens for that state. If
335 there is an overlap, there is a conflict. When AnaGram finds a
336 conflict, it prepares a detailed report on the conflict. This
337 information is then available in the \agwindow{Conflicts} window. A
338 shift-reduce conflict does not prevent you from building a parser. If
339 you build a parser for a grammar which has such a shift-reduce
340 conflict, the parser will resolve the issue by shifting the lookahead
341 token into the input buffer and ignoring the reduction.
342
343 There is another type of conflict involving reductions. Suppose a
344 given state has more than one
345 \index{Completed rule}\index{Rule}completed rule.
346 AnaGram then checks the sets of reducing tokens for all the completed
347 rules. If there is an overlap, there is no way for AnaGram to
348 determine which rule should be reduced simply on the basis of the
349 \index{Token}\index{Lookahead token}lookahead token.
350 This situation is called a
351 \index{Reduce-reduce conflicts}\index{Conflicts}
352 \agterm{reduce-reduce conflict}.
353 AnaGram will also diagnose reduce-reduce conflicts in the Conflicts
354 window. If you decide to ignore a reduce-reduce conflict and ask
355 AnaGram to build a parser, the parser will choose the rule AnaGram
356 encountered first when it read your grammar.
357
358 \index{Conflicts}Conflicts are not necessarily errors. Sometimes you
359 can build a more compact, faster parser by deliberately introducing
360 conflicts. You can rely on AnaGram's default resolution of the
361 conflicts, or you may use operator precedence to resolve the
362 conflicts. The use of operator precedence is described in Chapter 9.
363
364 AnaGram has extensive facilities for identifying and correcting
365 unwanted conflicts. These facilities are described in Chapter 7.
366
367 Traditionally, parsers are built using only shift and reduce actions
368 as described above. The parsers AnaGram builds normally use a number
369 of more complex actions, each of which is equivalent to a number of
370 shift and reduce actions. These complex actions permit AnaGram to
371 build faster, more compact parsers. If you wish, however, you may
372 restrict AnaGram to using only shift and reduce actions by setting the
373 \index{Traditional engine}\index{Configuration switches}
374 \agparam{traditional engine} configuration switch. See Appendix A,
375 Configuration Parameters.
376
377
378 \section{A Note on Notation}
379
380 \index{Context free grammar}\index{Grammar}Context free grammars have
381 been traditionally represented in the literature using
382 \index{Backus-Naur Form}\agterm{Backus-Naur Form}, or
383 \index{BNF}\agterm{BNF}.
384 In Backus-Naur Form, certain characters, called metacharacters, are
385 used to punctuate productions and all other printable characters are
386 taken to represent themselves literally. Named tokens are denoted by
387 enclosing the name within angle brackets ($< >$). The left side of a
388 production is distinguished from the right side by the
389 % XXX s/characters/symbol/
390 characters $::=$.
391 If several productions have the same left side, the vertical
392 bar $|$ is used to separate them. The elements of a grammar
393 rule are simply juxtaposed to indicate that one follows another.
394 Blanks are ignored. Thus, in BNF, the first production given for
395 \agcode{date}, above, would be:
396
397 % ~ is a hard space
398 $$
399 <date>~~::=~~<name~of~month>~~<day>~~,~~<year>
400 $$
401
402 AnaGram uses a notation more consonant with ordinary programming
403 usage. Thus token names need not be bracketed and literal characters
404 must be appropriately quoted. The elements of rules are joined by
405 commas. Using this approach, there is no need for metacharacters and
406 it becomes possible to make a number of useful extensions to the
407 notation. The detailed rules for writing grammars using AnaGram's
408 notation are given in Chapter 8.
409
410
411 \section{Reduction Procedures}
412 \index{Reduction procedure}
413
414 Of course, the reason for parsing an input stream is to interpret the
415 data in the stream and to process it in some useful manner. The
416 primary tool for doing this is the \agterm{reduction procedure}. A
417 reduction procedure is a piece of C code that is executed when a
418 particular grammar rule is reduced. Often, a reduction procedure
419 calculates a value which becomes the
420 \index{Value}\index{Semantic value}semantic value
421 of the
422 \index{Token}\index{Reduction token}reduction token.
423 The input to the reduction procedure consists of the values of the
424 tokens that make up the grammar rule.
425
426 AnaGram allows you to assign C variable names to the tokens in the
427 grammar rule, so that you can refer to them in the reduction
428 procedure. To assign a C variable name, simply follow the token in
429 the rule with a colon and the C variable name. Simple reduction
430 procedures can be written as C or C++ expressions. At the end of the
431 rule, write an equal sign and an expression, terminated by a
432 semicolon. For example:
433
434 \begin{indentingcode}{0.4in}
435 (int) hex digit
436 -> '0-9':d = d-'0';
437 -> 'a-f':d = d-'a'+10;
438 -> 'A-F':d = d-'A'+10;
439 \end{indentingcode}
440
441 When any one of these rules is matched, the value of the token in the
442 rule is assigned to the temporary variable \agcode{d}. The expression
443 to the right of the equal sign is evaluated and the value of the
444 expression is stored as the value of \agcode{hex digit}, which has
445 been declared to be an \agcode{int}. These productions define
446 hexadecimal digits as ASCII characters, and calculate the binary value
447 of the digit in terms of the ASCII character code, \agcode{d}. The
448 binary value becomes the value of \agcode{hex digit}. Hexadecimal
449 digits can be combined to make hexadecimal numbers by writing the
450 following productions:
451
452 \begin{indentingcode}{0.4in}
453 (int) hex number
454 -> hex digit
455 -> hex number:n, hex digit:d = 16*n+d;
456 \end{indentingcode}
457
458 %In this example, note that if you do not specify a reduction procedure for
459 %a grammar rule, the value of the reduction token will be taken to be the
460 %value of the first token in the rule. Thus, it is not necessary to provide
461 %a reduction procedure in the first production for hex number. The second
462 %production recursively defines the value of hexadecimal numbers that have
463 %more than one digit.
464
465 There are several important points to notice in this example. First,
466 reduction procedures are executed ``from the bottom up''. That is,
467 the reduction procedure for \agcode{hex digit} is executed before any
468 reduction procedure for \agcode{hex number}. Second, if there is no
469 reduction procedure for a production, the value of the first token in
470 the rule is assigned to the reduction token. Thus, it is not
471 necessary to provide a reduction procedure in the first production for
472 \agcode{hex number}. Third, the reduction procedures for recursive
473 productions are always executed \emph{after} the reduction procedure,
474 if any, for the non-recursive production which begins the recursion.
475 Finally, when an input sequence is described using left recursion, as
476 in this example, the elements of the sequence are processed left to
477 right.
478
479 %
480 % XXX this is a bad example as you never want to read floats this way!
481 % Can we think of something else?
482 %
483 If you wish to process the elements of a sequence right to left, you
484 may use right recursion. For example, it is sometimes convenient to
485 define the fraction part of a decimal number thus:
486
487 \begin{indentingcode}{0.4in}
488 (double) fraction part
489 -> '0-9':d = (d - '0')/10.0;
490 -> '0-9':d, fraction part:f = (d - '0' + f)/10.0;
491 \end{indentingcode}
492
493 In this case the leading digits are stored temporarily in the parse
494 stack, and then the fraction part is evaluated right to left only when
495 the last digit has been found.
496
497 Reduction procedures can be more complex than simple expressions. After the
498 equal sign you may include an arbitrary block of C or C++ code, enclosed in
499 braces, \agcode{\bra \ket}. To return a
500 \index{Semantic value}\index{Token}\index{Value}semantic value
501 for the reduction token simply use a return statement. Of course,
502 reduction procedures have the full resources of C or C++ at their
503 disposal. They may set and interrogate global variables and may call
504 functions freely. Reduction procedures are usually implemented as C
505 functions. Very simple procedures may be implemented as macros,
506 unless you have disabled this option by turning off the
507 \index{Macros}\index{Allow macros}\agparam{allow macros} configuration
508 switch. See Chapter 8 for the rules for writing reduction procedures.
509 When AnaGram builds a parser it copies all of the reduction procedures
510 you have defined to the parser file, and includes code to call them at
511 the right time.
512
513 Since the reduction procedures you write will probably need some
514 support code, such as \agcode{\#include} statements and declarations,
515 you may incorporate C or C++ code into your syntax file at any point.
516 You need only enclose it in braces, \agcode{\bra \ket}. Such code
517 is called
518 \index{Embedded C}\agterm{embedded C}.
519 All embedded C code is also copied to the \index{File}parser file, and
520 \emph{precedes} all of your reduction procedures.
521
522
523 \section{Building a Parser}
524 \index{Building a Parser}\index{Parser}
525
526 In order to round out a parser into a functioning program it needs
527 input procedures, as well as error diagnosis and recovery
528 capabilities. AnaGram has a number of options available which give
529 you a high degree of flexibility in configuring a parser to suit your
530 particular needs. All of the options are provided with reasonable
531 defaults, so that you can safely disregard any option until you need
532 the features it provides. The following paragraphs summarize the
533 options AnaGram provides. These topics are discussed in greater
534 detail in Chapter 9.
535
536 \paragraph{Invoking a Parser.}
537 \index{Parser}
538 Normally, AnaGram configures parsers as functions which you can call
539 from elsewhere in your program. In this situation, you call the
540 parser, it processes its input, and returns either when it has
541 finished or cannot proceed because of errors. Alternatively, if you
542 set the
543 \index{Event driven}\agparam{event driven}
544 configuration switch, your parser will be configured so that you have
545 two procedures to call: an \index{Initializer}\agterm{initializer} and
546 a \index{Parser}\agterm{parser}. In the event driven configuration
547 you start the parse by calling the initializer and then you call the
548 parser once for each unit of input. Using the event driven mode makes
549 it quite easy to configure a parser as a filter and to chain several
550 parsers together so that the output from one parser is the input to
551 the next. Such
552 \index{Parsing}\index{Multi-stage parsing}\agterm{multi-stage parsing}
553 is a convenient way to deal with complex input that is not context free.
554
555 \paragraph{Communicating with a Parser.}
556 The complete status of your parser is contained in a single data
557 structure called a
558 \index{Parser control block}\index{Control block}
559 \agterm{parser control block}.
560 All communications with a parser take place via the parser control
561 block. \index{Input procedures}Input procedures must place input data
562 in the appropriate field in the parser control block. When the parse
563 is complete or encounters an error, the results of the parse may be
564 found in the parser control block. When AnaGram builds a parser it
565 includes, in the \index{File}\index{Header file}header file it
566 generates, a \agcode{typedef} statement which defines the structure of
567 the parser control block.
568
569 \paragraph{Parser Input.}
570 \index{Input}\index{Parser}\index{Input procedures}
571 The input to your parser may be either characters read directly from
572 an input stream, or tokens accumulated by a pre-processor or lexical
573 scanner. The way you provide input to your parser depends on how your
574 grammar defines input tokens and also on whether or not you have
575 requested an event driven parser.
576
577 If your parser is event driven, you provide its input by storing the
578 input code and the input value, if any, into the parser control block
579 and calling the parser.
580
581 If you have set the
582 \index{Pointer input}\agparam{pointer input}
583 \index{Configuration switches}configuration switch
584 in your syntax file, you simply initialize the pointer field in your
585 parser control block before you call your parser. Your parser will
586 then read its input directly from memory by simply incrementing the
587 pointer as necessary.
588
589 Otherwise, your parser will invoke a macro called
590 \index{\agcode{GET{\us}INPUT}}\index{Macros}\agcode{GET{\us}INPUT} every
591 time it needs more input. You may define \agcode{GET{\us}INPUT}
592 according to your needs. You can define it so that it calls an input
593 function, or you can define it so that it executes in-line code each
594 time it is invoked. Your \agcode{GET{\us}INPUT} macro should store its
595 input code in the
596 \index{\agcode{input{\us}code}}\agcode{input{\us}code}
597 field of the
598 \index{PCB}parser control block.
599 If you do not write a \agcode{GET{\us}INPUT} macro, AnaGram will provide
600 one which will read characters from \agcode{stdin}.
601
602 If your grammar does not define terminal tokens in terms of ASCII
603 characters or external token numbers, your \agcode{GET{\us}INPUT} will
604 have to determine the appropriate internal
605 \index{Token number}\index{Token}\index{Number}token number
606 for each input token. To assist you in determining these token
607 numbers AnaGram provides a \agcode{typedef enum} statement in the
608 \index{File}\index{Header file}header file. You can then use named
609 constants to specify the internal token numbers for the input tokens.
610
611 \paragraph{Error Handling.}
612 \index{Error handling}\index{Error diagnosis}\index{Error recovery}
613 Your parser must be prepared to deal with erroneous input. There are
614 two aspects to error handling: diagnosing the error, and recovering
615 from the error. On encountering an error, your parser will invoke a
616 macro called
617 \index{\agcode{SYNTAX{\us}ERROR}}\index{Macros}\agcode{SYNTAX{\us}ERROR}.
618 If you do not provide a definition for \agcode{SYNTAX{\us}ERROR}, AnaGram
619 will provide a simple error diagnostic. AnaGram can also provide
620 automatic error diagnoses which pinpoint the location of the error.
621
622 \index{Resynchronization}
623 AnaGram provides two options for error recovery:
624 \agterm{error token resynchronization} and
625 \agterm{automatic resynchronization}. These are
626 techniques for getting your parser back in synch with its input so it
627 can proceed after encountering an error. Normally, if you do not
628 select one of these recovery techniques, your parser will terminate
629 when it encounters an error; however, you may override this default if
630 you wish and provide your own recovery technique.
631
632
633 \section{Special Features of AnaGram}
634
635 AnaGram provides a number of special features not generally available
636 in other parsing systems. These features extend the applicability of
637 syntax directed parsing, make it easier for you to write and debug
638 your grammars, and make it easier to interface your parser to other
639 parsers and to your main program. The following paragraphs provide a
640 brief introduction to these features. Their use is described more
641 fully in the appropriate chapters in this manual.
642
643 \paragraph{Configuration Parameters.}
644 \index{Configuration parameters}\index{Parameters}
645 AnaGram has a number of configuration switches and parameters which
646 you can use to control the way AnaGram works in various circumstances.
647 Since programmers often have different requirements, AnaGram strives
648 to be as flexible as possible.
649
650 Configuration parameters may be set in
651 \index{Configuration file}\index{File}configuration files, or in your
652 syntax file. The rules for writing configuration parameters are given
653 in Chapter 8. A summary of all configuration parameters is given in
654 Appendix A. The use of particular configuration parameters is
655 described as appropriate throughout this manual.
656
657 \paragraph{Character Sets}.
658 \index{Character sets}
659 In your grammar, if you write terminal tokens explicitly as characters
660 or as sets of ASCII characters, AnaGram will set up your parser to
661 accept ASCII characters directly, so you will not need to build a
662 \index{Lexical scanner}lexical scanner and interface it to
663 your parser. Character sets may be specified in terms of ranges of
664 characters, such as \agcode{'a-z'}, as unions, intersections or
665 differences of sets, denoted by \agcode{+}, \agcode{\&}, or \agcode{-}
666 respectively, or as the complement of a set, denoted by \agcode{\~{}}.
667 You may also specify a set consisting of a single character either
668 with a literal character or with a numeric specification. The
669 detailed rules for character set expressions are given in Chapter 8.
670 Some useful character sets are:
671
672 \begin{itemize}
673 \item Letters: \agcode{'a-z' + 'A-Z'}
674 \item Digits: \agcode{'0-9'}
675 \item DOS end of file: \agcode{-1 + \^{}Z}
676 \item Neither end of file nor end of line:
677 \agcode{\~{}(-1 + \^{}Z
678 + '{\bs}n')}
679 \end{itemize}
680
681 \paragraph{Keywords.}
682 \index{Keywords}
683 AnaGram makes special provision for the use of keywords in the parsers
684 it builds. In your grammar, you may specify a keyword as a quoted
685 string. AnaGram will incorporate special string matching logic in
686 your parser to recognize keywords in the parser input. When a keyword
687 is recognized, it is treated as an individual token, completely
688 independent of the characters which make up the keyword. Some
689 examples of keywords: \agcode{"let"}, \agcode{".EQ."},
690 \agcode{"<>"}, \agcode{"/*"}. The
691 rules for the use of keyword strings are given in Chapter 8.
692
693 Keywords are a powerful and useful tool, but they are not completely
694 consistent with the assumptions underlying the use of context-free
695 grammars. Thus under certain circumstances it is possible for a
696 parser built from a grammar that uses keywords to fail
697 % XXX s/parse correctly/correctly parse/
698 to parse correctly text that
699 appears to be consistent with the grammar. Such a situation is called a
700 \index{Keyword anomalies}\index{Anomaly}keyword anomaly.
701 AnaGram detects and diagnoses keyword anomalies in your grammar, and
702 provides tools to help you find and correct the problem. Many such
703 apparent anomalies, however, are completely innocuous. Keyword
704 anomalies are discussed in greater detail in Chapter 7.
705
706 \paragraph{Virtual Productions.}
707 \index{Virtual productions}\index{Production}
708 Virtual productions are a powerful, shorthand notation for specifying
709 optional or repetitive input. Using a virtual production can often
710 spare you the trouble of writing out four or five regular productions
711 and naming them.
712
713 Most of the notation for virtual productions should be familiar since
714 it has been used in programming manuals for years. The basic forms
715 are these:
716
717 \index{\agcode{[]}}
718 \index{\_opb\_clb} % {}, apparently. XXX?
719 \index{\agcode{...}}\index{Ellipsis}
720 \index{\agcode{?}}
721 \index{\agcode{?...}}
722 \index{\agcode{/...}}
723
724 % XXX figure out how to not have to encode the line breaks manually.
725 % I think we need one of the fancier extension forms of tabular.
726
727 \begin{tabular}[t]{ll}
728
729 \agcode{[]}&enclosing a list of one or more grammar rules separated
730 by \agcode{|} characters denotes an optional\\
731 \phantom{}&choice of one of the rules.\\
732
733 \agcode{\bra \ket}&enclosing a list of one or more grammar rules separated by
734 \agcode{|} characters indicates a required\\
735 \phantom{}&choice of one of the rules.\\
736
737 \agcode{...}&following a token, or following \agcode{[]} or \agcode{\bra \ket},
738 indicates arbitrary repetition of the previous choice or\\
739 \phantom{}&token.\\
740
741 \agcode{?}&following a token or set expression makes the token optional.\\
742
743 \agcode{?...}&following a token or set expression denotes zero or more
744 repetitions of the token.\\
745
746 \agcode{/...}&following \agcode{[]} or \agcode{\bra \ket} indicates arbitrary
747 repetition of the choice subject to the constraint that\\
748 \phantom{}&choices must alternate.\\
749
750 \end{tabular}
751
752 \index{Production}\index{Virtual productions}Virtual productions
753 are simply syntactic devices. When AnaGram encounters a virtual
754 production it translates the virtual production into conventional
755 productions which show up in your grammar tables. Virtual productions
756 are described in Chapter 8.
757
758 \paragraph{Definition Statements.}
759 \index{Statement}\index{Definition statement}
760 AnaGram provides a mechanism for naming particular character sets,
761 keywords or virtual productions in your syntax file. The name you
762 define can be used freely in place of the thing defined. Definition
763 statements have the form:
764
765 % XXX why the hell is the first line offset by a small amount?
766 \begin{indentingcode}{0.4in}
767 name = \codemeta{character set}
768 name = \codemeta{virtual production}
769 name = \codemeta{keyword}
770 name = \codemeta{immediate action}
771 name = \codemeta{token name}
772 \end{indentingcode}
773
774 The name may be any name acceptable to AnaGram. The name can then be
775 used anywhere you might have used the expression on the right side.
776 For example:
777
778 \begin{indentingcode}{0.4in}
779 upper case letter = 'A-Z'
780 lower case letter = 'a-z'
781 letter = upper case letter + lower case letter
782 statement list = statement?...
783 while = "WHILE"
784 dos eof = \^{}Z
785 \end{indentingcode}
786
787 \paragraph{Disregard Statements.}
788 \index{Disregard statement}\index{Statement}
789 If you wish your parser generally to skip over certain characters or
790 constructs in its input, such as blanks, newlines, or comments, you
791 may use disregard statements to specify precisely what is to
792 be skipped and under what circumstances. You may use any number of
793 disregard statements and you may cause your parser to disregard any
794 number of tokens, of whatever complexity. Any \agparam{disregard}
795 statements must
796 be placed in a configuration section. They may appear anywhere in
797 your syntax file.
798
799 The format of the \agparam{disregard} statement is as follows:
800
801 \begin{indentingcode}{0.4in}
802 disregard \codemeta{token name}
803 \end{indentingcode}
804
805 The rules for using disregard statements are given in Chapter 8.
806
807 \paragraph{Lexeme Statements.}
808 \index{Lexeme statement}\index{Statement}
809 The lexeme statement is used to fine tune the disregard statement.
810 Any \agparam{lexeme} statements must be placed in a configuration
811 section. They may appear anywhere in your syntax file.
812
813 The format of the \agparam{lexeme} statement is as follows:
814
815 \begin{indentingcode}{0.4in}
816 lexeme \bra \codemeta{token list} \ket
817 \end{indentingcode}
818
819 where \textit{token list} is a list of nonterminal tokens, separated
820 by commas.
821
822 The lexeme statement allows you to specify that for the purposes of
823 the disregard statement, the listed tokens are to be treated as
824 % XXX make this: indivisible lexical units
825 lexical units, so that the disregard statement will be inoperative
826 within the tokens specified. Lexeme statements are discussed in
827 Chapter 8.
828
829 \paragraph{Semantically Determined Productions.}
830 \index{Semantically determined production}\index{Production}
831 Traditionally, syntax directed parsing systems have enforced a rigid
832 distinction between syntactic and semantic information. In these
833 systems, it is not possible to use the knowledge that, for example, a
834 symbol is a function name and not a variable name in the syntax of a
835 programming language. AnaGram provides semantically determined
836 productions as a mechanism to allow semantic information to control
837 syntactic analysis.
838
839 A semantically determined production has more than one
840 \index{Token}\index{Reduction token}reduction token
841 on the left side. The reduction procedure determines which reduction
842 token the parser should use. For example:
843
844 \begin{indentingcode}{0.4in}
845 variable name, function name
846 -> symbol:s = identify{\us}symbol(s);
847 \end{indentingcode}
848
849 In this situation, the parser will set the reduction token to the
850 leftmost token, \agcode{variable name}, before calling the reduction
851 procedure. If the reduction procedure finds that \agcode{s} in fact
852 identifies a function name, it can change the reduction token to
853 \agcode{function name}. Using this capability, you can write your grammar in
854 terms of \agcode{variable name} and \agcode{function name}, just as
855 though they were syntactically distinguishable. Further discussion of
856 semantically determined productions may be found in Chapter 9.
857
858 \paragraph{Event Driven Parsers.}
859 \index{Parser}\index{Event driven parser}
860 Traditionally, parsers have been constructed as subroutines which are
861 called to analyze a stream of input and return only when they have
862 finished parsing the input or have encountered an unrecoverable error.
863 Such a parser itself calls a subroutine to obtain each unit of input.
864 There are a number of circumstances where this traditional
865 architecture is inconvenient. The most obvious is in situations where
866 programs must be event-driven, as in many popular windowing systems.
867 A somewhat less obvious situation is where one parser, often called a
868 \index{Lexical scanner}\agterm{lexical scanner}, provides input to a
869 second. In these situations it is convenient to invert the role of the
870 calling and the called programs.
871
872 AnaGram provides the capability of specifying that a parser it builds
873 should be event-driven. In this case, the parser consists of an
874 initialization procedure and an event handler. The main program calls
875 the initialization procedure and then calls the event handler with
876 each input token in turn. The event handler returns every time the
877 parser needs more input. Details on event-driven parsers are given in
878 Chapter 9.
879
880 \paragraph{Context tracking.}
881 \index{Context tracking}
882 When you are writing a reduction procedure for a particular grammar
883 rule, you often need to know the value one or another of your program
884 variables had at the time the first token in the rule was encountered.
885 Examples of such variables are:
886
887 \begin{itemize}
888 \item Line or column number
889 \item Index in an input file
890 \item Index into an array
891 \item Counters, as of symbols defined, etc.
892 \end{itemize}
893
894 Such variables can be thought of as representing the ``context'' of
895 the rule you are reducing. Sometimes it is possible to incorporate
896 the values of such variables into the values of reduction tokens, but
897 this can become quite cumbersome. AnaGram provides a feature known as
898 ``context tracking'' to deal with this problem. Details on context
899 tracking can be found in Chapter 9.
900
901 \paragraph{Coverage Analysis.}
902 \index{Coverage analysis}
903 AnaGram has facilities to help you determine the adequacy of your test
904 suites. If you set appropriate configuration switches, AnaGram will
905 include code in your parser to count the number of times each rule in
906 your grammar is reduced or the number of times each reduction
907 procedure in your grammar is executed. You can use these counts to
908 determine the completeness of your testing. The use of these
909 facilities is discussed in Chapter 9.
910
911 \paragraph{Immediate Actions.}
912 \index{Action}\index{Immediate action}
913 AnaGram provides for executing particular functions, called
914 ``immediate actions'', at any point in a grammar rule. Thus,
915 computation does not need to wait until the rule is complete.
916 Immediate actions are most useful when using AnaGram to control an
917 interactive process. Immediate actions are discussed in Chapter 8.
918
919 \paragraph{Grammar Trace.}
920 \index{Grammar Trace}\index{Window}\index{Trace}
921 The \agwindow{Grammar Trace} facility of AnaGram allows you to examine
922 the workings of your parser in detail. You can set up a
923 representation of the
924 \index{Parser state stack}\index{State stack}\index{Stack}parser state stack
925 and parser state as they might appear in the course of execution of
926 your parser. You can then examine the possible inputs and see how the
927 state and the state stack change in response to any input you choose.
928 You can back up and try other options. You can have several grammar
929 traces active simultaneously so you can compare the results of
930 different sequences of input. For any configuration of the state
931 stack, you can see what grammar rules the parser is in the process of
932 matching.
933
934 The use of the Grammar Trace is described in Chapter 5.
935
936 \paragraph{File Trace.}
937 \index{Trace}\index{File Trace}\index{Window}
938 The \agwindow{File Trace} is a facility which allows you to see how
939 your parser will work on real data. File Trace is an interactive,
940 interpretive parser governed by the rules in your grammar. It lets
941 you see precisely how your grammar parses a test file. You can see
942 the contents of the parse stack at any point in the parse. You can
943 watch the parse progress at any level of detail you choose. If you
944 wish, you may back up and try again. When AnaGram runs the
945 \agwindow{File Trace} or \agwindow{Grammar Trace} it also builds a
946 \index{Coverage}\index{Trace Coverage}\index{Window}
947 \agwindow{Trace Coverage} table.
948
949 Note that AnaGram normally uses a number of short-cut parsing actions.
950 To make the parse look like a textbook parse, set the
951 \index{Traditional engine}\agparam{traditional engine}
952 \index{Configuration parameters}configuration parameter
953 in your syntax file.
954
955 \paragraph{Aids to Debugging.}
956 AnaGram provides a number of facilities to help you debug your parser.
957 It provides numerous tables that summarize your grammar and a
958 windowing environment to assist you in inspecting and comparing this
959 data. To assist in identifying a conflict in your grammar it can
960 provide a
961 \index{Trace}\index{Conflict Trace}\index{Window}\agwindow{Conflict Trace},
962 a pre-built grammar trace showing a sequence of input which will
963 trigger the conflict. If your parser encounters a
964 \index{Syntax error}\index{Errors}syntax error,
965 you can arrange to have AnaGram build an \agwindow{Error Trace}, a
966 pre-built grammar trace showing the sequence of input tokens that led
967 to the syntax error. These facilities are described in Chapter 5.
968
969 \paragraph{Conflict Resolution.}
970 \index{Conflict}
971 In order to help you deal with conflicts in your grammar, AnaGram
972 provides several methods for resolving conflicts. You may include
973 \agterm{precedence declarations} to assist in the construction of
974 simple precedence grammars, you may declare tokens to be
975 \agparam{sticky}, or you may declare tokens to be
976 \agterm{subgrammar} tokens.
977 These techniques are described in Chapter 7.