comparison doc/manual/dd.tex @ 0:13d2b8934445

Import AnaGram (near-)release tree into Mercurial.
author David A. Holland
date Sat, 22 Dec 2007 17:52:45 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:13d2b8934445
1 \chapter{Programming With AnaGram}
2
3 Although AnaGram has many options and features which enable you to
4 build a parser that meets your needs precisely, it has well-defined
5 defaults so that you do not generally need to learn about an option
6 until you need the facility it provides. The purpose of this chapter
7 is to show you how to use the options and features effectively.
8
9 The options and features of AnaGram can be divided roughly into three
10 groups: those that control the general aspects of your parser, those
11 that control input to the parser and those that control error
12 handling. After dealing with these three groups of options and
13 features, this chapter concludes with a discussion of various advanced
14 techniques.
15
16 Many aspects of your parser are controlled by setting configuration
17 parameters, either in a configuration file or in your syntax file.
18 This chapter presumes you are familiar with setting configuration
19 parameters. The names of configuration parameters, as they occur in
20 the text, are printed in \agparam{bold face type}. Appendix A
21 describes the use of configuration parameters and provides a detailed
22 discussion of each configuration parameter.
23
24
25 \section{General Aspects}
26
27 \subsection{Program Development}
28
29 The first step in writing a program is to write a grammar in AnaGram
30 notation which describes the input the program expects. The file
31 containing the grammar, called the syntax file, conventionally has the
32 extension \agfile{.syn}. You could also make up a few sample input
33 files at this time, but it is not necessary to write reduction
34 procedures at this stage.
35
36 Run AnaGram and use the \index{Analyze Grammar}Analyze Grammar command
37 to create parse tables. If there are syntax errors in the grammar at
38 this point, you will have to correct them before proceeding, but you
39 do not necessarily have to eliminate conflicts, if there are any, at
40 this time. There are, however, many aids available to help you with
41 conflicts. These aids are described in Chapters 5 through 7, and
42 somewhat more briefly in the Online Help topics.
43
44 Once syntax errors are corrected, you can try out your grammar on the
45 sample input files using the File Trace facility. With File Trace,
46 you can see interactively just how your grammar operates on your test
47 files. You can also use Grammar Trace to answer ``what if'' questions
48 concerning input to the grammar. The Grammar Trace does not use a
49 test file, but rather allows you to make input choices interactively.
50
51 At any time, you can write reduction procedures to process your input
52 data as its components are identified in the input stream. Each
53 procedure is associated with a grammar rule. The reduction procedures
54 will be incorporated into your parser when you create it with the
55 \index{Build Parser}Build Parser command.
56
57 By default, unless you specify an input procedure, parser input will
58 be read from \agcode{stdin}, using the default \agcode{GET{\us}INPUT}
59 macro. You will probably wish to redefine \agcode{GET{\us}INPUT}, or
60 configure your parser to use \agparam{pointer input} or \agparam{event
61 driven} input.
62
63 \subsection{The Default Parser}
64 \index{Parser}
65
66 If you apply the Build Parser command to a syntax file which contains
67 only a grammar, with no reduction procedures and no embedded C code,
68 AnaGram will still produce a complete C command line program which you
69 can compile and run. \index{Input procedures}This parser will parse
70 character input from \agcode{stdin}. If the input does not satisfy
71 the rules of your grammar, the parser will issue a syntax error
72 diagnostic to \agcode{stderr} identifying the exact line and column
73 numbers of the error. If the parser should overflow its stack, it
74 will abort with an error message to \agcode{stderr}. If the parse is
75 successful, that is if the parser succeeds in identifying the grammar
76 token without encountering an error, it will simply return to the
77 command line.
78
79 You can extend such a simple parser, often quite effectively, by
80 adding only reduction procedures. If the reduction procedures write
81 output to \agcode{stdout}, you can produce a conventional ``filter''
82 program without having to pay any attention to input handling, error
83 handling, or any of the other options AnaGram provides.
84 %CALC, in the EXAMPLES directory, is an example of such a program.
85
86 \subsection{The Content of the Parser and Header Files}
87
88 % XXX s/from your parser file/from your syntax file/
89 AnaGram creates two \index{Output files}\index{File}output files: a
90 parser file and a header file. \index{Parser file}\index{File}The
91 parser file contains the C code you need to compile and link before
92 you can run your parser. It begins with the \index{C
93 prologue}\index{Prologue}C prologue, if any, from your parser file.
94 The C prologue is an optional block of \index{Embedded C}embedded C or
95 C++ which precedes everything else in your syntax file. Although it
96 can contain anything you wish, normally it is used to place
97 identification information, \index{Copyright notice}copyright notices,
98 etc., at the beginning of your parser file. If your parser uses token
99 types that require definition, the appropriate \agcode{\#include}
100 statements and definitions should be placed in the C prologue. See
101 ``Defining Token Types'', below.
102
103 Following the C prologue, AnaGram places a number of definitions of
104 variables and macros that you might need to refer to in your embedded
105 C, and in your reduction procedures. Not the least of these
106 definitions is the parser control block, described below. Following
107 these definitions, AnaGram inserts all your embedded C, in the order
108 in which it occurred in your syntax file. Following the embedded C
109 come all your reduction procedures. Finally, AnaGram adds the tables
110 which summarize your grammar and a parsing engine customized to your
111 requirements.
112
113 The \index{Header file}\index{File}header file contains definitions
114 needed by your parser. These include definitions of the \index{Parser
115 value stack}\index{Value stack}\index{Stack}value stack type, the
116 input token type, the \index{Parser control block}parser control block
117 type, and token name enumeration constants. The definitions are
118 placed in a header file so that you can make them available to other
119 modules if necessary.
120
121 \subsection{Naming Output Files}
122 \index{Output files}\index{File}
123
124 Unless you specify otherwise, AnaGram names the parser and header
125 files following conventional programming practice. Both \index{File
126 name}\index{File name}files have the same name as your syntax file,
127 with extensions \agfile{.c} and \agfile{.h} respectively. These
128 names, however, are controlled by the configuration parameters
129 \index{Configuration parameters}\index{Name}
130 \index{Parser file name}\agparam{parser file name} and
131 \index{Header file name}\agparam{header file name}
132 respectively, so you can override AnaGram's defaults if you wish. If
133 you normally use C++ rather than C, for example, you might want to
134 include the following statement in your configuration file:
135
136 \begin{indentingcode}{0.4in}
137 parser file name = "\#.cpp"
138 \end{indentingcode}
139
140 When AnaGram names the parser file it substitutes the name of your
141 syntax file for the ``\#'' character in the file name template.
142
143 \subsection{Compiling Your Parser}
144 \index{Parser}
145
146 Although AnaGram was designed primarily with ANSI C in mind, a good
147 deal of care has been taken to ensure that its output is consistent
148 with older C compilers and with newer C++ compilers. If your compiler
149 does not support ANSI function prototypes, you should set the
150 \index{Old style}\index{Configuration switches}\agparam{old style}
151 switch in your configuration file. If you are intending to compile
152 your parser using a 16-bit compiler, you might want to turn on the
153 \index{Near functions}\index{Configuration switches}\agparam{near functions}
154 switch in your configuration file. If you are building a parser for
155 use in an embedded system, you might want to make sure the
156 \index{Const data}\index{Configuration switch}\agparam{const data}
157 configuration switch is set so that all the tables AnaGram generates
158 will be declared \agcode{const}.
159
160 \subsection{Naming Your Parser}
161 \index{Parser}
162
163 In the default case, AnaGram creates a main program for you.
164 Generally, however, you will probably want a parser function which you
165 can call from your own main program. You won't want AnaGram to define
166 \agcode{main} for you. You can stop AnaGram from defining
167 \agcode{main} in any of several ways: Include some embedded C in your
168 syntax file, turn off
169 \index{Main program}the \index{Configuration switches}\agparam{main program}
170 configuration switch, or turn on either the \agparam{event driven} or
171 \agparam{pointer input} switches. Since you almost always will have
172 some embedded C in your syntax file, you will seldom have to use the
173 \agparam{main program} switch.
174
175 Normally, AnaGram simply uses the name of your syntax file to create
176 the name of your parser. Thus if your syntax file is called
177 \agfile{ana.syn} your parser will have the name \agcode{ana}. AnaGram
178 does not check the parser name for compliance with the rules of C. If
179 you use strange characters in your file name, you will get strange
180 characters in the name of your parser, and you will get unpleasant
181 remarks from your C compiler when you try to compile your parser.
182 Thus, for example, if you were to name your parser file
183 \agfile{!@\#.syn}, AnaGram will call your parser \agcode{!@\#}. Your
184 compiler will doubtless choke.
185
186 \index{Parser}
187 If you wish AnaGram to give your parser a name other than the file
188 name, you may set the
189 \index{Parser name}\index{Name}\index{Configuration parameters}
190 \agparam{parser name}
191 configuration parameter. Thus, to make sure your parser is called
192 \agcode{periwinkle} you would include the following line in a
193 configuration section in your syntax file:
194
195 % Note: this is not actually required to be in double quotes.
196 % It'll also accept anything that's syntactically acceptable to it
197 % as a C data type, which also lets you give it things like
198 % ``periwinkle *'' that result in uncompilable code.
199
200 \begin{indentingcode}{0.4in}
201 parser name = "periwinkle"
202 \end{indentingcode}
203
204 Besides the parser itself, AnaGram generates a number of other
205 functions, variables and type definitions when it creates your parser.
206 All these entities are named using the parser name as the base. The
207 templates and their usages are as follows:
208
209 \begin{indenting}{0.4in}
210 \begin{tabular}{ll}
211
212 \index{Parser}\index{Initializer}\index{Name}
213 \agcode{init{\us}\$}&initializer for parser\\
214
215 \index{Grammar token}\index{Value}
216 \agcode{\${\us}value}&returns value of grammar token\\
217
218 \index{Parser value stack}\index{Value stack}\index{Stack}
219 \agcode{\${\us}vs{\us}type}&value stack type\\
220
221 \agcode{\${\us}it{\us}type}&input token union\\
222 \agcode{\${\us}token{\us}type}&token name enumeration typedef\\
223 \agcode{\${\us}\%{\us}token}&token name enumeration constants\\
224 \agcode{\${\us}pcb{\us}type}&typedef of parser control block\\
225
226 \index{Parser control block}
227 \agcode{\${\us}pcb}&parser control block\\
228
229 \index{Rule Count}
230 \agcode{\${\us}nrc}&rule count table\\
231
232 \agcode{\${\us}nrpc}&reduction procedure count table\\
233 \\
234 \end{tabular}
235 \end{indenting}
236
237 When AnaGram defines these entities it substitutes the parser name for
238 the dollar sign. In the token name enumeration constants it
239 substitutes the token name for the \index{{\us}prc}``\%'' character.
240 Embedded space characters are replaced with underscore characters.
241
242 \subsection{The Parser Control Block}
243 \index{Parser control block}
244
245 The complete status of a parse is kept in a structure called a
246 \agterm{parser control block}. As a default, AnaGram defines a parser
247 control block for you, and provides a macro, \index{PCB}\agcode{PCB},
248 which enables you to access it simply. The name AnaGram assigns to
249 the parser control block is
250 % XXX
251 %\agcode{\${\us}pcb}, where as above ``\$'' is replaced with the name of
252 %your parser.
253 \agcode{\textit{$<$parser name$>$}{\us}pcb}.
254 If you need to refer to the parser control block from some module
255 other than the parser module, use an \agcode{\#include} statement to
256 include the header file for your parser and refer to the parser
257 control block by its name as above. The structure of the parser
258 control block is described in Appendix E. In this chapter, particular
259 fields will be discussed as necessary.
260
261 Since the parser control block contains the complete status of a
262 parse, you may interrupt a parse and continue it later by saving and
263 restoring the control block. If you have multiple input streams, all
264 controlled by the same grammar, you may have a separate control block
265 for each stream. If you wish to call your parser recursively, you may
266 define a fresh control block for each level of recursion. To make
267 best use of these capabilities, you will need to declare the parser
268 control block yourself. This is discussed below under ``Advanced
269 Techniques''.
270
271 \subsection{Calling Your Parser}
272 % XXX should have an example of actually calling the thing.
273 % XXX should also have ``terminating your parser'' or something like
274 % that.
275
276 The parser function AnaGram defines is a simple function which takes
277 no arguments and returns no values. All communication with the parser
278 takes place via the parser control block. When your parser returns,
279 \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} contains an exit
280 code describing the outcome of the parse. Symbols for the
281 exit codes are defined in the header file AnaGram generates.
282 \index{Exit codes}\index{Error codes}These symbols, their values,
283 and their meanings are:
284
285 \index{AG{\us}RUNNING{\us}CODE}
286 \index{AG{\us}SUCCESS{\us}CODE}
287 \index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}
288 \index{AG{\us}REDUCTION{\us}ERROR{\us}CODE}
289 \index{AG{\us}STACK{\us}ERROR{\us}CODE}
290 \index{AG{\us}SEMANTIC{\us}ERROR{\us}CODE}
291 \begin{indenting}{0.4in}
292 \begin{tabular}{lll}
293 \agcode{AG{\us}RUNNING{\us}CODE}&0&Parse is not yet complete\\
294 \agcode{AG{\us}SUCCESS{\us}CODE}&1&Parse terminated successfully\\
295 \agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE}&2&Syntax error was encountered\\
296 \agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE}&3&Bad reduction token encountered\\
297 \agcode{AG{\us}STACK{\us}ERROR{\us}CODE}&4&Parser stack overflowed\\
298 \agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE}&5&Semantic error\\
299 \\
300 \end{tabular}
301 \end{indenting}
302
303 Only an event driven parser will return the value
304 \agcode{AG{\us}RUNNING{\us}CODE}, since any other parser continues executing
305 until it terminates successfully or encounters an unrecoverable error.
306
307 Syntax errors, reduction token errors, and stack errors are discussed
308 below under ``Error Handling''.
309
310 % XXX: this bit belongs somewhere else
311 \agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE} is a special case. It is available
312 for you to use in your reduction procedures to terminate a parse for
313 semantic reasons.
314 % XXX add: AnaGram will never set it itself.
315 If, in a reduction procedure, you determine that parsing should not
316 continue, you need only include the statement:
317
318 \begin{indentingcode}{0.4in}
319 PCB.exit{\us}flag = AG{\us}SEMANTIC{\us}ERROR{\us}CODE;
320 \end{indentingcode}
321
322 When your reduction procedure returns, the parse will then terminate
323 and the parser will return control to the calling program.
324
325 \subsection{Parser Return Value}
326 \index{Value}
327
328 If, in your grammar, there is a value assigned to the grammar token,
329 you may retrieve it, after the parse is complete, by calling the
330 parser value function, the name of which is given by
331 \agcode{\${\us}value} where ``\$'' is the name of your parser.
332 \agcode{\${\us}value} takes no arguments, and returns a value of the type
333 assigned to the grammar token in your syntax file.
334
335 Although in theoretical discussions of parsing the result of the parse
336 is contained in the value of the grammar token, in practice, more
337 often than not, results are communicated to other procedures by
338 setting the values of global variables. Thus the value of the grammar
339 token is often of little interest.
340
341 Since the parser per se takes no arguments, it is usually convenient
342 to write a small interface function with a calling sequence
343 appropriate to the problem. The interface function can then take care
344 of appropriate initializations, call the parser, and retrieve results.
345
346 \subsection{Defining Token Types}
347
348 When you add reduction procedures to your grammar, you will often find
349 it convenient to add type declarations for the \index{Semantic
350 value}\index{Token}\index{Value}semantic values of some of the tokens
351 in your grammar. As long as the types you use are conventional C data
352 types\index{Data type}\index{Token}, you don't have to do anything
353 special. If, however, you have used types or classes that you have
354 defined yourself, you need to make sure that the appropriate
355 definition statements precede their use in the code AnaGram generates.
356 To do this, you need to have a C prologue in your syntax file. In the
357 C prologue, you should place the definition statements your parser
358 will need, or at least an \agcode{\#include} statement that will cause
359 the types or classes to be defined.
360
361 \subsection{Debugging Your Parser}
362
363 Because the ``flow of control'' of your parser is algorithmically
364 derived from your grammar, debugging your parser separates into two
365 separate exercises: debugging your grammar, discussed in Chapter 7,
366 and debugging your reduction procedures.
367
368 When debugging, it is usually a good idea to turn off the
369 \index{Macros}\index{Allow macros}\index{Configuration switches}
370 \agparam{allow macros}
371 switch. This switch is normally on and causes simple reduction
372 procedures to be implemented as macros. When you turn it off, you get
373 a proper function definition for each reduction procedure, so you can
374 put a breakpoint in any reduction procedure you choose. If the
375 \index{Line numbers}\index{Configuration switches}
376 \agparam{line numbers} switch
377 is on each reduction procedure will contain a
378 \index{\#line}\agcode{\#line} directive to show where the reduction
379 procedure is found in your syntax file. Once you have acquired
380 confidence in your reduction procedures you may turn the
381 \agparam{allow macros} switch back on for slightly improved
382 performance.
383
384 If your debugger allows you to inspect entire structures, you will
385 find it convenient to look at the parser control block while you are
386 debugging. The contents of the parser control block are described in
387 Appendix E.
388
389 A good way to begin debugging a new parser is to simply put a
390 breakpoint in each reduction procedure. Start your parser and step
391 through the reduction procedures one by one, verifying that they
392 perform as expected. After you have stepped through a reduction
393 procedure, turn off its breakpoint. If there are multiple paths,
394 leave breakpoints on the paths not taken. Liberal use of the assert
395 macro helps assure that your fixes don't break procedures you have
396 already tested.
397
398 \section{Providing Input to Your Parser}
399 \index{Parser}\index{Input}\index{Input procedures}
400
401 This section describes three methods for providing input to your
402 parser. In the first method your program calls the parser which then
403 requests input tokens as it needs them. It returns only when it has
404 completed the parse. The parser requests input tokens by invoking a
405 macro called \agcode{GET{\us}INPUT}, described below.
406
407 The second method for providing input can be used when the entire
408 sequence of input tokens is available in memory. This method is
409 controlled by the \index{Pointer input}\index{Configuration
410 switches}\agparam{pointer input} configuration switch. It is
411 discussed below.
412
413 The third method for providing input is especially convenient when
414 using \index{Lexical scanner}lexical scanners or multi-stage parsing.
415 It is controlled by the \index{Event driven}\index{Configuration
416 switches}\agparam{event driven} configuration switch.
417
418 \subsection{The \agcode{GET{\us}INPUT} Macro}
419 \index{GET{\us}INPUT}\index{Macros}
420
421 The default parser simply reads characters from \agcode{stdin}. It
422 does this by invoking a macro called \agcode{GET{\us}INPUT} every time it
423 needs an input character. The default definition of
424 \agcode{GET{\us}INPUT} is:
425
426 \index{PCB}\index{input{\us}code}
427 \begin{indentingcode}{0.4in}
428 \#define GET{\us}INPUT (PCB.input{\us}code = getchar())
429 \end{indentingcode}
430
431 \agcode{PCB.input{\us}code} is an integer field in the parser control
432 block which is used to hold the current input \index{Character
433 codes}character code.
434
435 By including your own definition of \agcode{GET{\us}INPUT} in your
436 embedded C, you override the default definition provided by AnaGram.
437 The only requirement for \agcode{GET{\us}INPUT} is that it store a
438 character in \agcode{PCB.input{\us}code}. Suppose you wish to make a
439 parser that reads characters from a file provided by the calling
440 program. You could include the following in your embedded C:
441
442 \begin{indentingcode}{0.4in}
443 extern FILE *file;
444 \#define GET{\us}INPUT (PCB.input{\us}code = fgetc(file))
445 \end{indentingcode}
446
447 Now your parser, when invoked, will read characters from the specified
448 file instead of reading them from \agcode{stdin}. Of course,
449 \agcode{GET{\us}INPUT} is not constrained to reading a file or data
450 stream. You may implement \agcode{GET{\us}INPUT} in any manner you
451 choose. You may implement it as a function call, or you may choose to
452 define \agcode{GET{\us}INPUT} so that it expands into inline code for
453 faster execution.
454
455 \subsection{Pointer Input}
456 \index{Pointer input}\index{Input procedures}
457
458 It often happens that the data you wish to parse are already in memory
459 when you are ready to call the parser. While you could rewrite
460 \agcode{GET{\us}INPUT} to simply scan the array by incrementing a
461 pointer, AnaGram provides an alternative approach since this is such a
462 common situation. In a configuration section in your syntax file
463 simply turn on the \index{Pointer input}\index{Configuration
464 switches}\agparam{pointer input} switch. Then before you call your
465 parser, load \index{pointer}\index{PCB}\agcode{PCB.pointer}, the
466 pointer field in the parser control block, with a pointer to your
467 array. Assuming your parser is called \agcode{ana}, and you wish to
468 call an interface function with an argument consisting of a character
469 string, here's what you do:
470
471 \begin{indentingcode}{0.4in}
472 {}[
473 pointer input
474 ]
475
476 \bra
477 void ana{\us}shell(char *source{\us}text) \bra
478 PCB.pointer = (unsigned char *)source{\us}text;
479 ana();
480 \ket
481 \ket
482 \end{indentingcode}
483
484 % XXX s/the//
485 The type of the \agcode{PCB.pointer} defaults to
486 \agcode{unsigned char *} to
487 minimize difficulty with full 256-character sets. If your compiler is
488 fussy, you should use a cast, as above, when you set the value. If
489 your data requires more than 256
490 \index{Character codes}character codes, you may still use pointer
491 input by using the \index{Pointer type}\index{Configuration
492 parameters}\agparam{pointer type} configuration parameter to change
493 the definition of the field in the parser control block. Normally,
494 the value of \agparam{pointer type} should be a C data type that
495 converts to integer. If \agparam{pointer type} does not convert to
496 integer, you must provide an
497 \index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro, as
498 described below, to extract a token identifier. Do not change
499 \agparam{pointer type} to \agcode{signed char} in order to avoid the
500 cast in the above example. That will have the effect of making all
501 character codes above 127 inaccessible to your parser.
502
503 Note that if you use pointer input your parser does not need a
504 \agcode{GET{\us}INPUT} macro. Parsers that use pointer input usually
505 run somewhat faster than those that use \agcode{GET{\us}INPUT},
506 particularly if they use keywords.
507 % XXX that is unclear - I know it means that the keyword logic is
508 % particularly improved by using pointer input, but it could be read
509 % to imply that adding keywords makes the parser even faster, which is
510 % backwards.
511
512 \subsection{Event Driven Parsers}
513 \index{Event driven parser}\index{Parser}
514
515 There are many situations where the input to a parser is developed by
516 an independent process and the linkage required to implement a
517 \agcode{GET{\us}INPUT} macro is unduly cumbersome. In these
518 circumstances, it is convenient to use an \agparam{event driven}
519 parser. With an event driven parser, you do not simply call the
520 parser and wait for it to finish. Instead, you call its
521 \index{Initializer}initializer first, and then call it each time you
522 have a character for it. The parser processes the character and
523 returns as soon as it needs more input, encounters an error or finds
524 the parse complete. You can interrogate
525 \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to determine
526 whether the parser can accept more input.
527
528 To create an event driven parser, set the \index{Event
529 driven}\index{Configuration switches}\agparam{event driven} switch in
530 your syntax file. Then, to initialize the parser, call the
531 initialization procedure, or \index{Initializer}initializer, provided
532 by AnaGram. The name of this procedure is \agcode{init{\us}\$} where
533 ``\agcode{\$}'' represents the name of your parser. If your parser is named
534 \agcode{ana}, the
535 \index{Parser}initialization procedure is named \agcode{init{\us}ana}.
536 To process a single character, store the character in
537 \index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code}, then call
538 \agcode{ana}. When it returns, check
539 \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to see if the
540 parser is still running. When the parse is successful, you may
541 retrieve the value of the grammar token, if you wish, by using the
542 \index{Parser value function}parser value function, in this case,
543 \agcode{ana{\us}value}.
544 % XXX s/case,/case/ above. or s/function,/function;/
545
546 As an example, let us imagine we are to write a an interface function
547 for our parser which takes a list of string pointers, a count, and a
548 pointer to a location into which we may store an error flag. The
549 input to our parser is to be the concatenation of all the character
550 strings. We will set up a loop which will call the parser for all the
551 characters of the strings in turn. We will assume that the function
552 will return the value of the grammar token, which we will assume to be
553 also of type double:
554
555 \begin{indentingcode}{0.4in}
556 {}[
557 event driven
558 ]
559
560 \bra
561 double parse{\us}strings(char **ptr, int n{\us}strings, int *error) \bra
562 init{\us}ana();
563 while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\&
564 n{\us}strings--) \bra
565 char *p = *ptr++;
566 while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\& *p) \bra
567 PCB.input{\us}code == *p++;
568 ana();
569 \ket
570 \ket
571 assert(error);
572 *error = PCB.exit{\us}flag != AG{\us}SUCCESS{\us}CODE;
573 return ana{\us}value();
574 \ket
575 \ket
576 \end{indentingcode}
577
578 The purpose of this example is simply to show how to use an event
579 driven parser. Of course it would be possible, as far as this example
580 is concerned, to concatenate the strings and use pointer input
581 instead. A problem sufficiently complex to \emph{require} an event
582 driven parser would be too complex to serve as a simple example.
583
584 \subsection{Token Input}
585 \index{Token input}\index{Input procedures}
586
587 Thus far in this chapter, we have assumed that the input to your
588 parser consisted of ordinary characters. There are many situations
589 where it is convenient to have a
590 \index{Preprocessor}\index{Token}\index{Token}preprocessor, or
591 \index{Lexical scanner}lexical scanner, which identifies basic tokens
592 and hands them over to your parser for further processing. Accepting
593 input from such preprocessors is discussed in the remainder of this
594 section.
595
596 Sometimes preprocessors simply pass on text characters, acting as
597 filters to remove unwanted characters, such as white space or
598 comments, and to insert other text, such as macro expansions. In such
599 situations, there is no need to treat the preprocessor differently
600 from any other character source. The input methods described above
601 are sufficient to deal with the input provided by the preprocessor.
602
603 In what follows, we deal with situations where the preprocessor passes
604 on \index{Token number}\index{Token}\index{Number}\agterm{token
605 numbers} rather than character codes. The preprocessor may also pass
606 on token \emph{values}, which also need accommodation of some sort.
607 % XXX also also?
608
609 There are two principal interfacing problems to deal with. The first
610 has to do with identifying the tokens to your parser. The second has
611 to do with providing the semantic values of the tokens.
612 %
613 %If your preprocessor does not provide values with its tokens, your parser
614 %may use any of the input techniques described above for character input,
615 %the only difference being that instead of setting PCB.input{\us}code to a
616 %character value, you set it to the token identifier.
617 %
618 %If your preprocessor does provide token values, then you have to use either
619 %a GET{\us}INPUT macro, or configure your parser to be event driven. If you wish
620 %to use pointer input, you must provide an INPUT{\us}CODE macro.
621 %
622
623 \subsection{Identifying Tokens using Predefined Token Numbers}
624 \index{Token}\index{Number}\index{Token number}
625
626 If you have a pre-existing \index{Lexical scanner}lexical scanner,
627 written for use with some other parsing system, it probably outputs
628 its own set of token numbers. The most robust way of interfacing such
629 a lexical scanner is to include, in your syntax file, either an
630 \index{Enum statement}\agparam{enum} statement or a set of definition
631 statements
632 for the terminal tokens, equating
633 \index{Terminal token}\index{Token}terminal token names with the
634 numeric values output by the lexical scanner, so that AnaGram treats
635 them as character codes. In this situation, you simply set
636 \index{PCB}\index{input{\us}code}\agcode{PCB.input{\us}code} to the token
637 number determined by the lexical scanner. Generally, lexical scanners
638 written for other parsing systems expect to be called for each token.
639 Therefore, you would normally use a \agcode{GET{\us}INPUT} macro to call
640 the lexical scanner and provide input to your parser.
641 % XXX as far as I know, lex expects to call yacc, not vice versa.
642
643 \subsection{Identifying Tokens using AnaGram's Token Numbers}
644
645 If you are writing a new preprocessor, you have more freedom. You
646 could simply create a set of codes as above. On the other hand, you
647 can save a level of translation and make your system run faster by
648 providing your parser with internal token numbers directly. Here's
649 what you have to do.
650
651 First, when you write your syntax file, leave all the terminal tokens
652 undefined. That means, of course, that you have to have a name for
653 each terminal token. You can't use a literal character or a number
654 for the token. AnaGram will generate a unique token number for each
655 token in your grammar. In the header file it generates, AnaGram
656 always provides a set of
657 \index{Enumeration constants}\index{Constants}enumeration constants
658 for all the named tokens in your grammar. The names for these
659 constants are controlled by the
660 \index{Configuration parameters}\index{Enum constant name}
661 \agparam{enum constant name}
662 parameter. (See Appendix A.) These constants normally have the form
663 \agcode{\textit{$<$parser name$>$}{\us}\textit{$<$token name$>$}{\us}token}.
664 Note that embedded space in the token name will be replaced with
665 underscore characters. Assume your parser is called \agcode{ana}, and
666 in your grammar you have a token called \agcode{integer constant}.
667 The enumeration constant identifying the token is then
668 \agcode{ana{\us}integer{\us}constant{\us}token}. Now, to hand off an integer
669 constant to your parser you write:
670
671 \begin{indentingcode}{0.4in}
672 PCB.input{\us}code = ana{\us}integer{\us}constant{\us}token;
673 \end{indentingcode}
674
675 \subsection{Providing Token Values}
676
677 If your \index{Preprocessor}preprocessor provides \index{Semantic
678 value}\index{Token}\index{Value}semantic values for input tokens, you
679 must inform AnaGram by setting the
680 \index{Input values}\index{Configuration switches}\index{Value}
681 \agparam{input values}
682 configuration switch in your syntax file. Then, whenever you provide a
683 token, you must also store a value in
684 \index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}.
685 You can do this as part of your \agcode{GET{\us}INPUT} macro, or, if you
686 have an \agparam{event driven} parser, when you set
687 \index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code} prior to
688 calling the parser function. If you are using \index{Pointer
689 input}\index{Configuration switches}\agparam{pointer input}, the
690 pointer will presumably identify the token value. You must provide an
691 \index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro to extract
692 the identification code from the token value. For example, if the
693 token value is a structure and the appropriate member field is called
694 \agcode{id}, you would write:
695
696 \begin{indentingcode}{0.4in}
697 \#define INPUT{\us}CODE(t) (t).id
698 \end{indentingcode}
699
700 Generally, the simplest way to interface the preprocessor and your
701 parser, when you are passing token values, is to use an event driven
702 parser. In this situation, the preprocessor, when it identifies a
703 token, simply loads the token identifier into
704 \agcode{PCB.input{\us}code}, loads the value into
705 \index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}, and calls
706 the parser.
707
708 \index{Token}
709 If the values of your input tokens are all of the same type, you must
710 set the
711 \index{Default input type}\index{Configuration parameters}
712 \index{Input type}\agparam{default input type}
713 configuration parameter so that AnaGram can declare
714 \index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}
715 appropriately. \index{Token type}\agparam{Default input type} will
716 default to \agcode{int} if you do not set it either in your configuration file
717 or in your syntax file.
718
719 Some \index{Lexical scanner}lexical scanners simply provide a pointer
720 to the text of the token they have identified. In this situation, you
721 would set \agparam{default input type} to \agcode{char *}. When you
722 provide a token to the parser you would set \agcode{PCB.input{\us}value}
723 to point to the text of the token.
724
725 If different tokens have values of different types, the situation
726 becomes slightly more complex. First, you must tell AnaGram about the
727 types of your input tokens. You do this by including a
728 \index{Declaration}\index{Type declarations}\agterm{type declaration}
729 in your syntax file. A type declaration is a token declaration
730 preceded by a C data type\index{Data type}\index{Token} in
731 parentheses. Assume that your \index{Preprocessor}preprocessor
732 identifies, among others, the following tokens: \agcode{name},
733 \agcode{string}, \agcode{real constant}, \agcode{integer constant},
734 and \agcode{unsigned constant}. You might then include the following
735 in your syntax file:
736
737 \begin{indentingcode}{0.4in}
738 {}[
739 input values
740 ]
741
742 (char *) name, string
743 (double) real constant
744 (long) integer constant, unsigned constant
745 \end{indentingcode}
746
747 AnaGram will then create, in the parser control block, an input value
748 field which can accommodate any of these terminal tokens in your
749 grammar.
750
751 To enable you to store data into the input value field of the parser
752 control block, AnaGram provides a convenient macro called
753 \index{INPUT{\us}VALUE}\index{Macros}\agcode{INPUT{\us}VALUE} to serve as
754 the destination of an assignment statement. \agcode{INPUT{\us}VALUE}
755 takes the type of the data as a parameter. Thus one could write:
756
757 \begin{indentingcode}{0.4in}
758 INPUT{\us}VALUE(char *) = text{\us}pointer;
759 INPUT{\us}VALUE(long) = constant{\us}value;
760 \end{indentingcode}
761
762 \section{Error Handling}
763
764 There are two classes of errors your parser needs to be able to deal
765 with. The first consists of \agterm{implementation errors} and the second
766 consists of \agterm{syntax errors}. Syntax errors arise because the input to
767 the parser does not conform to the definition of the language it is
768 designed to parse. Implementation errors arise because the programs
769 we write are never perfect and because the environment in which our
770 programs run is often something less than ideal.
771
772 \subsection{Implementation Errors}
773 \index{Implementation errors}\index{Errors}
774
775 % XXX parser stack overflow is not really an ``implementation error''
776
777 There are two implementation errors which your parser needs to be able
778 to deal with. The first is \agterm{parser stack overflow}. The
779 second comes from a bad \agterm{reduction token}.
780
781 \index{Stack}
782 \paragraph{Stack Overflow.}
783 Stack overflow is an error which your parser must be able to deal
784 with. In general, no matter how big you make your parser stack, it is
785 possible for legitimate input to cause it to overflow. The size of
786 the stack for your parser is controlled by the configuration parameter
787 \agparam{parser stack size}. This parameter defaults to a value of
788 32. This value has been found to be adequate for ordinary usage.
789
790 If your parser has only left recursive constructs, then there is a
791 maximum depth beyond which the parser stack will never grow. If your
792 parser has center recursive or right recursive productions, then no
793 matter how much stack space you allocate, there will always be a
794 syntactically correct input file which causes the stack to overflow.
795 This can be illustrated by the following set of C statements:
796
797 \begin{indentingcode}{0.4in}
798 x = y;
799 x = (y);
800 x = ((y));
801 x = (((y)));
802 .
803 .
804 .
805 \end{indentingcode}
806
807 Each set of parentheses requires another level on the parser stack.
808 When this set of statements was tried with Borland C++, it ran out of
809 stack space at 127 sets of parentheses and diagnosed the problem as
810 ``Expression is too complicated''.
811
812 AnaGram calculates the actual size of the parser stack by calculating
813 the maximum depth for left recursive constructs and adding half the
814 value of
815 \index{Parser stack size}\index{Configuration parameters}\index{Stack}
816 \index{Parser state stack}\index{State stack}
817 \agparam{parser stack size}. It then uses the larger of the calculated
818 value and \agparam{parser stack size} to allocate stack storage. You
819 may check the value actually used in your parser by inspecting the
820 definition of
821 \index{AG{\us}PARSER{\us}STACK{\us}SIZE}\agcode{AG{\us}PARSER{\us}STACK{\us}SIZE}.
822
823 If your parser runs out of stack space, it will set
824 \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
825 \index{AG{\us}STACK{\us}ERROR{\us}CODE}\agcode{AG{\us}STACK{\us}ERROR{\us}CODE}, invoke
826 the
827 \index{Macros}\index{PARSER{\us}STACK{\us}OVERFLOW}\agcode{PARSER{\us}STACK{\us}OVERFLOW}
828 macro and return to the calling program. The default definition of
829 this macro is:
830
831 \begin{indentingcode}{0.4in}
832 \#define PARSER{\us}STACK{\us}OVERFLOW \bra fprintf(stderr, {\bs}
833 "{\bs}nParser stack overflow{\bs}n"); \ket
834 \end{indentingcode}
835
836 % XXX ``provide your own definition'', not ``redefine''
837
838 If this definition is not consistent with your needs, you may redefine
839 it in any block of embedded C in your syntax file.
840
841 \index{Reduction token error}
842 \paragraph{Reduction Token Error.}
843 A properly functioning parser should never encounter a reduction token
844 error. Therefore, reduction token errors should be taken quite
845 seriously. The only way to cause a reduction token error in an
846 otherwise properly functioning parser is to set incorrectly the
847 reduction token for a semantically determined production.
848 % XXX ``to incorrectly set''
849
850 Before your parser calls a reduction procedure, it stores the token
851 number of the token to which the production would normally reduce in
852 \index{reduction{\us}token}\index{PCB}\agcode{PCB.reduction{\us}token}. If
853 the production is a semantically determined production, you may, in
854 your reduction procedure, change the value of
855 \agcode{PCB.reduction{\us}token} to one of the alternative tokens on
856 the left side of the production. When your reduction procedure
857 returns, your parser checks to verify that
858 \agcode{PCB.reduction{\us}token} is a valid token number for the
859 current state of the parser. If it is not, it sets
860 \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
861 \index{AG{\us}REDUCTION{\us}ERROR{\us}CODE}\agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE}
862 and invokes
863 \index{REDUCTION{\us}TOKEN{\us}ERROR}\index{Macros}\agcode{REDUCTION{\us}TOKEN{\us}ERROR}.
864 The default definition of this macro is:
865
866 \begin{indentingcode}{0.4in}
867 \#define REDUCTION{\us}TOKEN{\us}ERROR \bra fprintf(stderr,{\bs}
868 "{\bs}nReduction{\us}token error{\bs}n"); \ket
869 \end{indentingcode}
870
871 \subsection{Syntax Errors}
872 \index{Syntax error}\index{Errors}
873
874 If the input data to your parser does not conform to the rules you
875 have specified in your grammar, your parser will detect a syntax
876 error. There are two basic aspects of dealing with syntax errors:
877 \index{Error diagnosis}\agterm{diagnosing} the error and
878 \agterm{recovering} from the error, that is, restarting the parse, or
879 ``resynchronizing'' the parser.
880
881 If you use the default settings for syntax error handling, then on
882 encountering a syntax error your parser will call a diagnostic
883 procedure which will create an error message and store a pointer to it
884 in
885 \index{Error messages}\index{error{\us}message}\index{PCB}
886 \agcode{PCB.error{\us}message}.
887 Then, it will set
888 \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
889 \index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE} and
890 call a macro called
891 \index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR}. The
892 default definition of \agcode{SYNTAX{\us}ERROR} will print the error
893 message on \agcode{stderr}. Finally, in lieu of trying to continue
894 the parse, it will return to the calling program.
895
896 AnaGram has several options which allow you to tailor diagnostic
897 messages to your requirements or help you to create your own. It also
898 provides several options for continuing the parse.
899
900 The options available to help you diagnose errors are:
901
902 \begin{itemize}
903 \item line and column tracking
904 \item creation of a diagnostic message
905 \item identification of the error frame
906 \end{itemize}
907
908 \index{Numbers}\index{Lines and columns}\index{Configuration switches}
909 \paragraph{Line and Column Tracking.}
910 Your parser will automatically track lines and columns in its input if
911 the \agparam{lines and columns} configuration switch is on. Since
912 this is a common requirement, \agparam{lines and columns} defaults to
913 on. If you don't want your parser to spend time counting lines and
914 columns you should turn the switch off, thus:
915
916 \begin{indentingcode}{0.4in}
917 \agcode{
918 \~{}lines and columns
919 }
920 \end{indentingcode}
921
922 Normally, if you are using a \index{Lexical scanner}lexical scanner,
923 you would turn lines and columns off.
924 % XXX: this should say *why*.
925
926 The line and column counts are maintained in
927 \index{line}\index{PCB}\agcode{PCB.line} and
928 \index{column}\index{PCB}\agcode{PCB.column} respectively.
929 \agcode{PCB.line} and \agcode{PCB.column} are initialized with the
930 values of the \index{FIRST{\us}LINE}\index{Macros}\agcode{FIRST{\us}LINE}
931 and \index{Macros}\index{FIRST{\us}COLUMN}\agcode{FIRST{\us}COLUMN} macros
932 respectively. These macros provide default initial values of 1 for
933 both line and column numbers. To override these definitions, simply
934 include definitions for these macros in your syntax file. If tab
935 characters are encountered, they are expanded in accordance with the
936 \index{Tab spacing}\agparam{tab spacing} parameter.
937
938 When your parser is executing a reduction procedure, \agcode{PCB.line} and
939 \agcode{PCB.column} refer to the first input character following the
940 rule that is being reduced. When your parser has encountered a syntax
941 error, and is executing your \agcode{SYNTAX{\us}ERROR} macro,
942 \agcode{PCB.line} and \agcode{PCB.column} refer to the erroneous input
943 character.
944
945 \paragraph{Diagnostic Messages.}
946 If the \index{Diagnose errors}\index{Configuration switches}
947 \agparam{diagnose errors} switch is on, its default setting, AnaGram
948 will include an error diagnostic procedure in your parser. When your
949 parser encounters a syntax error, this procedure will create a simple
950 diagnostic message and store a pointer to it in
951 \index{error{\us}message}\index{PCB}\agcode{PCB.error{\us}message} before
952 your \agcode{SYNTAX{\us}ERROR} macro is executed. The default definition
953 of \agcode{SYNTAX{\us}ERROR} prints this message on \agcode{stderr}.
954
955 If your parser was in a state where there was a single input character
956 expected or a simple named token expected, it will create a message of
957 the form:
958
959 \begin{indentingcode}{0.4in}
960 Missing ';'
961 \end{indentingcode}
962 or
963 \begin{indentingcode}{0.4in}
964 Missing semicolon
965 \end{indentingcode}
966
967 If there was more than one possible input your parser will check to
968 see if it can identify the erroneous input. If it can it will create
969 a message of the form:
970
971 \begin{indentingcode}{0.4in}
972 Unexpected ';'
973 \end{indentingcode}
974 or
975 \begin{indentingcode}{0.4in}
976 Unexpected semicolon
977 \end{indentingcode}
978
979 Otherwise, the diagnostic message will be simply:
980
981 \begin{indentingcode}{0.4in}
982 Unexpected input
983 \end{indentingcode}
984
985 If you do not need a diagnostic message, or choose to create your own,
986 you should turn \agparam{diagnose errors} off.
987
988 % XXX Somewhere there should be a discussion of what ``creating your
989 % own'' would entail.
990
991 \index{Error frame}
992 \paragraph{Error Frame.}
993 Often it is desirable to know the ``frame'' of an error, that is, what
994 the parser thought it was doing when it encountered the error. If,
995 for instance, you forget to terminate a comment in a C program, your C
996 compiler sees an unexpected end of file. When you look simply at the
997 alleged error, of course, you can't see any problem. In order to
998 understand the error, you need to know that the parser was trying to
999 find a complete comment. In this case, we can say that the comment is
1000 the ``frame'' of the error.
1001
1002 AnaGram provides an optional facility in its error diagnostic
1003 procedure, controlled by the
1004 \index{Error frame}\index{Configuration switches}\agparam{error frame}
1005 switch, for identifying the frame of a syntax error. The
1006 \agparam{diagnose errors} switch must also be on to enable the
1007 diagnostic procedure. If you enable \agparam{error frame} in your
1008 syntax file, AnaGram will include a procedure which will scan
1009 backwards on the state stack looking for the frame of the error. When
1010 it finds what appears to be the error frame, it will store the stack
1011 index in
1012 \index{error{\us}frame{\us}ssx}\index{PCB}\agcode{PCB.error{\us}frame{\us}ssx} and
1013 the token number of the nonterminal token the parser was looking for
1014 in
1015 \index{error{\us}frame{\us}token}\index{PCB}\agcode{PCB.error{\us}frame{\us}token}.
1016
1017 %
1018 % XXX. Why is the discussion of ``hidden'' inside the discussion of
1019 % ``error frame''? hidden applies to ordinary error diagnosis also.
1020 %
1021 % Furthermore, this discussion of error frame needs an example, or
1022 % nobody will ever figure out how to do it.
1023 %
1024
1025 If, in your grammar, there are nonterminal tokens that are not
1026 suitable for diagnostic use, usually because they name an intermediate
1027 stage in the parse that means nothing to your user, you can make sure
1028 that AnaGram ignores them in doing its analysis by declaring them as
1029 \index{Declaration}\index{Hidden declaration}\agparam{hidden}. To
1030 declare tokens as hidden, include a \agparam{hidden} declaration in a
1031 configuration section. (See Chapter 8.) For instance, consider:
1032
1033 \begin{indentingcode}{0.4in}
1034 comment
1035 -> comment head, "*/"
1036 comment head
1037 -> "/*"
1038 -> comment head, \~{}end of file
1039 {}[ hidden \bra comment head \ket ]
1040 \end{indentingcode}
1041
1042 We mark comment head as hidden, because we only wish to talk about
1043 complete comments with our users.
1044
1045 In order to use the error frame effectively in your diagnostics, you
1046 need to have an ASCII representation of the name of the token as well
1047 as its token number. If you turn the
1048 \index{Token names}\index{Configuration switches}\agparam{token names}
1049 configuration switch on in your syntax file, AnaGram will provide an
1050 array of ASCII strings, indexed by token number, which you may use in
1051 your diagnostics. The name of the array is created by appending
1052 \agcode{{\us}token{\us}names} to the name of your parser. If your parser is
1053 called \agcode{ana}, your token name array will have the name
1054 \agcode{ana{\us}token{\us}names}. As a convenience, AnaGram
1055 also defines a macro,
1056 \index{TOKEN{\us}NAMES}\index{Macros}\agcode{TOKEN{\us}NAMES}, which
1057 evaluates to the name of the token name array. Note that
1058 \agparam{token names}
1059 controls the generation of an array of ASCII strings and should not be
1060 confused with the \agcode{typedef enum} statement in the parser header
1061 file which provides you with a set of enumeration constants.
1062 % XXX maybe it means the *strings* should not be confused?
1063
1064 If you are tracking context, using the techniques described below, you
1065 can use the macro
1066 \index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} or
1067 \index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT} to
1068 determine the context of the error frame token.
1069
1070 \index{SYNTAX{\us}ERROR}\index{Macros}
1071 \paragraph{SYNTAX{\us}ERROR Macro.}
1072 When your parser finds a syntax error, it first executes any of the
1073 diagnostic procedures described above that you have enabled, sets
1074 \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to
1075 \index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE},
1076 and then invokes the \agcode{SYNTAX{\us}ERROR} macro. If you have not
1077 defined \agcode{SYNTAX{\us}ERROR} it will be defined thus if you have set
1078 \index{Lines and columns}\index{Configuration switches}
1079 \agparam{lines and columns}:
1080
1081 \begin{indentingcode}{0.4in}
1082 \#define SYNTAX{\us}ERROR {\bs}
1083 fprintf(stderr,"\%s,line \%d,column \%d{\bs}n", {\bs}
1084 PCB.error{\us}message, PCB.line, PCB.column)
1085 \end{indentingcode}
1086
1087 and thus if you have not:
1088
1089 \begin{indentingcode}{0.4in}
1090 \#define SYNTAX{\us}ERROR {\bs}
1091 fprintf(stderr, "\%s{\bs}n", PCB.error{\us}message)
1092 \end{indentingcode}
1093
1094 In most circumstances, you will probably want to write your own
1095 \agcode{SYNTAX{\us}ERROR} macro, since this diagnostic is one your users
1096 will see with some frequency.
1097 % XXX yes and why exactly? is there something we have in mind better
1098 % than just printing PCB.error_message?
1099
1100 The default macro simply returns to the parser. Your macro doesn't
1101 have to. If you wish, you could call \agcode{abort} or \agcode{exit}
1102 directly from the macro. If the \agcode{SYNTAX{\us}ERROR} macro returns
1103 control to the parser, subsequent events depend on your choices for
1104 error recovery.
1105
1106 \section{Error Recovery}
1107 \index{Error recovery}\index{Syntax error}\index{Errors}
1108
1109 Syntax errors can be caused by any of a number of problems. Some come
1110 from simple typographic errors: the user skips a character or types
1111 the wrong one. Others come from true errors: he types something that
1112 might be correct in its place, but in context is totally wrong.
1113 Usually, if your parser is reading a file, you will want to continue
1114 parsing the input, checking for other syntax errors at the very least.
1115 The problem with doing this is getting the parser restarted, or
1116 ``resynchronized'', in some reasonable manner.
1117
1118 AnaGram provides a number of ways for your parser to recover from a
1119 syntax error. The least graceful, of course, is simply to call
1120 \agcode{abort} or \agcode{exit} from the \agcode{SYNTAX{\us}ERROR} macro.
1121 If you don't do this you have several options:
1122
1123 \begin{itemize}
1124 \item error token resynchronization
1125 \item auto resynchronization
1126 \item simple return to calling program
1127 \item ignore the error
1128 \end{itemize}
1129
1130 \subsection{Error Token Resynchronization}
1131 \index{Resynchronization}
1132
1133 When AnaGram builds your parser it checks to see if you have used a
1134 token called \agcode{error} in your grammar or if you have assigned a
1135 token name as the value of the configuration parameter
1136 \index{Error token}\index{token}\index{Configuration parameters}
1137 \agparam{error token}. If so, it includes a call to an error token
1138 resynchronization procedure immediately after the invocation of
1139 \index{SYNTAX{\us}ERROR}\agcode{SYNTAX{\us}ERROR}. The error token
1140 resynchronization procedure works in the following way: It scans the
1141 state stack backwards looking for the most recent state in which
1142 \agcode{error} or the token named by \agparam{error token} was valid
1143 input. It then truncates the stack to this level, and jumps to the
1144 state indicated by the error token. It then passes over any input it
1145 sees until it sees valid input for the state in which it finds itself.
1146 At this point, it returns to the parser which continues as though
1147 nothing had happened. Since this is substantially easier than it
1148 sounds, let's look at an example. Suppose we are writing a C
1149 compiler, and we wish to catch errors in ordinary statements. We add
1150 the following production to our grammar:
1151
1152 \begin{indentingcode}{0.4in}
1153 statement
1154 -> error, ';'
1155 \end{indentingcode}
1156
1157 Now, if the parser encounters a syntax error anytime while it is
1158 parsing any statement, it will pop back to the most recent state where
1159 it was looking for a statement, jump forward to the state indicated by
1160 the token \agcode{error} in the new production, and then skip input
1161 until it sees a semicolon. At this point it will continue a normal
1162 parse. The effect of continuing at this point is to recognize and
1163 reduce the above production, i.e., the parser will proceed as if it
1164 had found a complete, correct ``statement''. This production could
1165 even have a reduction procedure to do any clean-up that an error might
1166 require.
1167
1168 If you use error token resynchronization, you must identify an end of
1169 file token to guarantee that the resynchronization procedure can
1170 always terminate. To do this, either name your end of file token
1171 \agcode{eof} or use the
1172 \index{Eof token}\index{Configuration parameters}\index{Token}
1173 \agparam{eof token} configuration parameter to specify it.
1174
1175 For example, if your parser is reading conventional stream input, the
1176 end of file will be denoted by a $-1$ value. You can define the end
1177 of file token thus:
1178
1179 \begin{indentingcode}{0.4in}
1180 eof = -1
1181 \end{indentingcode}
1182
1183 % XXX as ``finally'' means something in Java, let's change this to
1184 % ``at last''
1185 On the other hand, if you have already defined a token named
1186 \agcode{finally}, you can add the following line to any configuration
1187 segment:
1188
1189 \begin{indentingcode}{0.4in}
1190 eof token = finally
1191 \end{indentingcode}
1192
1193 The end of file token, of course, must be a terminal token.
1194 % XXX this is not ``of course'' to a casual observer.
1195
1196 \subsection{Automatic Resynchronization}
1197 \index{Resynchronization}\index{Automatic resynchronization}
1198
1199 If you have not specified an \agcode{error} token in your syntax file,
1200 AnaGram checks to see if you have turned on the
1201 \index{Auto resynch}\index{Configuration switches}
1202 \agparam{auto resynch} configuration switch.
1203 If so, it includes a call to an automatic resynchronization procedure
1204 immediately after the call to \agcode{SYNTAX{\us}ERROR}. The automatic
1205 resynchronization procedure uses a heuristic based on your grammar to
1206 get back in step with the input. To use it you need do only two
1207 things: You need to turn on the \index{Auto resynch}\agparam{auto
1208 resynch} switch, and you need to specify an end of file token as for
1209 error token resynchronization, above.
1210
1211 The primary advantage of the automatic resynchronization is that it is
1212 easy to use. The disadvantage is that it turns off all reduction
1213 procedures, so that your parser is reduced to being a syntax checker
1214 after it encounters an error. If your grammar uses semantically
1215 determined productions, your reduction procedures will not be invoked
1216 so the primary reduction token will be used in all cases.
1217
1218 % XXX *why* does it do this?
1219
1220 \subsection{Other Ways to Continue}
1221
1222 % XXX the example of ``reading input from a keyboard'' should be
1223 % clarified to indicate that this means something like an application
1224 % where you press F10 for the menu, not typing at a command line.
1225 %
1226 If you do not wish to use either of the above resynchronization
1227 procedures, you still have a number of options. If your parser is
1228 reading input from a keyboard, for instance, it is probably sufficient
1229 to simply ignore bad input characters. You can do this by simply
1230 resetting \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to
1231 zero in your
1232 \index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro.
1233 % XXX XXX should say \agcode{AG_RUNNING_CODE}, not zero!!
1234 Your parser will then continue, passing over the bad input as though
1235 it had never occurred. If you do this, you should, of course, notify
1236 your user somehow that you're skipping a character. Issuing a beep on
1237 the computer's speaker from the \agcode{SYNTAX{\us}ERROR} macro is
1238 usually enough.
1239
1240 If you do not wish to continue the parse, but want your main program
1241 to continue, you need do nothing special. \agcode{PCB.exit{\us}flag} is
1242 % XXX XXX should say \agcode{AG_SYNTAX_ERROR_CODE}, not 2!!
1243 set to 2 before the \agcode{SYNTAX{\us}ERROR} macro is called. If your
1244 macro does not change \agcode{PCB.exit{\us}flag}, when it relinquishes
1245 control to your parser, your parser will simply return to the calling
1246 program. The calling program can determine that the parse was
1247 unsuccessful by inspecting \agcode{PCB.exit{\us}flag} and take whatever
1248 action you deem appropriate.
1249
1250
1251 \section{Advanced Techniques}
1252
1253 \subsection{Semantically Determined Productions}
1254 \index{Semantically determined production}\index{Production}
1255
1256 A semantically determined production is one which has more than one
1257 token on the left side. The reduction procedure then determines which
1258 token has in fact been identified, using whatever criteria are
1259 necessary. In some cases where the purpose is simply to provide
1260 multiple syntactic options to be chosen at execution time, the
1261 determination is made simply by interrogating a switch. Other
1262 situations may require a more complex determination, such as a symbol
1263 table look-up, for instance.
1264
1265 \index{Production}
1266 The tokens on the left side of the production can be used just like
1267 any other tokens in your grammar. Their semantic values, however,
1268 must all be of the same \index{Data type}\index{Token}data type.
1269
1270 Depending on how you have defined your grammar, it may be that
1271 whenever any one of the tokens on the left side is syntactically
1272 acceptable input, all the tokens on the left are syntactically
1273 acceptable. That is, the production could reduce to any of the tokens
1274 on the left without causing an immediate error condition. In many
1275 circumstances, however, this is not the case. In a Pascal grammar,
1276 for example, a semantically determined production might be used to
1277 allow a reduction procedure to determine whether a particular
1278 identifier is a constant identifier, a type identifier, a variable
1279 identifier, or so on. In any particular context, only a subset of the
1280 tokens on the left may be syntactically acceptable.
1281
1282 Before your reduction procedure is called, your parser will set the
1283 reduction token to the first token on the left side which is
1284 syntactically correct. If you need to change this assignment you have
1285 several options. From within your reduction procedure, you may simply
1286 set
1287 \index{reduction{\us}token}\index{PCB}\index{Token}\agcode{PCB.reduction{\us}token}
1288 to the semantically correct value. For this purpose, it is convenient
1289 to use the token name enumeration constants provided in the header
1290 file for your parser. Note that if you select a reduction token that
1291 is not syntactically correct, after your reduction procedure returns,
1292 your parser will encounter a \index{Reduction token
1293 error}\agterm{reduction token error}, described above.
1294
1295 AnaGram provides several tools to help you set the reduction token
1296 correctly. First, it provides a \agterm{change reduction} function
1297 which will set the reduction token to a specified token only if the
1298 specified token is syntactically correct. It will return a flag to
1299 indicate the outcome: non-zero on success, zero on failure. The name
1300 of this function is given by appending \agcode{{\us}change{\us}reduction} to
1301 the name of your parser. Thus, if your parser is named \agcode{ana},
1302 the name of the function would be \agcode{ana{\us}change{\us}reduction}. In
1303 those cases where the semantically correct reduction token is not
1304 syntactically correct, you will want to provide error diagnostics for
1305 your user. If you wish the parse to continue, so you can check
1306 errors, you may simply return from the reduction procedure. Since the
1307 default reduction is syntactically correct, the parse can continue as
1308 though there had been no error.
1309
1310 To simplify use of the change reduction function, AnaGram provides a macro,
1311 \index{CHANGE{\us}REDUCTION}\index{Macros}\agcode{CHANGE{\us}REDUCTION}.
1312 Simply call the macro with the name of the desired token as the
1313 argument, replacing embedded blanks in the token name with
1314 underscores.
1315
1316 For example, in writing a grammar for the C language, it is quite
1317 convenient to write the following production:
1318
1319 \begin{indentingcode}{0.4in}
1320 identifier, typedef name
1321 -> name = check{\us}typedef();
1322 \end{indentingcode}
1323
1324 The reduction procedure can then check the symbol table to see if
1325 whether the name that has been found is a typedef name. If so, it can
1326 use the \agcode{CHANGE{\us}REDUCTION} macro to change the reduction token
1327 to \agcode{typedef name} and verify that this is acceptable:
1328
1329 \begin{indentingcode}{0.4in}
1330 if (!CHANGE{\us}REDUCTION(typedef{\us}name)) diagnose{\us}error();
1331 \end{indentingcode}
1332
1333 Note that the embedded space in the token name must be replaced with
1334 an underscore character.
1335
1336 Under some circumstances, in your reduction procedure, you might wish
1337 to know precisely which reduction tokens are syntactically correct.
1338 For instance, you might wish, in an error diagnostic, to tell your
1339 user what you expected to see. If you set the
1340 \index{Reduction choices}\index{Configuration switches}
1341 \agparam{reduction choices} switch,
1342 AnaGram will include in your parser file a function which will
1343 identify the acceptable choices for the reduction token in the current
1344 state. The prototype of this function is:
1345
1346 \begin{indentingcode}{0.4in}
1347 int \${\us}reduction{\us}choices(int *);
1348 \end{indentingcode}
1349
1350 where ``\agcode{\$}'' represents the name of your parser. You must provide an
1351 integer array whose length is at least as long as the maximum number
1352 of reduction choices you might have. The function will fill the array
1353 with the token numbers of those which are acceptable in the current
1354 state and return a count of the number of acceptable choices it found.
1355 You can call this function from any reduction procedure. AnaGram also
1356 provides a macro to invoke this procedure:
1357 \index{REDUCTION{\us}CHOICES}\index{Macros}\agcode{REDUCTION{\us}CHOICES}.
1358 For example, to provide a diagnostic which details the acceptable
1359 token, you might combine the use of the \agparam{reduction choices}
1360 switch with the
1361 \index{Token names}\index{Configuration switches}\agparam{token names}
1362 switch described above:
1363
1364 \begin{indentingcode}{0.4in}
1365 int ok{\us}tokens[20], n{\us}ok{\us}tokens, i;
1366 n{\us}ok{\us}tokens = REDUCTION{\us}CHOICES(ok{\us}tokens);
1367 printf("Acceptable input comprises: {\bs}n");
1368 for (i = 0; i $<$ n{\us}ok{\us}tokens; i++) \bra
1369 printf(" \%s{\bs}n", TOKEN{\us}NAMES[i]);
1370 \ket
1371 \end{indentingcode}
1372
1373 A semantically determined production can even be a null production.
1374 You can use a semantically determined null production to interrogate
1375 the settings of parameters and control parsing accordingly:
1376
1377 \begin{indentingcode}{0.4in}
1378 condition false, condition true
1379 -> = \bra if (condition) CHANGE{\us}REDUCTION(condition{\us}true); \ket
1380 \end{indentingcode}
1381
1382 There are numerous examples of the use of semantically determined
1383 productions in the examples provided in the
1384 \index{examples}\agfile{examples} directory of your AnaGram
1385 distribution disk.
1386 % XXX too much anaphora
1387 % XXX s/disk//
1388
1389 \subsection{Defining Parser Control Blocks}
1390 \index{Parser control block}
1391
1392 All references to the parser control block in your parser are made
1393 using the macro \index{PCB}\agcode{PCB}. The only intrinsic
1394 requirement on PCB is that it evaluate to an \agterm{lvalue} (see
1395 Kernighan and Ritchie) that identifies a parser control block. The
1396 actual access may be direct, indirect through a pointer, subscripted,
1397 or even more complex, although if the access is too complex, the
1398 performance of your parser could suffer. Simple indirect or
1399 subscripted references are usually enough to enable you to build a
1400 system with multiple parallel parsing processes. If you wish to
1401 define \agcode{PCB} in some way other than a simple, direct access to
1402 a compiled-in control block, you will have to declare the control
1403 block yourself.
1404
1405 When AnaGram builds a parser, it checks the status of the
1406 \index{Declare pcb}\index{Configuration switches}\agparam{declare pcb}
1407 configuration switch. If it is on, the default setting, AnaGram
1408 declares a parser control block for you. AnaGram creates the name of
1409 the parser control block variable by appending \agcode{{\us}pcb} to the
1410 name of your parser. Thus if the name of your parser is
1411 \agcode{ana}, the parser control block is \agcode{ana{\us}pcb}.
1412
1413 In the header file AnaGram generates, a typedef statement defines the
1414 structure of the parser control block. The typedef name is given by
1415 appending \agcode{{\us}pcb{\us}type} to the name of your parser. Thus if
1416 the name of your parser is \agcode{ana}, the type of the parser
1417 control block is given by \agcode{ana{\us}pcb{\us}type}. Thus, when AnaGram
1418 defines the parser control block for \agcode{ana}, it does so by
1419 including the following two lines of code:
1420
1421 \begin{indentingcode}{0.4in}
1422 ana{\us}pcb{\us}type ana{\us}pcb;
1423 \#define PCB ana{\us}pcb
1424 \end{indentingcode}
1425
1426 If you wish to declare the parser control block yourself, you should
1427 turn off the \agparam{declare pcb} switch. To turn \agparam{declare
1428 pcb} off, include the following line in a configuration segment in
1429 your syntax file:
1430
1431 \begin{indentingcode}{0.4in}
1432 \~{}declare pcb
1433 \end{indentingcode}
1434
1435 Suppose your program needs to serve up to sixteen ``clients'', each
1436 with its own input stream. You might turn \agparam{declare pcb} off
1437 and declare the parser control block in the following manner:
1438
1439 \begin{indentingcode}{0.4in}
1440 ana{\us}pcb{\us}type ana{\us}pcb[16]; /* declare control blocks */
1441 int client;
1442 \#define PCB ana{\us}pcb[client] /* tell parser about it */
1443 \end{indentingcode}
1444
1445 Perhaps you need to parse a number of input streams, but you don't
1446 know exactly how many until run time. You might make the following
1447 declarations:
1448
1449 \begin{indentingcode}{0.4in}
1450 ana{\us}pcb{\us}type *ana{\us}pcb; /* pointer to control block */
1451 \#define PCB (*ana{\us}pcb) /* tell parser about it */
1452 \end{indentingcode}
1453
1454 Note that when you declare \agcode{PCB} as a pointer, you should put
1455 parentheses around the declaration so that your compiler codes the
1456 indirection properly.
1457
1458 There are many situations where it is convenient for a parser to be
1459 reentrant. A parser used for evaluating formulas in a spreadsheet
1460 program, for instance, needs to be able to call itself recursively if
1461 it is to use natural order recalculation. A parser used to implement
1462 macro substitutions may need to be recursive to deal with embedded
1463 macros.
1464
1465 Here is an example of an interface function which is designed for
1466 recursive calls to a parser, using the definitions above:
1467
1468 % XXX can I please at least remove the nonstandard <alloc.h>?
1469 % And fix the misuse of assert, and check malloc for failure?
1470 % And use AG_SUCCESS_CODE instead of 1?
1471 \begin{indentingcode}{0.4in}
1472 \#include <assert.h>
1473 \#include <alloc.h>
1474
1475 \#define PCB (*ana{\us}pcb)
1476 ana{\us}pcb{\us}type *ana{\us}pcb;
1477
1478 void do{\us}ana(void) \bra
1479 ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb;
1480 ana{\us}pcb = malloc(sizeof(ana{\us}pcb{\us}type));
1481 ana();
1482 assert(ana{\us}pcb.exit{\us}flag == 1);
1483 free(ana{\us}pcb);
1484 ana{\us}pcb = save{\us}ana;
1485 \ket
1486 \end{indentingcode}
1487
1488 Here is another way to accomplish the same end, this time using stack
1489 storage rather than heap storage:
1490
1491 % XXX ditto
1492 \begin{indentingcode}{0.4in}
1493 \#include <assert.h>
1494 \#include <alloc.h>
1495
1496 \#define PCB (*ana{\us}pcb)
1497 ana{\us}pcb{\us}type *ana{\us}pcb;
1498
1499 void do{\us}ana(void) \bra
1500 ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb;
1501 ana{\us}pcb{\us}type local{\us}pcb;
1502 ana{\us}pcb = \&local{\us}pcb;
1503 ana();\\
1504 assert(ana{\us}pcb.exit{\us}flag == 1);
1505 ana{\us}pcb = save{\us}ana;
1506 \ket
1507 \end{indentingcode}
1508
1509 % XXX and here we should discuss \agparam{reentrant parser}, too.
1510
1511
1512 \subsection{Multi-stage Parsing}
1513 \index{Parsing}\index{Multi-stage parsing}
1514
1515 Multi-stage parsing consists of chaining together a number of parsers
1516 in series so that each parser provides input to the following one.
1517 Users of \agfile{lex} and \agfile{yacc} are accustomed to using
1518 two-level parsing, since the ``\index{Lexical scanner}lexical
1519 scanner'', or ``lexer'' they write in \agfile{lex} is really a very
1520 simple parser whose output becomes the input to the parser written in
1521 \agfile{yacc}. AnaGram has been developed so that you may use as many
1522 levels as are appropriate to your problem, and so that, if you wish,
1523 you may write all of the parsers in AnaGram.
1524
1525 Many problems that do not lend themselves conveniently to solution
1526 with a simple grammar can be neatly solved by using multi-stage
1527 parsing. In many cases this is because multi-stage parsing can be
1528 used to parse constructs that are not context-free. A first level
1529 parser can use semantic information to decide which tokens to pass on
1530 to the next level. Thus, a first level parser for a C compiler can
1531 use semantic information to distinguish typedef names from variable
1532 names.
1533
1534 % XXX I believe this is referring to QPL. Nowadays there's Python...
1535 As another example, a proprietary programming language used indents to
1536 control its block structure. A first level parser looked only at
1537 lines and indents, passing the text through to the second level
1538 parser. When it encountered changes in indentation level, it inserted
1539 block start and block end tokens as necessary.
1540
1541 Using AnaGram it is extremely easy to set up multi-stage parses.
1542 Simply configure the second level parser as an event-driven parser.
1543 The first level parser can then hand over tokens or characters to it
1544 as it develops them.
1545
1546 The C macro preprocessor example, found in the
1547 \index{examples}\agfile{examples} directory of your AnaGram
1548 distribution disk, illustrates the use of multi-stage parsing.
1549
1550 \subsection{Context Tracking}
1551 \index{Context tracking}
1552
1553 When you are writing a reduction procedure for a particular grammar
1554 rule, you often need to know the value one or another of your program
1555 variables had at the time the first token in the rule was encountered.
1556 Examples of such variables are:
1557
1558 \begin{itemize}
1559 \item Line or column number
1560 \item Index in an input file
1561 \item Index into an array
1562 \item Counters, as of symbols defined, etc.
1563 \end{itemize}
1564
1565 Such variables can be thought of as representing the ``context'' of
1566 the rule you are reducing. Sometimes it is possible to incorporate
1567 the values of such variables into the values of reduction tokens, but
1568 this can become quite cumbersome. AnaGram provides an optional
1569 feature known as ``context tracking'' to deal with this problem.
1570 Here's how it works:
1571
1572 First, you identify the variables which you want to track. Second,
1573 you write a typedef statement in the \index{C prologue}C prologue of
1574 your parser which defines a data structure with fields to accommodate
1575 values for all of these variables. Third, you tell AnaGram what the
1576 name of the type of your data structure is, using the
1577 \index{Context type}\index{Configuration parameters}\agparam{context type}
1578 configuration parameter. This causes AnaGram to add a field called
1579 \index{PCB}\index{input{\us}context}\agcode{input{\us}context} and a stack,
1580 the \index{Context stack}\index{Stack}\agterm{context stack}, called
1581 \index{PCB}\index{cs}\agcode{cs}, both of the type you have specified,
1582 to your parser control block. Fourth, you write code to gather the
1583 context information for each input character.
1584
1585 There are several ways to provide the initial context information.
1586 You may write a
1587 \index{GET{\us}CONTEXT}\index{Macros}\agcode{GET{\us}CONTEXT} macro which
1588 sets the context stack variables directly. Using the
1589 \index{CONTEXT}\index{Macros}\agcode{CONTEXT} macro defined below, and
1590 assuming your context type has line, column and pointer fields, you
1591 could define \agcode{GET{\us}CONTEXT} as follows:
1592
1593 \begin{indentingcode}{0.4in}
1594 \#define GET{\us}CONTEXT CONTEXT.pointer = PCB.pointer,{\bs}
1595 CONTEXT.line = PCB.line,{\bs}
1596 CONTEXT.column = PCB.column
1597 \end{indentingcode}
1598
1599 If you are using \agparam{pointer input}, you must write a
1600 \agcode{GET{\us}CONTEXT} macro to save context information. If you use a
1601 \index{GET{\us}INPUT}\index{Macros}\agcode{GET{\us}INPUT} macro or have an
1602 event-driven parser, you may either store values directly into
1603 \index{input{\us}context}\index{PCB}\agcode{PCB.input{\us}context} when you
1604 develop the input token, or you may write a \agcode{GET{\us}CONTEXT}
1605 macro. The macro will provide a slight increment in performance.
1606 % XXX say why it's faster (I assume because it won't look up context
1607 % for inputs that don't need it?)
1608
1609 AnaGram provides six macros to enable you to read values in a
1610 convenient manner from the context stack,
1611 \index{cs}\index{PCB}\agcode{PCB.cs}. Three of these macros are
1612 designed to be used from your parser itself, and three are available
1613 to use from other modules. These three macros are designed for use in
1614 your parser:
1615
1616 \begin{itemize}
1617 \item \agcode{CONTEXT}
1618 \item \agcode{RULE{\us}CONTEXT}
1619 \item \agcode{ERROR{\us}CONTEXT}
1620 \end{itemize}
1621
1622 These macros are defined at the beginning of your parser file, so they
1623 may be used anywhere within your parser.
1624
1625 \index{CONTEXT}\index{Macros}\agcode{CONTEXT}
1626 can be used to read or write the current top of the context stack as
1627 indexed by \index{PCB}\agcode{PCB.ssx}. When your parser is executing
1628 a reduction procedure for a particular grammar rule, \agcode{CONTEXT}
1629 will evaluate to the value of the input context as it was just before
1630 the very first token in the rule. The definition of \agcode{CONTEXT}
1631 is:
1632
1633 \begin{indentingcode}{0.4in}
1634 \#define CONTEXT (PCB.cs[PCB.ssx])
1635 \end{indentingcode}
1636
1637 \index{RULE{\us}CONTEXT}\index{Macros}\agcode{RULE{\us}CONTEXT} can be used
1638 within a reduction procedure to get the context for any element within
1639 the rule being reduced. For example, \agcode{RULE{\us}CONTEXT[0]} is the
1640 context of the first element in the rule, \agcode{RULE{\us}CONTEXT[1]} is
1641 the context of the second element in the rule, and so on.
1642 \agcode{RULE{\us}CONTEXT[0]} is exactly the same as \agcode{CONTEXT}.
1643
1644 % XXX There should be a way to address the context of tokens in a
1645 % rule by the symbolic names we've bound to them.
1646
1647 The definition of \agcode{RULE{\us}CONTEXT} is:
1648
1649 \begin{indentingcode}{0.4in}
1650 \#define RULE{\us}CONTEXT (\&(PCB.cs[PCB.ssx]))
1651 \end{indentingcode}
1652
1653 As an example, let us suppose that we are writing a parser to read a
1654 parameter file for a program. Let us imagine the following statements
1655 make up a part of our syntax file:
1656
1657 \begin{indentingcode}{0.4in}
1658 \bra
1659 typedef struct \bra int line, column \ket location;
1660 \#define GET{\us}INPUT {\bs}
1661 PCB.input{\us}code = fgetc(input{\us}file); {\bs}
1662 PCB.input{\us}context.line = PCB.line; {\bs}
1663 PCB.input{\us}context.column = PCB.column;
1664 \ket
1665 {}[ context type = location ]\\
1666
1667 parameter assignment
1668 -> parameter name, '=', number
1669 \end{indentingcode}
1670
1671 Let us suppose that for each parameter we have stored a range of
1672 admissible values. We have to diagnose an attempt to use an incorrect
1673 value. We could write our diagnostic message as follows:
1674
1675 \begin{indentingcode}{0.4in}
1676 fprintf(stderr, "Bad value at line \%d, column \%d in "
1677 "parameter assignment at line \%d, column \%d",
1678 RULE{\us}CONTEXT[2].line,
1679 RULE{\us}CONTEXT[2].column,
1680 CONTEXT.line,
1681 CONTEXT.column);
1682 \end{indentingcode}
1683
1684 This diagnostic message would give our user the exact location both of
1685 the bad value and of the beginning of the statement that contained the
1686 bad value.
1687
1688 \index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} can be
1689 used within a
1690 \index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro to
1691 find the context of an error if you have turned on the
1692 \index{Error frame}\index{Configuration switches}\agparam{error frame}
1693 and
1694 \index{Diagnose errors}\index{Configuration switches}
1695 \agparam{diagnose errors}
1696 switches. AnaGram itself tracks context using a structure consisting
1697 of line and column numbers. In case of errors such as encountering an
1698 end of file in a comment, it uses the \agcode{ERROR{\us}CONTEXT} macro
1699 to determine the line and column number at which the comment began.
1700 % XXX that sounds like something AG does with your grammar, not
1701 % what AG does reading its own input, which is what it is. rephrase...
1702 The definition of \agcode{ERROR{\us}CONTEXT} is:
1703
1704 \begin{indentingcode}{0.4in}
1705 \#define ERROR{\us}CONTEXT (PCB.cs[PCB.error{\us}frame{\us}ssx])
1706 \end{indentingcode}
1707
1708 Three similar macros are also available for more general use:
1709
1710 \begin{itemize}
1711 \item \index{PCONTEXT}\index{Macros}\agcode{PCONTEXT(pcb)}
1712 \item \index{PRULE{\us}CONTEXT}\index{Macros}\agcode{PRULE{\us}CONTEXT(pcb)}
1713 \item \index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT(pcb)}
1714 \end{itemize}
1715
1716 % XXX repeating ``modules other than'' is bad
1717 These macros are identical in function to the corresponding macros in
1718 the first class. The only difference is that they take the name of a
1719 parser control block, \agcode{pcb}, as an argument so they can be used
1720 in modules other than the parser module. AnaGram includes the
1721 definitions for these macros in the parser header file so that they
1722 can be used in modules other than the parser itself. Since these
1723 macros are not specific to any one parser, the definitions are
1724 conditional so that they will only be defined once in a given module,
1725 even if you include header files corresponding to several parsers.
1726 The definitions of these macros are as follows:
1727
1728 \begin{indentingcode}{0.4in}
1729 \#define PCONTEXT(pcb) (pcb.cs[pcb.ssx])
1730 \#define PRULE{\us}CONTEXT(pcb) (\&(pcb.cs[pcb.ssx]))
1731 \#define PERROR{\us}CONTEXT(pcb) (pcb.cs[pcb.error{\us}frame{\us}ssx])
1732 \end{indentingcode}
1733
1734 Note that since the context macros only make sense when called from a
1735 reduction procedure or an error procedure, there are not many
1736 occasions to use these macros. The most common situation would be
1737 when you have compiled the bulk of the code for your reduction
1738 procedures in a separate module.
1739
1740 Remember that \agcode{PRULE{\us}CONTEXT}, because it identifies an array
1741 rather than a value, requires a subscript. For an example, let us
1742 rewrite the diagnostic message given above for
1743 \agcode{RULE{\us}CONTEXT} using \agcode{PRULE{\us}CONTEXT}, assuming
1744 that the name of our parser control block is \agcode{ana{\us}pcb}:
1745
1746 \begin{indentingcode}{0.4in}
1747 fprintf(stderr, "Bad value at line \%d, column \%d in "
1748 "resource statement at line \%d, column \%d",
1749 PRULE{\us}CONTEXT(ana{\us}pcb)[2].line,
1750 PRULE{\us}CONTEXT(ana{\us}pcb)[2].column,
1751 PCONTEXT.line,
1752 PCONTEXT.column);
1753 \end{indentingcode}
1754
1755 \subsection{Coverage Analysis}
1756 \index{Coverage analysis}
1757
1758 AnaGram has simple facilities for helping you determine the adequacy
1759 of your test suites. The
1760 \index{Rule coverage}\index{Configuration switches}
1761 \agparam{rule coverage} configuration switch
1762 controls these facilities. When you set \agparam{rule coverage},
1763 AnaGram includes code in your parser to count the number of times the
1764 parser identifies each rule in your grammar. AnaGram also provides
1765 procedures you can use to write these counts to a file and accumulate
1766 them over multiple executions of your parser. Finally, it provides a
1767 window where you may inspect the counts to see the extent to which
1768 your tests have covered the options in your grammar.
1769
1770 To maintain the counts, AnaGram declares, at the beginning of your
1771 parser, an integer array, whose name is created by appending
1772 \agcode{{\us}nrc} to the name of your parser. The array contains one
1773 counter for each rule you have defined in your grammar. There are no
1774 entries for the auxiliary rules that AnaGram creates to deal with set
1775 overlaps or disregard statements. In order to identify positively all
1776 the rules that the parser reduces, AnaGram turns off certain
1777 optimization features in your parser. Therefore, a parser that has
1778 the \agparam{rule coverage} switch enabled will run slightly slower
1779 than one with the switch off.
1780
1781 AnaGram also provides procedures to write the counts to a file and to
1782 initialize the counts from a file. The procedures are named by
1783 appending \agcode{{\us}write{\us}counts} and \agcode{{\us}read{\us}counts}
1784 respectively to the name of your parser. Thus, if your parser is
1785 called \agcode{ana}, the procedures are called
1786 \agcode{ana{\us}write{\us}counts} and \agcode{ana{\us}read{\us}counts}.
1787 Neither takes any arguments nor returns a value. To accumulate counts
1788 correctly, you should include calls to the
1789 \index{read{\us}counts}\agcode{read{\us}counts} and
1790 \index{write{\us}counts}\agcode{write{\us}counts} procedures in your
1791 program. A convenient way to do this is to include statements such as
1792 the following in your main program:
1793
1794 % XXX perhaps this means ``atexit''
1795 \begin{indentingcode}{0.4in}
1796 ana{\us}read{\us}counts(); /* before calling parser */
1797 at{\us}exit(ana{\us}write{\us}counts);
1798 \end{indentingcode}
1799
1800 For your convenience, AnaGram defines two macros,
1801 \index{READ{\us}COUNTS}\index{Macros}\agcode{READ{\us}COUNTS} and
1802 \index{WRITE{\us}COUNTS}\index{Macros}\agcode{WRITE{\us}COUNTS}, in your
1803 parser. They call the \agcode{read{\us}counts} and
1804 \agcode{write{\us}counts} procedures respectively when \agparam{rule
1805 coverage} is set. Otherwise they are null. Thus you may code them
1806 into your main program and it will work whether or not the
1807 \agparam{rule coverage} switch is set. For example,
1808
1809 \begin{indentingcode}{0.4in}
1810 READ{\us}COUNTS; /* read counts if coverage enabled */
1811 my{\us}parser(); /* call parser */
1812 WRITE{\us}COUNTS; /* write updated counts */
1813 \end{indentingcode}
1814
1815 The \agcode{write{\us}counts} procedure writes an identifier code and the
1816 counts to a count file. The name of the count file is given by the
1817 \index{Coverage file name}\index{Configuration parameters}
1818 \agparam{coverage file name} parameter, which defaults to the same name as your
1819 syntax file but with the extension
1820 \index{File extension}\index{nrc}\agfile{.nrc}. The identifier code
1821 changes each time you modify your syntax file. The
1822 \agcode{read{\us}counts} procedure attempts to read the count file. If
1823 it cannot find it, or the identifier code is out of date, it simply
1824 initializes the counter array to zeroes. Otherwise, it initializes
1825 the counter arrays to the values found in the file.
1826
1827 When you run AnaGram and analyze your syntax file, if
1828 \agparam{rule coverage} is set, AnaGram will enable the \agmenu{Rule
1829 Coverage} option on the \agmenu{Browse} menu. If you select
1830 \agmenu{Rule Coverage}, AnaGram will prepare a \agwindow{Rule
1831 Coverage} window from the rule count file you select. AnaGram will
1832 warn you if the file you selected is older than the syntax file, since
1833 under those conditions, the coverage file might be invalid.
1834
1835 The \index{Rule Coverage}\index{Window}\agwindow{Rule Coverage} window
1836 shows the count for each rule, the rule number and the text of the
1837 rule. It is also synched to the syntax file so that you can see the
1838 rule in context. AnaGram also modifies the display of the
1839 \index{Reduction Procedures}\index{Window}\agwindow{Reduction
1840 Procedures} window so that each procedure descriptor is preceded by
1841 the number of times it has been called. You can use this display to
1842 verify that all your reduction procedures have been tried.
1843
1844 % XXX having this paragraph here seems confusing
1845 The \index{Trace Coverage}\index{Window}\agwindow{Trace Coverage}
1846 window, created when you use the \agwindow{File Trace} or
1847 \agwindow{Grammar Trace} option, provides information similar to that
1848 provided by \agwindow{Rule Coverage}. The differences are these:
1849 Optimizations are not turned off for the \agwindow{Trace Coverage}, so
1850 that some rules of length zero or one will not be properly counted.
1851 Also, the \agwindow{Trace Coverage} does not tell you about the
1852 reduction procedures you have tested.
1853
1854 \agwindow{File Trace} can become quite tedious to use if you have very
1855 many semantically determined productions, so in these cases the
1856 \agparam{rule coverage} approach can give you the information you need
1857 more quickly.
1858
1859 \subsection{Using Precedence Operators}
1860
1861 The conventional syntax for arithmetic expressions used in most
1862 programming languages can be parsed simply by reference to
1863 \index{Operator precedence}\index{Precedence operators}
1864 \agterm{operator precedence}. Operator precedence refers to
1865 the rules we use to determine the order in which arithmetic operations
1866 should be carried out. In normal usage, this means that
1867 multiplication and division take precedence over addition and
1868 subtraction, which in turn take precedence over comparison operations.
1869 One can formalize this usage by assigning a numeric \index{Precedence
1870 level}\agterm{precedence level} to each operator, so that the
1871 operations are carried out starting with those of highest precedence
1872 and continuing in order of declining precedence. When operators have
1873 the same precedence level, such as addition and subtraction operators,
1874 one can decide the order of operation to be left to right or right to
1875 left. Operators of equal precedence which are to be evaluated left to
1876 right are called \agterm{left associative}. Those which should be
1877 evaluated right to left are called \agterm{right associative}. If the
1878 nature of the operators is such that the question should never arise,
1879 they are called \agterm{non-associative}.
1880
1881 AnaGram provides three declarations,
1882 \index{Precedence declarations}\index{Left}\index{Right}\index{Nonassoc}
1883 \agparam{left}, \agparam{right}, and \agparam{nonassoc}, which you can
1884 use to associate precedence levels and associativity with tokens in
1885 your grammar. The syntax of these statements is given in Chapter 8.
1886
1887 When AnaGram encounters a shift-reduce \index{Conflicts}conflict in
1888 your grammar, it looks to see if the conflict can be resolved by using
1889 precedence and associativity rules. If so, it applies the rules to
1890 the conflict and records the resolution in the \index{Resolved
1891 Conflicts}\index{Window}\agwindow{Resolved Conflicts} table.
1892
1893 There are two occasions where you should consider using precedence
1894 declarations in your grammar: Where rewriting the grammar to get rid
1895 of a conflict would obscure and complicate the grammar, and where you
1896 wish to try to get a more compact, slightly faster parser by using
1897 precedence rules for parsing arithmetic expressions.
1898
1899 Here is an example of using precedence declarations to parse simple
1900 arithmetic expressions:
1901
1902 \begin{indentingcode}{0.4in}
1903 unary minus = '-'
1904 {}[
1905 left \bra '+', '-' \ket
1906 left \bra '*', '/' \ket
1907 right \bra unary minus \ket
1908 ]
1909
1910 exp
1911 -> number
1912 -> unary minus, exp
1913 -> exp, '+', exp
1914 -> exp, '-', exp
1915 -> exp, '*', exp
1916 -> exp, '/', exp
1917 \end{indentingcode}
1918
1919 A complete working calculator grammar using this syntax,
1920 \agfile{ffcalcx}, can be found in the
1921 \index{examples}\agfile{examples/ffcalc} directory of your
1922 AnaGram distribution disk.
1923 % XXX s/disk//
1924
1925 \subsection{Parser Performance}
1926
1927 The parsers AnaGram generates have been engineered to provide maximum
1928 performance subject to constraints of reliability and robustness.
1929 There are a number of steps you may take, however, to make optimize
1930 the performance of your parser.
1931
1932 \paragraph{Standard Stack Frame.} If your compiler has a switch that
1933 allows you to turn \emph{off} the standard stack frame when you
1934 compile your parser, do so. Your parser uses a large number of very
1935 small functions which run fastest when your compiler does not use the
1936 standard stack frame.
1937
1938 \paragraph{Error Diagnostic Features.} If your parser does not need
1939 to diagnose errors, turn off the
1940 \index{Diagnose errors}\index{Configuration switches}
1941 \agparam{diagnose errors} switch.
1942 Turn off the
1943 \index{Lines and columns}\index{Configuration switches}
1944 \agparam{lines and columns} switch if you don't need this information.
1945 If your parser doesn't need a diagnostic, and halts on syntax error,
1946 turn off the
1947 \index{Backtrack}\index{Configuration switches}\agparam{backtrack} switch.
1948
1949 \paragraph{Anti-optimization Switches.} Certain switches de-optimize
1950 your parser for various reasons. These switches,
1951 \index{Traditional engine}\index{Configuration switches}
1952 \agparam{traditional engine} and
1953 \index{Rule coverage}\index{Configuration switches}
1954 \agparam{rule coverage},
1955 should be turned off once you no longer need their effects.
1956
1957 \paragraph{Other Switches.} For maximum performance you should use
1958 \index{Pointer input}\index{Configuration switches}\agparam{pointer
1959 input}. If you can guarantee that your input will not have
1960 out-of-range input, you can turn off
1961 \index{Test range}\index{Configuration switches}\index{Range}
1962 \agparam{test range}.
1963 % XXX s/out-of-range input/out-of-range characters or tokens/
1964