Mercurial > ~dholland > hg > ag > index.cgi
comparison doc/manual/dd.tex @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:13d2b8934445 |
---|---|
1 \chapter{Programming With AnaGram} | |
2 | |
3 Although AnaGram has many options and features which enable you to | |
4 build a parser that meets your needs precisely, it has well-defined | |
5 defaults so that you do not generally need to learn about an option | |
6 until you need the facility it provides. The purpose of this chapter | |
7 is to show you how to use the options and features effectively. | |
8 | |
9 The options and features of AnaGram can be divided roughly into three | |
10 groups: those that control the general aspects of your parser, those | |
11 that control input to the parser and those that control error | |
12 handling. After dealing with these three groups of options and | |
13 features, this chapter concludes with a discussion of various advanced | |
14 techniques. | |
15 | |
16 Many aspects of your parser are controlled by setting configuration | |
17 parameters, either in a configuration file or in your syntax file. | |
18 This chapter presumes you are familiar with setting configuration | |
19 parameters. The names of configuration parameters, as they occur in | |
20 the text, are printed in \agparam{bold face type}. Appendix A | |
21 describes the use of configuration parameters and provides a detailed | |
22 discussion of each configuration parameter. | |
23 | |
24 | |
25 \section{General Aspects} | |
26 | |
27 \subsection{Program Development} | |
28 | |
29 The first step in writing a program is to write a grammar in AnaGram | |
30 notation which describes the input the program expects. The file | |
31 containing the grammar, called the syntax file, conventionally has the | |
32 extension \agfile{.syn}. You could also make up a few sample input | |
33 files at this time, but it is not necessary to write reduction | |
34 procedures at this stage. | |
35 | |
36 Run AnaGram and use the \index{Analyze Grammar}Analyze Grammar command | |
37 to create parse tables. If there are syntax errors in the grammar at | |
38 this point, you will have to correct them before proceeding, but you | |
39 do not necessarily have to eliminate conflicts, if there are any, at | |
40 this time. There are, however, many aids available to help you with | |
41 conflicts. These aids are described in Chapters 5 through 7, and | |
42 somewhat more briefly in the Online Help topics. | |
43 | |
44 Once syntax errors are corrected, you can try out your grammar on the | |
45 sample input files using the File Trace facility. With File Trace, | |
46 you can see interactively just how your grammar operates on your test | |
47 files. You can also use Grammar Trace to answer ``what if'' questions | |
48 concerning input to the grammar. The Grammar Trace does not use a | |
49 test file, but rather allows you to make input choices interactively. | |
50 | |
51 At any time, you can write reduction procedures to process your input | |
52 data as its components are identified in the input stream. Each | |
53 procedure is associated with a grammar rule. The reduction procedures | |
54 will be incorporated into your parser when you create it with the | |
55 \index{Build Parser}Build Parser command. | |
56 | |
57 By default, unless you specify an input procedure, parser input will | |
58 be read from \agcode{stdin}, using the default \agcode{GET{\us}INPUT} | |
59 macro. You will probably wish to redefine \agcode{GET{\us}INPUT}, or | |
60 configure your parser to use \agparam{pointer input} or \agparam{event | |
61 driven} input. | |
62 | |
63 \subsection{The Default Parser} | |
64 \index{Parser} | |
65 | |
66 If you apply the Build Parser command to a syntax file which contains | |
67 only a grammar, with no reduction procedures and no embedded C code, | |
68 AnaGram will still produce a complete C command line program which you | |
69 can compile and run. \index{Input procedures}This parser will parse | |
70 character input from \agcode{stdin}. If the input does not satisfy | |
71 the rules of your grammar, the parser will issue a syntax error | |
72 diagnostic to \agcode{stderr} identifying the exact line and column | |
73 numbers of the error. If the parser should overflow its stack, it | |
74 will abort with an error message to \agcode{stderr}. If the parse is | |
75 successful, that is if the parser succeeds in identifying the grammar | |
76 token without encountering an error, it will simply return to the | |
77 command line. | |
78 | |
79 You can extend such a simple parser, often quite effectively, by | |
80 adding only reduction procedures. If the reduction procedures write | |
81 output to \agcode{stdout}, you can produce a conventional ``filter'' | |
82 program without having to pay any attention to input handling, error | |
83 handling, or any of the other options AnaGram provides. | |
84 %CALC, in the EXAMPLES directory, is an example of such a program. | |
85 | |
86 \subsection{The Content of the Parser and Header Files} | |
87 | |
88 % XXX s/from your parser file/from your syntax file/ | |
89 AnaGram creates two \index{Output files}\index{File}output files: a | |
90 parser file and a header file. \index{Parser file}\index{File}The | |
91 parser file contains the C code you need to compile and link before | |
92 you can run your parser. It begins with the \index{C | |
93 prologue}\index{Prologue}C prologue, if any, from your parser file. | |
94 The C prologue is an optional block of \index{Embedded C}embedded C or | |
95 C++ which precedes everything else in your syntax file. Although it | |
96 can contain anything you wish, normally it is used to place | |
97 identification information, \index{Copyright notice}copyright notices, | |
98 etc., at the beginning of your parser file. If your parser uses token | |
99 types that require definition, the appropriate \agcode{\#include} | |
100 statements and definitions should be placed in the C prologue. See | |
101 ``Defining Token Types'', below. | |
102 | |
103 Following the C prologue, AnaGram places a number of definitions of | |
104 variables and macros that you might need to refer to in your embedded | |
105 C, and in your reduction procedures. Not the least of these | |
106 definitions is the parser control block, described below. Following | |
107 these definitions, AnaGram inserts all your embedded C, in the order | |
108 in which it occurred in your syntax file. Following the embedded C | |
109 come all your reduction procedures. Finally, AnaGram adds the tables | |
110 which summarize your grammar and a parsing engine customized to your | |
111 requirements. | |
112 | |
113 The \index{Header file}\index{File}header file contains definitions | |
114 needed by your parser. These include definitions of the \index{Parser | |
115 value stack}\index{Value stack}\index{Stack}value stack type, the | |
116 input token type, the \index{Parser control block}parser control block | |
117 type, and token name enumeration constants. The definitions are | |
118 placed in a header file so that you can make them available to other | |
119 modules if necessary. | |
120 | |
121 \subsection{Naming Output Files} | |
122 \index{Output files}\index{File} | |
123 | |
124 Unless you specify otherwise, AnaGram names the parser and header | |
125 files following conventional programming practice. Both \index{File | |
126 name}\index{File name}files have the same name as your syntax file, | |
127 with extensions \agfile{.c} and \agfile{.h} respectively. These | |
128 names, however, are controlled by the configuration parameters | |
129 \index{Configuration parameters}\index{Name} | |
130 \index{Parser file name}\agparam{parser file name} and | |
131 \index{Header file name}\agparam{header file name} | |
132 respectively, so you can override AnaGram's defaults if you wish. If | |
133 you normally use C++ rather than C, for example, you might want to | |
134 include the following statement in your configuration file: | |
135 | |
136 \begin{indentingcode}{0.4in} | |
137 parser file name = "\#.cpp" | |
138 \end{indentingcode} | |
139 | |
140 When AnaGram names the parser file it substitutes the name of your | |
141 syntax file for the ``\#'' character in the file name template. | |
142 | |
143 \subsection{Compiling Your Parser} | |
144 \index{Parser} | |
145 | |
146 Although AnaGram was designed primarily with ANSI C in mind, a good | |
147 deal of care has been taken to ensure that its output is consistent | |
148 with older C compilers and with newer C++ compilers. If your compiler | |
149 does not support ANSI function prototypes, you should set the | |
150 \index{Old style}\index{Configuration switches}\agparam{old style} | |
151 switch in your configuration file. If you are intending to compile | |
152 your parser using a 16-bit compiler, you might want to turn on the | |
153 \index{Near functions}\index{Configuration switches}\agparam{near functions} | |
154 switch in your configuration file. If you are building a parser for | |
155 use in an embedded system, you might want to make sure the | |
156 \index{Const data}\index{Configuration switch}\agparam{const data} | |
157 configuration switch is set so that all the tables AnaGram generates | |
158 will be declared \agcode{const}. | |
159 | |
160 \subsection{Naming Your Parser} | |
161 \index{Parser} | |
162 | |
163 In the default case, AnaGram creates a main program for you. | |
164 Generally, however, you will probably want a parser function which you | |
165 can call from your own main program. You won't want AnaGram to define | |
166 \agcode{main} for you. You can stop AnaGram from defining | |
167 \agcode{main} in any of several ways: Include some embedded C in your | |
168 syntax file, turn off | |
169 \index{Main program}the \index{Configuration switches}\agparam{main program} | |
170 configuration switch, or turn on either the \agparam{event driven} or | |
171 \agparam{pointer input} switches. Since you almost always will have | |
172 some embedded C in your syntax file, you will seldom have to use the | |
173 \agparam{main program} switch. | |
174 | |
175 Normally, AnaGram simply uses the name of your syntax file to create | |
176 the name of your parser. Thus if your syntax file is called | |
177 \agfile{ana.syn} your parser will have the name \agcode{ana}. AnaGram | |
178 does not check the parser name for compliance with the rules of C. If | |
179 you use strange characters in your file name, you will get strange | |
180 characters in the name of your parser, and you will get unpleasant | |
181 remarks from your C compiler when you try to compile your parser. | |
182 Thus, for example, if you were to name your parser file | |
183 \agfile{!@\#.syn}, AnaGram will call your parser \agcode{!@\#}. Your | |
184 compiler will doubtless choke. | |
185 | |
186 \index{Parser} | |
187 If you wish AnaGram to give your parser a name other than the file | |
188 name, you may set the | |
189 \index{Parser name}\index{Name}\index{Configuration parameters} | |
190 \agparam{parser name} | |
191 configuration parameter. Thus, to make sure your parser is called | |
192 \agcode{periwinkle} you would include the following line in a | |
193 configuration section in your syntax file: | |
194 | |
195 % Note: this is not actually required to be in double quotes. | |
196 % It'll also accept anything that's syntactically acceptable to it | |
197 % as a C data type, which also lets you give it things like | |
198 % ``periwinkle *'' that result in uncompilable code. | |
199 | |
200 \begin{indentingcode}{0.4in} | |
201 parser name = "periwinkle" | |
202 \end{indentingcode} | |
203 | |
204 Besides the parser itself, AnaGram generates a number of other | |
205 functions, variables and type definitions when it creates your parser. | |
206 All these entities are named using the parser name as the base. The | |
207 templates and their usages are as follows: | |
208 | |
209 \begin{indenting}{0.4in} | |
210 \begin{tabular}{ll} | |
211 | |
212 \index{Parser}\index{Initializer}\index{Name} | |
213 \agcode{init{\us}\$}&initializer for parser\\ | |
214 | |
215 \index{Grammar token}\index{Value} | |
216 \agcode{\${\us}value}&returns value of grammar token\\ | |
217 | |
218 \index{Parser value stack}\index{Value stack}\index{Stack} | |
219 \agcode{\${\us}vs{\us}type}&value stack type\\ | |
220 | |
221 \agcode{\${\us}it{\us}type}&input token union\\ | |
222 \agcode{\${\us}token{\us}type}&token name enumeration typedef\\ | |
223 \agcode{\${\us}\%{\us}token}&token name enumeration constants\\ | |
224 \agcode{\${\us}pcb{\us}type}&typedef of parser control block\\ | |
225 | |
226 \index{Parser control block} | |
227 \agcode{\${\us}pcb}&parser control block\\ | |
228 | |
229 \index{Rule Count} | |
230 \agcode{\${\us}nrc}&rule count table\\ | |
231 | |
232 \agcode{\${\us}nrpc}&reduction procedure count table\\ | |
233 \\ | |
234 \end{tabular} | |
235 \end{indenting} | |
236 | |
237 When AnaGram defines these entities it substitutes the parser name for | |
238 the dollar sign. In the token name enumeration constants it | |
239 substitutes the token name for the \index{{\us}prc}``\%'' character. | |
240 Embedded space characters are replaced with underscore characters. | |
241 | |
242 \subsection{The Parser Control Block} | |
243 \index{Parser control block} | |
244 | |
245 The complete status of a parse is kept in a structure called a | |
246 \agterm{parser control block}. As a default, AnaGram defines a parser | |
247 control block for you, and provides a macro, \index{PCB}\agcode{PCB}, | |
248 which enables you to access it simply. The name AnaGram assigns to | |
249 the parser control block is | |
250 % XXX | |
251 %\agcode{\${\us}pcb}, where as above ``\$'' is replaced with the name of | |
252 %your parser. | |
253 \agcode{\textit{$<$parser name$>$}{\us}pcb}. | |
254 If you need to refer to the parser control block from some module | |
255 other than the parser module, use an \agcode{\#include} statement to | |
256 include the header file for your parser and refer to the parser | |
257 control block by its name as above. The structure of the parser | |
258 control block is described in Appendix E. In this chapter, particular | |
259 fields will be discussed as necessary. | |
260 | |
261 Since the parser control block contains the complete status of a | |
262 parse, you may interrupt a parse and continue it later by saving and | |
263 restoring the control block. If you have multiple input streams, all | |
264 controlled by the same grammar, you may have a separate control block | |
265 for each stream. If you wish to call your parser recursively, you may | |
266 define a fresh control block for each level of recursion. To make | |
267 best use of these capabilities, you will need to declare the parser | |
268 control block yourself. This is discussed below under ``Advanced | |
269 Techniques''. | |
270 | |
271 \subsection{Calling Your Parser} | |
272 % XXX should have an example of actually calling the thing. | |
273 % XXX should also have ``terminating your parser'' or something like | |
274 % that. | |
275 | |
276 The parser function AnaGram defines is a simple function which takes | |
277 no arguments and returns no values. All communication with the parser | |
278 takes place via the parser control block. When your parser returns, | |
279 \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} contains an exit | |
280 code describing the outcome of the parse. Symbols for the | |
281 exit codes are defined in the header file AnaGram generates. | |
282 \index{Exit codes}\index{Error codes}These symbols, their values, | |
283 and their meanings are: | |
284 | |
285 \index{AG{\us}RUNNING{\us}CODE} | |
286 \index{AG{\us}SUCCESS{\us}CODE} | |
287 \index{AG{\us}SYNTAX{\us}ERROR{\us}CODE} | |
288 \index{AG{\us}REDUCTION{\us}ERROR{\us}CODE} | |
289 \index{AG{\us}STACK{\us}ERROR{\us}CODE} | |
290 \index{AG{\us}SEMANTIC{\us}ERROR{\us}CODE} | |
291 \begin{indenting}{0.4in} | |
292 \begin{tabular}{lll} | |
293 \agcode{AG{\us}RUNNING{\us}CODE}&0&Parse is not yet complete\\ | |
294 \agcode{AG{\us}SUCCESS{\us}CODE}&1&Parse terminated successfully\\ | |
295 \agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE}&2&Syntax error was encountered\\ | |
296 \agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE}&3&Bad reduction token encountered\\ | |
297 \agcode{AG{\us}STACK{\us}ERROR{\us}CODE}&4&Parser stack overflowed\\ | |
298 \agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE}&5&Semantic error\\ | |
299 \\ | |
300 \end{tabular} | |
301 \end{indenting} | |
302 | |
303 Only an event driven parser will return the value | |
304 \agcode{AG{\us}RUNNING{\us}CODE}, since any other parser continues executing | |
305 until it terminates successfully or encounters an unrecoverable error. | |
306 | |
307 Syntax errors, reduction token errors, and stack errors are discussed | |
308 below under ``Error Handling''. | |
309 | |
310 % XXX: this bit belongs somewhere else | |
311 \agcode{AG{\us}SEMANTIC{\us}ERROR{\us}CODE} is a special case. It is available | |
312 for you to use in your reduction procedures to terminate a parse for | |
313 semantic reasons. | |
314 % XXX add: AnaGram will never set it itself. | |
315 If, in a reduction procedure, you determine that parsing should not | |
316 continue, you need only include the statement: | |
317 | |
318 \begin{indentingcode}{0.4in} | |
319 PCB.exit{\us}flag = AG{\us}SEMANTIC{\us}ERROR{\us}CODE; | |
320 \end{indentingcode} | |
321 | |
322 When your reduction procedure returns, the parse will then terminate | |
323 and the parser will return control to the calling program. | |
324 | |
325 \subsection{Parser Return Value} | |
326 \index{Value} | |
327 | |
328 If, in your grammar, there is a value assigned to the grammar token, | |
329 you may retrieve it, after the parse is complete, by calling the | |
330 parser value function, the name of which is given by | |
331 \agcode{\${\us}value} where ``\$'' is the name of your parser. | |
332 \agcode{\${\us}value} takes no arguments, and returns a value of the type | |
333 assigned to the grammar token in your syntax file. | |
334 | |
335 Although in theoretical discussions of parsing the result of the parse | |
336 is contained in the value of the grammar token, in practice, more | |
337 often than not, results are communicated to other procedures by | |
338 setting the values of global variables. Thus the value of the grammar | |
339 token is often of little interest. | |
340 | |
341 Since the parser per se takes no arguments, it is usually convenient | |
342 to write a small interface function with a calling sequence | |
343 appropriate to the problem. The interface function can then take care | |
344 of appropriate initializations, call the parser, and retrieve results. | |
345 | |
346 \subsection{Defining Token Types} | |
347 | |
348 When you add reduction procedures to your grammar, you will often find | |
349 it convenient to add type declarations for the \index{Semantic | |
350 value}\index{Token}\index{Value}semantic values of some of the tokens | |
351 in your grammar. As long as the types you use are conventional C data | |
352 types\index{Data type}\index{Token}, you don't have to do anything | |
353 special. If, however, you have used types or classes that you have | |
354 defined yourself, you need to make sure that the appropriate | |
355 definition statements precede their use in the code AnaGram generates. | |
356 To do this, you need to have a C prologue in your syntax file. In the | |
357 C prologue, you should place the definition statements your parser | |
358 will need, or at least an \agcode{\#include} statement that will cause | |
359 the types or classes to be defined. | |
360 | |
361 \subsection{Debugging Your Parser} | |
362 | |
363 Because the ``flow of control'' of your parser is algorithmically | |
364 derived from your grammar, debugging your parser separates into two | |
365 separate exercises: debugging your grammar, discussed in Chapter 7, | |
366 and debugging your reduction procedures. | |
367 | |
368 When debugging, it is usually a good idea to turn off the | |
369 \index{Macros}\index{Allow macros}\index{Configuration switches} | |
370 \agparam{allow macros} | |
371 switch. This switch is normally on and causes simple reduction | |
372 procedures to be implemented as macros. When you turn it off, you get | |
373 a proper function definition for each reduction procedure, so you can | |
374 put a breakpoint in any reduction procedure you choose. If the | |
375 \index{Line numbers}\index{Configuration switches} | |
376 \agparam{line numbers} switch | |
377 is on each reduction procedure will contain a | |
378 \index{\#line}\agcode{\#line} directive to show where the reduction | |
379 procedure is found in your syntax file. Once you have acquired | |
380 confidence in your reduction procedures you may turn the | |
381 \agparam{allow macros} switch back on for slightly improved | |
382 performance. | |
383 | |
384 If your debugger allows you to inspect entire structures, you will | |
385 find it convenient to look at the parser control block while you are | |
386 debugging. The contents of the parser control block are described in | |
387 Appendix E. | |
388 | |
389 A good way to begin debugging a new parser is to simply put a | |
390 breakpoint in each reduction procedure. Start your parser and step | |
391 through the reduction procedures one by one, verifying that they | |
392 perform as expected. After you have stepped through a reduction | |
393 procedure, turn off its breakpoint. If there are multiple paths, | |
394 leave breakpoints on the paths not taken. Liberal use of the assert | |
395 macro helps assure that your fixes don't break procedures you have | |
396 already tested. | |
397 | |
398 \section{Providing Input to Your Parser} | |
399 \index{Parser}\index{Input}\index{Input procedures} | |
400 | |
401 This section describes three methods for providing input to your | |
402 parser. In the first method your program calls the parser which then | |
403 requests input tokens as it needs them. It returns only when it has | |
404 completed the parse. The parser requests input tokens by invoking a | |
405 macro called \agcode{GET{\us}INPUT}, described below. | |
406 | |
407 The second method for providing input can be used when the entire | |
408 sequence of input tokens is available in memory. This method is | |
409 controlled by the \index{Pointer input}\index{Configuration | |
410 switches}\agparam{pointer input} configuration switch. It is | |
411 discussed below. | |
412 | |
413 The third method for providing input is especially convenient when | |
414 using \index{Lexical scanner}lexical scanners or multi-stage parsing. | |
415 It is controlled by the \index{Event driven}\index{Configuration | |
416 switches}\agparam{event driven} configuration switch. | |
417 | |
418 \subsection{The \agcode{GET{\us}INPUT} Macro} | |
419 \index{GET{\us}INPUT}\index{Macros} | |
420 | |
421 The default parser simply reads characters from \agcode{stdin}. It | |
422 does this by invoking a macro called \agcode{GET{\us}INPUT} every time it | |
423 needs an input character. The default definition of | |
424 \agcode{GET{\us}INPUT} is: | |
425 | |
426 \index{PCB}\index{input{\us}code} | |
427 \begin{indentingcode}{0.4in} | |
428 \#define GET{\us}INPUT (PCB.input{\us}code = getchar()) | |
429 \end{indentingcode} | |
430 | |
431 \agcode{PCB.input{\us}code} is an integer field in the parser control | |
432 block which is used to hold the current input \index{Character | |
433 codes}character code. | |
434 | |
435 By including your own definition of \agcode{GET{\us}INPUT} in your | |
436 embedded C, you override the default definition provided by AnaGram. | |
437 The only requirement for \agcode{GET{\us}INPUT} is that it store a | |
438 character in \agcode{PCB.input{\us}code}. Suppose you wish to make a | |
439 parser that reads characters from a file provided by the calling | |
440 program. You could include the following in your embedded C: | |
441 | |
442 \begin{indentingcode}{0.4in} | |
443 extern FILE *file; | |
444 \#define GET{\us}INPUT (PCB.input{\us}code = fgetc(file)) | |
445 \end{indentingcode} | |
446 | |
447 Now your parser, when invoked, will read characters from the specified | |
448 file instead of reading them from \agcode{stdin}. Of course, | |
449 \agcode{GET{\us}INPUT} is not constrained to reading a file or data | |
450 stream. You may implement \agcode{GET{\us}INPUT} in any manner you | |
451 choose. You may implement it as a function call, or you may choose to | |
452 define \agcode{GET{\us}INPUT} so that it expands into inline code for | |
453 faster execution. | |
454 | |
455 \subsection{Pointer Input} | |
456 \index{Pointer input}\index{Input procedures} | |
457 | |
458 It often happens that the data you wish to parse are already in memory | |
459 when you are ready to call the parser. While you could rewrite | |
460 \agcode{GET{\us}INPUT} to simply scan the array by incrementing a | |
461 pointer, AnaGram provides an alternative approach since this is such a | |
462 common situation. In a configuration section in your syntax file | |
463 simply turn on the \index{Pointer input}\index{Configuration | |
464 switches}\agparam{pointer input} switch. Then before you call your | |
465 parser, load \index{pointer}\index{PCB}\agcode{PCB.pointer}, the | |
466 pointer field in the parser control block, with a pointer to your | |
467 array. Assuming your parser is called \agcode{ana}, and you wish to | |
468 call an interface function with an argument consisting of a character | |
469 string, here's what you do: | |
470 | |
471 \begin{indentingcode}{0.4in} | |
472 {}[ | |
473 pointer input | |
474 ] | |
475 | |
476 \bra | |
477 void ana{\us}shell(char *source{\us}text) \bra | |
478 PCB.pointer = (unsigned char *)source{\us}text; | |
479 ana(); | |
480 \ket | |
481 \ket | |
482 \end{indentingcode} | |
483 | |
484 % XXX s/the// | |
485 The type of the \agcode{PCB.pointer} defaults to | |
486 \agcode{unsigned char *} to | |
487 minimize difficulty with full 256-character sets. If your compiler is | |
488 fussy, you should use a cast, as above, when you set the value. If | |
489 your data requires more than 256 | |
490 \index{Character codes}character codes, you may still use pointer | |
491 input by using the \index{Pointer type}\index{Configuration | |
492 parameters}\agparam{pointer type} configuration parameter to change | |
493 the definition of the field in the parser control block. Normally, | |
494 the value of \agparam{pointer type} should be a C data type that | |
495 converts to integer. If \agparam{pointer type} does not convert to | |
496 integer, you must provide an | |
497 \index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro, as | |
498 described below, to extract a token identifier. Do not change | |
499 \agparam{pointer type} to \agcode{signed char} in order to avoid the | |
500 cast in the above example. That will have the effect of making all | |
501 character codes above 127 inaccessible to your parser. | |
502 | |
503 Note that if you use pointer input your parser does not need a | |
504 \agcode{GET{\us}INPUT} macro. Parsers that use pointer input usually | |
505 run somewhat faster than those that use \agcode{GET{\us}INPUT}, | |
506 particularly if they use keywords. | |
507 % XXX that is unclear - I know it means that the keyword logic is | |
508 % particularly improved by using pointer input, but it could be read | |
509 % to imply that adding keywords makes the parser even faster, which is | |
510 % backwards. | |
511 | |
512 \subsection{Event Driven Parsers} | |
513 \index{Event driven parser}\index{Parser} | |
514 | |
515 There are many situations where the input to a parser is developed by | |
516 an independent process and the linkage required to implement a | |
517 \agcode{GET{\us}INPUT} macro is unduly cumbersome. In these | |
518 circumstances, it is convenient to use an \agparam{event driven} | |
519 parser. With an event driven parser, you do not simply call the | |
520 parser and wait for it to finish. Instead, you call its | |
521 \index{Initializer}initializer first, and then call it each time you | |
522 have a character for it. The parser processes the character and | |
523 returns as soon as it needs more input, encounters an error or finds | |
524 the parse complete. You can interrogate | |
525 \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to determine | |
526 whether the parser can accept more input. | |
527 | |
528 To create an event driven parser, set the \index{Event | |
529 driven}\index{Configuration switches}\agparam{event driven} switch in | |
530 your syntax file. Then, to initialize the parser, call the | |
531 initialization procedure, or \index{Initializer}initializer, provided | |
532 by AnaGram. The name of this procedure is \agcode{init{\us}\$} where | |
533 ``\agcode{\$}'' represents the name of your parser. If your parser is named | |
534 \agcode{ana}, the | |
535 \index{Parser}initialization procedure is named \agcode{init{\us}ana}. | |
536 To process a single character, store the character in | |
537 \index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code}, then call | |
538 \agcode{ana}. When it returns, check | |
539 \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to see if the | |
540 parser is still running. When the parse is successful, you may | |
541 retrieve the value of the grammar token, if you wish, by using the | |
542 \index{Parser value function}parser value function, in this case, | |
543 \agcode{ana{\us}value}. | |
544 % XXX s/case,/case/ above. or s/function,/function;/ | |
545 | |
546 As an example, let us imagine we are to write a an interface function | |
547 for our parser which takes a list of string pointers, a count, and a | |
548 pointer to a location into which we may store an error flag. The | |
549 input to our parser is to be the concatenation of all the character | |
550 strings. We will set up a loop which will call the parser for all the | |
551 characters of the strings in turn. We will assume that the function | |
552 will return the value of the grammar token, which we will assume to be | |
553 also of type double: | |
554 | |
555 \begin{indentingcode}{0.4in} | |
556 {}[ | |
557 event driven | |
558 ] | |
559 | |
560 \bra | |
561 double parse{\us}strings(char **ptr, int n{\us}strings, int *error) \bra | |
562 init{\us}ana(); | |
563 while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\& | |
564 n{\us}strings--) \bra | |
565 char *p = *ptr++; | |
566 while (PCB.exit{\us}flag == AG{\us}RUNNING{\us}CODE \&\& *p) \bra | |
567 PCB.input{\us}code == *p++; | |
568 ana(); | |
569 \ket | |
570 \ket | |
571 assert(error); | |
572 *error = PCB.exit{\us}flag != AG{\us}SUCCESS{\us}CODE; | |
573 return ana{\us}value(); | |
574 \ket | |
575 \ket | |
576 \end{indentingcode} | |
577 | |
578 The purpose of this example is simply to show how to use an event | |
579 driven parser. Of course it would be possible, as far as this example | |
580 is concerned, to concatenate the strings and use pointer input | |
581 instead. A problem sufficiently complex to \emph{require} an event | |
582 driven parser would be too complex to serve as a simple example. | |
583 | |
584 \subsection{Token Input} | |
585 \index{Token input}\index{Input procedures} | |
586 | |
587 Thus far in this chapter, we have assumed that the input to your | |
588 parser consisted of ordinary characters. There are many situations | |
589 where it is convenient to have a | |
590 \index{Preprocessor}\index{Token}\index{Token}preprocessor, or | |
591 \index{Lexical scanner}lexical scanner, which identifies basic tokens | |
592 and hands them over to your parser for further processing. Accepting | |
593 input from such preprocessors is discussed in the remainder of this | |
594 section. | |
595 | |
596 Sometimes preprocessors simply pass on text characters, acting as | |
597 filters to remove unwanted characters, such as white space or | |
598 comments, and to insert other text, such as macro expansions. In such | |
599 situations, there is no need to treat the preprocessor differently | |
600 from any other character source. The input methods described above | |
601 are sufficient to deal with the input provided by the preprocessor. | |
602 | |
603 In what follows, we deal with situations where the preprocessor passes | |
604 on \index{Token number}\index{Token}\index{Number}\agterm{token | |
605 numbers} rather than character codes. The preprocessor may also pass | |
606 on token \emph{values}, which also need accommodation of some sort. | |
607 % XXX also also? | |
608 | |
609 There are two principal interfacing problems to deal with. The first | |
610 has to do with identifying the tokens to your parser. The second has | |
611 to do with providing the semantic values of the tokens. | |
612 % | |
613 %If your preprocessor does not provide values with its tokens, your parser | |
614 %may use any of the input techniques described above for character input, | |
615 %the only difference being that instead of setting PCB.input{\us}code to a | |
616 %character value, you set it to the token identifier. | |
617 % | |
618 %If your preprocessor does provide token values, then you have to use either | |
619 %a GET{\us}INPUT macro, or configure your parser to be event driven. If you wish | |
620 %to use pointer input, you must provide an INPUT{\us}CODE macro. | |
621 % | |
622 | |
623 \subsection{Identifying Tokens using Predefined Token Numbers} | |
624 \index{Token}\index{Number}\index{Token number} | |
625 | |
626 If you have a pre-existing \index{Lexical scanner}lexical scanner, | |
627 written for use with some other parsing system, it probably outputs | |
628 its own set of token numbers. The most robust way of interfacing such | |
629 a lexical scanner is to include, in your syntax file, either an | |
630 \index{Enum statement}\agparam{enum} statement or a set of definition | |
631 statements | |
632 for the terminal tokens, equating | |
633 \index{Terminal token}\index{Token}terminal token names with the | |
634 numeric values output by the lexical scanner, so that AnaGram treats | |
635 them as character codes. In this situation, you simply set | |
636 \index{PCB}\index{input{\us}code}\agcode{PCB.input{\us}code} to the token | |
637 number determined by the lexical scanner. Generally, lexical scanners | |
638 written for other parsing systems expect to be called for each token. | |
639 Therefore, you would normally use a \agcode{GET{\us}INPUT} macro to call | |
640 the lexical scanner and provide input to your parser. | |
641 % XXX as far as I know, lex expects to call yacc, not vice versa. | |
642 | |
643 \subsection{Identifying Tokens using AnaGram's Token Numbers} | |
644 | |
645 If you are writing a new preprocessor, you have more freedom. You | |
646 could simply create a set of codes as above. On the other hand, you | |
647 can save a level of translation and make your system run faster by | |
648 providing your parser with internal token numbers directly. Here's | |
649 what you have to do. | |
650 | |
651 First, when you write your syntax file, leave all the terminal tokens | |
652 undefined. That means, of course, that you have to have a name for | |
653 each terminal token. You can't use a literal character or a number | |
654 for the token. AnaGram will generate a unique token number for each | |
655 token in your grammar. In the header file it generates, AnaGram | |
656 always provides a set of | |
657 \index{Enumeration constants}\index{Constants}enumeration constants | |
658 for all the named tokens in your grammar. The names for these | |
659 constants are controlled by the | |
660 \index{Configuration parameters}\index{Enum constant name} | |
661 \agparam{enum constant name} | |
662 parameter. (See Appendix A.) These constants normally have the form | |
663 \agcode{\textit{$<$parser name$>$}{\us}\textit{$<$token name$>$}{\us}token}. | |
664 Note that embedded space in the token name will be replaced with | |
665 underscore characters. Assume your parser is called \agcode{ana}, and | |
666 in your grammar you have a token called \agcode{integer constant}. | |
667 The enumeration constant identifying the token is then | |
668 \agcode{ana{\us}integer{\us}constant{\us}token}. Now, to hand off an integer | |
669 constant to your parser you write: | |
670 | |
671 \begin{indentingcode}{0.4in} | |
672 PCB.input{\us}code = ana{\us}integer{\us}constant{\us}token; | |
673 \end{indentingcode} | |
674 | |
675 \subsection{Providing Token Values} | |
676 | |
677 If your \index{Preprocessor}preprocessor provides \index{Semantic | |
678 value}\index{Token}\index{Value}semantic values for input tokens, you | |
679 must inform AnaGram by setting the | |
680 \index{Input values}\index{Configuration switches}\index{Value} | |
681 \agparam{input values} | |
682 configuration switch in your syntax file. Then, whenever you provide a | |
683 token, you must also store a value in | |
684 \index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}. | |
685 You can do this as part of your \agcode{GET{\us}INPUT} macro, or, if you | |
686 have an \agparam{event driven} parser, when you set | |
687 \index{input{\us}code}\index{PCB}\agcode{PCB.input{\us}code} prior to | |
688 calling the parser function. If you are using \index{Pointer | |
689 input}\index{Configuration switches}\agparam{pointer input}, the | |
690 pointer will presumably identify the token value. You must provide an | |
691 \index{INPUT{\us}CODE}\index{Macros}\agcode{INPUT{\us}CODE} macro to extract | |
692 the identification code from the token value. For example, if the | |
693 token value is a structure and the appropriate member field is called | |
694 \agcode{id}, you would write: | |
695 | |
696 \begin{indentingcode}{0.4in} | |
697 \#define INPUT{\us}CODE(t) (t).id | |
698 \end{indentingcode} | |
699 | |
700 Generally, the simplest way to interface the preprocessor and your | |
701 parser, when you are passing token values, is to use an event driven | |
702 parser. In this situation, the preprocessor, when it identifies a | |
703 token, simply loads the token identifier into | |
704 \agcode{PCB.input{\us}code}, loads the value into | |
705 \index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value}, and calls | |
706 the parser. | |
707 | |
708 \index{Token} | |
709 If the values of your input tokens are all of the same type, you must | |
710 set the | |
711 \index{Default input type}\index{Configuration parameters} | |
712 \index{Input type}\agparam{default input type} | |
713 configuration parameter so that AnaGram can declare | |
714 \index{input{\us}value}\index{PCB}\agcode{PCB.input{\us}value} | |
715 appropriately. \index{Token type}\agparam{Default input type} will | |
716 default to \agcode{int} if you do not set it either in your configuration file | |
717 or in your syntax file. | |
718 | |
719 Some \index{Lexical scanner}lexical scanners simply provide a pointer | |
720 to the text of the token they have identified. In this situation, you | |
721 would set \agparam{default input type} to \agcode{char *}. When you | |
722 provide a token to the parser you would set \agcode{PCB.input{\us}value} | |
723 to point to the text of the token. | |
724 | |
725 If different tokens have values of different types, the situation | |
726 becomes slightly more complex. First, you must tell AnaGram about the | |
727 types of your input tokens. You do this by including a | |
728 \index{Declaration}\index{Type declarations}\agterm{type declaration} | |
729 in your syntax file. A type declaration is a token declaration | |
730 preceded by a C data type\index{Data type}\index{Token} in | |
731 parentheses. Assume that your \index{Preprocessor}preprocessor | |
732 identifies, among others, the following tokens: \agcode{name}, | |
733 \agcode{string}, \agcode{real constant}, \agcode{integer constant}, | |
734 and \agcode{unsigned constant}. You might then include the following | |
735 in your syntax file: | |
736 | |
737 \begin{indentingcode}{0.4in} | |
738 {}[ | |
739 input values | |
740 ] | |
741 | |
742 (char *) name, string | |
743 (double) real constant | |
744 (long) integer constant, unsigned constant | |
745 \end{indentingcode} | |
746 | |
747 AnaGram will then create, in the parser control block, an input value | |
748 field which can accommodate any of these terminal tokens in your | |
749 grammar. | |
750 | |
751 To enable you to store data into the input value field of the parser | |
752 control block, AnaGram provides a convenient macro called | |
753 \index{INPUT{\us}VALUE}\index{Macros}\agcode{INPUT{\us}VALUE} to serve as | |
754 the destination of an assignment statement. \agcode{INPUT{\us}VALUE} | |
755 takes the type of the data as a parameter. Thus one could write: | |
756 | |
757 \begin{indentingcode}{0.4in} | |
758 INPUT{\us}VALUE(char *) = text{\us}pointer; | |
759 INPUT{\us}VALUE(long) = constant{\us}value; | |
760 \end{indentingcode} | |
761 | |
762 \section{Error Handling} | |
763 | |
764 There are two classes of errors your parser needs to be able to deal | |
765 with. The first consists of \agterm{implementation errors} and the second | |
766 consists of \agterm{syntax errors}. Syntax errors arise because the input to | |
767 the parser does not conform to the definition of the language it is | |
768 designed to parse. Implementation errors arise because the programs | |
769 we write are never perfect and because the environment in which our | |
770 programs run is often something less than ideal. | |
771 | |
772 \subsection{Implementation Errors} | |
773 \index{Implementation errors}\index{Errors} | |
774 | |
775 % XXX parser stack overflow is not really an ``implementation error'' | |
776 | |
777 There are two implementation errors which your parser needs to be able | |
778 to deal with. The first is \agterm{parser stack overflow}. The | |
779 second comes from a bad \agterm{reduction token}. | |
780 | |
781 \index{Stack} | |
782 \paragraph{Stack Overflow.} | |
783 Stack overflow is an error which your parser must be able to deal | |
784 with. In general, no matter how big you make your parser stack, it is | |
785 possible for legitimate input to cause it to overflow. The size of | |
786 the stack for your parser is controlled by the configuration parameter | |
787 \agparam{parser stack size}. This parameter defaults to a value of | |
788 32. This value has been found to be adequate for ordinary usage. | |
789 | |
790 If your parser has only left recursive constructs, then there is a | |
791 maximum depth beyond which the parser stack will never grow. If your | |
792 parser has center recursive or right recursive productions, then no | |
793 matter how much stack space you allocate, there will always be a | |
794 syntactically correct input file which causes the stack to overflow. | |
795 This can be illustrated by the following set of C statements: | |
796 | |
797 \begin{indentingcode}{0.4in} | |
798 x = y; | |
799 x = (y); | |
800 x = ((y)); | |
801 x = (((y))); | |
802 . | |
803 . | |
804 . | |
805 \end{indentingcode} | |
806 | |
807 Each set of parentheses requires another level on the parser stack. | |
808 When this set of statements was tried with Borland C++, it ran out of | |
809 stack space at 127 sets of parentheses and diagnosed the problem as | |
810 ``Expression is too complicated''. | |
811 | |
812 AnaGram calculates the actual size of the parser stack by calculating | |
813 the maximum depth for left recursive constructs and adding half the | |
814 value of | |
815 \index{Parser stack size}\index{Configuration parameters}\index{Stack} | |
816 \index{Parser state stack}\index{State stack} | |
817 \agparam{parser stack size}. It then uses the larger of the calculated | |
818 value and \agparam{parser stack size} to allocate stack storage. You | |
819 may check the value actually used in your parser by inspecting the | |
820 definition of | |
821 \index{AG{\us}PARSER{\us}STACK{\us}SIZE}\agcode{AG{\us}PARSER{\us}STACK{\us}SIZE}. | |
822 | |
823 If your parser runs out of stack space, it will set | |
824 \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to | |
825 \index{AG{\us}STACK{\us}ERROR{\us}CODE}\agcode{AG{\us}STACK{\us}ERROR{\us}CODE}, invoke | |
826 the | |
827 \index{Macros}\index{PARSER{\us}STACK{\us}OVERFLOW}\agcode{PARSER{\us}STACK{\us}OVERFLOW} | |
828 macro and return to the calling program. The default definition of | |
829 this macro is: | |
830 | |
831 \begin{indentingcode}{0.4in} | |
832 \#define PARSER{\us}STACK{\us}OVERFLOW \bra fprintf(stderr, {\bs} | |
833 "{\bs}nParser stack overflow{\bs}n"); \ket | |
834 \end{indentingcode} | |
835 | |
836 % XXX ``provide your own definition'', not ``redefine'' | |
837 | |
838 If this definition is not consistent with your needs, you may redefine | |
839 it in any block of embedded C in your syntax file. | |
840 | |
841 \index{Reduction token error} | |
842 \paragraph{Reduction Token Error.} | |
843 A properly functioning parser should never encounter a reduction token | |
844 error. Therefore, reduction token errors should be taken quite | |
845 seriously. The only way to cause a reduction token error in an | |
846 otherwise properly functioning parser is to set incorrectly the | |
847 reduction token for a semantically determined production. | |
848 % XXX ``to incorrectly set'' | |
849 | |
850 Before your parser calls a reduction procedure, it stores the token | |
851 number of the token to which the production would normally reduce in | |
852 \index{reduction{\us}token}\index{PCB}\agcode{PCB.reduction{\us}token}. If | |
853 the production is a semantically determined production, you may, in | |
854 your reduction procedure, change the value of | |
855 \agcode{PCB.reduction{\us}token} to one of the alternative tokens on | |
856 the left side of the production. When your reduction procedure | |
857 returns, your parser checks to verify that | |
858 \agcode{PCB.reduction{\us}token} is a valid token number for the | |
859 current state of the parser. If it is not, it sets | |
860 \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to | |
861 \index{AG{\us}REDUCTION{\us}ERROR{\us}CODE}\agcode{AG{\us}REDUCTION{\us}ERROR{\us}CODE} | |
862 and invokes | |
863 \index{REDUCTION{\us}TOKEN{\us}ERROR}\index{Macros}\agcode{REDUCTION{\us}TOKEN{\us}ERROR}. | |
864 The default definition of this macro is: | |
865 | |
866 \begin{indentingcode}{0.4in} | |
867 \#define REDUCTION{\us}TOKEN{\us}ERROR \bra fprintf(stderr,{\bs} | |
868 "{\bs}nReduction{\us}token error{\bs}n"); \ket | |
869 \end{indentingcode} | |
870 | |
871 \subsection{Syntax Errors} | |
872 \index{Syntax error}\index{Errors} | |
873 | |
874 If the input data to your parser does not conform to the rules you | |
875 have specified in your grammar, your parser will detect a syntax | |
876 error. There are two basic aspects of dealing with syntax errors: | |
877 \index{Error diagnosis}\agterm{diagnosing} the error and | |
878 \agterm{recovering} from the error, that is, restarting the parse, or | |
879 ``resynchronizing'' the parser. | |
880 | |
881 If you use the default settings for syntax error handling, then on | |
882 encountering a syntax error your parser will call a diagnostic | |
883 procedure which will create an error message and store a pointer to it | |
884 in | |
885 \index{Error messages}\index{error{\us}message}\index{PCB} | |
886 \agcode{PCB.error{\us}message}. | |
887 Then, it will set | |
888 \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to | |
889 \index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE} and | |
890 call a macro called | |
891 \index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR}. The | |
892 default definition of \agcode{SYNTAX{\us}ERROR} will print the error | |
893 message on \agcode{stderr}. Finally, in lieu of trying to continue | |
894 the parse, it will return to the calling program. | |
895 | |
896 AnaGram has several options which allow you to tailor diagnostic | |
897 messages to your requirements or help you to create your own. It also | |
898 provides several options for continuing the parse. | |
899 | |
900 The options available to help you diagnose errors are: | |
901 | |
902 \begin{itemize} | |
903 \item line and column tracking | |
904 \item creation of a diagnostic message | |
905 \item identification of the error frame | |
906 \end{itemize} | |
907 | |
908 \index{Numbers}\index{Lines and columns}\index{Configuration switches} | |
909 \paragraph{Line and Column Tracking.} | |
910 Your parser will automatically track lines and columns in its input if | |
911 the \agparam{lines and columns} configuration switch is on. Since | |
912 this is a common requirement, \agparam{lines and columns} defaults to | |
913 on. If you don't want your parser to spend time counting lines and | |
914 columns you should turn the switch off, thus: | |
915 | |
916 \begin{indentingcode}{0.4in} | |
917 \agcode{ | |
918 \~{}lines and columns | |
919 } | |
920 \end{indentingcode} | |
921 | |
922 Normally, if you are using a \index{Lexical scanner}lexical scanner, | |
923 you would turn lines and columns off. | |
924 % XXX: this should say *why*. | |
925 | |
926 The line and column counts are maintained in | |
927 \index{line}\index{PCB}\agcode{PCB.line} and | |
928 \index{column}\index{PCB}\agcode{PCB.column} respectively. | |
929 \agcode{PCB.line} and \agcode{PCB.column} are initialized with the | |
930 values of the \index{FIRST{\us}LINE}\index{Macros}\agcode{FIRST{\us}LINE} | |
931 and \index{Macros}\index{FIRST{\us}COLUMN}\agcode{FIRST{\us}COLUMN} macros | |
932 respectively. These macros provide default initial values of 1 for | |
933 both line and column numbers. To override these definitions, simply | |
934 include definitions for these macros in your syntax file. If tab | |
935 characters are encountered, they are expanded in accordance with the | |
936 \index{Tab spacing}\agparam{tab spacing} parameter. | |
937 | |
938 When your parser is executing a reduction procedure, \agcode{PCB.line} and | |
939 \agcode{PCB.column} refer to the first input character following the | |
940 rule that is being reduced. When your parser has encountered a syntax | |
941 error, and is executing your \agcode{SYNTAX{\us}ERROR} macro, | |
942 \agcode{PCB.line} and \agcode{PCB.column} refer to the erroneous input | |
943 character. | |
944 | |
945 \paragraph{Diagnostic Messages.} | |
946 If the \index{Diagnose errors}\index{Configuration switches} | |
947 \agparam{diagnose errors} switch is on, its default setting, AnaGram | |
948 will include an error diagnostic procedure in your parser. When your | |
949 parser encounters a syntax error, this procedure will create a simple | |
950 diagnostic message and store a pointer to it in | |
951 \index{error{\us}message}\index{PCB}\agcode{PCB.error{\us}message} before | |
952 your \agcode{SYNTAX{\us}ERROR} macro is executed. The default definition | |
953 of \agcode{SYNTAX{\us}ERROR} prints this message on \agcode{stderr}. | |
954 | |
955 If your parser was in a state where there was a single input character | |
956 expected or a simple named token expected, it will create a message of | |
957 the form: | |
958 | |
959 \begin{indentingcode}{0.4in} | |
960 Missing ';' | |
961 \end{indentingcode} | |
962 or | |
963 \begin{indentingcode}{0.4in} | |
964 Missing semicolon | |
965 \end{indentingcode} | |
966 | |
967 If there was more than one possible input your parser will check to | |
968 see if it can identify the erroneous input. If it can it will create | |
969 a message of the form: | |
970 | |
971 \begin{indentingcode}{0.4in} | |
972 Unexpected ';' | |
973 \end{indentingcode} | |
974 or | |
975 \begin{indentingcode}{0.4in} | |
976 Unexpected semicolon | |
977 \end{indentingcode} | |
978 | |
979 Otherwise, the diagnostic message will be simply: | |
980 | |
981 \begin{indentingcode}{0.4in} | |
982 Unexpected input | |
983 \end{indentingcode} | |
984 | |
985 If you do not need a diagnostic message, or choose to create your own, | |
986 you should turn \agparam{diagnose errors} off. | |
987 | |
988 % XXX Somewhere there should be a discussion of what ``creating your | |
989 % own'' would entail. | |
990 | |
991 \index{Error frame} | |
992 \paragraph{Error Frame.} | |
993 Often it is desirable to know the ``frame'' of an error, that is, what | |
994 the parser thought it was doing when it encountered the error. If, | |
995 for instance, you forget to terminate a comment in a C program, your C | |
996 compiler sees an unexpected end of file. When you look simply at the | |
997 alleged error, of course, you can't see any problem. In order to | |
998 understand the error, you need to know that the parser was trying to | |
999 find a complete comment. In this case, we can say that the comment is | |
1000 the ``frame'' of the error. | |
1001 | |
1002 AnaGram provides an optional facility in its error diagnostic | |
1003 procedure, controlled by the | |
1004 \index{Error frame}\index{Configuration switches}\agparam{error frame} | |
1005 switch, for identifying the frame of a syntax error. The | |
1006 \agparam{diagnose errors} switch must also be on to enable the | |
1007 diagnostic procedure. If you enable \agparam{error frame} in your | |
1008 syntax file, AnaGram will include a procedure which will scan | |
1009 backwards on the state stack looking for the frame of the error. When | |
1010 it finds what appears to be the error frame, it will store the stack | |
1011 index in | |
1012 \index{error{\us}frame{\us}ssx}\index{PCB}\agcode{PCB.error{\us}frame{\us}ssx} and | |
1013 the token number of the nonterminal token the parser was looking for | |
1014 in | |
1015 \index{error{\us}frame{\us}token}\index{PCB}\agcode{PCB.error{\us}frame{\us}token}. | |
1016 | |
1017 % | |
1018 % XXX. Why is the discussion of ``hidden'' inside the discussion of | |
1019 % ``error frame''? hidden applies to ordinary error diagnosis also. | |
1020 % | |
1021 % Furthermore, this discussion of error frame needs an example, or | |
1022 % nobody will ever figure out how to do it. | |
1023 % | |
1024 | |
1025 If, in your grammar, there are nonterminal tokens that are not | |
1026 suitable for diagnostic use, usually because they name an intermediate | |
1027 stage in the parse that means nothing to your user, you can make sure | |
1028 that AnaGram ignores them in doing its analysis by declaring them as | |
1029 \index{Declaration}\index{Hidden declaration}\agparam{hidden}. To | |
1030 declare tokens as hidden, include a \agparam{hidden} declaration in a | |
1031 configuration section. (See Chapter 8.) For instance, consider: | |
1032 | |
1033 \begin{indentingcode}{0.4in} | |
1034 comment | |
1035 -> comment head, "*/" | |
1036 comment head | |
1037 -> "/*" | |
1038 -> comment head, \~{}end of file | |
1039 {}[ hidden \bra comment head \ket ] | |
1040 \end{indentingcode} | |
1041 | |
1042 We mark comment head as hidden, because we only wish to talk about | |
1043 complete comments with our users. | |
1044 | |
1045 In order to use the error frame effectively in your diagnostics, you | |
1046 need to have an ASCII representation of the name of the token as well | |
1047 as its token number. If you turn the | |
1048 \index{Token names}\index{Configuration switches}\agparam{token names} | |
1049 configuration switch on in your syntax file, AnaGram will provide an | |
1050 array of ASCII strings, indexed by token number, which you may use in | |
1051 your diagnostics. The name of the array is created by appending | |
1052 \agcode{{\us}token{\us}names} to the name of your parser. If your parser is | |
1053 called \agcode{ana}, your token name array will have the name | |
1054 \agcode{ana{\us}token{\us}names}. As a convenience, AnaGram | |
1055 also defines a macro, | |
1056 \index{TOKEN{\us}NAMES}\index{Macros}\agcode{TOKEN{\us}NAMES}, which | |
1057 evaluates to the name of the token name array. Note that | |
1058 \agparam{token names} | |
1059 controls the generation of an array of ASCII strings and should not be | |
1060 confused with the \agcode{typedef enum} statement in the parser header | |
1061 file which provides you with a set of enumeration constants. | |
1062 % XXX maybe it means the *strings* should not be confused? | |
1063 | |
1064 If you are tracking context, using the techniques described below, you | |
1065 can use the macro | |
1066 \index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} or | |
1067 \index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT} to | |
1068 determine the context of the error frame token. | |
1069 | |
1070 \index{SYNTAX{\us}ERROR}\index{Macros} | |
1071 \paragraph{SYNTAX{\us}ERROR Macro.} | |
1072 When your parser finds a syntax error, it first executes any of the | |
1073 diagnostic procedures described above that you have enabled, sets | |
1074 \index{exit{\us}flag}\index{PCB}\agcode{PCB.exit{\us}flag} to | |
1075 \index{AG{\us}SYNTAX{\us}ERROR{\us}CODE}\agcode{AG{\us}SYNTAX{\us}ERROR{\us}CODE}, | |
1076 and then invokes the \agcode{SYNTAX{\us}ERROR} macro. If you have not | |
1077 defined \agcode{SYNTAX{\us}ERROR} it will be defined thus if you have set | |
1078 \index{Lines and columns}\index{Configuration switches} | |
1079 \agparam{lines and columns}: | |
1080 | |
1081 \begin{indentingcode}{0.4in} | |
1082 \#define SYNTAX{\us}ERROR {\bs} | |
1083 fprintf(stderr,"\%s,line \%d,column \%d{\bs}n", {\bs} | |
1084 PCB.error{\us}message, PCB.line, PCB.column) | |
1085 \end{indentingcode} | |
1086 | |
1087 and thus if you have not: | |
1088 | |
1089 \begin{indentingcode}{0.4in} | |
1090 \#define SYNTAX{\us}ERROR {\bs} | |
1091 fprintf(stderr, "\%s{\bs}n", PCB.error{\us}message) | |
1092 \end{indentingcode} | |
1093 | |
1094 In most circumstances, you will probably want to write your own | |
1095 \agcode{SYNTAX{\us}ERROR} macro, since this diagnostic is one your users | |
1096 will see with some frequency. | |
1097 % XXX yes and why exactly? is there something we have in mind better | |
1098 % than just printing PCB.error_message? | |
1099 | |
1100 The default macro simply returns to the parser. Your macro doesn't | |
1101 have to. If you wish, you could call \agcode{abort} or \agcode{exit} | |
1102 directly from the macro. If the \agcode{SYNTAX{\us}ERROR} macro returns | |
1103 control to the parser, subsequent events depend on your choices for | |
1104 error recovery. | |
1105 | |
1106 \section{Error Recovery} | |
1107 \index{Error recovery}\index{Syntax error}\index{Errors} | |
1108 | |
1109 Syntax errors can be caused by any of a number of problems. Some come | |
1110 from simple typographic errors: the user skips a character or types | |
1111 the wrong one. Others come from true errors: he types something that | |
1112 might be correct in its place, but in context is totally wrong. | |
1113 Usually, if your parser is reading a file, you will want to continue | |
1114 parsing the input, checking for other syntax errors at the very least. | |
1115 The problem with doing this is getting the parser restarted, or | |
1116 ``resynchronized'', in some reasonable manner. | |
1117 | |
1118 AnaGram provides a number of ways for your parser to recover from a | |
1119 syntax error. The least graceful, of course, is simply to call | |
1120 \agcode{abort} or \agcode{exit} from the \agcode{SYNTAX{\us}ERROR} macro. | |
1121 If you don't do this you have several options: | |
1122 | |
1123 \begin{itemize} | |
1124 \item error token resynchronization | |
1125 \item auto resynchronization | |
1126 \item simple return to calling program | |
1127 \item ignore the error | |
1128 \end{itemize} | |
1129 | |
1130 \subsection{Error Token Resynchronization} | |
1131 \index{Resynchronization} | |
1132 | |
1133 When AnaGram builds your parser it checks to see if you have used a | |
1134 token called \agcode{error} in your grammar or if you have assigned a | |
1135 token name as the value of the configuration parameter | |
1136 \index{Error token}\index{token}\index{Configuration parameters} | |
1137 \agparam{error token}. If so, it includes a call to an error token | |
1138 resynchronization procedure immediately after the invocation of | |
1139 \index{SYNTAX{\us}ERROR}\agcode{SYNTAX{\us}ERROR}. The error token | |
1140 resynchronization procedure works in the following way: It scans the | |
1141 state stack backwards looking for the most recent state in which | |
1142 \agcode{error} or the token named by \agparam{error token} was valid | |
1143 input. It then truncates the stack to this level, and jumps to the | |
1144 state indicated by the error token. It then passes over any input it | |
1145 sees until it sees valid input for the state in which it finds itself. | |
1146 At this point, it returns to the parser which continues as though | |
1147 nothing had happened. Since this is substantially easier than it | |
1148 sounds, let's look at an example. Suppose we are writing a C | |
1149 compiler, and we wish to catch errors in ordinary statements. We add | |
1150 the following production to our grammar: | |
1151 | |
1152 \begin{indentingcode}{0.4in} | |
1153 statement | |
1154 -> error, ';' | |
1155 \end{indentingcode} | |
1156 | |
1157 Now, if the parser encounters a syntax error anytime while it is | |
1158 parsing any statement, it will pop back to the most recent state where | |
1159 it was looking for a statement, jump forward to the state indicated by | |
1160 the token \agcode{error} in the new production, and then skip input | |
1161 until it sees a semicolon. At this point it will continue a normal | |
1162 parse. The effect of continuing at this point is to recognize and | |
1163 reduce the above production, i.e., the parser will proceed as if it | |
1164 had found a complete, correct ``statement''. This production could | |
1165 even have a reduction procedure to do any clean-up that an error might | |
1166 require. | |
1167 | |
1168 If you use error token resynchronization, you must identify an end of | |
1169 file token to guarantee that the resynchronization procedure can | |
1170 always terminate. To do this, either name your end of file token | |
1171 \agcode{eof} or use the | |
1172 \index{Eof token}\index{Configuration parameters}\index{Token} | |
1173 \agparam{eof token} configuration parameter to specify it. | |
1174 | |
1175 For example, if your parser is reading conventional stream input, the | |
1176 end of file will be denoted by a $-1$ value. You can define the end | |
1177 of file token thus: | |
1178 | |
1179 \begin{indentingcode}{0.4in} | |
1180 eof = -1 | |
1181 \end{indentingcode} | |
1182 | |
1183 % XXX as ``finally'' means something in Java, let's change this to | |
1184 % ``at last'' | |
1185 On the other hand, if you have already defined a token named | |
1186 \agcode{finally}, you can add the following line to any configuration | |
1187 segment: | |
1188 | |
1189 \begin{indentingcode}{0.4in} | |
1190 eof token = finally | |
1191 \end{indentingcode} | |
1192 | |
1193 The end of file token, of course, must be a terminal token. | |
1194 % XXX this is not ``of course'' to a casual observer. | |
1195 | |
1196 \subsection{Automatic Resynchronization} | |
1197 \index{Resynchronization}\index{Automatic resynchronization} | |
1198 | |
1199 If you have not specified an \agcode{error} token in your syntax file, | |
1200 AnaGram checks to see if you have turned on the | |
1201 \index{Auto resynch}\index{Configuration switches} | |
1202 \agparam{auto resynch} configuration switch. | |
1203 If so, it includes a call to an automatic resynchronization procedure | |
1204 immediately after the call to \agcode{SYNTAX{\us}ERROR}. The automatic | |
1205 resynchronization procedure uses a heuristic based on your grammar to | |
1206 get back in step with the input. To use it you need do only two | |
1207 things: You need to turn on the \index{Auto resynch}\agparam{auto | |
1208 resynch} switch, and you need to specify an end of file token as for | |
1209 error token resynchronization, above. | |
1210 | |
1211 The primary advantage of the automatic resynchronization is that it is | |
1212 easy to use. The disadvantage is that it turns off all reduction | |
1213 procedures, so that your parser is reduced to being a syntax checker | |
1214 after it encounters an error. If your grammar uses semantically | |
1215 determined productions, your reduction procedures will not be invoked | |
1216 so the primary reduction token will be used in all cases. | |
1217 | |
1218 % XXX *why* does it do this? | |
1219 | |
1220 \subsection{Other Ways to Continue} | |
1221 | |
1222 % XXX the example of ``reading input from a keyboard'' should be | |
1223 % clarified to indicate that this means something like an application | |
1224 % where you press F10 for the menu, not typing at a command line. | |
1225 % | |
1226 If you do not wish to use either of the above resynchronization | |
1227 procedures, you still have a number of options. If your parser is | |
1228 reading input from a keyboard, for instance, it is probably sufficient | |
1229 to simply ignore bad input characters. You can do this by simply | |
1230 resetting \index{PCB}\index{exit{\us}flag}\agcode{PCB.exit{\us}flag} to | |
1231 zero in your | |
1232 \index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro. | |
1233 % XXX XXX should say \agcode{AG_RUNNING_CODE}, not zero!! | |
1234 Your parser will then continue, passing over the bad input as though | |
1235 it had never occurred. If you do this, you should, of course, notify | |
1236 your user somehow that you're skipping a character. Issuing a beep on | |
1237 the computer's speaker from the \agcode{SYNTAX{\us}ERROR} macro is | |
1238 usually enough. | |
1239 | |
1240 If you do not wish to continue the parse, but want your main program | |
1241 to continue, you need do nothing special. \agcode{PCB.exit{\us}flag} is | |
1242 % XXX XXX should say \agcode{AG_SYNTAX_ERROR_CODE}, not 2!! | |
1243 set to 2 before the \agcode{SYNTAX{\us}ERROR} macro is called. If your | |
1244 macro does not change \agcode{PCB.exit{\us}flag}, when it relinquishes | |
1245 control to your parser, your parser will simply return to the calling | |
1246 program. The calling program can determine that the parse was | |
1247 unsuccessful by inspecting \agcode{PCB.exit{\us}flag} and take whatever | |
1248 action you deem appropriate. | |
1249 | |
1250 | |
1251 \section{Advanced Techniques} | |
1252 | |
1253 \subsection{Semantically Determined Productions} | |
1254 \index{Semantically determined production}\index{Production} | |
1255 | |
1256 A semantically determined production is one which has more than one | |
1257 token on the left side. The reduction procedure then determines which | |
1258 token has in fact been identified, using whatever criteria are | |
1259 necessary. In some cases where the purpose is simply to provide | |
1260 multiple syntactic options to be chosen at execution time, the | |
1261 determination is made simply by interrogating a switch. Other | |
1262 situations may require a more complex determination, such as a symbol | |
1263 table look-up, for instance. | |
1264 | |
1265 \index{Production} | |
1266 The tokens on the left side of the production can be used just like | |
1267 any other tokens in your grammar. Their semantic values, however, | |
1268 must all be of the same \index{Data type}\index{Token}data type. | |
1269 | |
1270 Depending on how you have defined your grammar, it may be that | |
1271 whenever any one of the tokens on the left side is syntactically | |
1272 acceptable input, all the tokens on the left are syntactically | |
1273 acceptable. That is, the production could reduce to any of the tokens | |
1274 on the left without causing an immediate error condition. In many | |
1275 circumstances, however, this is not the case. In a Pascal grammar, | |
1276 for example, a semantically determined production might be used to | |
1277 allow a reduction procedure to determine whether a particular | |
1278 identifier is a constant identifier, a type identifier, a variable | |
1279 identifier, or so on. In any particular context, only a subset of the | |
1280 tokens on the left may be syntactically acceptable. | |
1281 | |
1282 Before your reduction procedure is called, your parser will set the | |
1283 reduction token to the first token on the left side which is | |
1284 syntactically correct. If you need to change this assignment you have | |
1285 several options. From within your reduction procedure, you may simply | |
1286 set | |
1287 \index{reduction{\us}token}\index{PCB}\index{Token}\agcode{PCB.reduction{\us}token} | |
1288 to the semantically correct value. For this purpose, it is convenient | |
1289 to use the token name enumeration constants provided in the header | |
1290 file for your parser. Note that if you select a reduction token that | |
1291 is not syntactically correct, after your reduction procedure returns, | |
1292 your parser will encounter a \index{Reduction token | |
1293 error}\agterm{reduction token error}, described above. | |
1294 | |
1295 AnaGram provides several tools to help you set the reduction token | |
1296 correctly. First, it provides a \agterm{change reduction} function | |
1297 which will set the reduction token to a specified token only if the | |
1298 specified token is syntactically correct. It will return a flag to | |
1299 indicate the outcome: non-zero on success, zero on failure. The name | |
1300 of this function is given by appending \agcode{{\us}change{\us}reduction} to | |
1301 the name of your parser. Thus, if your parser is named \agcode{ana}, | |
1302 the name of the function would be \agcode{ana{\us}change{\us}reduction}. In | |
1303 those cases where the semantically correct reduction token is not | |
1304 syntactically correct, you will want to provide error diagnostics for | |
1305 your user. If you wish the parse to continue, so you can check | |
1306 errors, you may simply return from the reduction procedure. Since the | |
1307 default reduction is syntactically correct, the parse can continue as | |
1308 though there had been no error. | |
1309 | |
1310 To simplify use of the change reduction function, AnaGram provides a macro, | |
1311 \index{CHANGE{\us}REDUCTION}\index{Macros}\agcode{CHANGE{\us}REDUCTION}. | |
1312 Simply call the macro with the name of the desired token as the | |
1313 argument, replacing embedded blanks in the token name with | |
1314 underscores. | |
1315 | |
1316 For example, in writing a grammar for the C language, it is quite | |
1317 convenient to write the following production: | |
1318 | |
1319 \begin{indentingcode}{0.4in} | |
1320 identifier, typedef name | |
1321 -> name = check{\us}typedef(); | |
1322 \end{indentingcode} | |
1323 | |
1324 The reduction procedure can then check the symbol table to see if | |
1325 whether the name that has been found is a typedef name. If so, it can | |
1326 use the \agcode{CHANGE{\us}REDUCTION} macro to change the reduction token | |
1327 to \agcode{typedef name} and verify that this is acceptable: | |
1328 | |
1329 \begin{indentingcode}{0.4in} | |
1330 if (!CHANGE{\us}REDUCTION(typedef{\us}name)) diagnose{\us}error(); | |
1331 \end{indentingcode} | |
1332 | |
1333 Note that the embedded space in the token name must be replaced with | |
1334 an underscore character. | |
1335 | |
1336 Under some circumstances, in your reduction procedure, you might wish | |
1337 to know precisely which reduction tokens are syntactically correct. | |
1338 For instance, you might wish, in an error diagnostic, to tell your | |
1339 user what you expected to see. If you set the | |
1340 \index{Reduction choices}\index{Configuration switches} | |
1341 \agparam{reduction choices} switch, | |
1342 AnaGram will include in your parser file a function which will | |
1343 identify the acceptable choices for the reduction token in the current | |
1344 state. The prototype of this function is: | |
1345 | |
1346 \begin{indentingcode}{0.4in} | |
1347 int \${\us}reduction{\us}choices(int *); | |
1348 \end{indentingcode} | |
1349 | |
1350 where ``\agcode{\$}'' represents the name of your parser. You must provide an | |
1351 integer array whose length is at least as long as the maximum number | |
1352 of reduction choices you might have. The function will fill the array | |
1353 with the token numbers of those which are acceptable in the current | |
1354 state and return a count of the number of acceptable choices it found. | |
1355 You can call this function from any reduction procedure. AnaGram also | |
1356 provides a macro to invoke this procedure: | |
1357 \index{REDUCTION{\us}CHOICES}\index{Macros}\agcode{REDUCTION{\us}CHOICES}. | |
1358 For example, to provide a diagnostic which details the acceptable | |
1359 token, you might combine the use of the \agparam{reduction choices} | |
1360 switch with the | |
1361 \index{Token names}\index{Configuration switches}\agparam{token names} | |
1362 switch described above: | |
1363 | |
1364 \begin{indentingcode}{0.4in} | |
1365 int ok{\us}tokens[20], n{\us}ok{\us}tokens, i; | |
1366 n{\us}ok{\us}tokens = REDUCTION{\us}CHOICES(ok{\us}tokens); | |
1367 printf("Acceptable input comprises: {\bs}n"); | |
1368 for (i = 0; i $<$ n{\us}ok{\us}tokens; i++) \bra | |
1369 printf(" \%s{\bs}n", TOKEN{\us}NAMES[i]); | |
1370 \ket | |
1371 \end{indentingcode} | |
1372 | |
1373 A semantically determined production can even be a null production. | |
1374 You can use a semantically determined null production to interrogate | |
1375 the settings of parameters and control parsing accordingly: | |
1376 | |
1377 \begin{indentingcode}{0.4in} | |
1378 condition false, condition true | |
1379 -> = \bra if (condition) CHANGE{\us}REDUCTION(condition{\us}true); \ket | |
1380 \end{indentingcode} | |
1381 | |
1382 There are numerous examples of the use of semantically determined | |
1383 productions in the examples provided in the | |
1384 \index{examples}\agfile{examples} directory of your AnaGram | |
1385 distribution disk. | |
1386 % XXX too much anaphora | |
1387 % XXX s/disk// | |
1388 | |
1389 \subsection{Defining Parser Control Blocks} | |
1390 \index{Parser control block} | |
1391 | |
1392 All references to the parser control block in your parser are made | |
1393 using the macro \index{PCB}\agcode{PCB}. The only intrinsic | |
1394 requirement on PCB is that it evaluate to an \agterm{lvalue} (see | |
1395 Kernighan and Ritchie) that identifies a parser control block. The | |
1396 actual access may be direct, indirect through a pointer, subscripted, | |
1397 or even more complex, although if the access is too complex, the | |
1398 performance of your parser could suffer. Simple indirect or | |
1399 subscripted references are usually enough to enable you to build a | |
1400 system with multiple parallel parsing processes. If you wish to | |
1401 define \agcode{PCB} in some way other than a simple, direct access to | |
1402 a compiled-in control block, you will have to declare the control | |
1403 block yourself. | |
1404 | |
1405 When AnaGram builds a parser, it checks the status of the | |
1406 \index{Declare pcb}\index{Configuration switches}\agparam{declare pcb} | |
1407 configuration switch. If it is on, the default setting, AnaGram | |
1408 declares a parser control block for you. AnaGram creates the name of | |
1409 the parser control block variable by appending \agcode{{\us}pcb} to the | |
1410 name of your parser. Thus if the name of your parser is | |
1411 \agcode{ana}, the parser control block is \agcode{ana{\us}pcb}. | |
1412 | |
1413 In the header file AnaGram generates, a typedef statement defines the | |
1414 structure of the parser control block. The typedef name is given by | |
1415 appending \agcode{{\us}pcb{\us}type} to the name of your parser. Thus if | |
1416 the name of your parser is \agcode{ana}, the type of the parser | |
1417 control block is given by \agcode{ana{\us}pcb{\us}type}. Thus, when AnaGram | |
1418 defines the parser control block for \agcode{ana}, it does so by | |
1419 including the following two lines of code: | |
1420 | |
1421 \begin{indentingcode}{0.4in} | |
1422 ana{\us}pcb{\us}type ana{\us}pcb; | |
1423 \#define PCB ana{\us}pcb | |
1424 \end{indentingcode} | |
1425 | |
1426 If you wish to declare the parser control block yourself, you should | |
1427 turn off the \agparam{declare pcb} switch. To turn \agparam{declare | |
1428 pcb} off, include the following line in a configuration segment in | |
1429 your syntax file: | |
1430 | |
1431 \begin{indentingcode}{0.4in} | |
1432 \~{}declare pcb | |
1433 \end{indentingcode} | |
1434 | |
1435 Suppose your program needs to serve up to sixteen ``clients'', each | |
1436 with its own input stream. You might turn \agparam{declare pcb} off | |
1437 and declare the parser control block in the following manner: | |
1438 | |
1439 \begin{indentingcode}{0.4in} | |
1440 ana{\us}pcb{\us}type ana{\us}pcb[16]; /* declare control blocks */ | |
1441 int client; | |
1442 \#define PCB ana{\us}pcb[client] /* tell parser about it */ | |
1443 \end{indentingcode} | |
1444 | |
1445 Perhaps you need to parse a number of input streams, but you don't | |
1446 know exactly how many until run time. You might make the following | |
1447 declarations: | |
1448 | |
1449 \begin{indentingcode}{0.4in} | |
1450 ana{\us}pcb{\us}type *ana{\us}pcb; /* pointer to control block */ | |
1451 \#define PCB (*ana{\us}pcb) /* tell parser about it */ | |
1452 \end{indentingcode} | |
1453 | |
1454 Note that when you declare \agcode{PCB} as a pointer, you should put | |
1455 parentheses around the declaration so that your compiler codes the | |
1456 indirection properly. | |
1457 | |
1458 There are many situations where it is convenient for a parser to be | |
1459 reentrant. A parser used for evaluating formulas in a spreadsheet | |
1460 program, for instance, needs to be able to call itself recursively if | |
1461 it is to use natural order recalculation. A parser used to implement | |
1462 macro substitutions may need to be recursive to deal with embedded | |
1463 macros. | |
1464 | |
1465 Here is an example of an interface function which is designed for | |
1466 recursive calls to a parser, using the definitions above: | |
1467 | |
1468 % XXX can I please at least remove the nonstandard <alloc.h>? | |
1469 % And fix the misuse of assert, and check malloc for failure? | |
1470 % And use AG_SUCCESS_CODE instead of 1? | |
1471 \begin{indentingcode}{0.4in} | |
1472 \#include <assert.h> | |
1473 \#include <alloc.h> | |
1474 | |
1475 \#define PCB (*ana{\us}pcb) | |
1476 ana{\us}pcb{\us}type *ana{\us}pcb; | |
1477 | |
1478 void do{\us}ana(void) \bra | |
1479 ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb; | |
1480 ana{\us}pcb = malloc(sizeof(ana{\us}pcb{\us}type)); | |
1481 ana(); | |
1482 assert(ana{\us}pcb.exit{\us}flag == 1); | |
1483 free(ana{\us}pcb); | |
1484 ana{\us}pcb = save{\us}ana; | |
1485 \ket | |
1486 \end{indentingcode} | |
1487 | |
1488 Here is another way to accomplish the same end, this time using stack | |
1489 storage rather than heap storage: | |
1490 | |
1491 % XXX ditto | |
1492 \begin{indentingcode}{0.4in} | |
1493 \#include <assert.h> | |
1494 \#include <alloc.h> | |
1495 | |
1496 \#define PCB (*ana{\us}pcb) | |
1497 ana{\us}pcb{\us}type *ana{\us}pcb; | |
1498 | |
1499 void do{\us}ana(void) \bra | |
1500 ana{\us}pcb{\us}type *save{\us}ana = ana{\us}pcb; | |
1501 ana{\us}pcb{\us}type local{\us}pcb; | |
1502 ana{\us}pcb = \&local{\us}pcb; | |
1503 ana();\\ | |
1504 assert(ana{\us}pcb.exit{\us}flag == 1); | |
1505 ana{\us}pcb = save{\us}ana; | |
1506 \ket | |
1507 \end{indentingcode} | |
1508 | |
1509 % XXX and here we should discuss \agparam{reentrant parser}, too. | |
1510 | |
1511 | |
1512 \subsection{Multi-stage Parsing} | |
1513 \index{Parsing}\index{Multi-stage parsing} | |
1514 | |
1515 Multi-stage parsing consists of chaining together a number of parsers | |
1516 in series so that each parser provides input to the following one. | |
1517 Users of \agfile{lex} and \agfile{yacc} are accustomed to using | |
1518 two-level parsing, since the ``\index{Lexical scanner}lexical | |
1519 scanner'', or ``lexer'' they write in \agfile{lex} is really a very | |
1520 simple parser whose output becomes the input to the parser written in | |
1521 \agfile{yacc}. AnaGram has been developed so that you may use as many | |
1522 levels as are appropriate to your problem, and so that, if you wish, | |
1523 you may write all of the parsers in AnaGram. | |
1524 | |
1525 Many problems that do not lend themselves conveniently to solution | |
1526 with a simple grammar can be neatly solved by using multi-stage | |
1527 parsing. In many cases this is because multi-stage parsing can be | |
1528 used to parse constructs that are not context-free. A first level | |
1529 parser can use semantic information to decide which tokens to pass on | |
1530 to the next level. Thus, a first level parser for a C compiler can | |
1531 use semantic information to distinguish typedef names from variable | |
1532 names. | |
1533 | |
1534 % XXX I believe this is referring to QPL. Nowadays there's Python... | |
1535 As another example, a proprietary programming language used indents to | |
1536 control its block structure. A first level parser looked only at | |
1537 lines and indents, passing the text through to the second level | |
1538 parser. When it encountered changes in indentation level, it inserted | |
1539 block start and block end tokens as necessary. | |
1540 | |
1541 Using AnaGram it is extremely easy to set up multi-stage parses. | |
1542 Simply configure the second level parser as an event-driven parser. | |
1543 The first level parser can then hand over tokens or characters to it | |
1544 as it develops them. | |
1545 | |
1546 The C macro preprocessor example, found in the | |
1547 \index{examples}\agfile{examples} directory of your AnaGram | |
1548 distribution disk, illustrates the use of multi-stage parsing. | |
1549 | |
1550 \subsection{Context Tracking} | |
1551 \index{Context tracking} | |
1552 | |
1553 When you are writing a reduction procedure for a particular grammar | |
1554 rule, you often need to know the value one or another of your program | |
1555 variables had at the time the first token in the rule was encountered. | |
1556 Examples of such variables are: | |
1557 | |
1558 \begin{itemize} | |
1559 \item Line or column number | |
1560 \item Index in an input file | |
1561 \item Index into an array | |
1562 \item Counters, as of symbols defined, etc. | |
1563 \end{itemize} | |
1564 | |
1565 Such variables can be thought of as representing the ``context'' of | |
1566 the rule you are reducing. Sometimes it is possible to incorporate | |
1567 the values of such variables into the values of reduction tokens, but | |
1568 this can become quite cumbersome. AnaGram provides an optional | |
1569 feature known as ``context tracking'' to deal with this problem. | |
1570 Here's how it works: | |
1571 | |
1572 First, you identify the variables which you want to track. Second, | |
1573 you write a typedef statement in the \index{C prologue}C prologue of | |
1574 your parser which defines a data structure with fields to accommodate | |
1575 values for all of these variables. Third, you tell AnaGram what the | |
1576 name of the type of your data structure is, using the | |
1577 \index{Context type}\index{Configuration parameters}\agparam{context type} | |
1578 configuration parameter. This causes AnaGram to add a field called | |
1579 \index{PCB}\index{input{\us}context}\agcode{input{\us}context} and a stack, | |
1580 the \index{Context stack}\index{Stack}\agterm{context stack}, called | |
1581 \index{PCB}\index{cs}\agcode{cs}, both of the type you have specified, | |
1582 to your parser control block. Fourth, you write code to gather the | |
1583 context information for each input character. | |
1584 | |
1585 There are several ways to provide the initial context information. | |
1586 You may write a | |
1587 \index{GET{\us}CONTEXT}\index{Macros}\agcode{GET{\us}CONTEXT} macro which | |
1588 sets the context stack variables directly. Using the | |
1589 \index{CONTEXT}\index{Macros}\agcode{CONTEXT} macro defined below, and | |
1590 assuming your context type has line, column and pointer fields, you | |
1591 could define \agcode{GET{\us}CONTEXT} as follows: | |
1592 | |
1593 \begin{indentingcode}{0.4in} | |
1594 \#define GET{\us}CONTEXT CONTEXT.pointer = PCB.pointer,{\bs} | |
1595 CONTEXT.line = PCB.line,{\bs} | |
1596 CONTEXT.column = PCB.column | |
1597 \end{indentingcode} | |
1598 | |
1599 If you are using \agparam{pointer input}, you must write a | |
1600 \agcode{GET{\us}CONTEXT} macro to save context information. If you use a | |
1601 \index{GET{\us}INPUT}\index{Macros}\agcode{GET{\us}INPUT} macro or have an | |
1602 event-driven parser, you may either store values directly into | |
1603 \index{input{\us}context}\index{PCB}\agcode{PCB.input{\us}context} when you | |
1604 develop the input token, or you may write a \agcode{GET{\us}CONTEXT} | |
1605 macro. The macro will provide a slight increment in performance. | |
1606 % XXX say why it's faster (I assume because it won't look up context | |
1607 % for inputs that don't need it?) | |
1608 | |
1609 AnaGram provides six macros to enable you to read values in a | |
1610 convenient manner from the context stack, | |
1611 \index{cs}\index{PCB}\agcode{PCB.cs}. Three of these macros are | |
1612 designed to be used from your parser itself, and three are available | |
1613 to use from other modules. These three macros are designed for use in | |
1614 your parser: | |
1615 | |
1616 \begin{itemize} | |
1617 \item \agcode{CONTEXT} | |
1618 \item \agcode{RULE{\us}CONTEXT} | |
1619 \item \agcode{ERROR{\us}CONTEXT} | |
1620 \end{itemize} | |
1621 | |
1622 These macros are defined at the beginning of your parser file, so they | |
1623 may be used anywhere within your parser. | |
1624 | |
1625 \index{CONTEXT}\index{Macros}\agcode{CONTEXT} | |
1626 can be used to read or write the current top of the context stack as | |
1627 indexed by \index{PCB}\agcode{PCB.ssx}. When your parser is executing | |
1628 a reduction procedure for a particular grammar rule, \agcode{CONTEXT} | |
1629 will evaluate to the value of the input context as it was just before | |
1630 the very first token in the rule. The definition of \agcode{CONTEXT} | |
1631 is: | |
1632 | |
1633 \begin{indentingcode}{0.4in} | |
1634 \#define CONTEXT (PCB.cs[PCB.ssx]) | |
1635 \end{indentingcode} | |
1636 | |
1637 \index{RULE{\us}CONTEXT}\index{Macros}\agcode{RULE{\us}CONTEXT} can be used | |
1638 within a reduction procedure to get the context for any element within | |
1639 the rule being reduced. For example, \agcode{RULE{\us}CONTEXT[0]} is the | |
1640 context of the first element in the rule, \agcode{RULE{\us}CONTEXT[1]} is | |
1641 the context of the second element in the rule, and so on. | |
1642 \agcode{RULE{\us}CONTEXT[0]} is exactly the same as \agcode{CONTEXT}. | |
1643 | |
1644 % XXX There should be a way to address the context of tokens in a | |
1645 % rule by the symbolic names we've bound to them. | |
1646 | |
1647 The definition of \agcode{RULE{\us}CONTEXT} is: | |
1648 | |
1649 \begin{indentingcode}{0.4in} | |
1650 \#define RULE{\us}CONTEXT (\&(PCB.cs[PCB.ssx])) | |
1651 \end{indentingcode} | |
1652 | |
1653 As an example, let us suppose that we are writing a parser to read a | |
1654 parameter file for a program. Let us imagine the following statements | |
1655 make up a part of our syntax file: | |
1656 | |
1657 \begin{indentingcode}{0.4in} | |
1658 \bra | |
1659 typedef struct \bra int line, column \ket location; | |
1660 \#define GET{\us}INPUT {\bs} | |
1661 PCB.input{\us}code = fgetc(input{\us}file); {\bs} | |
1662 PCB.input{\us}context.line = PCB.line; {\bs} | |
1663 PCB.input{\us}context.column = PCB.column; | |
1664 \ket | |
1665 {}[ context type = location ]\\ | |
1666 | |
1667 parameter assignment | |
1668 -> parameter name, '=', number | |
1669 \end{indentingcode} | |
1670 | |
1671 Let us suppose that for each parameter we have stored a range of | |
1672 admissible values. We have to diagnose an attempt to use an incorrect | |
1673 value. We could write our diagnostic message as follows: | |
1674 | |
1675 \begin{indentingcode}{0.4in} | |
1676 fprintf(stderr, "Bad value at line \%d, column \%d in " | |
1677 "parameter assignment at line \%d, column \%d", | |
1678 RULE{\us}CONTEXT[2].line, | |
1679 RULE{\us}CONTEXT[2].column, | |
1680 CONTEXT.line, | |
1681 CONTEXT.column); | |
1682 \end{indentingcode} | |
1683 | |
1684 This diagnostic message would give our user the exact location both of | |
1685 the bad value and of the beginning of the statement that contained the | |
1686 bad value. | |
1687 | |
1688 \index{ERROR{\us}CONTEXT}\index{Macros}\agcode{ERROR{\us}CONTEXT} can be | |
1689 used within a | |
1690 \index{SYNTAX{\us}ERROR}\index{Macros}\agcode{SYNTAX{\us}ERROR} macro to | |
1691 find the context of an error if you have turned on the | |
1692 \index{Error frame}\index{Configuration switches}\agparam{error frame} | |
1693 and | |
1694 \index{Diagnose errors}\index{Configuration switches} | |
1695 \agparam{diagnose errors} | |
1696 switches. AnaGram itself tracks context using a structure consisting | |
1697 of line and column numbers. In case of errors such as encountering an | |
1698 end of file in a comment, it uses the \agcode{ERROR{\us}CONTEXT} macro | |
1699 to determine the line and column number at which the comment began. | |
1700 % XXX that sounds like something AG does with your grammar, not | |
1701 % what AG does reading its own input, which is what it is. rephrase... | |
1702 The definition of \agcode{ERROR{\us}CONTEXT} is: | |
1703 | |
1704 \begin{indentingcode}{0.4in} | |
1705 \#define ERROR{\us}CONTEXT (PCB.cs[PCB.error{\us}frame{\us}ssx]) | |
1706 \end{indentingcode} | |
1707 | |
1708 Three similar macros are also available for more general use: | |
1709 | |
1710 \begin{itemize} | |
1711 \item \index{PCONTEXT}\index{Macros}\agcode{PCONTEXT(pcb)} | |
1712 \item \index{PRULE{\us}CONTEXT}\index{Macros}\agcode{PRULE{\us}CONTEXT(pcb)} | |
1713 \item \index{PERROR{\us}CONTEXT}\index{Macros}\agcode{PERROR{\us}CONTEXT(pcb)} | |
1714 \end{itemize} | |
1715 | |
1716 % XXX repeating ``modules other than'' is bad | |
1717 These macros are identical in function to the corresponding macros in | |
1718 the first class. The only difference is that they take the name of a | |
1719 parser control block, \agcode{pcb}, as an argument so they can be used | |
1720 in modules other than the parser module. AnaGram includes the | |
1721 definitions for these macros in the parser header file so that they | |
1722 can be used in modules other than the parser itself. Since these | |
1723 macros are not specific to any one parser, the definitions are | |
1724 conditional so that they will only be defined once in a given module, | |
1725 even if you include header files corresponding to several parsers. | |
1726 The definitions of these macros are as follows: | |
1727 | |
1728 \begin{indentingcode}{0.4in} | |
1729 \#define PCONTEXT(pcb) (pcb.cs[pcb.ssx]) | |
1730 \#define PRULE{\us}CONTEXT(pcb) (\&(pcb.cs[pcb.ssx])) | |
1731 \#define PERROR{\us}CONTEXT(pcb) (pcb.cs[pcb.error{\us}frame{\us}ssx]) | |
1732 \end{indentingcode} | |
1733 | |
1734 Note that since the context macros only make sense when called from a | |
1735 reduction procedure or an error procedure, there are not many | |
1736 occasions to use these macros. The most common situation would be | |
1737 when you have compiled the bulk of the code for your reduction | |
1738 procedures in a separate module. | |
1739 | |
1740 Remember that \agcode{PRULE{\us}CONTEXT}, because it identifies an array | |
1741 rather than a value, requires a subscript. For an example, let us | |
1742 rewrite the diagnostic message given above for | |
1743 \agcode{RULE{\us}CONTEXT} using \agcode{PRULE{\us}CONTEXT}, assuming | |
1744 that the name of our parser control block is \agcode{ana{\us}pcb}: | |
1745 | |
1746 \begin{indentingcode}{0.4in} | |
1747 fprintf(stderr, "Bad value at line \%d, column \%d in " | |
1748 "resource statement at line \%d, column \%d", | |
1749 PRULE{\us}CONTEXT(ana{\us}pcb)[2].line, | |
1750 PRULE{\us}CONTEXT(ana{\us}pcb)[2].column, | |
1751 PCONTEXT.line, | |
1752 PCONTEXT.column); | |
1753 \end{indentingcode} | |
1754 | |
1755 \subsection{Coverage Analysis} | |
1756 \index{Coverage analysis} | |
1757 | |
1758 AnaGram has simple facilities for helping you determine the adequacy | |
1759 of your test suites. The | |
1760 \index{Rule coverage}\index{Configuration switches} | |
1761 \agparam{rule coverage} configuration switch | |
1762 controls these facilities. When you set \agparam{rule coverage}, | |
1763 AnaGram includes code in your parser to count the number of times the | |
1764 parser identifies each rule in your grammar. AnaGram also provides | |
1765 procedures you can use to write these counts to a file and accumulate | |
1766 them over multiple executions of your parser. Finally, it provides a | |
1767 window where you may inspect the counts to see the extent to which | |
1768 your tests have covered the options in your grammar. | |
1769 | |
1770 To maintain the counts, AnaGram declares, at the beginning of your | |
1771 parser, an integer array, whose name is created by appending | |
1772 \agcode{{\us}nrc} to the name of your parser. The array contains one | |
1773 counter for each rule you have defined in your grammar. There are no | |
1774 entries for the auxiliary rules that AnaGram creates to deal with set | |
1775 overlaps or disregard statements. In order to identify positively all | |
1776 the rules that the parser reduces, AnaGram turns off certain | |
1777 optimization features in your parser. Therefore, a parser that has | |
1778 the \agparam{rule coverage} switch enabled will run slightly slower | |
1779 than one with the switch off. | |
1780 | |
1781 AnaGram also provides procedures to write the counts to a file and to | |
1782 initialize the counts from a file. The procedures are named by | |
1783 appending \agcode{{\us}write{\us}counts} and \agcode{{\us}read{\us}counts} | |
1784 respectively to the name of your parser. Thus, if your parser is | |
1785 called \agcode{ana}, the procedures are called | |
1786 \agcode{ana{\us}write{\us}counts} and \agcode{ana{\us}read{\us}counts}. | |
1787 Neither takes any arguments nor returns a value. To accumulate counts | |
1788 correctly, you should include calls to the | |
1789 \index{read{\us}counts}\agcode{read{\us}counts} and | |
1790 \index{write{\us}counts}\agcode{write{\us}counts} procedures in your | |
1791 program. A convenient way to do this is to include statements such as | |
1792 the following in your main program: | |
1793 | |
1794 % XXX perhaps this means ``atexit'' | |
1795 \begin{indentingcode}{0.4in} | |
1796 ana{\us}read{\us}counts(); /* before calling parser */ | |
1797 at{\us}exit(ana{\us}write{\us}counts); | |
1798 \end{indentingcode} | |
1799 | |
1800 For your convenience, AnaGram defines two macros, | |
1801 \index{READ{\us}COUNTS}\index{Macros}\agcode{READ{\us}COUNTS} and | |
1802 \index{WRITE{\us}COUNTS}\index{Macros}\agcode{WRITE{\us}COUNTS}, in your | |
1803 parser. They call the \agcode{read{\us}counts} and | |
1804 \agcode{write{\us}counts} procedures respectively when \agparam{rule | |
1805 coverage} is set. Otherwise they are null. Thus you may code them | |
1806 into your main program and it will work whether or not the | |
1807 \agparam{rule coverage} switch is set. For example, | |
1808 | |
1809 \begin{indentingcode}{0.4in} | |
1810 READ{\us}COUNTS; /* read counts if coverage enabled */ | |
1811 my{\us}parser(); /* call parser */ | |
1812 WRITE{\us}COUNTS; /* write updated counts */ | |
1813 \end{indentingcode} | |
1814 | |
1815 The \agcode{write{\us}counts} procedure writes an identifier code and the | |
1816 counts to a count file. The name of the count file is given by the | |
1817 \index{Coverage file name}\index{Configuration parameters} | |
1818 \agparam{coverage file name} parameter, which defaults to the same name as your | |
1819 syntax file but with the extension | |
1820 \index{File extension}\index{nrc}\agfile{.nrc}. The identifier code | |
1821 changes each time you modify your syntax file. The | |
1822 \agcode{read{\us}counts} procedure attempts to read the count file. If | |
1823 it cannot find it, or the identifier code is out of date, it simply | |
1824 initializes the counter array to zeroes. Otherwise, it initializes | |
1825 the counter arrays to the values found in the file. | |
1826 | |
1827 When you run AnaGram and analyze your syntax file, if | |
1828 \agparam{rule coverage} is set, AnaGram will enable the \agmenu{Rule | |
1829 Coverage} option on the \agmenu{Browse} menu. If you select | |
1830 \agmenu{Rule Coverage}, AnaGram will prepare a \agwindow{Rule | |
1831 Coverage} window from the rule count file you select. AnaGram will | |
1832 warn you if the file you selected is older than the syntax file, since | |
1833 under those conditions, the coverage file might be invalid. | |
1834 | |
1835 The \index{Rule Coverage}\index{Window}\agwindow{Rule Coverage} window | |
1836 shows the count for each rule, the rule number and the text of the | |
1837 rule. It is also synched to the syntax file so that you can see the | |
1838 rule in context. AnaGram also modifies the display of the | |
1839 \index{Reduction Procedures}\index{Window}\agwindow{Reduction | |
1840 Procedures} window so that each procedure descriptor is preceded by | |
1841 the number of times it has been called. You can use this display to | |
1842 verify that all your reduction procedures have been tried. | |
1843 | |
1844 % XXX having this paragraph here seems confusing | |
1845 The \index{Trace Coverage}\index{Window}\agwindow{Trace Coverage} | |
1846 window, created when you use the \agwindow{File Trace} or | |
1847 \agwindow{Grammar Trace} option, provides information similar to that | |
1848 provided by \agwindow{Rule Coverage}. The differences are these: | |
1849 Optimizations are not turned off for the \agwindow{Trace Coverage}, so | |
1850 that some rules of length zero or one will not be properly counted. | |
1851 Also, the \agwindow{Trace Coverage} does not tell you about the | |
1852 reduction procedures you have tested. | |
1853 | |
1854 \agwindow{File Trace} can become quite tedious to use if you have very | |
1855 many semantically determined productions, so in these cases the | |
1856 \agparam{rule coverage} approach can give you the information you need | |
1857 more quickly. | |
1858 | |
1859 \subsection{Using Precedence Operators} | |
1860 | |
1861 The conventional syntax for arithmetic expressions used in most | |
1862 programming languages can be parsed simply by reference to | |
1863 \index{Operator precedence}\index{Precedence operators} | |
1864 \agterm{operator precedence}. Operator precedence refers to | |
1865 the rules we use to determine the order in which arithmetic operations | |
1866 should be carried out. In normal usage, this means that | |
1867 multiplication and division take precedence over addition and | |
1868 subtraction, which in turn take precedence over comparison operations. | |
1869 One can formalize this usage by assigning a numeric \index{Precedence | |
1870 level}\agterm{precedence level} to each operator, so that the | |
1871 operations are carried out starting with those of highest precedence | |
1872 and continuing in order of declining precedence. When operators have | |
1873 the same precedence level, such as addition and subtraction operators, | |
1874 one can decide the order of operation to be left to right or right to | |
1875 left. Operators of equal precedence which are to be evaluated left to | |
1876 right are called \agterm{left associative}. Those which should be | |
1877 evaluated right to left are called \agterm{right associative}. If the | |
1878 nature of the operators is such that the question should never arise, | |
1879 they are called \agterm{non-associative}. | |
1880 | |
1881 AnaGram provides three declarations, | |
1882 \index{Precedence declarations}\index{Left}\index{Right}\index{Nonassoc} | |
1883 \agparam{left}, \agparam{right}, and \agparam{nonassoc}, which you can | |
1884 use to associate precedence levels and associativity with tokens in | |
1885 your grammar. The syntax of these statements is given in Chapter 8. | |
1886 | |
1887 When AnaGram encounters a shift-reduce \index{Conflicts}conflict in | |
1888 your grammar, it looks to see if the conflict can be resolved by using | |
1889 precedence and associativity rules. If so, it applies the rules to | |
1890 the conflict and records the resolution in the \index{Resolved | |
1891 Conflicts}\index{Window}\agwindow{Resolved Conflicts} table. | |
1892 | |
1893 There are two occasions where you should consider using precedence | |
1894 declarations in your grammar: Where rewriting the grammar to get rid | |
1895 of a conflict would obscure and complicate the grammar, and where you | |
1896 wish to try to get a more compact, slightly faster parser by using | |
1897 precedence rules for parsing arithmetic expressions. | |
1898 | |
1899 Here is an example of using precedence declarations to parse simple | |
1900 arithmetic expressions: | |
1901 | |
1902 \begin{indentingcode}{0.4in} | |
1903 unary minus = '-' | |
1904 {}[ | |
1905 left \bra '+', '-' \ket | |
1906 left \bra '*', '/' \ket | |
1907 right \bra unary minus \ket | |
1908 ] | |
1909 | |
1910 exp | |
1911 -> number | |
1912 -> unary minus, exp | |
1913 -> exp, '+', exp | |
1914 -> exp, '-', exp | |
1915 -> exp, '*', exp | |
1916 -> exp, '/', exp | |
1917 \end{indentingcode} | |
1918 | |
1919 A complete working calculator grammar using this syntax, | |
1920 \agfile{ffcalcx}, can be found in the | |
1921 \index{examples}\agfile{examples/ffcalc} directory of your | |
1922 AnaGram distribution disk. | |
1923 % XXX s/disk// | |
1924 | |
1925 \subsection{Parser Performance} | |
1926 | |
1927 The parsers AnaGram generates have been engineered to provide maximum | |
1928 performance subject to constraints of reliability and robustness. | |
1929 There are a number of steps you may take, however, to make optimize | |
1930 the performance of your parser. | |
1931 | |
1932 \paragraph{Standard Stack Frame.} If your compiler has a switch that | |
1933 allows you to turn \emph{off} the standard stack frame when you | |
1934 compile your parser, do so. Your parser uses a large number of very | |
1935 small functions which run fastest when your compiler does not use the | |
1936 standard stack frame. | |
1937 | |
1938 \paragraph{Error Diagnostic Features.} If your parser does not need | |
1939 to diagnose errors, turn off the | |
1940 \index{Diagnose errors}\index{Configuration switches} | |
1941 \agparam{diagnose errors} switch. | |
1942 Turn off the | |
1943 \index{Lines and columns}\index{Configuration switches} | |
1944 \agparam{lines and columns} switch if you don't need this information. | |
1945 If your parser doesn't need a diagnostic, and halts on syntax error, | |
1946 turn off the | |
1947 \index{Backtrack}\index{Configuration switches}\agparam{backtrack} switch. | |
1948 | |
1949 \paragraph{Anti-optimization Switches.} Certain switches de-optimize | |
1950 your parser for various reasons. These switches, | |
1951 \index{Traditional engine}\index{Configuration switches} | |
1952 \agparam{traditional engine} and | |
1953 \index{Rule coverage}\index{Configuration switches} | |
1954 \agparam{rule coverage}, | |
1955 should be turned off once you no longer need their effects. | |
1956 | |
1957 \paragraph{Other Switches.} For maximum performance you should use | |
1958 \index{Pointer input}\index{Configuration switches}\agparam{pointer | |
1959 input}. If you can guarantee that your input will not have | |
1960 out-of-range input, you can turn off | |
1961 \index{Test range}\index{Configuration switches}\index{Range} | |
1962 \agparam{test range}. | |
1963 % XXX s/out-of-range input/out-of-range characters or tokens/ | |
1964 |