Mercurial > ~dholland > hg > ag > index.cgi
comparison doc/misc/html/isdp.html @ 0:13d2b8934445
Import AnaGram (near-)release tree into Mercurial.
author | David A. Holland |
---|---|
date | Sat, 22 Dec 2007 17:52:45 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:13d2b8934445 |
---|---|
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> | |
2 <HTML> | |
3 <HEAD> | |
4 <TITLE>Introduction to Syntax Directed Parsing</TITLE> | |
5 </HEAD> | |
6 | |
7 <BODY BGCOLOR="#ffffff" BACKGROUND="tilbl6h.gif" | |
8 TEXT="#000000" LINK="#0033CC" | |
9 VLINK="#CC0033" ALINK="#CC0099"> | |
10 | |
11 <P> | |
12 | |
13 <IMG ALIGN="right" SRC="images/agrsl6c.gif" ALT="AnaGram" | |
14 WIDTH=124 HEIGHT=30> | |
15 <BR CLEAR="all"> | |
16 Back to <A HREF="index.html">Index</A> | |
17 </P> | |
18 <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" | |
19 WIDTH=1010 HEIGHT=2 > | |
20 | |
21 <BR CLEAR="all"> | |
22 | |
23 | |
24 <H1 ALIGN="LEFT">Introduction to Syntax Directed Parsing</H1> | |
25 <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" | |
26 WIDTH=1010 HEIGHT=2 > | |
27 </P> | |
28 | |
29 <P> | |
30 (Adapted from Chapter 4, AnaGram User's Guide)</P> | |
31 </P> | |
32 <UL> | |
33 <LI SRC="#What"><A HREF="#What">What is Syntax Directed Parsing?</A></LI> | |
34 <LI><A HREF="#Describing">Describing an Input Sequence</A></LI> | |
35 <LI><A HREF="#How">How a Parser Works</A></LI> | |
36 <LI><A HREF="#Note">A Note on Notation</A></LI> | |
37 <LI><A HREF="#Reduction">Reduction Procedures</A></LI> | |
38 <LI><A HREF="#Building">Building a Parser</A> | |
39 <UL> | |
40 <LI SRC="#Invoking"><A HREF="#Invoking">Invoking a Parser</A></LI> | |
41 <LI><A HREF="#Communicating">Communicating with a Parser</A></LI> | |
42 <LI><A HREF="#Parser">Parser Input</A></LI> | |
43 <LI><A HREF="#Error">Error Handling</A></LI> | |
44 </UL> | |
45 </LI> | |
46 </UL> | |
47 <BR> | |
48 | |
49 <H2><A NAME="What">What is Syntax Directed Parsing?</A></H2> | |
50 <P> | |
51 Every programmer has to deal with input data. Usually processing input data | |
52 depends on what has preceded, and often even on what follows, the input under | |
53 consideration. Keeping track of these dependencies in order to know how to | |
54 process the data is called parsing. It is often relatively easy to keep track of | |
55 simple dependencies when first writing a program. As the program develops, as | |
56 new features are added, as bugs are fixed, the dependencies often stop being | |
57 simple. Input processing becomes a headache, since it is hard to keep track of | |
58 or even identify all the particular cases. Changes to the program cause | |
59 unexpected problems and program maintenance threatens to get out of control.</P> | |
60 <P> | |
61 Syntax directed parsing is a technique for addressing these difficulties. In | |
62 syntax directed parsing, the input part of a program is constructed | |
63 automatically, by means of a standard algorithm, from a high level description | |
64 of the input data structure. Code to perform any required processing of the data | |
65 is attached to the description in a convenient way.</P> | |
66 <P> | |
67 The description, being non-procedural, is generally easier to write and to | |
68 modify than the equivalent program code, and much less likely to harbor bugs. It | |
69 is easier to read, and easier to maintain. It is easy to re-use in other | |
70 programs requiring the same input, thus encouraging uniform interfaces. The | |
71 technique also simplifies the overall program by decoupling the input and | |
72 processing components and providing a natural, modular structure.</P> | |
73 <P> | |
74 To use syntax directed parsing you first write the input data description, | |
75 called a <A HREF="gloss.html#Grammar">grammar</A>. A file which contains a grammar is called a syntax | |
76 file. A parser generator, such as AnaGram, then can create, from the syntax | |
77 file, a function (or program) called a <A HREF="gloss.html#Parser">parser</A>, written in C or C++. The parser | |
78 keeps track of all the dependencies in your input, and calls certain functions, | |
79 <A HREF="gloss.html#ReductionProcedure">reduction procedures</A>, to deal with specific units or sequences of data as they | |
80 are encountered.</P> | |
81 <P> | |
82 Reduction procedures are the code you write to process your data. In your | |
83 grammar they are linked to the structures in your input, so that your parser | |
84 will call them at precisely the right times with precisely the right data. Note | |
85 that with this technique you need only provide a non-procedural <I>description</I> | |
86 of the input. The details of flow of control are handled entirely by the parser. | |
87 When you write reduction procedures, you can concentrate entirely on what you | |
88 have to do with the data. You don't have to encumber your code with switches and | |
89 tests to determine the structure of your input.</P> | |
90 <P> | |
91 The parsers you build using a parser generator such as AnaGram may be complete | |
92 stand-alone programs, or they may serve as input routines for a more extensive | |
93 program. Some programs may even use more than one parser.</P> | |
94 <BR> | |
95 | |
96 <H2><A NAME="Describing">Describing an Input Sequence</A></H2> | |
97 <P> | |
98 Writing a grammar consists of describing the acceptable input sequences for your | |
99 program. The vehicle for describing an input sequence is called a <A HREF="gloss.html#Production">production</A>. | |
100 Productions show how a logical component of the input can be made up of a | |
101 sequence of more elementary components. A production that describes a date might | |
102 be written:</P> | |
103 <PRE> | |
104 date | |
105 -> name of month, day, comma, year </PRE> | |
106 <P> | |
107 The components of the input are called <A HREF="gloss.html#Token">tokens</A>. The sequence of tokens on the | |
108 side of the production is called a <A HREF="gloss.html#GrammarRule">grammar rule</A>, or rule for short. The | |
109 individual tokens on the right side of the rule are also called <A HREF="gloss.html#RuleElement">rule elements</A>. | |
110 </P> | |
111 <P> | |
112 The token on the left side of the production is called the <A HREF="gloss.html#ReductionToken">reduction token</A> for | |
113 the rule. Tokens may have semantic values, as distinguished from syntactic | |
114 values, which you can use in your <A HREF="gloss.html#ReductionProcedure">reduction procedures</A>. For instance, the value | |
115 of name of month could be an integer in the range zero to eleven, or it could be | |
116 a pointer to an ascii string. The value of day could be an integer in the range | |
117 one to thirty-one.</P> | |
118 <P> | |
119 A <A HREF="gloss.html#Grammar">grammar</A> consists of a number of such <A HREF="gloss.html#Production">productions</A>, each of which describes some | |
120 component of the input in terms of other components. It does not take very many | |
121 productions to describe quite complex input streams. A grammar for the C | |
122 language, for instance, requires about two hundred productions.</P> | |
123 <P> | |
124 Many people find the term "production" quite confusing. It comes from theoretical | |
125 linguistics where it is used to describe how one may produce sequences which | |
126 correspond to a set of grammatical rules. Ironically, the major usage of the | |
127 idea has been in parsing where the interest is not so much in creating sequences | |
128 which satisfy the grammatical rules as in decoding and analyzing such sequences. | |
129 Nonetheless, it is convenient, in the above example, to say that the token date | |
130 produces a sequence of tokens consisting of name of month, day, comma, and year. | |
131 We also say that the sequence reduces to date.</P> | |
132 <P> | |
133 There may be more than one production to describe a given component, if there is | |
134 more than one way it may be represented. For instance,</P> | |
135 <PRE> | |
136 date | |
137 -> day, name of month, year </PRE> | |
138 <P> | |
139 describes another common way of writing a date. In other words, a <A HREF="gloss.html#ReductionToken">reduction | |
140 token</A> may produce a number of different <A HREF="gloss.html#GrammarRule">grammar rules</A>.</P> | |
141 <P> | |
142 Tokens which appear on the left side of one or more productions are called | |
143 <A HREF="gloss.html#Nonterminal">nonterminal tokens</A>. Those which appear <I>only</I> on the right sides of | |
144 productions are called <A HREF="gloss.html#Terminal">terminal tokens</A>. Terminal tokens are the units which | |
145 actually appear physically in the input. Nonterminal tokens are identified when | |
146 a sequence of tokens that matches the right side of a production is seen in the | |
147 input. When AnaGram analyzes a <A HREF="gloss.html#Grammar">grammar</A>, it assigns a unique token number to each | |
148 token it finds in the grammar.</P> | |
149 <P> | |
150 Nonterminal tokens, such as date in the example above, may appear in any grammar | |
151 rule just as though they were input tokens. The token on the left side of a | |
152 production can even appear on the right side as well. Such a production is | |
153 called a recursive production. When a nonterminal token appears on the right | |
154 side of a production, it may be represented in this context by <I>any</I> of | |
155 the <A HREF="gloss.html#GrammarRule">grammar rules</A> it produces. Grammars described in this manner are called | |
156 <A HREF="gloss.html#ContextFreeGrammar">context free grammars</A> since there is no contextual constraint on which of the | |
157 rules that a token produces can appear in any given context.</P> | |
158 <P> | |
159 Recursive productions may be either left recursive or right recursive. Left | |
160 recursive productions are those where the recursively defined <A HREF="gloss.html#Nonterminal">nonterminal</A> | |
161 appears as the first <A HREF="gloss.html#RuleElement">element</A> in the recursive rule. Right recursive productions | |
162 are those where it is the last element. If it appears anywhere between, the | |
163 production is said to be center recursive.</P> | |
164 <P> | |
165 Any nonterminal token which has a recursive production must also have at least | |
166 one simple, non-recursive production. Otherwise, it is not possible to create a | |
167 finite sequence of terminal tokens from the nonterminal token.</P> | |
168 <P> | |
169 Recursion may also occur implicitly in a <A HREF="gloss.html#Grammar">grammar</A> when one of the tokens on the | |
170 right side of a production itself has a production involving the token on the | |
171 left. Such implicit recursion sometimes may involve numerous levels of | |
172 productions. Implicit recursion occurs most commonly in describing constructs | |
173 such as arithmetic expressions or the block structure of programming languages.</P> | |
174 <P> | |
175 Clearly, grammars can accommodate multiple levels of structure in the input | |
176 sequences they describe. There must, at the top, be a single token which | |
177 encompasses the entire input. This special token is variously called the <A HREF="gloss.html#GrammarToken">grammar | |
178 token</A>, the goal token or the start token.</P> | |
179 <P> | |
180 AnaGram allows you to specify <A HREF="gloss.html#Terminal">terminal tokens</A> explicitly as ascii characters, or | |
181 even as <A HREF="gloss.html#CharacterSets">sets of ascii characters</A>, right in the grammar. Thus, you may write | |
182 '0-9' to represent the set of ascii digits, or 'A-Z' to represent the set of | |
183 upper case letters. The semantic value of such a <A HREF="gloss.html#Token">token</A> is the ascii character | |
184 code that actually appears in the input stream. If the various sets you use in | |
185 your grammar overlap, they may not properly represent terminal tokens. In this | |
186 case, AnaGram automatically extends your <A HREF="gloss.html#Grammar">grammar</A> appropriately. </P> | |
187 <BR> | |
188 | |
189 <H2><A NAME="How">How a Parser Works</A></H2> | |
190 <P> | |
191 The aim of a <A HREF="gloss.html#Parser">parser</A> is to match its input with the full syntactic structure | |
192 specified by the <A HREF="gloss.html#Production">productions</A> which make up the grammar. The primary component of | |
193 a parser is an input buffer, sometimes thought of as a stack, into which tokens | |
194 are shifted, or stored sequentially, as they are encountered in the input. At | |
195 the same time that a token is shifted into the input buffer, its semantic value | |
196 is pushed onto the value stack. A token is not shifted into the buffer unless it | |
197 "makes sense", that is, unless it is consistent with the rules of the | |
198 <A HREF="gloss.html#Grammar">grammar</A> and with the input that has preceded it. If a token does not make sense, | |
199 the parser signals a syntax error. In order to determine whether a token makes | |
200 sense, the parser has a sort of decision table which provides a list of | |
201 acceptable tokens for each of a number of states. The table also specifies what | |
202 the parser is to do with each acceptable token. When the table indicates that a | |
203 token is to be shifted into the buffer, it also specifies a new state. The | |
204 parser stacks the current state on a state stack and jumps to the new state. | |
205 Thus every time a token is shifted into the input buffer, a state number is | |
206 pushed onto the state stack. For each state of the parser, excepting only the | |
207 initial state, there is a unique token which will cause a jump to that state. | |
208 This token is called the <A HREF="gloss.html#CharacteristicToken">characteristic token</A> for the state.</P> | |
209 <P> | |
210 When the rightmost, or most recent, tokens in the input buffer match the right | |
211 side of a production precisely, the parser <I>may</I> replace the tokens that | |
212 match the rule with a single token, the token on the left side of the | |
213 production. This process of replacing a sequence of tokens with a single token | |
214 is called reduction. The token that replaces the sequence of tokens is called | |
215 the <A HREF="gloss.html#ReductionToken">reduction token</A>.</P> | |
216 <P> | |
217 The actual mechanism of the reduction is quite important. At the same time that | |
218 input tokens are removed from the input buffer, state numbers are popped from | |
219 the state stack, so that when all input tokens matching the rule have been | |
220 removed, the parser state has been restored to the value it had at the time the | |
221 first token in the rule was seen. As state numbers are popped from the state | |
222 stack, token values are popped from the value stack. If the rule has a <A HREF="gloss.html#ReductionProcedure">reduction | |
223 procedure</A>, temporary variables are loaded with the values popped from the stack | |
224 and the reduction procedure is called. The reduction token is now shifted into | |
225 the input buffer just as though it were an input token. If the reduction | |
226 procedure returned a result it is shifted into the value stack as the value of | |
227 the reduction token. The parser stacks the current state again and jumps to a | |
228 new state as determined by the parser tables.</P> | |
229 <P> | |
230 If there are no errors in your input, when the last token has been read from the | |
231 input, shifted into the input buffer, and reductions performed, there will be | |
232 precisely one token in the input buffer: the <A HREF="gloss.html#GrammarToken">grammar, or goal, token</A> which | |
233 describes your entire input. At this point your parser declares itself finished.</P> | |
234 <P> | |
235 Reductions do not necessarily occur every time a rule matches the tokens in the | |
236 input buffer. If the reduction token does not make sense, that is, if it is not | |
237 consistent with the rules of the <A HREF="gloss.html#GrammarToken">grammar</A> and with the input that has preceded | |
238 it, the parser will not perform the reduction. Suppose there are tokens in the | |
239 input which match one of the rules given above for date. A reduction will not | |
240 occur unless date is one of the tokens the parser is actually looking for at | |
241 that stage in the input.</P> | |
242 <P> | |
243 Even when the reduction token would make sense, there are still situations where | |
244 the reduction would not take place. Suppose a grammar includes the two | |
245 <A HREF="gloss.html#Production">productions</A> given above for date as well as the following:</P> | |
246 <PRE> | |
247 | |
248 date | |
249 -> name of month, day </PRE> | |
250 <P> | |
251 This production is the same as the first, but with no year specification. When a | |
252 parser directed by this grammar has encountered name of month and day, it can't | |
253 tell without looking further whether it has a short form or a long form date. In | |
254 such a circumstance, the parser looks at the next following token, which is | |
255 called a <A HREF="gloss.html#Lookahead">lookahead token</A>. If the lookahead token is a comma, then, in the | |
256 absence of other productions, the input is a long form date. If the lookahead | |
257 token is not a comma, then the input is certainly not a long form date, and it | |
258 is proper to reduce the short form production.</P> | |
259 <P> | |
260 Suppose the lookahead token were a comma and the grammar were also to contain | |
261 the following production:</P> | |
262 <PRE> | |
263 | |
264 appointment | |
265 -> date, comma, time </PRE> | |
266 <P> | |
267 Since a comma can follow date, according to this rule, and can also follow day | |
268 according to the first production, it is impossible to determine, simply by | |
269 looking at the <A HREF="gloss.html#Lookahead">lookahead token</A>, whether the date was being given in short form | |
270 or long form. One would have to look beyond the comma to see if what follows the | |
271 comma matches the rules for time or for year. Although it is possible to build | |
272 parsers which can do this, it is not generally feasible. Situations of this sort | |
273 are called <A HREF="gloss.html#Conflict">conflicts</A>. AnaGram warns you about the conflicts in your <A HREF="gloss.html#Grammar">grammar</A> when | |
274 it analyzes your grammar, and provides numerous facilities to help you correct | |
275 them.</P> | |
276 <BR> | |
277 | |
278 <H2><A NAME="Note">A Note on Notation</A></H2> | |
279 <P> | |
280 <A HREF="gloss.html#ContextFreeGrammar">Context free grammars</A> have been traditionally represented in the literature | |
281 using <A HREF="gloss.html#BNF">Backus-Naur Form</A>, or BNF. In Backus-Naur Form, certain characters, called | |
282 metacharacters, are used to punctuate <A HREF="gloss.html#Production">productions</A> and all other printable | |
283 characters are taken to represent themselves literally. Named tokens are denoted | |
284 by enclosing the name within angle brackets <CODE>< ></CODE> . The left side of a | |
285 production is distinguished from the right side by the | |
286 characters <CODE> ::= </CODE>. | |
287 If several productions have the same left side, the vertical | |
288 bar <CODE> | </CODE> is used to separate them. The <A HREF="gloss.html#RuleElement">elements</A> of a <A HREF="gloss | |
289 .html#GrammarRule">grammar rule</A> are simply juxtaposed | |
290 to indicate that one follows another. Blanks are ignored. Thus, in BNF, the | |
291 first production given for date, above, would be:</P> | |
292 <PRE> | |
293 <date> ::= <name of month> <day>, <year> | |
294 </PRE> | |
295 <P> | |
296 AnaGram uses a notation more consonant with ordinary programming usage. Thus | |
297 token names need not be bracketed and literal characters must be appropriately | |
298 quoted. The elements of rules are joined by commas. Using this approach, there | |
299 is no need for metacharacters and it becomes possible to make a number of useful | |
300 extensions to the notation.</P> | |
301 <BR> | |
302 | |
303 <H2><A NAME="Reduction">Reduction Procedures</A></H2> | |
304 <P> | |
305 Of course, the reason for parsing an input stream is to interpret the data in | |
306 the stream and to process it in some useful manner. The primary tool for doing | |
307 this is the <A HREF="gloss.html#ReductionProcedure">reduction procedure</A>. A reduction procedure is a piece of C code that | |
308 is executed when a particular <A HREF="gloss.html#GrammarRule">grammar rule</A> is reduced. Often, a reduction | |
309 procedure calculates a value which becomes the semantic value of the reduction | |
310 token. The input to the reduction procedure consists of the values of the tokens | |
311 that make up the grammar rule.</P> | |
312 <P> | |
313 AnaGram allows you to assign C variable names to the tokens in a grammar rule, | |
314 so that you can refer to them in the reduction procedure. To assign a C variable | |
315 name, simply follow the token in the rule with a colon and the C variable name. | |
316 Simple reduction procedures can be written as C or C++ expressions. At the end | |
317 of the rule, write an equal sign and an expression, terminated by a semicolon. | |
318 For example:</P> | |
319 <PRE> | |
320 (int) hex digit | |
321 -> '0-9':d =d-'0'; | |
322 -> 'a-f':d =d-'a'+10; | |
323 -> 'A-F':d =d-'A'+10; </PRE> | |
324 <P> | |
325 When any one of these rules is matched, the value of the token in the rule is | |
326 assigned to the temporary variable. The expression to the right of the equal | |
327 sign is evaluated and the value of the expression is stored as the value of hex | |
328 digit, which has been declared to be an int. These <A HREF="gloss.html#Production">productions</A> define | |
329 hexadecimal digits as ascii characters, and calculate the binary value of the | |
330 digit in terms of the ascii character code, d. The binary value becomes the | |
331 value of hex digit. Hexadecimal digits can be combined to make hexadecimal | |
332 numbers by writing the following productions:</P> | |
333 <PRE> | |
334 (int) hex number | |
335 -> hex digit | |
336 -> hex number:n, hex digit:d =16*n+d; </PRE> | |
337 <P> | |
338 There are several important points to notice in this example. First, <A HREF="gloss.html#ReductionProcedure">reduction | |
339 procedures</A> are executed "from the bottom up". That is, the reduction | |
340 procedure for hex digit is executed before any reduction procedure for hex | |
341 number. Second, if there is no reduction procedure for a production, the value | |
342 of the first token in the rule is assigned to the reduction token. Thus, it is | |
343 not necessary to provide a reduction procedure in the first production for hex | |
344 number. Third, the reduction procedures for recursive productions are always | |
345 executed <I>after</I> the reduction procedure, if any, for the non-recursive | |
346 production which begins the recursion. Finally, when an input sequence is | |
347 described using left recursion, as in this example, the <A HREF="gloss.html#RuleElement">elements</A> of the sequence | |
348 are processed left to right.</P> | |
349 <P> | |
350 If you wish to process the elements of a sequence right to left, you may use | |
351 right recursion. For example, it is sometimes convenient to define the fraction | |
352 part of a decimal number thus:</P> | |
353 <PRE> | |
354 (double) fraction part | |
355 -> '0-9':d =(d-'0')/10.; | |
356 -> '0-9':d,fraction part:f =(d-'0'+f)/10.;</PRE> | |
357 <P> | |
358 In this case the leading digits are stored temporarily in the parse stack, and | |
359 then the fraction part is evaluated right to left only when the last digit has | |
360 been found.</P> | |
361 <P> | |
362 Reduction procedures can be more complex than simple expressions. After the | |
363 equal sign you may include an arbitrary block of C or C++ code, enclosed in | |
364 braces, { }. To return a semantic value for the reduction token simply use a | |
365 return statement. Of course, reduction procedures have the full resources of C | |
366 or C++ at their disposal. They may set and interrogate global variables and may | |
367 call functions freely.</P> | |
368 <P> | |
369 Since the reduction procedures you write will probably need some support code, | |
370 such as #include statements and declarations, you may incorporate C or C++ code | |
371 into your syntax file at any point. You need only enclose it in braces ({}). | |
372 Such code is called embedded C. All embedded C code is also copied to the parser | |
373 file, and <I>precedes</I> all of your reduction procedures.</P> | |
374 <BR> | |
375 | |
376 <H2><A NAME="Building">Building a Parser</A></H2> | |
377 <P> | |
378 In order to round out a <A HREF="gloss.html#Parser">parser</A> into a functioning program it needs input | |
379 procedures, as well as error diagnosis and recovery capabilities. AnaGram has a | |
380 number of options available which give you a high degree of flexibility in | |
381 configuring a parser to suit your particular needs. All of the options are | |
382 provided with reasonable defaults, so that you can safely disregard any option | |
383 until you need the features it provides.</P> | |
384 <BR> | |
385 | |
386 <H4><A NAME="Invoking">Invoking a Parser</A></H4> | |
387 <P> | |
388 Normally, AnaGram configures <A HREF="gloss.html#Parser">parsers</A> as functions which you can call from | |
389 elsewhere in your program. In this situation, you call the parser, it processes | |
390 its input, and returns either when it has finished or cannot proceed because of | |
391 errors. Alternatively, if you set the event driven configuration switch, your | |
392 parser will be configured so that you have two procedures to call: an | |
393 initializer and a parser. In the event driven configuration you start the parse | |
394 by calling the initializer and then you call the parser once for each unit of | |
395 input. Using the event driven mode makes it quite easy to configure a parser as | |
396 a filter and to chain several parsers together so that the output from one | |
397 parser is the input to the next. Such multi-stage parsing is a convenient way to | |
398 deal with complex input that is not context free.</P> | |
399 <BR> | |
400 | |
401 <H4><A NAME="Communicating">Communicating with a Parser</A></H4> | |
402 <P>The complete status of your parser is contained in a single data structure | |
403 called a parser control block. All communications with a parser take place via | |
404 the parser control block. Input procedures must place input data in the | |
405 appropriate field in the parser control block. When the parse is complete or | |
406 encounters an error, the results of the parse may be found in the parser control | |
407 block. When AnaGram builds a parser it includes, in the header file it | |
408 generates, a typedef statement which defines the structure of the parser control | |
409 block.</P> | |
410 <BR> | |
411 | |
412 <H4><A NAME="Parser">Parser Input</A></H4> | |
413 <P>The input to your <A HREF="gloss.html#Parser">parser</A> may be either characters read directly from an | |
414 input stream, or tokens accumulated by a pre-processor or <A HREF="gloss.html#LexicalScanner">lexical scanner</A>. The | |
415 way you provide input to your parser depends on how your <A HREF="gloss.html#Grammar">grammar</A> defines input | |
416 tokens and also on whether or not you have requested an event driven parser.</P> | |
417 <P> | |
418 If your parser is event driven, you provide its input by storing the input code | |
419 and the input value, if any, into the parser control block and calling the | |
420 parser.</P> | |
421 <P> | |
422 If you have set the pointer input configuration switch in your syntax file, you | |
423 simply initialize the pointer field in your parser control block before you call | |
424 your parser. Your parser will then read its input directly from memory by simply | |
425 incrementing the pointer as necessary.</P> | |
426 <P> | |
427 Otherwise, your parser will invoke a macro called GET_INPUT every time it needs | |
428 more input. You may define GET_INPUT according to your needs. You can define it | |
429 so that it calls an input function, or you can define it so that it executes | |
430 in-line code each time it is invoked. Your GET_INPUT macro should store its | |
431 input code in the input_code field of the parser control block. If you do not | |
432 write a GET_ INPUT macro, AnaGram will provide one which will read characters | |
433 from stdin.</P> | |
434 <P> | |
435 If your <A HREF="gloss.html#Grammar">grammar</A> does not define <A HREF="gloss.html#Terminal">terminal tokens</A> in terms of ascii characters or | |
436 external token numbers, your GET_INPUT will have to determine the appropriate | |
437 internal token number for each input token. To assist you in determining these | |
438 token numbers AnaGram provides a typedef enum statement in the header file. You | |
439 can then use named constants to specify the internal token numbers for the input | |
440 tokens.</P> | |
441 <BR> | |
442 | |
443 <H4><A NAME="Error">Error Handling</A></H4> | |
444 <P>Your <A HREF="gloss.html#Parser">parser</A> must be prepared to deal with erroneous input. There are two | |
445 aspects to error handling: diagnosing the error, and recovering from the error. | |
446 On encountering an error, your parser will invoke a macro called SYNTAX_ ERROR. | |
447 If you do not provide a definition for SYNTAX_ERROR, AnaGram will provide a | |
448 simple error diagnostic. AnaGram can also provide automatic error diagnoses | |
449 which pinpoint the location of the error.</P> | |
450 <P> | |
451 AnaGram provides two options for error recovery: error | |
452 token <A HREF="gloss.html#Resynchronization">resynchronization</A> | |
453 and automatic resynchronization. These are techniques for getting your parser | |
454 back in synch with its input so it can proceed after encountering an error. | |
455 Normally, if you do not select one of these recovery techniques, your parser | |
456 will terminate when it encounters an error; however, you may override this | |
457 default if you wish and provide your own recovery technique.</P> | |
458 | |
459 | |
460 <IMG ALIGN="bottom" SRC="images/rbline6j.gif" ALT="----------------------" | |
461 WIDTH=1010 HEIGHT=2 > | |
462 <P> | |
463 <IMG ALIGN="right" SRC="images/pslrb6d.gif" ALT="Parsifal Software" | |
464 WIDTH=181 HEIGHT=25> | |
465 <BR CLEAR="right"> | |
466 | |
467 Back to <A HREF="index.html">Index</A> | |
468 <P> | |
469 <ADDRESS><FONT SIZE="-1"> | |
470 AnaGram parser generator - documentation<BR> | |
471 Introduction to Syntax Directed Parsing<BR> | |
472 Copyright © 1993-1999, Parsifal Software. <BR> | |
473 All Rights Reserved.<BR> | |
474 </FONT></ADDRESS> | |
475 | |
476 </BODY> | |
477 </HTML> | |
478 | |
479 |