Saturday, 15 October 2016

Using lex with yacc


If you work on a compiler project or develop a program to check the validity of an input language, you may want to use the UNIX system program tool yacc(CP).  lex can conveniently be used with yacc to develop compilers. Whether or not you plan to use lex with yacc, this part is useful because it covers information of interest to all lex programmers.

yylex()

No matter what kind of parser you are using, the lexical analyzer generated by lex is invoked through a call to the function yylex(). yylex is an integer-valued function. This name is convenient because the built-in function call that yacc-generated parsers use to invoke the lexical analyzer is also yylex().

Returning tokens

If your lexical analyzer is to be used along with a parser, each lex action should end with a return statement of the form:
   return token;
or
   return(token);
Here, token is an integer-valued variable, literal, or macro, or a character enclosed in single quotes. The value returned indicates to the parser what kind of token the lexical analyzer has found. The parser then resumes control and later makes another call to the lexical analyzer (via a call to yylex()) when it needs another token.
The different return values indicate to the parser whether the token is a reserved word, identifier, constant, arithmetic operator, relational operator, or other token type. In some cases, such as reserved words, and sometimes operators, the return value indicates which token was recognized: for example, a value of 16 might indicate the reserved word else in a C compiler. In the other cases, especially identifiers and constants, there needs to be another mechanism for telling the parser the exact value of the token. Consider the following portion of lex source for a lexical analyzer for a hypothetical programming language:
  1 %{
  2 include "token.defs"
  3 extern int tokval;
  4 %}
  5 %%
  6 begin                     return(BEGIN);
  7 end                       return(END);
  8 while                     return(WHILE);
  9 if                        return(IF);
 10 package                   return(PACKAGE);
 11 reverse                   return(REVERSE);
 12 loop                      return(LOOP);
 13 "("                       return('(');
 14 ")"                       return(')');
 15 [a-zA-Z][a-zA-Z0-9]*    { tokval = put_in_tabl();
 16                             return(IDENTIFIER); }
 17 [0-9]+                  { tokval = put_in_tabl();
 18                             return(INTEGER); }
 19 \+                       { tokval = PLUS;
 20                             return(ARITHOP); }
 21 -                      { tokval = MINUS;
 22                             return(ARITHOP); }
 23 >                       { tokval = GREATER;
 24                             return(RELOP); }
 25 >=                      { tokval = GREATEREQL;
 26                             return(RELOP); }
 27 %%
 28 put_in_tabl()
 29 {
 30 .
 31 .
 32 .
 33 }
The identifiers BEGINEND, and WHILE, used as return values would typically be integer-valued macros defined in some header file: in this case, token.defs (see line 2). The header file would look something like this:
   #define BEGIN   1
   #define END     2
   #define WHILE   3
   .
   .
   .
   #define RELOP   12
If it becomes necessary to change the value of some token type, then a #define statement in the header file is changed. If using yacc to generate the parser, you may insert the following statement into the definitions section of your lex source:
   #include "y.tab.h"


NOTE: yacc, with the -d option, generates y.tab.h on the basis of the yacc specification, which includes token declarations. If lex and yacc are used together, presumably the same token set is used with both. If the file y.tab.h is included in the lex source, then yacc must be run before the lex-generated file is compiled.

To indicate the reserved words in the example, the returned integer values suffice. If a token consists of a single character, like the ``('' and ``)'' tokens in lines 13 and 14 of the example, its literal value may be passed to the parser in the ``return'' statement. For the other token types (lines 15-26), the integer value identifying the token is stored in the programmer-defined variable tokval. This variable is globally defined so that the parser as well as the lexical analyzer can access it. If the parser has been generated by yacc, the variable yylval should be used for this purpose. Parsers generated by yacc look in that variable for the value of the token. yylval is defined in the file y.tab.h. It is an integer variable by default, but it can be redefined as a C union in the yacc specification.

Using a symbol table

This example illustrates the use of a symbol table. The symbol table is a data structure that is globally accessible (or at least accessible by the parser); it stores the text of most or all of the tokens found in a particular input and associates them with identifying ``indexes''. The function put_in_tabl() helps maintain the symbol table. This function may be defined in the subroutines section of the lex specification, or in an included file. The implementation details of this function are not important here, but it may be understood to carry out the following tasks:

  • When a token is recognized, put_in_tabl() takes the value in yytext and compares it with all the strings currently in the symbol table.
  • If it finds that value already in the table, put_in_tabl() returns the index of the value.The index is a unique identifying integer: no other identifier may have the same index. The index could be an array subscript, but the details of this are unimportant to us as long as the value is unique.
  • If the current token does not already appear in the symbol table, a new entry is created for it and an index is generated. (The function may store some other information in the symbol table such as whether the token is an identifier, constant, or other token type.) put_in_tabl() then returns the index.
Different parts of the program know about the text of an identifier by way of the index stored in tokval. For example, when the parser receives a value from the lexical analyzer indicating that an identifier (line 15) has been recognized, it can find the string in the symbol table that has an index equal to tokval.
Note that the example shows two ways to assign a value to tokvalput_in_tabl() may assign a symbol table index to tokval. In the last few actions of the example (line 15-26), tokval is assigned a specific integer (in the form of a defined macro) indicating which operator the analyzer recognized. For example, in the last rule in the specification (lines 25 and 26), the lexical analyzer indicates the general class of operator by returning the value RELOP to the parser. This enables the parser to check the input for syntactic correctness. Further, the lexical analyzer indicates the specific operator by assigning the value GREATEREQL to tokval. The parser has some way of associating this value with the string >= . Quite possibly the symbol table will have a set of base entries that includes >= . The definition of GREATEREQL, and the other predefined token indexes will not necessarily appear in the same set of definitions as the token types.

No comments:

Post a Comment