Saturday, 15 October 2016

Disambiguating rules



lex follows two rules to resolve ambiguities that may arise from the lex specification. These rules are:

  • when a string of input text can match two or more rules in a specification, the first rule in the specification is the one which is matched, and the one whose action is executed
  • if a string in the input text can match a rule in the specification, but a longer string which has the first string as its prefix will also match a rule, then the longer string matches
A situation of the type addressed by the first rule could arise if a specification had patterns to match keywords and identifiers:
   %{
   #include "y.tab.h"
   %}
   %%
   START                   { return(STARTTOK); }
   BREAK                   { return(BREAKTOK); }
   END                     { return(ENDTOK); }
   [a-zA-Z][a-zA-Z0-9]*    { return(yytext); }
The string "START" could be matched by both the first or the fourth rule: because START is a reserved word, you want only the action associated with the first rule to be executed. By placing the rule for START and the other reserved words before the rule for identifiers, you ensure that reserved words are recognized as such.
The second kind of ambiguity could arise, for example, if the input text was coded in a programming language that had operators that were similar. Part of a lex specification for the C language might look like this:
   "+"             {return(PLUS); }
   "++"            {return(INC); }
The lexical analyzer should recognize the increment operator "++", not the addition operator "+", when it reads the following statement:
   i++;

No comments:

Post a Comment