The patterns describing the classes of strings to be searched for are written using regular expressions in a notation similar to that used in awk(C) and sed(C). The terms ``pattern'' and ``regular expression'' are often used interchangeably. A regular expression is formed by concatenating characters and, usually, certain operators. This notation used with lex is summarized in the following list:
- A string of text characters with no operators at
all just matches the literal string. To match the word ``orange'', use:
· orange
- To match a literal string that contains spaces or
tabs, surround the expression with double quotes. To match the phrase
``red apple'', use the expression:
· "red apple"
- An expression, followed by the ``'' operator, matches 0
or more occurrences of that expression. To match a string containing any
number of ``m'''s, or the null string, use the expression:
· m*
- An expression, followed by the ``+'' operator,
matches one or more occurrences of that expression. To match a string
containing one or more ``m'''s, but not the null string, use the
expression:
· m+
- An expression, followed by the ``?'' operator,
matches 0 or 1 occurrence(s) of that expression. This is equivalent to
saying that the expression is optional. To match one occurrence of the letter
``m'', or the null string, use the expression:
· m?
- The period character, (.), matches any single
character. To match any five-letter string starting with ``m'' and ending
with ``y'', use the expression:
· m...y
- Alternation in regular expressions is supported
using the vertical bar, (|). To match either of the strings ``love'' and
``money'', use the expression:
· love|money
- Expressions may be grouped using parentheses, '('
and ')'. To match a string that consists of any number of a's and b's,
followed by a ``c'', use the expression:
· (a|b)*c
- The circumflex, (^), followed by a
pattern, signifies that the pattern must match at the beginning of a line.
The following rule matches the word ``First'' at the beginning of a line:
· ^First
- The dollar sign, ($) is appended to a
pattern to indicate that it must match at the end of a line. The following
rule matches the word ``cow'' at the end of a line:
· cow$
- To indicate that a regular expression should be
matched a specific number of times, follow that expression with a number
enclosed in curly braces, '{' and '}'. To match three repetitions of
``cd'', that is, ``cdcdcd'', use the expression:
· (cd){3}
- To specify a range of repetitions, follow the
expression by two numbers, separated by a comma and enclosed in curly
braces. To match three, four, or five repetitions of ``ab'', that is,
``ababab'', ``abababab'', or ``ababababab'', use the expression:
· (ab){3,5}
- A sequence of characters inside square brackets,
'[' and ']', matches any one character in the sequence. To match any one
of ``d'', ``g'', ``k'', and ``a'', use the expression:
· [dgka]
If
the circumflex, (^), is the first character inside the square brackets, then
the pattern matches any character that does not appear inside the brackets. In
this context, the circumflex does not signify the start of a line, as it does
when prepended to a pattern. To match any character other than ``a'', ``b'',
and ``c'', use the expression:
[^abc]
- Ranges within a standard alphabetic or numeric
order are indicated with a hyphen, (-). The following expression
matches any digit, uppercase letter, or lowercase letter:
· [0-9A-Za-z]
- Regular expressions can be concatenated. The
resulting expression matches whatever the first expression matches
followed by whatever the second expression matches. The following regular
expression matches an identifier in many programming languages. An
identifier, thus defined, is a letter followed by zero or more letters or
digits:
· [a-zA-Z][0-9a-zA-Z]*
- To treat an otherwise special character as a
literal character, rather than as a special character, enclose the
character in quotation marks or precede it with a backslash (\). Either of
the following expressions could be used to match an asterisk followed by
one or more digits:
· \*[0-9]+
· "*"[0-9]+
To
recognize a backslash itself, either of these expressions could be used:
\\
"\"
- lex understands
the standard C escape sequences, such as \n for the end-of-line.
No comments:
Post a Comment