Saturday, 15 October 2016

Regular expressions

The patterns describing the classes of strings to be searched for are written using regular expressions in a notation similar to that used in awk(C) and sed(C). The terms ``pattern'' and ``regular expression'' are often used interchangeably. A regular expression is formed by concatenating characters and, usually, certain operators. This notation used with lex is summarized in the following list:

  • A string of text characters with no operators at all just matches the literal string. To match the word ``orange'', use:
·           orange
  • To match a literal string that contains spaces or tabs, surround the expression with double quotes. To match the phrase ``red apple'', use the expression:
·           "red apple"
  • An expression, followed by the ``*'' operator, matches 0 or more occurrences of that expression. To match a string containing any number of ``m'''s, or the null string, use the expression:
·           m*
  • An expression, followed by the ``+'' operator, matches one or more occurrences of that expression. To match a string containing one or more ``m'''s, but not the null string, use the expression:
·           m+
  • An expression, followed by the ``?'' operator, matches 0 or 1 occurrence(s) of that expression. This is equivalent to saying that the expression is optional. To match one occurrence of the letter ``m'', or the null string, use the expression:
·           m?
  • The period character, (.), matches any single character. To match any five-letter string starting with ``m'' and ending with ``y'', use the expression:
·           m...y
  • Alternation in regular expressions is supported using the vertical bar, (|). To match either of the strings ``love'' and ``money'', use the expression:
·           love|money
  • Expressions may be grouped using parentheses, '(' and ')'. To match a string that consists of any number of a's and b's, followed by a ``c'', use the expression:
·           (a|b)*c
  • The circumflex, (^), followed by a pattern, signifies that the pattern must match at the beginning of a line. The following rule matches the word ``First'' at the beginning of a line:
·           ^First
  • The dollar sign, ($) is appended to a pattern to indicate that it must match at the end of a line. The following rule matches the word ``cow'' at the end of a line:
·           cow$    
  • To indicate that a regular expression should be matched a specific number of times, follow that expression with a number enclosed in curly braces, '{' and '}'. To match three repetitions of ``cd'', that is, ``cdcdcd'', use the expression:
·           (cd){3}
  • To specify a range of repetitions, follow the expression by two numbers, separated by a comma and enclosed in curly braces. To match three, four, or five repetitions of ``ab'', that is, ``ababab'', ``abababab'', or ``ababababab'', use the expression:
·           (ab){3,5}
  • A sequence of characters inside square brackets, '[' and ']', matches any one character in the sequence. To match any one of ``d'', ``g'', ``k'', and ``a'', use the expression:
·           [dgka]
If the circumflex, (^), is the first character inside the square brackets, then the pattern matches any character that does not appear inside the brackets. In this context, the circumflex does not signify the start of a line, as it does when prepended to a pattern. To match any character other than ``a'', ``b'', and ``c'', use the expression:
   [^abc]
  • Ranges within a standard alphabetic or numeric order are indicated with a hyphen, (-). The following expression matches any digit, uppercase letter, or lowercase letter:
·           [0-9A-Za-z]
  • Regular expressions can be concatenated. The resulting expression matches whatever the first expression matches followed by whatever the second expression matches. The following regular expression matches an identifier in many programming languages. An identifier, thus defined, is a letter followed by zero or more letters or digits:
·           [a-zA-Z][0-9a-zA-Z]*
  • To treat an otherwise special character as a literal character, rather than as a special character, enclose the character in quotation marks or precede it with a backslash (\). Either of the following expressions could be used to match an asterisk followed by one or more digits:
·           \*[0-9]+
·           "*"[0-9]+
To recognize a backslash itself, either of these expressions could be used:
   \\
   "\"
  • lex understands the standard C escape sequences, such as \n for the end-of-line.


No comments:

Post a Comment