Monday, February 23, 2009

Regular Expression

from JAVA Programming Cookbook

  • Quantifiers
+ Match one or more
* Match 0 or more
? Match 0 or 1

- Greedy, Reluctant, and Possessive Quantifiers

A greedy quantifier matches the longest matching sequence.
A reluctant quantifier (also called a lazy quantifier) matches the shortest matching sequence.
To create a reluctant quantifier, follow it with a ?.
A possessive quantifier matches the longest matching sequence and will not matching a shorter sequence even if it would enable the entire expression to succeed.
To create a possessive quantifier, follow it with a +.

E.G.
"simple sample"
The pattern s.+e will match the longest sequence, which is the entire string because the greedy quantifier will match all characters after the first s, up to the final e.
The pattern s.+?e will match 'simple', which is the shortest match. This is because the reluctant quantifier .+? will stop after the finding the first matching sequence.
The pattern s.++e will fail, because the possesive quantifier .++ will match all characters after the initial s. Because it is possessive, it will not release the final e to enable the overall pattern to match. Thus, the final e will not be found and the match fails.

  • JAVA's Regular Expression API
- Pattern
The Pattern class defines no constructors. Instead, a Pattern is created by calling the compile() factory method.
- Matcher
Matcher class has no constructors. Instead, a Matcher is created by calling the matcher() factory method defined by Pattern.

  • Character Classes
\d, \D -- digits 0-9, all non-digits
\s, \S -- whitespaces, all non-whitespaces
\w, \W -- Characters that can be part of a word, all non-word characters

Java supplies a large number of other character classes which have the following general form:
\p{classname}
i.e. \p{Lower} , \p{Upper}, \p{Punct}. \p{Alpha}

To create a class that contain the intersection of two or more sets of characters, use the && operator. i.e. [\w && [^A-Z]]

  • Boundary Matchers
^, $
\A Start of string
\b Word boundary
\B Non-word boundary
\G End of prior match
\Z End of string (Doesn't include line terminator.)
\z End of string (Includes line terminator.)