3. Regex
• Regular expressions are a way to describe a
set of strings based on common
characteristics shared by each string in the set.
• They can be used to search, edit, or
manipulate text and data.
• They are created with a specific syntax.
4. Regex in Java
• Regex in Java is similar to Perl
• The java.util.regex package primarily consists
of three classes: Pattern, Matcher,
and PatternSyntaxException.
5. Pattern & PatternSyntaxException
• You can think of this as the regular expression
wrapper object.
• You get a Pattern by calling:
– Pattern.compile(“RegularExpressionString”);
• If your “RegularExpressionString” is invalid,
you will get the PatternSyntaxException.
6. Matcher
• You can think of this as the search result
object.
• You can get a matcher object by calling:
– myPattern.matcher(“StringToBeSearched”);
• You use it by calling:
– myMatcher.find()
• Then call any number of methods on
myMatcher to see attributes of the result.
7. Regex Test Harness
• The tutorials give a test harness that uses the
Console class. It doesn’t work in any IDE.
• So I rewrote it to use Basic I/O
9. Regex
• Test harness output example.
• Input is given in Bold.
Enter your regex: foo
Enter input string to search: foofoo
Found ‘foo’ at index 0, ending at index 3.
Found ‘foo’ at index 3, ending at index 6.
13. Character Classes
Construct Description
[abc] a, b, or c (simple class)
Any character except a, b, or c
[^abc]
(negation)
a through z, or A through Z, inclusive
[a-zA-Z]
(range)
a through d, OR m through p: [a-dm-p]
[a-d[m-p]]
(union)
[a-z&&[def]] d, e, f (intersection)
a through z, except for b and c: [ad-z]
[a-z&&[^bc]]
(subtraction)
a through z, and not m through p: [a-lq-
[a-z&&[^m-p]]
z] (subtraction)
14. Character Class
Enter your regex: [bcr]at
Enter input string to search: rat
I found the text "rat" starting at index 0 and
ending at index 3.
Enter input string to search: cat
Found "cat" at index 0, ending at index 3.
15. Character Class: Negation
Enter your regex: [^bcr]at
Enter input string to search: rat
No match found.
Enter input string to search: hat
Found "hat" at index 0, ending at index 3.
16. Character Class: Range
Enter your regex: foo[1-5]
Enter input string to search: foo5
Found "foo5" at index 0, ending at index 4.
Enter input string to search: foo6
No match found.
17. Character Class: Union
Enter your regex: [0-4[6-8]]
Enter input string to search: 0
Found "0" at index 0, ending at index 1.
Enter input string to search: 5
No match found.
Enter input string to search: 6
Found "6" starting at index 0, ending at index 1.
18. Character Class: Intersection
Enter your regex: [0-9&&[345]]
Enter input string to search: 5
Found "5" at index 0, ending at index 1.
Enter input string to search: 2
No match found.
20. Predefined Character Classes
Construct Description
Any character (may or may not match line
.
terminators)
d A digit: [0-9]
D A non-digit: [^0-9]
s A whitespace character: [ tnx0Bfr]
S A non-whitespace character: [^s]
w A word character: [a-zA-Z_0-9]
W A non-word character: [^w]
21. Predefined Character Classes (cont.)
• To summarize:
– d matches all digits
– s matches spaces
– w matches word characters
• Whereas a capital letter is the opposite:
– D matches non-digits
– S matches non-spaces
– W matches non-word characters
22. Quantifiers
Greedy Reluctant Possessive Meaning
X? X?? X?+ X, once or not at all
X, zero or more
X* X*? X*+
times
X, one or more
X+ X+? X++
times
X{n} X{n}? X{n}+ X, exactly n times
X{n,} X{n,}? X{n,}+ X, at least n times
X, at least n but not
X{n,m} X{n,m}? X{n,m}+
more than m times
24. Zero Length Match
• In the regexes ‘a?’ and ‘a*’ each allow for zero
occurrences of the letter a.
Enter your regex: a*
Enter input string to search: aa
Found “aa" at index 0, ending at index 2.
Found “” at index 2, ending at index 2.
25. Quatifiers: Exact
Enter your regex: a{3}
Enter input string to search: aa
No match found.
Enter input string to search: aaaa
Found "aaa" at index 0, ending at index 3.
26. Quantifiers: At Least, No Greater
Enter your regex: a{3,}
Enter input string to search: aaaaaaaaa
Found "aaaaaaaaa" at index 0, ending at index 9.
Enter your regex: a{3,6}
Enter input string to search: aaaaaaaaa
Found "aaaaaa" at index 0, ending at index 6.
Found "aaa" at index 6, ending at index 9.
27. Quantifiers
• "abc+"
– Means "a, followed by b, followed by (c one or
more times)".
– “abcc” = match!, “abbc” = no match
• “*abc++”
– Means “(a, b, or c) one or more times)
– “bba” = match!
28. Greedy, Reluctant, and Possessive
• Greedy
– The whole input is validated, end characters are
consecutively left off as needed
• Reluctant
– No input is validated, beginning characters are
consecutively added as needed
• Possessive
– The whole input is validated, no retries are made
29. Greedy
Enter your regex: .*foo
Enter input string to search: xfooxxxxxxfoo
Found "xfooxxxxxxfoo" at index 0, ending at
index 13.
30. Reluctant
Enter your regex: .*?foo
Enter input string to search: xfooxxxxxxfoo
Found "xfoo" at index 0, ending at index 4.
Found "xxxxxxfoo" at index 4, ending at index
13.
32. Capturing Group
• Capturing groups are a way to treat multiple
characters as a single unit.
• They are created by placing the characters to
be grouped inside a set of parentheses.
• “(dog)”
– Means a single group containing the letters "d"
"o" and "g".
34. Capturing Groups: Numbering
• ((A)(B(C)))
1. ((A)(B(C)))
2. (A)
3. (B(C))
4. (C)
• The index is based on the opening
parentheses.
35. Capturing Groups: Numbering Usage
• Some Matcher methods accept a group
number as a parameter:
• int start(int group)
• int end (int group)
• String group (int group)
36. Capturing Groups: Backreferences
• The section of input matching the capturing
group is saved for recall via backreference.
• Specify a backreference with ‘’ followed by
the group number.
• ’(dd)’
– Can be recalled with the expression ‘1’.
37. Capturing Groups: Backreferences
Enter your regex: (dd)1
Enter input string to search: 1212
Found "1212" at index 0, ending at index 4.
Enter input string to search: 1234
No match found.
38. Boundary Matchers
Boundary Construct Description
^ The beginning of a line
$ The end of a line
b A word boundary
B A non-word boundary
A The beginning of the input
G The end of the previous match
The end of the input but for the final
Z
terminator, if any
z The end of the input
39. Boundary Matchers
Enter your regex: ^dog$
Enter input string to search: dog
Found "dog" at index 0, ending at index 3.
Enter your regex: ^dogw*
Enter input string to search: dogblahblah
Found "dogblahblah" at index 0, ending at index
11.
40. Boundary Matchers (cont.)
Enter your regex: bdogb
Enter input string to search: The doggie
plays in the yard.
No match found.
Enter your regex: Gdog
Enter input string to search: dog dog
Found "dog" at index 0, ending at index 3.
41. Pattern Class (cont.)
• There are a number of flags that can be
passed to the ‘compile’ method.
• Embeddable flag expressions are Java-specific
regex that duplicates these compile flags.
• Check out ‘matches’, ‘split’, and ‘quote’
methods as well.
42. Matcher Class (cont.)
• The Matcher class can slice input a multitude
of ways:
– Index methods give the position of matches
– Study methods give boolean results to queries
– Replacement methods let you edit input
43. PatternSyntaxException (cont.)
• You get a little more than just an error
message from the PatternSyntaxException.
• Check out the following methods:
– public String getDescription()
– public int getIndex()
– public String getPattern()
– public String getMessage()