3. What are Regular Expressions?
• Very small language for describing text.
• Not a programming language.
• Incredibly powerful tool for search/replace
operations.
• Old (1950s-60s)
• Arcane art.
• Ubiquitous.
4. Why Use Regular Expressions?
• Finding every instance of a string in a file
– i.e. every mention of “chickens” in a farm
diary
• How many times does “sing” appear in a
text in all tenses and conjugations?
• Reformatting dirty data
• Validating input.
• Command line work – listing files,
grepping log files
5. The Basics
• A regex is a pattern enclosed within
delimiters.
• Most characters match themselves.
• /rootstech/ is a regular expression that
matches “rootstech”.
– Slash is the delimiter enclosing the
expression.
– “rootstech” is the pattern.
9. Characters
• Matching is case sensitive.
• Special characters: ( ) ^ $ { } [ ] | . + ? *
• To match a special character in your text,
precede it with in your pattern:
– /snarky [sic]/ does not match “snarky [sic]”
– /snarky [sic]/ matches “snarky [sic]”
• Regular expressions can support Unicode.
10. Character Classes
• Characters within [ ] are choices for a
single-character match.
• Think of a set operation, or a type of or.
• Order within the set is unimportant.
• /x[01]/ matches “x0” and “x1”.
• /[10][23]/ matches “02”, “03”, “12” and
“13”.
• Initial^ negates the class:
– /[^45]/ matches all characters except 4 or 5.
13. Ranges
• Ranges define sets of characters within a
class.
– /[1-9]/ matches any non-zero digit.
– /[a-zA-Z]/ matches any letter.
– /[12][0-9]/ matches numbers between 10 and
29.
14. Shortcuts
Shortcut Name Equivalent Class
d digit [0-9]
D not digit [^0-9]
w word [a-zA-Z0-9_]
W not word [^a-zA-Z0-9_]
s space [tnrfv ]
S not space [^tnrfv ]
. everything [^n] (depends on mode)
15. /ddd[- ]dddd/
• Matches strings with: 501-1234 234 1252
– Three digits
– Space or dash
– Four digits 652.2648 713-342-7452
PE6-5000 653-6464x256
16. /ddd[- ]dddd/
• Matches strings with: 501-1234 234 1252
– Three digits
– Space or dash
– Four digits 652.2648 713-342-7452
PE6-5000 653-6464x256
17. Repeaters
• Symbols indicating Repeater Count
that the preceding ? zero or one
element of the pattern + one or more
can repeat.
* zero or more
• /runs?/ matches runs
or run {n} exactly n
• /1d*/ matches any {n,m} between n and
m times
number beginning
with “1”. {,m} no more than m
times
{n,} at least n times
18. Repeaters
Strings: Repeater Count
1: “at” 2: “art” ? zero or one
3: “arrrrt” 4: “aft” + one or more
* zero or more
Patterns: {n} exactly n
A: /ar?t/ B: /a[fr]?t/ {n,m} between n and
C: /ar*t/ D: /ar+t/ m times
E: /a.*t/ F: /a.+t/ {,m} no more than m
times
{n,} at least n times
19. Repeaters
• /ar?t/ matches “at” and “art” but not “arrrt”.
• /a[fr]?t/ matches “at”, “art”, and “aft”.
• /ar*t/ matches “at”, “art”, and “arrrrt”
• /ar+t/ matches “art” and “arrrt” but not “at”.
• /a.*t/ matches anything with an ‘a’
eventually followed by a ‘t’.
21. Lab Session I
Match “Brumfield” and “Bromfield” in
1702 John Bromfield's estate had been
proved in Isle of Wight prior to 1702,
Anne Brumfield rec'd. more than her share
from her father's estate.
22. Lab Reference
Repeater Count Shortcut Name
? zero or one d digit
+ one or more
D not digit
* zero or more
w word
{n} exactly n times
{n,m} between n and W not word
m times s space
{,m} no more than m S not space
times
{n,} at least n times . everything
23. Anchors
• Anchors match Anchor Matches
between characters. ^ start of line
• Used to assert that $ end of line
the characters you’re
b word boundary
matching must
appear in a certain B not boundary
place. A start of string
• /batb/ matches “at Z end of string
work” but not “batch”. z raw end of
string (rare)
24. Alternation
• In Regex, | means “or”.
• You can put a full expression on the left
and another full expression on the right.
• Either can match.
• /seeks?|sought/ matches “seek”, “seeks”,
or “sought”.
25. Grouping
• Everything within ( … ) is grouped into a
single element for the purposes of
repetition and alternation.
• The expression /(la)+/ matches “la”, “lala”,
“lalalala” but not “all”.
• /schema(ta)?/ matches “schema” and
“schemata” but not “schematic”.
27. Grouping Example
• What regular expression matches “eat”,
“eats”, “ate” and “eaten”?
• /eat(s|en)?|ate/
• Add word boundary anchors to exclude
“sate” and “eating”: /b(eat(s|en)?|ate)b/
28. Lab Session II
Match “William” and “Wm.” in
1736 Robert Mosby and John Brumfield
processioned the lands of Wm. Brittain
1739 … Witnesses: Richard Echols, William
Brumfield, John Hendrick
29. Replacement
• Regex most often used for search/replace
• Syntax varies; most scripting languages
and CLI tools use s/pattern/replacement/ .
• s/dog/hound/ converts “slobbery dogs” to
“slobbery hounds”.
• s/bsheepsb/sheep/ converts
– “sheepskin is made from sheeps” to
– “sheepskin is made from sheep”
30. Capture
• During searches, ( … ) groups capture
patterns for use in replacement.
• Special variables $1, $2, $3 etc. contain
the capture.
• /(ddd)-(dddd)/ “123-4567”
– $1 contains “123”
– $2 contains “4567”
31. Capture
• How do you convert
– “Smith, James” and “Jones, Sally” to
– “James Smith” and “Sally Jones”?
32. Capture
• How do you convert
– “Smith, James” and “Jones, Sally” to
– “James Smith” and “Sally Jones”?
• s/(w+), (w+)/$2 $1/
33. Caveats
• Check the language/application-specific
documentation: some common shortcuts
are not universal.