A wise hacker said: Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
Regular expressions are a powerful tool in our hands and a first class citizen in ruby so it is tempting to overuse them. But knowing them and using them properly is a fundamental asset of every developer.
We’ll see hands-on examples of proper Reg Exps usage in ruby code, we’ll also look at bad and ugly cases and learn how to approach writing, testing and debugging regular expressions.
5. Regexp syntax
literals: /cat/ matches any ‘cat’ substring
the dot: /./ matches any character
character classes: /[aeiou]/ /[a-z]/ /[01]/
negated character classes: /[^abc]/
@lmea
6. Regexp syntax
Modifiers
case insensitive: /./i
only interpolate #{} blocks once: /./o
multiline mode - '.' will match newline: /./m
extended mode - whitespace is ignored: /./x
@lmea
7. Regexp syntax
Shorthand classes
/d/ digit /D/ non digit
/s/ whitespace /S/ non whitespace
/w/ word character /W/ non word character
/h/ hexdigit /H/ non hexdigit
@lmea
8. Regexp syntax
Anchors
/^/ beginning of line /$/ end of line
/b/ word boundary /B/ non word boundary
/A/ beginning of string /z/ end of string
end of string. If string
ends with a newline,
/Z/
it matches just
before newline
@lmea
9. Regexp syntax
alternation: /cat|dog/ matches ‘cats and dogs’
0-or-more: /ab*/ matches ‘a’ ‘ab’ ‘abb’...
1-or-more: /ab+/ matches ‘ab’ ‘abb’ ...
given-number: /ab{2}/ matches ‘abb’ but not
‘ab’ or the whole ‘abbb’ string
@lmea
10. Regexp syntax
greedy matches: /.+cat/ matches ‘the cat is
catching a mouse’
lazy matches: /.+?scat/ matches ‘the cat is
catching a mouse’
@lmea
11. Regexp syntax
grouping: /(d{3}.){3}d{3}/ matches IP-
like strings
capturing: /a (cat|dog)/ the match is
captured in $1 to be used later
non capturing: /a (?:cat|dog)/ no content
captured
atomic grouping: /(?>a+)/ doesn’t backtrack
@lmea
12. String substitution
"My cat eats catfood".sub(/cat/, "dog")
# => My dog eats catfood
"My cat eats catfood".gsub(/cat/, "dog")
# => My dog eats dogfood
"My cat eats catfood".gsub(/bcat(w+)/, "dog1")
# => My cat eats dogfood
"My cat eats catfood".gsub(/bcat(w+)/){|m| $1.reverse}
# => My cat eats doof
@lmea
13. String parsing
"Codemotion Rome: Mar 20 to Mar 23".scan(/w{3} d{1,2}/)
# => ["Mar 20", "Mar 23"]
"Codemotion Rome: Mar 20 to Mar 23".scan(/(w{3}) (d{1,2})/)
# => [["Mar", "20"], ["Mar", "23"]]
"Codemotion Rome: Mar 20 to Mar 23".scan(/(w{3}) (d{1,2})/)
{|a,b| puts b+"/"+a}
# 20/Mar
# 23/Mar
# => "Codemotion Rome: Mar 20 to Mar 23"
@lmea
14. Regexp methods
if "what a wonderful world" =~ /(world)/
puts "hello #{$1.upcase}"
end
# hello WORLD
if /(world)/.match("The world")
puts "hello #{$1.upcase}"
end
# hello WORLD
match_data = /(world)/.match("The world")
puts "hello #{match_data[1].upcase}"
# hello WORLD
@lmea
16. Rails examples
# in ActiveModel::Validations::NumericalityValidator
def parse_raw_value_as_an_integer(raw_value)
raw_value.to_i if raw_value.to_s =~ /A[+-]?d+Z/
end
# in ActionDispatch::RemoteIp::IpSpoofAttackError
# IP addresses that are "trusted proxies" that can be stripped from
# the comma-delimited list in the X-Forwarded-For header. See also:
# http://en.wikipedia.org/wiki/Private_network#Private_IPv4_address_spaces
TRUSTED_PROXIES = %r{
^127.0.0.1$ | # localhost
^(10 | # private IP 10.x.x.x
172.(1[6-9]|2[0-9]|3[0-1]) | # private IP in the range 172.16.0.0 .. 172.31.255.255
192.168 # private IP 192.168.x.x
).
}x
WILDCARD_PATH = %r{*([^/)]+))?$}
@lmea
17. Regexps are
dangerous
"If I was going to place a bet on something
about Rails security, it'd be that there are more
regex vulnerabilities in the tree. I am
uncomfortable with how much Rails leans on
regex for policy decisions."
Thomas H. Ptacek (Founder @ Matasano, Feb 2013)
@lmea
18. Tip #1
Beware of nested quantifiers
/(x+x+)+y/ =~ 'xxxxxxxxxy'
/(xx+)+y/ =~ 'xxxxxxxxxx'
/(?>x+x+)+y/ =~ 'xxxxxxxxx'
@lmea
19. Tip #2
Don’t make everything optional
/[-+]?[0-9]*.?[0-9]*/ =~ '.'
/[-+]?([0-9]*.?[0-9]+|[0-9]+)/
/[-+]?[0-9]*.?[0-9]+/
@lmea
21. Tip #4
Capture repeated groups and don’t
repeat a captured group
/!(abc|123)+!/ =~ '!abc123!'
# $1 == '123'
/!((abc|123)+)!/ =~ '!abc123!'
# $1 == 'abc123'
@lmea
22. Tip #5
use interpolation with care
str = "cat"
/#{str}/ =~ "My cat eats catfood"
/#{Regexp.quote(str)}/ =~ "My cat eats catfood"
@lmea
23. Tip #6
Don’t use ^ and $ to match the
strings beginning and end
validates :url, :format => /^https?/
"http://example.com" =~ /^https?/
"javascript:alert('hello!');%0Ahttp://example.com"
"javascript:alert('hello!');nhttp://example.com" =~ /^https?/
"javascript:alert('hello!');nhttp://example.com" =~ /Ahttps?/
@lmea