2. WHAT IS MEANT BY
REGULAR EXPRESSION?
We have seen string/file slicing, searching, parsing etc and
built-in methods like split, find etc.
This task of searching and extracting finds applications in
Email classification, Web searching etc.
Python has a very powerful library called regularexpressions
that handles many of these tasks quite elegantly
Regular expressions are like small but powerful programming
language, for matching text patterns and provide a
standardized way of searching, replacing, and parsing text
with complex patterns of characters.
Regular expressions can be defined as the sequence of
characters which are used to search for a pattern in a string.
2
3. FEATURES OF REGEX
Hundreds of lines of code could be reduced to few lines with regular
expressions
Used to construct compilers, interpreters and text editors
Used to search and match text patterns
The power of the regular expressions comes when we add special
characters to the search string that allow us to do sophisticated
matching and extraction with very little code.
Used to validate text data formats especially input data
ARegular Expression (or Regex) is a pattern (or filter) that describes
a set of strings that matches the pattern. A regex consists of a
sequence of characters, metacharacters (such as . , d , ?, W etc ) and
operators (such as + , * , ? , | , ^ ).
Popular programming languages like Python, Perl, JavaScript, Ruby,
Tcl, C# etc have Regex capabilities 3
4. GENERAL USES OF REGULAR
EXPRESSIONS
Search a string (search and match)
Replace parts of a string(sub)
Break string into small pieces(split)
Finding a string (findall)
The module re provides the support to use regex in the
python program. The re module throws an exception if there
is some error while using the regular expression.
Before using the regular expressions in program, we have to
import the library using “import re”
4
5. REGEX FUNCTIONS
The re module offers a set of functions
FUNCTION DESCRIPTION
findall Returns a list containing all matches of a pattern in
the string
search Returns a match Object if there is a match
anywhere in the string
split Returns a list where the string has been split at each
match
sub Replaces one or more matches in a string
(substitute with another string)
match This method matches the regex pattern in the string
with the optional flag. It returns true if a match is
found in the string, otherwise it returns false.
5
6. EXAMPLE PROGRAM
• We open the file, loop through
each line, and use the regular
expression search() to only print
out lines that contain the string
“hello”. (same can be done using
“line.find()” also)
# Search for lines that contain ‘hello'
import re
fp = open('d:/18ec646/demo1.txt')
for line in fp:
line = line.rstrip()
if re.search('hello', line):
print(line)
Output:
hello and welcome to python class
hello how are you?
# Search for lines that contain ‘hello'
import re
fp = open('d:/18ec646/demo2.txt')
for line in fp:
line = line.rstrip()
if re.search('hello', line):
print(line)
Output:
friends,hello and welcome
hello,goodmorning 6
7. EXAMPLE PROGRAM
• To get the optimum performance from Regex, we need to use special
characters called ‘metacharacters’
# Search for lines that starts with 'hello'
import re
fp = open('d:/18ec646/demo1.txt')
for line in fp:
line = line.rstrip()
if re.search('^hello', line): ## note 'caret' metacharacter
print(line) ## before hello
Output:
hello and welcome to python class
hello how are you?
# Search for lines that starts with 'hello'
import re
fp = open('d:/18ec646/demo2.txt')
for line in fp:
line = line.rstrip()
if re.search('^hello', line): ## note 'caret' metacharacter
print(line) ## before hello
Output:
hello, goodmorning
7
8. METACHARACTERS
Metacharacters are characters that are interpreted in a
special way by a RegEx engine.
Metacharacters are very helpful for parsing/extraction
from the given file/string
Metacharacters allow us to build more powerful regular
expressions.
Table-1 provides a summary of metacharacters and their
meaning in RegEx
Here's a list of metacharacters:
[ ] . ^ $ * + ? { } ( ) |
8
9. Metacharacter Description Example
[ ] It represents the set of characters. "[a-z]"
It represents the special sequence (can also be
used to escape special characters)
"r"
. It signals that any character is present at some
specific place (except newline character)
"Ja...v."
^ It represents the pattern present at the beginning
of the string (indicates “startswith”)
"^python"
$ It represents the pattern present at the end of the
string. (indicates “endswith”)
"world"
* It represents zero or more occurrences of a
pattern in the string.
"hello*"
+ It represents one or more occurrences of a
pattern in the string.
"hello+"
{} The specified number of occurrences of a pattern
the string.
“hello{2}"
| It represents either this or the other character is
present.
"hello|hi"
() Capture and group
9
10. [ ] - SQUARE BRACKETS
• Square brackets specifies a set of characters you wish to match.
• A set is a group of characters given inside a pair of square brackets. It represents
the special meaning.
10
[abc] Returns a match if the string contains any of the specified
characters in the set.
[a-n] Returns a match if the string contains any of the characters between a to
n.
[^arn] Returns a match if the string contains the characters except a, r, and n.
[0123] Returns a match if the string contains any of the specified digits.
[0-9] Returns a match if the string contains any digit between 0 and 9.
[0-5][0-9] Returns a match if the string contains any digit between 00 and 59.
[a-zA-Z] Returns a match if the string contains any alphabet (lower-case or upper-
case).
11. CONTD..
### illustrating square brackets
import re
fh = open('d:/18ec646/demo5.txt')
for line in fh:
line = line.rstrip()
if re.search("[w]", line):
print(line)
## search all the lines where w is
present and display
Output:
Hello and welcome
@abhishek,how are you
### illustrating square brackets
import re
fh = open('d:/18ec646/demo3.txt')
for line in fh:
line = line.rstrip()
if re.search("[ge]", line):
print(line)
### Search for characters g or e or
both and display
Output:
Hello and welcome
This is Bangalore
11
12. CONTD…
### illustrating square brackets
import re
fh = open('d:/18ec646/demo3.txt')
for line in fh:
line = line.rstrip()
if re.search("[th]", line):
print(line)
Ouput:
This is Bangalore
This is Paris
This is London
import re
fh = open('d:/18ec646/demo7.txt')
for line in fh:
line = line.rstrip()
if re.search("[y]", line):
print(line) Ouput:
johny johny yes papa
open your mouth
### illustratingsquare brackets
import re
fh =
open('d:/18ec646/demo5.txt')
for line in fh:
line = line.rstrip()
if re.search("[x-z]", line):
print(line)
Output:
to:abhishek@yahoo.com
@abhishek,how are you
12
13. . PERIOD (DOT)
A period matches any single character (except newline 'n‘)
Expression String Matched?
..
(any two
characters)
a No match
ac 1 match
acd 1 match
acde
2 matches
(contains 4
characters)
### illustrating dot metacharacter
import re
fh = open('d:/18ec646/demo5.txt')
for line in fh:
line = line.rstrip()
if re.search("y.", line):
print(line)
Output:
to: abhishek@yahoo.com
@abhishek,how are you
13
14. CONTD..
### illustrating dot metacharacter
import re
fh = open('d:/18ec646/demo3.txt')
for line in fh:
line = line.rstrip()
if re.search("P.", line):
print(line)
Output:
This is Paris
### illustrating dot metacharacter
import re
fh = open('d:/18ec646/demo6.txt')
for line in fh:
line = line.rstrip()
if re.search("T..s", line):
print(line)
Output:
This is London
These are beautiful flowers
Thus we see the great London bridge
### illustrating dot metacharacter
import re
fh = open('d:/18ec646/demo6.txt')
for line in fh:
line = line.rstrip()
if re.search("L..d", line):
print(line)
Output:
This is London
Thus we see the great London bridge
## any two characters betweenT and s
14
15. ^ - CARET
The caret symbol ^ is used to check if a string starts with a certain
character
Expression String Matched?
^a
a 1 match
abc 1 match
bac No match
^ab
abc 1 match
acb No match (starts with a but not followedby b)
### illustrating caret
import re
fh = open('d:/18ec646/demo2.txt')
for line in fh:
line = line.rstrip()
if re.search("^h",line):
print(line) Output:
hello, goodmorning
### illustrating caret
import re
fh = open('d:/18ec646/demo5.txt')
for line in fh:
line = line.rstrip()
if re.search("^f", line):
print(line)
from:krishna.sksj@gmail.com
15
16. $ - DOLLAR
The dollar symbol $ is used to check if a string ends with a certain
character.
Expression String Matched?
a$
a 1 match
formula 1 match
cab No match
### illustrating metacharacters
import re
fh = open('d:/18ec646/demo5.txt')
for line in fh:
line = line.rstrip()
if re.search("m$", line):
print(line)
Output:
from:krishna.sksj@gmail.com
to: abhishek@yahoo.com
### illustrating metacharacters
import re
fh = open('d:/18ec646/demo7.txt')
for line in fh:
line = line.rstrip()
if re.search("papa$", line):
print(line)
Output:
johny johny yes papa
eating sugar no papa
16
17. * - STAR
The star symbol * matches zero or more occurrences of the pattern left
to it.
Expression String Matched?
ma*n
mn 1 match
man 1 match
maaan 1 match
main No match (a is not followedby n)
### illustrating metacharacters
import re
fh = open('d:/18ec646/demo6.txt')
for line in fh:
line = line.rstrip()
if re.search("London*",line):
print(line)
Output:
This is London
Thus we see the great London bridge
17
18. + - PLUS
The plus symbol + matchesone or more occurrences of the pattern left
to it.
Expression String Matched?
ma+n
mn No match (no a character)
man 1 match
maaan 1 match
main No match (a is not followedby n)
### illustrating metacharacters
import re
fh = open('d:/18ec646/demo6.txt')
for line in fh:
line = line.rstrip()
if re.search("see+", line):
print(line)
Output:
Thus we see the great London bridge
### illustrating metacharacters
import re
fh = open('d:/18ec646/demo6.txt')
for line in fh:
line = line.rstrip()
if re.search("ar+", line):
print(line)
Output:
These are beautiful flowers
18
19. ? - QUESTION MARK
The question mark symbol ? matches zero or one occurrence of the pattern left to
it.
Expression String Matched?
ma?n
mn 1 match
man 1 match
maaan No match (more than one a character)
### illustrating metacharacters
import re
fh = open('d:/18ec646/demo5.txt')
for line in fh:
line = line.rstrip()
if re.search("@gmail?", line):
print(line)
Output:
from:krishna.sksj@gmail.com
### illustrating metacharacters
import re
fh = open('d:/18ec646/demo5.txt')
for line in fh:
line = line.rstrip()
if re.search("you?",line):
print(line)
Output:
@abhishek,how are you
19
20. {} - BRACES
Finds the specified number of occurrences of a pattern. Consider {n, m}. This
means at least n, and at most m repetitions of the pattern left to it.
If a{2} was given, a should be repeated exactly twice
Expression String Matched?
a{2,3}
abc dat No match
abc daat 1 match (at daat)
aabc daaat 2 matches (at aabc and daaat)
aabc daaaat 2 matches (at aabc and daaaat)
20
21. | - ALTERNATION
Vertical bar | is used for alternation (or operator).
Expression String Matched?
a|b
cde No match
ade 1 match (match at ade)
acdbea 3 matches (at acdbea)
### illustrating metacharacters
import re
fh = open('d:/18ec646/demo7.txt')
for line in fh:
line = line.rstrip()
if re.search("yes|no", line):
print(line)
Output:
johny johny yes papa
eating sugar no papa
### illustrating metacharacters
import re
fh = open('d:/18ec646/demo2.txt')
for line in fh:
line = line.rstrip()
if re.search("hello|how", line):
print(line)
Output:
friends,hello and welcome
hello,goodmorning
21
22. () - GROUP
Parentheses () is used to group sub-patterns.
For ex, (a|b|c)xz match any string that matches
either a or b or c followed by xz
Expression String Matched?
(a|b|c)xz
ab xz No match
abxz 1 match (match at abxz)
axz cabxz 2 matches (at axzbc cabxz)
### illustrating metacharacters
import re
fh = open('d:/18ec646/demo5.txt')
for line in fh:
line = line.rstrip()
if re.search("(hello|how) are", line):
print(line)
Output:@abhishek,how are you
### illustrating metacharacters
import re
fh = open('d:/18ec646/demo2.txt')
for line in fh:
line = line.rstrip()
if re.search("(hello and)", line):
print(line)
Ouptut:
friends,hello and welcome
22
23. - BACKSLASH
Backlash is used to escape various characters including all
metacharacters.
For ex, $a match if a string contains $ followed by a.
Here, $ is not interpreted by a RegEx engine in a special way.
If you are unsure if a character has special meaning or not, you
can put in front of it. This makes sure the character is not treated
in a special way.
NOTE :- Another way of doing it is putting the special
character in the square brackets [ ]
23
24. SPECIAL SEQUENCES
A special sequence is a followed by one of the characters
(see Table) and has a special meaning
Special sequences make commonly used patterns easier to
write.
24
25. SPECIAL SEQUENCES
Character Description Example
A It returns a match if the specified characters are
present at the beginning of the string.
"AThe"
b It returns a match if the specified characters are
present at the beginning or the end of the string.
r"bain"
r"ainb"
B It returns a match if the specified characters are
present at the beginning of the string but not at the
end.
r"Bain"
r"ainB
d It returns a match if the string contains digits [0-9]. "d"
D It returns a match if the string doesn't contain the
digits [0-9].
"D"
s It returns a match if the string contains any white
space character.
"s"
S It returns a match if the string doesn't contain any
white space character.
"S"
w It returns a match if the string contains any word
characters (Ato Z, a to z, 0 to 9 and underscore)
"w"
W It returns a match if the string doesn't contain any
word characters
"W" 25
26. A - Matches if the specified characters are at the start of a string.
Expression String Matched?
Athe
the sun Match
In the sun No match
26
b - Matches if the specified characters are at the beginning or end of a word
Expression String Matched?
bfoo
football Match
a football Match
afootball No match
foob
football No Match
the afoo test Match
the afootest No match
27. B - Opposite of b. Matches if the specified characters
are not at the beginning or end of a word.
Expression String Matched?
Bfoo
football No match
a football No match
afootball Match
fooB
the foo No match
the afoo test No match
the afootest Match
27
28. d - Matches any decimal digit. Equivalent to [0-9]
D - Matches any non-decimal digit. Equivalent to [^0-9]
Expression String Matched?
d
12abc3 3 matches (at 12abc3)
Python No match
Expression String Matched?
D
1ab34"50 3 matches (at 1ab34"50)
1345 No match
28
29. s - Matches where a string contains any whitespace
character. Equivalent to [ tnrfv].
S - Matches where a string contains any non-whitespace
character. Equivalent to [^ tnrfv].
Expression String Matched?
s
Python RegEx 1 match
PythonRegEx No match
Expression String Matched?
S
a b 2 matches (at a b)
No match
29
30. w - Matches any alphanumeric character. Equivalent to [a-zA-Z0-
9_]. Underscore is also considered an alphanumeric character
W - Matches any non-alphanumeric character. Equivalent
to [^a-zA-Z0-9_]
Expression String Matched?
w
12&":;c 3 matches (at 12&":;c)
%"> ! No match
Expression String Matched?
W
1a2%c 1 match (at 1a2%c)
Python No match
30
31. Z - Matches if the specified characters are at the end of a
string.
Expression String Matched?
PythonZ
I like Python 1 match
I like Python
Programming
No match
Python is fun. No match
31
# check whether the specified
#characters are at the end of string
import re
fp = open('d:/18ec646/demo5.txt')
for x in fp:
x = x.rstrip()
if re.findall ("comZ", x):
print(x)
Output:
from:krishna.sksj@gmail.com
to: abhishek@yahoo.com
32. REGEX FUNCTIONS
The re module offers a set of functions
FUNCTION DESCRIPTION
findall Returns a list containing all matches of a pattern in
the string
search Returns a match Object if there is a match
anywhere in the string
split Returns a list where the string has been split at each
match
sub Replaces one or more matches in a string
(substitute with another string)
match This method matches the regex pattern in the string
with the optional flag. It returns true if a match is
found in the string, otherwise it returns false.
32
33. THE FINDALL() FUNCTION
The findall() function returns a list containing all matches.
The list contains the matches in the order they are found.
If no matches are found, an empty list is returned
Here is the syntax for this function −
re. findall(pattern, string, flags=0)
33
import re
str ="How are you. How is everything?"
matches= re.findall("How",str)
print(matches)
['How','How']
35. CONTD..
35
#check whether string starts with How
import re
str ="How are you. How is everything?"
x= re.findall("^How",str)
print (str)
print(x)
if x:
print ("string starts with 'How' ")
else:
print ("string does not start with 'How'")
Output:
How are you.How is everything?
['How']
string starts with 'How'
36. CONTD…
36
# match all lines that starts with 'hello'
import re
fp = open('d:/18ec646/demo1.txt')
for x in fp:
x = x.rstrip()
if re.findall ('^hello',x): ## note 'caret'
print(x)
Output:
hello and welcome to python class
hello how are you?
# match all lines that starts with ‘@'
import re
fp = open('d:/18ec646/demo5.txt')
for x in fp:
x = x.rstrip()
if re.findall ('^@',x): ## note 'caret'
metacharacter
print(x)
Output:
@abhishek,how are you
# check whether the string contains
## non-digit characters
import re
fp = open('d:/18ec646/demo5.txt')
for x in fp:
x = x.rstrip()
if re.findall ("D", x): ## special sequence
print(x)
from:krishna.sksj@gmail.com
to:abhishek@yahoo.com
Hello and welcome
@abhishek,how are you
37. THE SEARCH() FUNCTION
The search() function searches the string for a match, and
returns a Match object if there is a match.
If there is more than one match, only the first occurrence
of the match will be returned
If no matches are found, the value None is returned
Here is the syntax for this function −
re.search(pattern, string, flags=0)
37
39. THE SPLIT() FUNCTION
The re.split method splits the string where there is a match
and returns a list of strings where the splits have occurred.
You can pass maxsplit argument to the re.split() method. It's
the maximum number of splits that will occur.
If the pattern is not found, re.split() returns a list containing
the original string.
Here is the syntax for this function −
re.split(pattern, string, maxsplit=0, flags=0)
39
40. EXAPLES on split() function:-
40
# split function
import re
fp = open('d:/18ec646/demo5.txt')
for x in fp:
x = x.rstrip()
x= re.split("@",x)
print(x)
Output:
['from:krishna.sksj','gmail.com']
['to: abhishek','yahoo.com']
['Hello and welcome']
['','abhishek,how are you']
41. CONTD..
41
# split function
import re
fp =
open('d:/18ec646/demo7.txt')
for x in fp:
x = x.rstrip()
x= re.split("e",x)
print(x)
Output:
['johny johny y','s papa']
['', 'ating sugar no papa']
['t','lling li', 's']
['op','n your mouth']
Output:
['johny johny yes ', '']
['eating sugar no ','']
['telling lies']
['open your mouth']
# split function
import re
fp =
open('d:/18ec646/demo7.txt')
for x in fp:
x = x.rstrip()
x= re.split("papa",x)
print(x)
# split function
import re
fp =
open('d:/18ec646/demo3.txt')
for x in fp:
x = x.rstrip()
x= re.split("is",x)
print(x)
Output:
['Hello and welcome']
['Th',' ',' Bangalore']
['Th',' ',' Par','']
['Th',' ',' London']
42. THE SUB() FUNCTION
The sub() function replaces the matches with the text of your
choice
You can control the number of replacements by specifying
the count parameter
If the pattern is not found, re.sub() returns the original string
Here is the syntax for this function −
re.sub(pattern, repl, string, count=0, flags=0)
42
43. EXAPLES on sub() function:-
43
### illustration of substitute (replace)
import re
str ="How are you.How is everything?"
x= re.sub("How","where",str)
print(x)
Output:
where are you.where is everything?
# sub function
import re
fp = open('d:/18ec646/demo3.txt')
for x in fp:
x = x.rstrip()
x= re.sub("This","Where",x)
print(x)
Output:
Hello and welcome
Where is Bangalore
Where is Paris
Where is London
44. THE MATCH() FUNCTION
If zero or more characters at the beginning of string match
this regular expression, return a corresponding match object.
Return None if the string does not match the pattern.
Here is the syntax for this function −
Pattern.match(string[, pos[, endpos]])
The optional pos and endpos parameters have the same
meaning as for the search() method.
44
45. search() Vs match()
Python offers two different primitive operations based on
regular expressions:
re.match() checksfor a match only at the beginning of the string,
while re.search() checks for a match anywhere in the string
Eg:-
45
# match function
import re
fp = open('d:/18ec646/demo3.txt')
for x in fp:
x = x.rstrip()
if re.match("This",x):
print(x)
Outptut:
This is Bangalore
This is Paris
This is London
46. MATCH OBJECT
A Match Object is an object containing information about the
search and the result
If there is no match, the value None will be returned, instead
of the Match Object
Some of the commonly used methods and attributes of match
objects are:
match.group(), match.start(), match.end(), match.span(),
match.string
46
47. match.group()
The group() method returns the part of the string where
there is a match
match.start(), match.end()
The start() function returns the index of the start of the
matched substring.
Similarly, end() returns the end index of the matched
substring.
match.string
string attribute returns the passed string.
47