1. MODULE 3 – PART 4
REGULAR EXPRESSIONS
By,
Ravi Kumar B N
Assistant professor, Dept. of CSE
BMSIT & M
2. ➢ Regular expression is a sequence of characters that define a search pattern.
➢ patterns are used by string searching algorithms for "find" or "find and
replace" operations on strings, or for input validation.
➢ The regular expression library “re” must be imported into our program before
we can use it.
INTRODUCTION
3. ➢ search() function: used to search for a particular string. will only return the first occurrence that
matches the specified pattern.
This function is available in “re” library.
➢ the caret character (^) : is used in regular expressions to match the beginning of a line.
➢ The dollar character ($) : is used in regular expressions to match the end of a line.
Example: program to match only lines where “From:” is at the beginning of the line
import re
hand = open('mbox1.txt')
for line in hand:
line = line.rstrip()
if re.search('^From:', line) :
print(line)
#Output
From:stephen Sat Jan 5 09:14:16 2008
From: louis@media.berkeley.edu Mon Jan 4 16:10:39 2008
From:zqian@umich.edu Fri Jan 4 16:10:39 2008
mbox1.txt
From:stephen Sat Jan 5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
From: louis@media.berkeley.edu Mon Jan 4 16:10:39 2008
Subject: [sakai] svn commit:
From:zqian@umich.edu Fri Jan 4 16:10:39 2008
Return-Path: <postmaster@collab.sakaiproject.org>
✓ The instruction re.search('^From:', line) equivalent with the startswith() method from the
string library.
SEARCH() FUNCTION:
4. ➢ The dot character (.) : The most commonly used special character is the period (”dot”) or full
stop, which matches any character.
The regular expression “F..m:” would match any of the following strings since the period
characters in the regular expression match any character.
“From:”, “Fxxm:”, “F12m:”, or “F!@m:”
➢ The program in the previous slide is rewritten using dot character which gives the same output
CHARACTER MATCHING IN REGULAR
EXPRESSIONS
import re
hand = open('mbox1.txt')
for line in hand:
line = line.rstrip()
if re.search(‘^F..m:', line) :
print(line)
#Output
From:stephen Sat Jan 5 09:14:16 2008
From: louis@media.berkeley.edu Mon Jan 4 16:10:39 2008
From:zqian@umich.edu Fri Jan 4 16:10:39 2008
5. Character can be repeated any number of times using the “*” or “+” characters in a
regular expression.
➢ The Asterisk character (*) : matches zero-or-more characters
➢ The Plus character (+) : matches one-or-more characters
Example: Program to match lines that start with “From:”, followed by mail-id
import re
hand = open('mbox1.txt')
for line in hand:
line = line.rstrip()
if re.search(‘^From:.+@', line) :
print(line)
#Output
From: louis@media.berkeley.edu Mon Jan 4 16:10:39 2008
From:zqian@umich.edu Fri Jan 4 16:10:39 2008
✓ The search string “ˆFrom:.+@” will successfully match lines that start with “From:”, followed by one
or more characters (“.+”), followed by an at-sign. The “.+” wildcard matches all the characters
between the colon character and the at-sign.
6. ➢ non-whitespace character (S) - matches one non-whitespace character
➢findall() function: It is used to search for “all” occurrences that match a given pattern.
In contrast, search() function will only return the first occurrence that matches the specified pattern.
import re
s = 'Hello from csev@umich.edu to cwen@iupui.edu about the meeting @2PM'
lst = re.findall('S+@S+', s)
print(lst)
#output
['csev@umich.edu', 'cwen@iupui.edu']
Example1: Program returns a list of all of the strings that look like email addresses from a given line.
# same program using search() it will display only first mail id or first
matching string
import re
s = 'Hello from csev@umich.edu to cwen@iupui.edu about the meeting @2PM'
lst = re.search('S+@S+', s)
print(lst)
#output
<re.Match object; span=(11, 25), match='csev@umich.edu'>
'S+@S+’ this regular expression
matches substrings that have at least one
non-whitespace character, followed by an
at-sign, followed by at least one more
non-whitespace character
7. Example2: Program returns a list of all of the strings that look like email addresses from a given file.
import re
hand = open('mbox1.txt')
for line in hand:
line = line.rstrip()
x = re.findall('S+@S+', line)
if len(x) > 0 :
print(x)
#Output
['<postmaster@collab.sakaiproject.org>']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['<postmaster@collab.sakaiproject.org>']
➢ Square brackets “[]” : square brackets are used to indicate a set of multiple acceptable characters we
are willing to consider matching.
Example: [a-z] matches single lowercase letter
[A-Z] matches single uppercase letter
[a-zA-Z] matches single lowercase letter or uppercase letter
[a-zA-Z0-9] matches single lowercase letter or uppercase letter or number
Some of our email addresses have incorrect characters like
“<” or “;” at the beginning or end. we are only interested in
the portion of the string that starts and ends with a letter or
a number. To get the proper output we have to use following
character.
8. [amk] matches 'a', 'm', or ’k’
[(+*)] matches any of the literal characters ’(‘ , '+’, '*’, or ’)’
[0-5][0-9] matches all the two-digits numbers from 00 to 59
➢ Characters that are not within a range can be matched by complementing the set
If the first character of the set is '^', all the characters that are not in the set will be matched.
For example,
[^5] will match any character except ’5’
Ex: Program returns list of all email addresses in proper format.
import re
hand = open('mbox.txt')
for line in hand:
line = line.rstrip()
x = re.findall('[a-zA-Z0-9]S*@S*[a-zA-Z]', line)
if len(x) > 0 :
print(x)
#output
['postmaster@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['postmaster@collab.sakaiproject.org']
[a-zA-Z0-9]S*@S*[a-zA-Z] : substrings that start with a
single lowercase letter, uppercase letter, or number “[a-zA-
Z0-9]”, followed by zero or more non-blank characters “S*”,
followed by an at-sign, followed by zero or more non-blank
characters “S*”, followed by an uppercase or lowercase
letter “[a-zA-Z]”.
9. SEARCH AND EXTRACT
import re
hand = open('mbox2.txt')
for line in hand:
line = line.rstrip()
if re.search('^XS*: [0-9.]+', line) :
print(line)
#Output
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.9245
Example1: Find numbers on lines that start with the string “X-”
lines such as: X-DSPAM-Confidence: 0.8475
➢ parentheses “()” in regular expression : used to extract a portion of the substring that
matches the regular expression.
import re
hand = open('mbox2.txt')
for line in hand:
line = line.rstrip()
x = re.findall('^XS*: ([0-9.]+)', line)
if len(x) > 0 :
print(x) Search
#Output
['0.8475’] Extract
['0.9245']
mbox2.txt
From: stephen.marquard@uct.ac.za
Subject: [sakai] svn commit: r39772 - content/branches/sakai_2-5-x/conten
impl/impl/src/java/org
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
Content-Type: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Sat Jan 5 09:14:16 2008
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.9245
Above output has entire line we only want to extract
numbers from lines that have the above syntax
10. import re
hand = open('mbox1.txt')
for line in hand:
line = line.rstrip()
x = re.findall('^From.* ([0-3][0-9]):', line)
if len(x) > 0 :
print(x)
#Output
['09']
['16']
['16']
Example2: Program to print the day of received mails
12. ESCAPE CHARACTER
➢ Escape character (backslash "" ) is a metacharacter in regular expressions. It allow special
characters to be used without invoking their special meaning.
If you want to match 1+1=2, the correct regex is 1+1=2. Otherwise, the plus sign has a
special meaning.
For example, we can find money amounts with the following regular expression.
>>>import re
>>>x = 'We just received $10.00 for cookies.’
>>>y = re.findall(‘$[0-9.]+’,x)
>>> y
['$10.00']
13. SUMMARY
Character Meaning
ˆ Matches the beginning of the line
$ Matches the end of the line
. Matches any character (a wildcard)
s Matches a whitespace character
S Matches a non-whitespace character (opposite of s)
* Applies to the immediately preceding character and indicates to match zero or more of the
preceding character(s)
*? Applies to the immediately preceding character and indicates to match zero or more of the
preceding character(s) in “non-greedy mode”
+ Applies to the immediately preceding character and indicates to match one or more of the
preceding character(s)
+? Applies to the immediately preceding character and indicates to match one or more of the
preceding character(s) in “non-greedy mode”.
[aeiou] Matches a single character as long as that character is in the specified set. In this example, it would
match “a”, “e”, “i”, “o”, or “u”, but no other characters.
[a-z0-9] You can specify ranges of characters using the minus sign. This example is a single character that
must be a lowercase letter or a digit.
14. Character Meaning
[ˆA-Za-z] When the first character in the set notation is a caret, it inverts the logic. This example matches
a single character that is anything other than an uppercase or lowercase letter.
( ) When parentheses are added to a regular expression, they are ignored for the purpose of
matching, but allow you to extract a particular subset of the matched string rather than the
whole string when using findall()
b Matches the empty string, but only at the start or end of a word.
B Matches the empty string, but not at the start or end of a word
d Matches any decimal digit; equivalent to the set [0-9].
D Matches any non-digit character; equivalent to the set [ˆ0-9]
15. ASSIGNMENT
1) Write a python program to check the validity of a Password In this program, we will be taking a
password as a combination of alphanumeric characters along with special characters, and check whether
the password is valid or not with the help of few conditions.
Primary conditions for password validation :
1.Minimum 8 characters.
2.The alphabets must be between [a-z]
3.At least one alphabet should be of Upper Case [A-Z]
4.At least 1 number or digit between [0-9].
5.At least 1 character from [ _ or @ or $ ].
2) Write a pattern for the following:
Pattern to extract lines starting with the word From (or from) and ending with edu.
Pattern to extract lines ending with any digit.
Start with upper case letters and end with digits.
Search for the first white-space character in the string and display its position.
Replace every white-space character with the number 9: consider a sample text txt = "The rain in Spain"