The document discusses file handling and regular expressions in Python programming. It covers opening, reading, and writing files in both text and binary modes. It also describes parsing text files using built-in functions and regular expressions. Regular expressions topics covered include characters, character classes, quantifiers, grouping, capturing, assertions, and flags. The document provides examples of using the re module to search and manipulate strings using regular expression patterns.
Dubai Call Girls O525547819 Call Girls In Dubai (M0ist)
File handling & regular expressions in python programming
1. File Handling & Regular
Expressions in Python
Programming
Presented By
Dr. Srinivas Narasegouda,
Assistant Professor,
Jyoti Nivas College Autonomous,
Bangalore – 95.
2. File Handling & Regular Expressions in Python
Programming
File Handling:
Writing and Reading Binary Data, Writing and Parsing Text Files,
Random Access Binary Files.
Regular Expressions:
Python‘s Regular Expression Language: Characters and Character
Classes, Quantifiers, Grouping and Capturing, Assertions and Flags,
The Regular Expression Module.
3. File Handling
Python: How to read and write files?
Working with files consists of the following three steps:
1. Open a file
2. Perform read or write operation
3. Close the file
Types of files
1. Text files
2. Binary files
A text file is simply a file which stores sequences of characters using an encoding like
utf-8, latin1 etc., whereas in the case of binary file data is stored in the same format as in
Computer memory.
Text files: Python source code, HTML file, text file, markdown file etc.
Binary files: executable files, images, audio etc.
It is important to note that inside the disk both types of files are stored as a sequence of 1s
and 0s. The only difference is that when a text file is opened the data is decoded back using
the same encoding scheme they were encoded in. However, in the case of binary files no
such thing happens.
4. File Handling
Python: How to read and write files?
Binary files
Binary formats are usually very fast to save and load and they can be very compact.
Binary data doesn’t need parsing since each data type is stored using its natural
representation.
Binary data is not human readable or editable, and without knowing the format in detail it is
not possible to create separate tools to work with binary data.
Text files
Text formats are human readable and editable, and this can make text files easier to process
with separate tools or to change using a text editor.
Text formats can be tricky to parse and it is not always easy to give good error messages if a
text file’s format is broken (e.g., by careless editing).
5. File Handling
Opening Files in Python:
Python has a built-in open( ) function to open a file. This
function returns a file object, also called a handle, as it is used to
read or modify the file accordingly.
We can specify the mode while opening a file.
In mode, we specify whether we want to read r, write w or
append a to the file. We can also specify if we want to open the
file in text mode or binary mode.
The default is reading in text mode. In this mode, we get strings
when reading from the file.
On the other hand, binary mode returns bytes and this is the mode
to be used when dealing with non-text files like images or
executable files.
6. File Handling
Opening Files in Python:
File_object=open(“path_of_file_with_extension”, “mode”)
Mode Description
r Opens a file for reading. (default)
w
Opens a file for writing. Creates a new file if it does not exist or
truncates the file if it exists.
x
Opens a file for exclusive creation. If the file already exists, the
operation fails.
a
Opens a file for appending at the end of the file without truncating
it. Creates a new file if it does not exist.
t Opens in text mode. (default)
b Opens in binary mode.
+ Opens a file for updating (reading and writing)
7. File Handling
Opening Files in Python:
Writing to Files in Python
file_obj = open("fileoperations.txt","w") # Write
file_obj = open("fileoperations.txt",“a") # Append at the end of the file
Reading from Files in Python
file_obj = open("fileoperations.txt",“r")
Closing the file
file_obj.close( )
8. File Handling
Opening Files in Python:
Writing to Files in Python
file_obj = open("fileoperations.txt","w") # Write
9. File Handling
Writing and Reading Binary Data:
Binary formats, even without compression, usually take up
the least amount of disk space and are usually the fastest to
save and load.
Easiest of all is to use pickle.
10. File Handling
Writing and Reading Binary Data:
Pickle with Optional Compression
The standard pickle package provides an excellent default tool for
serializing arbitrary python objects and storing them to disk.
Standard python also includes broad set of data compression
packages.
compress_pickle provides an interface to the standard pickle.dump,
pickle.load, pickle.dumps and pickle.loads functions, but wraps
them in order to direct the serialized data through one of the standard
compression packages.
This way you can seemlessly serialize data to disk or to any file-like
object in a compressed way.
11. File Handling
Writing and Reading Binary Data:
Pickles with Optional Compression
pickle.dump: Write a pickled representation of object to the open file object file.
pickle.load: Read a string from the open file object file and interpret it as a pickle
data stream, reconstructing and returning the original object hierarchy.
pickle.dumps : Return the pickled representation of the object as a string,
instead of writing it to a file.
pickle.loads : Read a pickled object hierarchy from a string.
Program: binary_pickle.py
12. File Handling
Writing and Reading Binary Data:
Raw Binary Data with Optional Compression
import re
f=open("binfile.bin","wb")
num=[5, 10, 15, 20, 25]
arr=bytearray(num)
f.write(arr)
f.close( )
When creating custom binary file formats it is wise to create a magic number to
identify your file type, and a version number to identify the version of the file
format in use.
MAGIC = b"AIBx00"
FORMAT_VERSION = b"x00x01“
To write and read raw binary data we must have some means of converting
Python objects to and from suitable binary representations. Most of the
functionality we need is provided by the struct module
13. File Handling
Writing and Parsing Text Files:
Writing Text
file_obj = open("fileoperations.txt","w")
str_data=“Text to be written into the file”
file_obj.write(str_data)
file_obj.close( )
file_obj = open("fileoperations.txt","r")
print(file_obj.read()) # Reads till the end of the file
file_obj.close( )
14. File Handling
Writing and Parsing Text Files:
All methods of file object is given below:
Method Description
file.close() Closes the file.
file.flush() Flushes the internal buffer.
next(file) Returns the next line from the file each time it is called.
file.read([size]) Reads at a specified number of bytes from the file.
file.readline() Reads one entire line from the file.
file.readlines() Reads until EOF and returns a list containing the lines.
file.seek(offset, from) Sets the file's current position.
file.tell() Returns the file's current position
file.write(str) Writes a string to the file. There is no return value.
15. File Handling
Writing and Parsing Text Files:
Parsing Text
If your data is in a standard format or close enough, then there is
probably an existing package that you can use to read your data with
minimal effort.
split( )
panda?
16. File Handling
Writing and Parsing Text Files:
Parsing Text Using Regular
Expressions
Program text_parse_re.py
BeautifulSoup?
17. File Handling
Random Access Binary Files.
Earlier we worked on the basis that all of a program’s data was read into
memory in one go, processed, and then all written out in one go.
Modern computers have so much RAM that this is a perfectly viable
approach, even for large data sets. However, in some situations holding
the data on disk and just reading the bits we need and writing back
changes might be a better solution.
The disk-based random access approach is most easily done using a
key–value database (a “DBM”), or a full SQL.
18. File Handling
Random Access Binary Files.
A Generic BinaryRecordFile Class
Instances of this class represent a generic readable/writable binary
file, structured as a sequence of fixed length records.
BikeStock class which holds a collection of BikeStock.Bike objects
as records in a BinaryRecordFile.
http://www.cs.utsa.edu/~wagner/python/summerfield/py3/BinaryReco
rdFile.py
19. Regular Expressions:
A regular expression is a compact notation for representing a
collection of strings.
What makes regular expressions so powerful is that a single regular
expression can represent an unlimited number of strings—providing
they meet the regular expression’s requirements.
Regular expressions (which we will mostly call “regexes” from now
on) are defined using a mini-language that is completely different
from Python—but Python includes the re module through which we
can seamlessly create and use regexes.
20. Regular Expressions:
Regexes are used for five main purposes:
1. Parsing: identifying and extracting pieces of text that match certain
criteria—regexes are used for creating ad hoc parsers and also by traditional
parsing tools.
2. Searching: locating substrings that can have more than one form, for
example, finding any of “pet.png”, “pet.jpg”, “pet.jpeg”, or “pet.svg” while
avoiding “carpet.png” and similar.
3. Searching and replacing: replacing everywhere the regex matches with a
string, for example, finding “bicycle” or “human powered vehicle” and
replacing either with “bike”.
4. Splitting strings: splitting a string at each place the regex matches, for
example, splitting everywhere colon-space or equals (“: ” or “=”) occurs.
5. Validation: checking whether a piece of text meets some criteria, for
example, contains a currency symbol followed by digits.
21. Regular Expressions: Python‘s Regular Expression
Language
1. Characters and Character Classes: match individual characters or
groups of characters, for example, match a, or match b, or match
either a or b.
2. Quantifiers: match once, or match at least once, or match as many
times as possible.
3. Grouping and Capturing: shows how to group subexpressions and
how to capture matching text
4. Assertions and Flags: shows how to use the language’s assertions
and flags to affect how regular expressions work.
22. Regular Expressions: Python‘s Regular Expression
Language
Characters and Character Classes:
The simplest expressions are just literal characters, such as a or 5, and if no
quantifier is explicitly given it is taken to be “match one occurrence”.
For example, the regex tune consists of four expressions, each implicitly
quantified to match once, so it matches one t followed by one u followed
by one n followed by one e, and hence matches the strings tune in the
string attuned.
import re
s='attuned'
if re.search('tune',s):
print("Match found")
else:
print("Match not found")
23. Regular Expressions: Python‘s Regular Expression
Language
Characters and Character Classes:
Although most characters can be used as literals, some are “special
characters” these are symbols in the regex language and so must be
escaped by preceding them with a backslash () to use them as literals.
The special characters String are .^$?+*{}[]()|.
Most of escapes Python’s standard string escapes can also be used
within regexes, for example, n for newline and t for tab.
24. Regular Expressions: Python‘s Regular Expression Language
Characters and Character Classes:
In many cases, rather than matching one particular character we want to
match any one of a set of characters. This can be achieved by using a
character class—one or more characters enclosed in square brackets.
A character class is an expression, and like any other expression, if not
explicitly quantified it matches exactly one character (which can be any of
the characters in the character class). For example, the regex r[ea]d
matches both red and radar, but not read.
import re
print(re.search('r[ea]d', 'red')) # <re.Match object; span=(0, 3), match='red'>
print(re.search('r[ea]d', 'radar')) # <re.Match object; span=(0, 3), match='rad'>
print(re.search('r[ea]d', 'read')) # None
print(re.search('[0123456789]', '1')) # <re.Match object; span=(0,1), match='1'>
print(re.search('[0-9]', '2')) # <re.Match object; span=(0, 1), match='1'>
25. Regular Expressions: Python‘s Regular Expression Language
Characters and Character Classes:
Character Set Range: It is possible to also use the range of a
character. This is done by leveraging the hyphen symbol (-) between
two related characters; for example, to match any lowercase letter we
can use [a-z]. Likewise, to match any single digit we can define the
character set [0-9].
txt = """
The first season of Indian Premiere League (IPL) was played in 2008. The second season was played in 2009 in
South Africa. Last season was played in 2018 and won by Chennai Super Kings (CSK).CSK won the title in
2010 and 2011 as well. Mumbai Indians (MI) has also won the title 3 times in 2013, 2015 and 2017 and RCB
will win IPL in the year 3000 because ee sala cup namde."""
pattern = re.compile("[0-9][0-9][0-9][0-9]")
print(pattern.findall(txt)) # OUTPUT ['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']
pattern = re.compile("[^aeiou]")
print(pattern.findall(txt)) #OUTPUT will print all characters other than aeiou (vowels)
26. Regular Expressions: Python‘s Regular Expression
Language
Quantifiers:
Quantifiers are the mechanisms to define how a character, metacharacter, or
character set can be repeated.
Symbol Quantification of previous character
? Optional (0 or 1 repetation)
* Zero or more times
+ One or more times
{m,n} Between m and n times
27. Regular Expressions: Python‘s Regular Expression
Language
Grouping and Capturing:
In practical applications, we often need regexes that can match any one of
two or more alternatives, and we often need to capture the match or some
part of the match for further processing.
Also, we sometimes want a quantifier to apply to several expressions. All of
these goals can be achieved by grouping with parentheses ( ); and, in the
case of alternatives, using alternation with the vertical bar (|).
For example, the regex aircraft|airplane|jet will match any text that
contains aircraft or airplane or jet.
The same objective can be achieved using the regex air(craft|plane)|jet.
Here, the parentheses are used to group expressions, so we have two outer
expressions, air(craft|plane) and jet. The first of these has an inner
expression, craft|plane, and because this is preceded by air, the first outer
expression can match only aircraft or airplane.
28. Regular Expressions: Python‘s Regular Expression Language
Grouping and Capturing:
import re
print(re.search('aircraft|airplane|jet', 'airplane'))
print(re.search('air(craft|plane)|jet', 'airplane'))
print(re.search('air(craft|plane)|jet', 'aircraft'))
print(re.search('air(craft|plane)|jet', 'jet'))
29. Regular Expressions: Python‘s Regular Expression Language
Grouping and Capturing:
import re
print(re.search('aircraft|airplane|jet', 'airplane'))
print(re.search('air(craft|plane)|jet', 'jet'))
Parentheses in air(craft|plane)|jet serve two different purposes: grouping expressions
and capturing the text that matches an expression.
We'll use the term group to refer to a grouped expression whether it captures or not, and
capture and capture group to refer to a captured group.
If we used the regex (aircraft|airplane|jet), it not only would match any of the three
expressions, but would capture whichever one was matched for later reference.
Compare this with the regex (air(craft|plane)|jet), which has two captures if the first
expression matches (aircraft or airplane as the first capture and craft or plane as the
second capture), and one capture if the second expression matches (jet). We can switch
off the capturing effect by following an opening parenthesis with ?: like this:
print(re.search('air(?:craft|plane)|jet', 'jet'))
This will have only one capture if it matches (aircraft or airplane or jet).
30. Regular Expressions: Python‘s Regular Expression
Language
Assertions and Flags:
One problem that affects many of the regexes we've examined so far is that they
can match more or different text than we intended.
For example, regex aircraft|airplane|jet will match waterjet and jetski as well
as jet.
This kind of problem can be solved by using assertions.
An assertion doesn't match any text, but instead says something about the
text at the point where the assertion occurs.
31. Regular Expressions: Python‘s Regular Expression
Language
Assertions and Flags:
One assertion is b (word boundary), which asserts that the character that
precedes it must be a "word" (w) and the character that follows it must be a
non-"word" (W), or vice versa. For example, although the regex jet can match
twice in the text: “the jet and jetski are noisy”
print(re.search('aircraft|airplane|jet', ‘the waterjet'))
#OUTPUT <re.Match object; span=(9,12), match='jet'>
print(re.search(r'baircraftb|bairplaneb|bjetb', 'waterjet'))
#OUTPUT None
print(re.search(r'baircraftb|bairplaneb|bjetb', 'waterjet and jet'))
#OUTPUT <re.Match object; span=(13,16), match='jet'>
print(re.search(r'b(?:aircraft|airplane|jet)b', 'waterjet and jet'))
32. Regular Expressions: Python‘s Regular Expression Language
Symbol Meaning
^ Matches at the start; also matches after each newline with the re.MULTILINE flag.
$ Matches at the end; also matches before each newline with the re.MULTILINE flag.
A Matches at the start.
b Matches at a "word" boundary, influenced by the re.ASCII flag. Inside a character class,
this is the escape for the backspace character.
B Matches at a non-"word" boundary, influenced by the re.ASCII flag.
Z Matches at the end.
(?=e) Matches if the expression e matches at this assertion but doesn't advance over it—
called lookahead or positive lookahead.
(?!e) Matches if the expression e doesn't match at this assertion and doesn't advance over it—
called negative lookahead.
(?<=e) Matches if the expression e matches immediately before this assertion—called positive
lookbehind.
(?<!e) Matches if the expression e doesn't match immediately before this assertion—
called negative lookbehind.
33. Regular Expressions: Assertion and flags:
Sr.
No.
Modifier & Description
1 re.I Performs case-insensitive matching.
2 re.L
Interprets words according to the current locale. This interpretation affects the alphabetic group (w and W),
as well as word boundary behavior(b and B). Use locale settings for byte patterns and 8 bit locales.
3 re.M
Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not
just the start of the string).
# flag to treat input as multiline
4 re.S Makes a period (dot) match any character, including a newline.
Allow . (dot) metacharacter to match newline character
5 re.U
Interprets letters according to the Unicode character set. This flag affects the behavior of w, W, b, B.
6 re.X
Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [ ] or when escaped
by a backslash) and treats unescaped # as a comment marker.
34. Regular Expressions: Regular Expression Module
One way of using regular expression is to use the functions listed in Table(Next
slide), in which each function is given a regex as its first argument.
Each function converts the regex into an internal format—a process called
compiling—and then does its work.
This technique is very convenient for one-off uses, but if we need to use the
same regex repeatedly we can avoid the cost of compiling it at each use by
compiling it once using the re.compile( ) function.
Then we can call methods on the compiled regex object as many times as we like.
35. Regular Expressions: Regular Expression Module
Syntax Description
re.compile(r, f) Returns compiled regex r with its flags set to f if specified.
re.escape(s) Returns string s with all non-alphanumeric characters backslash-escaped; therefore, the returned
string has no special regex characters.
re.findall(r, s, f) Returns all non-overlapping matches of regex r in string s (influenced by the flags f if given). If
the regex has captures, each match is returned as a tuple of captures.
re.finditer(r, s, f) Returns a match object for each non-overlapping match of regex r in string s (influenced by the
flags f if given).
re.match(r, s, f) Returns a match object if the regex r matches at the start of string s (influenced by the flags f if
given); otherwise, returns None.
re.search(r, s, f) Returns a match object if the regex r matches anywhere in string s (influenced by the flags f if
given); otherwise, returns None.
re.split(r, s, m) Returns the list of strings that results from splitting string s on every occurrence of regex r doing
up to m splits (or as many as possible if no m is given). If the regex has captures, these are
included in the list between the parts they split.
re.sub(r, x, s, m) Returns a copy of string s with every (or up to m if given) match of regex r replaced with x—this
can be a string or a function.
re.subn(r, x, s m) Same as re.sub() except that it returns a 2-tuple of the resultant string and the number of
substitutions that were made.
36. Regular Expressions: Python‘s Regular Expression Language
compile(pattern, flags=0)
Regular expressions are handled as strings by Python.
However, with compile( ), you can computer a regular expression pattern into a
regular expression object.
When you need to use an expression several times in a single program, using
compile( ) to save the resulting regular expression object for reuse is more
efficient than saving it as a string.
This is because the compiled versions of the most recent patterns passed to
compile( ) and the module-level matching functions are cached.
37. Regular Expressions: Python‘s Regular Expression Language
search(pattern, string, flags=0)
With this function, you scan through the given string/sequence, looking for the
first location where the regular expression produces a match.
It returns a corresponding match object if found, else returns None if no position
in the string matches the pattern.
match(pattern, string, flags=0)
Returns a corresponding match object if zero or more characters at the
beginning of string match the pattern. Else it returns None, if the string does not
match the given pattern.
search( ) versus match( )
The match( ) function checks for a match only at the beginning of the string (by
default), whereas the search( ) function checks for a match anywhere in the
string.
38. Regular Expressions: Python‘s Regular Expression Language
findall(pattern, string, flags=0)
Finds all the possible matches in the entire sequence and returns them as a list of
strings. Each returned string represents one match.
finditer(string, [position, end_position])
Similar to findall( ) - it finds all the possible matches in the entire sequence but
returns regex match objects as an iterator.
statement = "Please contact us at: support@datacamp.com, xyz@datacamp.com"
#'addresses' is a list that stores all the possible match
addresses = re.finditer(r'[w.-]+@[w.-]+', statement)
for address in addresses:
print(address)
39. Regular Expressions: Python‘s Regular Expression Language
sub(pattern, repl, string, count=0, flags=0)
subn(pattern, repl, string, count=0)
sub( ) is the substitute function. It returns the string obtained by replacing or
substituting the leftmost non-overlapping occurrences of pattern in string by the
replacement repl. If the pattern is not found, then the string is returned unchanged.
The subn( ) is similar to sub( ). However, it returns a tuple containing the new
string value and the number of replacements that were performed in the
statement.
40. Regular Expressions: Python‘s Regular Expression Language
split(string, [maxsplit = 0])
This splits the strings wherever the pattern matches and returns a list. If the
optional argument maxsplit is nonzero, then the maximum 'maxsplit' number of
splits are performed.
start( ) - Returns the starting index of the match.
end( ) - Returns the index where the match ends.
span( ) - Return a tuple containing the (start, end) positions of the match.
41. Regular Expressions: Python‘s Regular Expression Language
Match Objects
Match objects always have a boolean value of True. Since match( ) and search(
) return None when there is no match, we can test whether there was a match with
a simple if statement:
42. Regular Expressions: Python‘s Regular Expression Language
Match Objects : Match Object Attributes and Methods
match.group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the
result is a single string; if there are multiple arguments, the result is a tuple
with one item per argument.
m = re.match(r"(w+) (w+)", "Isaac Newton, physicist")
print(m.group(0)) # Isaac Newton
match.groupdict(default=None)
Return a dictionary containing all the named subgroups of the match, keyed
by the subgroup name. The default argument is used for groups that did not
participate in the match; it defaults to None.
m = re.match(r"(?P<first_name>w+) (?P<last_name>w+)", "Malcolm Reynolds")
print(m.groupdict( )) # {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
43. Regular Expressions: Python‘s Regular Expression Language
Match Objects : Match Object Attributes and Methods
match.start([group])
match.end([group])
Return the indices of the start and end of the substring matched by group;
group defaults to zero (meaning the whole matched substring). Return -1 if group
exists but did not contribute to the match.
match.span([group])
For a match m, return the 2-tuple (m.start(group), m.end(group)). Note that if
group did not contribute to the match, this is (-1, -1). group defaults to zero, the
entire match.
44. Regular Expressions: Python‘s Regular Expression Language
Match Objects : Match Object Attributes and Methods
match.pos
The value of pos which was passed to the search( ) or match( ) method of a
regex object. This is the index into the string at which the RE engine started
looking for a match.
match.endpos
The value of endpos which was passed to the search( ) or match( ) method of a
regex object. This is the index into the string beyond which the RE engine will
not go.
45. Regular Expressions: Python‘s Regular Expression Language
Match Objects : Match Object Attributes and Methods
match.lastindex
The integer index of the last matched capturing group, or None if no group was
matched at all. For example, the expressions (a)b, ((a)(b)), and ((ab)) will have
lastindex = = 1 if applied to the string 'ab', while the expression (a)(b) will have
lastindex = = 2, if applied to the same string.
match.lastgroup
The name of the last matched capturing group, or None if the group didn’t
have a name, or if no group was matched at all.
match.re
The regular expression object whose match( ) or search( ) method produced this
match instance.
match.string
The string passed to match( ) or search( ).