Introduction on how to crawl for sites and content from the unstructured data on the web. using the Python programming language and some existing python modules.
3. BeautifulSoup
An HTML/XML parser for Python that can turn even invalid
markup into a parse tree. It provides simple, idiomatic ways
of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.
● In Summary:
○ Navigate the "soup" of HTML/XML tags,
programatically
○ Access tag´s properties and values
○ Search for tags and their attributes.
5. BeautifulSoup
○ Searching / Looking for things
■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings',
'findParent', 'findParents', 'findPrevious', 'findPreviousSibling',
'findPreviousSiblings'
■ findAll
● findAll(self, name=None, attrs={}, recursive=True,
text=None, limit=None, **kwargs)
● Extracts a list of Tag objects that match the given
● criteria. You can specify the name of the Tag and any
● attributes you want the Tag to have.
7. BeautifulSoup
● findAll (cont´d.):
>>> for t in docSoup.findAll('td'):
>>> print t
<td>one</td>
<td>two</td>
>>> for t in docSoup.findAll('td'):
>>> print t.getText()
one
two
8. BeautifulSoup
● findAll using attributes to qualify:
>>> soup.findAll('div',attrs = {'class': 'Menus'})
[<div>musicMenu</div>,<div>videoMenu</div>]
● For more options:
○ dir (BeautifulSoup)
○ help (yourSoup.<command>)
● Use BeautifulSoup rather than regexp patterns:
patFinderTitle = re.compile(r'<a[^>]*stitle="(.*?)"')
re.findAll(patFinderTitle, html)
○ by
soup = BeautifulSoup(html)
for tag in brand_row_soup.findAll('a'):
print tag['title']
9. Mechanize
● Stateful programmatic web browsing in Python, after
Andy Lester’s Perl module.
● mechanize.Browser and mechanize.UserAgentBase, so:
○ any URL can be opened, not just http:
○ mechanize.UserAgentBase offers easy dynamic configuration of
user-agent features like protocol, cookie, redirection and robots.
txt handling, without having to make a new OpenerDirector each
time, e.g. by callingbuild_opener().
● Easy HTML form filling.
● Convenient link parsing and following.
● Browser history (.back() and .reload() methods).
● The Referer HTTP header is added properly (optional).
● Automatic observance of robots.txt.
● Automatic handling of HTTP-Equiv and Refresh.
10. Mechanize
● Navigation commands:
○ open(url)
○ follow_link(link)
○ back()
○ submit()
○ reload()
● Examples
br = mechanize.Browser()
br.open("python.org")
gothtml = br.response().read()
for link in br.links(url_regex="python.org"):
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
br.back()
11. Mechanize
● Example:
import re
import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
# follow second link with element text matching
# regular expression
response1 = br.follow_link(text_regex=r"cheeses*shop")
assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info() # headers
print response1.read() # body
12. Mechanize
● Example: Combining Mechanize and BeautifulSoup
import re
import mechanize
from BeautifulSoup import BeutifulSoup
url = "http://www.hp.com"
br = mechanize.Browser()
br..open(url)
assert br.viewing_html()
html = br.response().read()
result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')
print "Found " + str(len(found_divs))
for d in found_divs:
print d
13. Mechanize
● Example: Combining Mechanize and BeautifulSoup
import re
import mechanize
url = "http://www.hp.com"
br = mechanize.Browser()
br..open(url)
assert br.viewing_html()
html = br.response().read()
result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')
print "Found " + str(len(found_divs))
for d in found_divs:
if d.has_key('class'):
print d['class']