Web Scrapping with Python

Web Scrapping with Python

Miguel Miranda de Mattos
:@mmmattos - mmmattos.net
Porto Alegre, Brazil.
2012

Web Scrapping with Python

● Tools:

○ BeautifulSoup

○ Mechanize

BeautifulSoup
An HTML/XML parser for Python that can turn even invalid
markup into a parse tree. It provides simple, idiomatic ways
of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.

● In Summary:
○ Navigate the "soup" of HTML/XML tags,
programatically

○ Access tag´s properties and values

○ Search for tags and their attributes.

BeautifulSoup
○ Example:

from BeautifulSoup import BeautifulSoup
doc = "<html><h1>Heading</h1><p>Text"
soup = BeautifulSoup(doc)
print soup.prettify()

# <html>
# <h1>
# Heading
# </h1>
# <p>
# Text
# </p>
# </html>
○

BeautifulSoup

○ Searching / Looking for things
■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings',
'findParent', 'findParents', 'findPrevious', 'findPreviousSibling',
'findPreviousSiblings'

■ findAll
● findAll(self, name=None, attrs={}, recursive=True,
text=None, limit=None, **kwargs)

● Extracts a list of Tag objects that match the given
● criteria. You can specify the name of the Tag and any
● attributes you want the Tag to have.

BeautifulSoup
● Example:
>>> from BeautifulSoup import BeautifulSoup
>>> doc = "<table><tr><td>one</td><td>two</td></tr></table>"
>>> docSoup = BeautifulSoup(doc)

>>> print docSoup.findAll('tr')
[<tr><td>one</td><td>two</td></tr>]

>>> print docSoup.findAll('td')
[<td>one</td>, <td>two</td>]

BeautifulSoup
● findAll (cont´d.):
>>> for t in docSoup.findAll('td'):
>>> print t

<td>one</td>
<td>two</td>

>>> for t in docSoup.findAll('td'):
>>> print t.getText()

one
two

BeautifulSoup
● findAll using attributes to qualify:
>>> soup.findAll('div',attrs = {'class': 'Menus'})
[<div>musicMenu</div>,<div>videoMenu</div>]

● For more options:
○ dir (BeautifulSoup)
○ help (yourSoup.<command>)

● Use BeautifulSoup rather than regexp patterns:
patFinderTitle = re.compile(r'<a[^>]*stitle="(.*?)"')
re.findAll(patFinderTitle, html)
○ by
soup = BeautifulSoup(html)
for tag in brand_row_soup.findAll('a'):
print tag['title']

Mechanize
● Stateful programmatic web browsing in Python, after
Andy Lester’s Perl module.
● mechanize.Browser and mechanize.UserAgentBase, so:
○ any URL can be opened, not just http:
○ mechanize.UserAgentBase offers easy dynamic configuration of
user-agent features like protocol, cookie, redirection and robots.
txt handling, without having to make a new OpenerDirector each
time, e.g. by callingbuild_opener().
● Easy HTML form filling.
● Convenient link parsing and following.
● Browser history (.back() and .reload() methods).
● The Referer HTTP header is added properly (optional).
● Automatic observance of robots.txt.
● Automatic handling of HTTP-Equiv and Refresh.

Mechanize
● Navigation commands:
○ open(url)
○ follow_link(link)
○ back()
○ submit()
○ reload()
● Examples
br = mechanize.Browser()
br.open("python.org")
gothtml = br.response().read()
for link in br.links(url_regex="python.org"):
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
br.back()

Mechanize
● Example:
import re
import mechanize

br.open("http://www.example.com/")

# follow second link with element text matching
# regular expression
response1 = br.follow_link(text_regex=r"cheeses*shop")

assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info() # headers
print response1.read() # body

Mechanize
● Example: Combining Mechanize and BeautifulSoup
import re
import mechanize
from BeautifulSoup import BeutifulSoup

url = "http://www.hp.com"
br..open(url)
html = br.response().read()
result_soup = BeautifulSoup(html)

found_divs = soup.findAll('div')
print "Found " + str(len(found_divs))
for d in found_divs:
print d

Mechanize
● Example: Combining Mechanize and BeautifulSoup
import re
import mechanize

url = "http://www.hp.com"
br..open(url)
html = br.response().read()
result_soup = BeautifulSoup(html)

found_divs = soup.findAll('div')
print "Found " + str(len(found_divs))
for d in found_divs:
if d.has_key('class'):
print d['class']

Web Scrapping with Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Web Scrapping with Python

Similar to Web Scrapping with Python (20)

Recently uploaded

Recently uploaded (20)

Web Scrapping with Python