SlideShare a Scribd company logo
1 of 13
Download to read offline
Web Scrapping with Python




               Miguel Miranda de Mattos
                      :@mmmattos - mmmattos.net
                            Porto Alegre, Brazil.
                                          2012
Web Scrapping with Python

     ● Tools:

       ○ BeautifulSoup

       ○ Mechanize
BeautifulSoup
 An HTML/XML parser for Python that can turn even invalid
markup into a parse tree. It provides simple, idiomatic ways
of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.

● In Summary:
   ○ Navigate the "soup" of HTML/XML tags,
     programatically

   ○ Access tag´s properties and values

   ○ Search for tags and their attributes.
BeautifulSoup
  ○ Example:

      from BeautifulSoup import BeautifulSoup
      doc = "<html><h1>Heading</h1><p>Text"
      soup = BeautifulSoup(doc)
      print soup.prettify()

      # <html>
      # <h1>
      # Heading
      # </h1>
      # <p>
      # Text
      # </p>
      # </html>
  ○
BeautifulSoup

  ○ Searching / Looking for things
    ■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
           'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings',
           'findParent', 'findParents', 'findPrevious', 'findPreviousSibling',
           'findPreviousSiblings'

      ■ findAll
        ● findAll(self, name=None, attrs={}, recursive=True,
                text=None, limit=None, **kwargs)

           ●       Extracts a list of Tag objects that match the given
           ●       criteria. You can specify the name of the Tag and any
           ●       attributes you want the Tag to have.
BeautifulSoup
● Example:
  >>> from BeautifulSoup import BeautifulSoup
  >>> doc = "<table><tr><td>one</td><td>two</td></tr></table>"
  >>> docSoup = BeautifulSoup(doc)

  >>> print docSoup.findAll('tr')
  [<tr><td>one</td><td>two</td></tr>]

  >>> print docSoup.findAll('td')
  [<td>one</td>, <td>two</td>]
BeautifulSoup
● findAll (cont´d.):
   >>> for t in docSoup.findAll('td'):
   >>>     print t

   <td>one</td>
   <td>two</td>

   >>> for t in docSoup.findAll('td'):
   >>>     print t.getText()

   one
   two
BeautifulSoup
●   findAll using attributes to qualify:
    >>> soup.findAll('div',attrs = {'class': 'Menus'})
    [<div>musicMenu</div>,<div>videoMenu</div>]

●   For more options:
    ○   dir (BeautifulSoup)
    ○   help (yourSoup.<command>)

●   Use BeautifulSoup rather than regexp patterns:
        patFinderTitle = re.compile(r'<a[^>]*stitle="(.*?)"')
        re.findAll(patFinderTitle, html)
    ○   by
        soup = BeautifulSoup(html)
        for tag in brand_row_soup.findAll('a'):
        print tag['title']
Mechanize
● Stateful programmatic web browsing in Python, after
   Andy Lester’s Perl module.
    ●   mechanize.Browser and mechanize.UserAgentBase, so:
          ○ any URL can be opened, not just http:
          ○ mechanize.UserAgentBase offers easy dynamic configuration of
            user-agent features like protocol, cookie, redirection and robots.
            txt handling, without having to make a new OpenerDirector each
            time, e.g. by callingbuild_opener().
    ●   Easy HTML form filling.
    ●   Convenient link parsing and following.
    ●   Browser history (.back() and .reload() methods).
    ●   The Referer HTTP header is added properly (optional).
    ●   Automatic observance of robots.txt.
    ●   Automatic handling of HTTP-Equiv and Refresh.
Mechanize
● Navigation commands:
  ○ open(url)
  ○ follow_link(link)
  ○ back()
  ○ submit()
  ○ reload()
● Examples
   br = mechanize.Browser()
   br.open("python.org")
   gothtml = br.response().read()
   for link in br.links(url_regex="python.org"):
      print link
      br.follow_link(link) # takes EITHER Link instance OR keyword args
      br.back()
Mechanize
● Example:
     import re
     import mechanize

     br = mechanize.Browser()
     br.open("http://www.example.com/")

     # follow second link with element text matching
     # regular expression
     response1 = br.follow_link(text_regex=r"cheeses*shop")

     assert br.viewing_html()
     print br.title()
     print response1.geturl()
     print response1.info() # headers
     print response1.read() # body
Mechanize
● Example: Combining Mechanize and BeautifulSoup
     import re
     import mechanize
     from BeautifulSoup import BeutifulSoup

     url = "http://www.hp.com"
     br = mechanize.Browser()
     br..open(url)
     assert br.viewing_html()
     html = br.response().read()
     result_soup = BeautifulSoup(html)

     found_divs = soup.findAll('div')
     print "Found " + str(len(found_divs))
     for d in found_divs:
           print d
Mechanize
● Example: Combining Mechanize and BeautifulSoup
     import re
     import mechanize

     url = "http://www.hp.com"
     br = mechanize.Browser()
     br..open(url)
     assert br.viewing_html()
     html = br.response().read()
     result_soup = BeautifulSoup(html)

     found_divs = soup.findAll('div')
     print "Found " + str(len(found_divs))
     for d in found_divs:
           if d.has_key('class'):
                 print d['class']

More Related Content

What's hot

How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.Diep Nguyen
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0IBM Cloud Data Services
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseBruce McPherson
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appBruce McPherson
 
regular expressions and the world wide web
regular expressions and the world wide webregular expressions and the world wide web
regular expressions and the world wide webSergio Burdisso
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaVincent Terrasi
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
 

What's hot (20)

Fun with Python
Fun with PythonFun with Python
Fun with Python
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Selenium&amp;scrapy
Selenium&amp;scrapySelenium&amp;scrapy
Selenium&amp;scrapy
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
CouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text SearchCouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text Search
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
CouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: MangoCouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: Mango
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing app
 
Routing @ Scuk.cz
Routing @ Scuk.czRouting @ Scuk.cz
Routing @ Scuk.cz
 
regular expressions and the world wide web
regular expressions and the world wide webregular expressions and the world wide web
regular expressions and the world wide web
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and Kibana
 
Introducing CouchDB
Introducing CouchDBIntroducing CouchDB
Introducing CouchDB
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 

Similar to Web Scrapping with Python

The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84Mahmoud Samir Fayed
 
The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189Mahmoud Samir Fayed
 
PyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesPyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesArtur Barseghyan
 
JavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptxJavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptxVivekBaghel30
 
The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202Mahmoud Samir Fayed
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talkdtdannen
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to DjangoJames Casey
 
Web performance essentials - Goodies
Web performance essentials - GoodiesWeb performance essentials - Goodies
Web performance essentials - GoodiesJerry Emmanuel
 
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and MingRapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and MingRick Copeland
 
FYBSC IT Web Programming Unit III Core Javascript
FYBSC IT Web Programming Unit III  Core JavascriptFYBSC IT Web Programming Unit III  Core Javascript
FYBSC IT Web Programming Unit III Core JavascriptArti Parab Academics
 
Introduction to html5 and css3
Introduction to html5 and css3Introduction to html5 and css3
Introduction to html5 and css3Sunny Batabyal
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosIgor Sobreira
 
Jquery presentation
Jquery presentationJquery presentation
Jquery presentationguest5d87aa6
 
Introduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdfIntroduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdfDakshPratapSingh1
 
Neoito — How modern browsers work
Neoito — How modern browsers workNeoito — How modern browsers work
Neoito — How modern browsers workNeoito
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 

Similar to Web Scrapping with Python (20)

The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84
 
The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189
 
PyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesPyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slides
 
JavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptxJavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptx
 
The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Web performance essentials - Goodies
Web performance essentials - GoodiesWeb performance essentials - Goodies
Web performance essentials - Goodies
 
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and MingRapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
 
FYBSC IT Web Programming Unit III Core Javascript
FYBSC IT Web Programming Unit III  Core JavascriptFYBSC IT Web Programming Unit III  Core Javascript
FYBSC IT Web Programming Unit III Core Javascript
 
Introduction to html5 and css3
Introduction to html5 and css3Introduction to html5 and css3
Introduction to html5 and css3
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazos
 
Code Management
Code ManagementCode Management
Code Management
 
Jquery presentation
Jquery presentationJquery presentation
Jquery presentation
 
Introduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdfIntroduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdf
 
Neoito — How modern browsers work
Neoito — How modern browsers workNeoito — How modern browsers work
Neoito — How modern browsers work
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Web Scrapping with Python

  • 1. Web Scrapping with Python Miguel Miranda de Mattos :@mmmattos - mmmattos.net Porto Alegre, Brazil. 2012
  • 2. Web Scrapping with Python ● Tools: ○ BeautifulSoup ○ Mechanize
  • 3. BeautifulSoup An HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. ● In Summary: ○ Navigate the "soup" of HTML/XML tags, programatically ○ Access tag´s properties and values ○ Search for tags and their attributes.
  • 4. BeautifulSoup ○ Example: from BeautifulSoup import BeautifulSoup doc = "<html><h1>Heading</h1><p>Text" soup = BeautifulSoup(doc) print soup.prettify() # <html> # <h1> # Heading # </h1> # <p> # Text # </p> # </html> ○
  • 5. BeautifulSoup ○ Searching / Looking for things ■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings' ■ findAll ● findAll(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) ● Extracts a list of Tag objects that match the given ● criteria. You can specify the name of the Tag and any ● attributes you want the Tag to have.
  • 6. BeautifulSoup ● Example: >>> from BeautifulSoup import BeautifulSoup >>> doc = "<table><tr><td>one</td><td>two</td></tr></table>" >>> docSoup = BeautifulSoup(doc) >>> print docSoup.findAll('tr') [<tr><td>one</td><td>two</td></tr>] >>> print docSoup.findAll('td') [<td>one</td>, <td>two</td>]
  • 7. BeautifulSoup ● findAll (cont´d.): >>> for t in docSoup.findAll('td'): >>> print t <td>one</td> <td>two</td> >>> for t in docSoup.findAll('td'): >>> print t.getText() one two
  • 8. BeautifulSoup ● findAll using attributes to qualify: >>> soup.findAll('div',attrs = {'class': 'Menus'}) [<div>musicMenu</div>,<div>videoMenu</div>] ● For more options: ○ dir (BeautifulSoup) ○ help (yourSoup.<command>) ● Use BeautifulSoup rather than regexp patterns: patFinderTitle = re.compile(r'<a[^>]*stitle="(.*?)"') re.findAll(patFinderTitle, html) ○ by soup = BeautifulSoup(html) for tag in brand_row_soup.findAll('a'): print tag['title']
  • 9. Mechanize ● Stateful programmatic web browsing in Python, after Andy Lester’s Perl module. ● mechanize.Browser and mechanize.UserAgentBase, so: ○ any URL can be opened, not just http: ○ mechanize.UserAgentBase offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots. txt handling, without having to make a new OpenerDirector each time, e.g. by callingbuild_opener(). ● Easy HTML form filling. ● Convenient link parsing and following. ● Browser history (.back() and .reload() methods). ● The Referer HTTP header is added properly (optional). ● Automatic observance of robots.txt. ● Automatic handling of HTTP-Equiv and Refresh.
  • 10. Mechanize ● Navigation commands: ○ open(url) ○ follow_link(link) ○ back() ○ submit() ○ reload() ● Examples br = mechanize.Browser() br.open("python.org") gothtml = br.response().read() for link in br.links(url_regex="python.org"): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()
  • 11. Mechanize ● Example: import re import mechanize br = mechanize.Browser() br.open("http://www.example.com/") # follow second link with element text matching # regular expression response1 = br.follow_link(text_regex=r"cheeses*shop") assert br.viewing_html() print br.title() print response1.geturl() print response1.info() # headers print response1.read() # body
  • 12. Mechanize ● Example: Combining Mechanize and BeautifulSoup import re import mechanize from BeautifulSoup import BeutifulSoup url = "http://www.hp.com" br = mechanize.Browser() br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html) found_divs = soup.findAll('div') print "Found " + str(len(found_divs)) for d in found_divs: print d
  • 13. Mechanize ● Example: Combining Mechanize and BeautifulSoup import re import mechanize url = "http://www.hp.com" br = mechanize.Browser() br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html) found_divs = soup.findAll('div') print "Found " + str(len(found_divs)) for d in found_divs: if d.has_key('class'): print d['class']