SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Intro to Web Scraping with Python
Vancouver PyLadies Meetup
March 7, 2016
Maris Lemba
2
Before we begin
● Do consider whether the site you are interested in allows scraping by
examining its robots.txt file
– This robots exclusion standard is used by web servers to communicate with web
crawlers and other robots
– Located in the top-level directory of web server
– Can specify which directories the site owner does not want robots to access
– Can have different access rules for different robots
● If you intend to publish the output from your scraping, consider whether it
is legal to do so
– It is generally ok to re-publish “factual data”
#Sample robots.txt file
User-agent: Googlebot
Disallow: /folder/
User-agent: *
Disallow: /photos/
3
Basic components of a web page
● HTML – a set of markup tags for describing web documents
● CSS – style sheet language for describing the presentation of an HTML document
● JavaScript – popular client-side programming language used for web applications
Sample html source
This is what it would look like in the
browser
If we wanted to scrape information from the “Day of the week” table, we can locate it in the
html inside the <table> ... </table> element.
If there are multiple tables, we can use attributes such as id to identify the right table.
4
How the web works
user interface
web server
web browser
database, data
handling, etc
server-based system
user interface
Ajax “engine”
(javascript)
web browser
HTTP request
HTTP request
HTML +
CSS data
web server
database, data
handling, etc
server-based system
XML/JSON or
HTML/CSS or
Javascript data
Schematic adapted from https://en.wikipedia.org/wiki/Ajax_%28programming%29
ClientServer
Javascript
call
HTML+CSS
data
Conventional model of web application Ajax model of web application
5
Scraping ad hoc vs at scale
● Use case 1 (basic scraping): I need to scrape a few pages,
maybe crawl a small site. I can wait to have all data retrievals
done sequentially.
– This is what we will focus on in this talk
– Two scenarios:
● All the data you need is in the HTML files you get from the server
● The page is dynamically generated using Ajax
● Use case 2: I am crawling large sites periodically. I need task
scheduling, asynchronous retrieval, pipeline automation, and
lots of other tools to make this easier.
– You may want to look into a crawling/scraping framework like Scrapy
6
Agenda
● We'll walk through examples for the two
scenarios under “use case 1”
– We will use two web sites,
pacificroadrunners.ca/firsthalf/ and
ultrasignup.com, to scrape the results for all
years of two running races
● Pacific Road Runners First Half Marathon
● Chuckanut 50K
7
Basic scraping: static HTML
● Tasks:
– Inspect the HTML source to locate where in the Document Object Model (DOM) the data of
interest resides
● Right-click on the page in browser window and select “View Page Source”
– Retrieve the HTML file using built-in urllib library (note that the interface to this library has
changed between Python 2 and 3)
– Parse the HTML file and extract the data of interest, using either a built-in parser like html.parser,
or third-party libraries such as lxml, html5lib or BeautifulSoup
● Which library you choose for parsing depends on
– how sloppy your html is: if your html is “incorrect”, the parsers may handle it differently
– whether you want to maximize speed (choose lxml)
– what external dependencies you can live with (lxml is a binding to popular C libraries libxml2
and libxslt)
– how “easy” you want the interface to be (BeautifulSoup has an easy-to-get-started API and great
documentation with examples)
8
urllib library
● High level standard library for retrieving data across the Web
(not just html files)
● urllib library was split into several submodules in Python
3 for working with URLs:
– urllib.request, urllib.parse, urllib.error, 
urllib.robotparser
● The urllib.request provides methods for opening URLs
● The urllib.request.urlopen() method is similar to
the built-in function open() for opening files
9
BeautifulSoup library
● Python third-party library for extracting data from html and
xml files
● Works with html.parser, lxml, html5lib
● Provides ways to navigate, search and modify the parse
tree based on the position in the parse tree, tag name, tag
attributes, css classes using regular expressions, user-
defined functions etc
● Excellent tutorial with examples at
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
● Supports Python 2.7 and 3
10
Quick intro to BeautifulSoup
● BeautifulSoup constructor creates an object that
represents original documents as a nested data
structure
●
The simplest way to address a tag is to use its name
●
find() and find_all() methods allow to search for an
element; find() returns the first element found, and
find_all() a list of elements
●
You can access a tag's attributes by treating the tag
like a dictionary
●
You can extract text from the element
Parsing samplepage.html with
BeautifulSoup
11
Example: Results tab of Vancouver First Half race
The main Results page
has links to yearly results
12
HTML source inspection: front page of results
Find the links in the HTML source file corresponding to the rows of yearly results you
see in the browser
Note that the links we are interested
in are all inside
<div class=”entry-content”> ... </div>
The links to each year's page
are inside an <a> tag, as a
value to attribute named href
The text field of the <a> ... </a> element is a
string containing the year of the race
13
Extract all links to individual year's results
Note that the collected links are relative; we can append
them to the main url to get absolute links
14
HTML source inspection: individual year
Find the table of results in the DOM This is the table we want, each
runner's info in <tr> ... </tr>
Each participant's entry is a table
row <tr> ... </tr>.
The text (participant's placement,
name, finishing time, etc) are
inside <td> ... </td> elements
nested in the table row.
15
Scrape results for all years
First five finishers of the 2016 race on
the web page
Let's print out the first five of the
2016 entry in our dictionary
16
What is Ajax
● Name comes from Asynchronous JavaScript and XML, even
though XML is no longer the only format for data exchange
● Refers to a group of web technologies used to implement web
applications that communicate with the server in the background
without interfering with the display of the page or reloading the
entire page
● For web scraping, this means that the data you are interested in
may not be in the source HTML file you receive from the server
but is generated on the client side
● You can use browser development tools (“inspect element”) to
inspect the Ajax-generated HTML that your browser displays
17
Second example: ultrasignup.com
So far, web site looks similar to previous example. We are interested in
scraping results for all years of the race.
18
Results page for an individual year
This is a sample page for 2014 results. It looks like the results
are in a neat table.
19
Inspecting the DOM: the result table is not
filled out in the source HTML
However, the <table id=”list”> element is empty
when we view source html file.
This means we can't simply extract the result table
from the HTML file that the server sent to us like we
did in the previous example.
We see all result rows in
<table id=”list”> </table>
when inspecting the element
in browser developer tools
such as Mozilla Firebug
(which shows the generated
HTML)
Again, the details of the
results are within the
<td> ... </td> element inside
each <tr> ... </tr> element
20
Scraping dynamically generated HTML
● Your scraping system needs to be able to execute JavaScript and
either retrieve the results directly, or generate the HTML as seen in
the browser and scrape data from that
● A popular choice for the latter is to use Selenium, a suite of tools for
browser automation
– Python's binding to Selenium suite is the library selenium
– Selenium supports several browsers (Firefox, Chrome, etc), but for scraping
a headless non-graphical browser like PhantomJS is a good choice
– Once Selenium has generated the HTML after running JavaScript, you can
either access it as a string and use this as input to BeautifulSoup for data
extraction, or you can navigate the DOM and find elements within Selenium
21
Quick intro to selenium
● Selenium WebDriver accepts commands, sends them to the browser, and retrieves results
● You can navigate and search the DOM
– find_element_by_id('my_id')
– find_elements_by_tag_name('table')
– find_elements_by_class_name('my_class_name')
– find_elements_by_xpath()
● For example, find_elements_by_xpath(“//table[@id='list']”) will find a table with the id = “list”
● Tutorial on xpath syntax: www.w3schools.com/xsl/xpath_intro.asp
– and other methods (using css selectors, link text, etc)
● You can click on links using WebDriver's click() method
● You can retrieve the generated html source by calling .page_source on webdriver
instance and parse it outside of selenium
● More on Selenium with Python:
http://selenium-python.readthedocs.org/index.html
22
Scraping ultrasignup race results using
Selenium and PhantomJS
23
And it works!
Note that the first two lists (rows in web page table) in our dictionary corresponding to
a year do not actually contain finisher results.
Let's ignore those, and print out the subsequent results for the first 12 finishers.
This table is from the 2014 results page for
comparison: the results match
24
Additional resources for scraping projects
Additional third-party libraries you may want to look at:
● requests library
– Filling out forms/login information before accessing data and managing cookies
● mechanize library
– Navigating through forms, session and cookie handling
– Does not yet support Python 3
– Does not run JavaScript
● scrapy library
– Framework for web crawling and scraping: managing crawling, asynchronous
scheduling and processing of requests, cookie and session handling, built-in
support for data extraction etc
– Does not yet support Python 3
25
References for further reading
● BeautifulSoup
– http://www.crummy.com/software/Bea
utifulSoup/bs4/doc/
● Selenium with Python
– http://selenium-python.readthedocs.org/
● General web scraping
– Web Scraping with Python by Ryan Mitchell,
O'Reilly 2015

Contenu connexe

Tendances

Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction ServicePromptCloud
 
Introduction to ajax
Introduction to ajaxIntroduction to ajax
Introduction to ajaxNir Elbaz
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Web development | Derin Dolen
Web development | Derin Dolen Web development | Derin Dolen
Web development | Derin Dolen Derin Dolen
 
Web Development with HTML5, CSS3 & JavaScript
Web Development with HTML5, CSS3 & JavaScriptWeb Development with HTML5, CSS3 & JavaScript
Web Development with HTML5, CSS3 & JavaScriptEdureka!
 
Web application framework
Web application frameworkWeb application framework
Web application frameworkPankaj Chand
 
What is Ajax technology?
What is Ajax technology?What is Ajax technology?
What is Ajax technology?JavaTpoint.Com
 
Web Design & Development - Session 1
Web Design & Development - Session 1Web Design & Development - Session 1
Web Design & Development - Session 1Shahrzad Peyman
 
Introduction to ajax
Introduction  to  ajaxIntroduction  to  ajax
Introduction to ajaxPihu Goel
 
Web Development Ppt
Web Development PptWeb Development Ppt
Web Development PptBruce Tucker
 
FULL stack -> MEAN stack
FULL stack -> MEAN stackFULL stack -> MEAN stack
FULL stack -> MEAN stackAshok Raj
 
Full Stack Web Development
Full Stack Web DevelopmentFull Stack Web Development
Full Stack Web DevelopmentSWAGATHCHOWDARY1
 

Tendances (20)

Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Introduction to ajax
Introduction to ajaxIntroduction to ajax
Introduction to ajax
 
WEB DEVELOPMENT USING REACT JS
 WEB DEVELOPMENT USING REACT JS WEB DEVELOPMENT USING REACT JS
WEB DEVELOPMENT USING REACT JS
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Web development | Derin Dolen
Web development | Derin Dolen Web development | Derin Dolen
Web development | Derin Dolen
 
Web Development with HTML5, CSS3 & JavaScript
Web Development with HTML5, CSS3 & JavaScriptWeb Development with HTML5, CSS3 & JavaScript
Web Development with HTML5, CSS3 & JavaScript
 
Ajax ppt
Ajax pptAjax ppt
Ajax ppt
 
Web application framework
Web application frameworkWeb application framework
Web application framework
 
Basics of JavaScript
Basics of JavaScriptBasics of JavaScript
Basics of JavaScript
 
What is Ajax technology?
What is Ajax technology?What is Ajax technology?
What is Ajax technology?
 
Web Design & Development - Session 1
Web Design & Development - Session 1Web Design & Development - Session 1
Web Design & Development - Session 1
 
jQuery for beginners
jQuery for beginnersjQuery for beginners
jQuery for beginners
 
Introduction to ajax
Introduction  to  ajaxIntroduction  to  ajax
Introduction to ajax
 
Web Development Ppt
Web Development PptWeb Development Ppt
Web Development Ppt
 
FULL stack -> MEAN stack
FULL stack -> MEAN stackFULL stack -> MEAN stack
FULL stack -> MEAN stack
 
Full Stack Web Development
Full Stack Web DevelopmentFull Stack Web Development
Full Stack Web Development
 
Rest API
Rest APIRest API
Rest API
 

En vedette

Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientistsErin Shellman
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BSJohn D
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4Eueung Mulyana
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupJim Chang
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 

En vedette (10)

Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 

Similaire à Intro to web scraping with Python

Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells
 
Tool it Up! - Session #2 - NetPanel
Tool it Up! - Session #2 - NetPanelTool it Up! - Session #2 - NetPanel
Tool it Up! - Session #2 - NetPaneltoolitup
 
How Browsers Work -By Tali Garsiel and Paul Irish
How Browsers Work -By Tali Garsiel and Paul IrishHow Browsers Work -By Tali Garsiel and Paul Irish
How Browsers Work -By Tali Garsiel and Paul IrishNagamurali Reddy
 
Company Visitor Management System Report.docx
Company Visitor Management System Report.docxCompany Visitor Management System Report.docx
Company Visitor Management System Report.docxfantabulous2024
 
Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabsWeb scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabszekeLabs Technologies
 
[PyConZA 2017] Web Scraping: Unleash your Internet Viking
[PyConZA 2017] Web Scraping: Unleash your Internet Viking[PyConZA 2017] Web Scraping: Unleash your Internet Viking
[PyConZA 2017] Web Scraping: Unleash your Internet VikingAndrew Collier
 
1 Introduction to Drupal Web Development
1 Introduction to Drupal Web Development1 Introduction to Drupal Web Development
1 Introduction to Drupal Web DevelopmentWingston
 
Ruby On Rails Siddhesh
Ruby On Rails SiddheshRuby On Rails Siddhesh
Ruby On Rails SiddheshSiddhesh Bhobe
 
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptx
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptxHow to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptx
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptxProductdata Scrape
 
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdf
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdfHow to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdf
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdfProductdata Scrape
 
The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018STELIANCREANGA
 
Html intake 38 lect1
Html intake 38 lect1Html intake 38 lect1
Html intake 38 lect1ghkadous
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentAlkacon Software GmbH & Co. KG
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.Shyjal Raazi
 
Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPPaul Redmond
 
HES2011 - joernchen - Ruby on Rails from a Code Auditor Perspective
HES2011 - joernchen - Ruby on Rails from a Code Auditor PerspectiveHES2011 - joernchen - Ruby on Rails from a Code Auditor Perspective
HES2011 - joernchen - Ruby on Rails from a Code Auditor PerspectiveHackito Ergo Sum
 

Similaire à Intro to web scraping with Python (20)

Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
Tool it Up! - Session #2 - NetPanel
Tool it Up! - Session #2 - NetPanelTool it Up! - Session #2 - NetPanel
Tool it Up! - Session #2 - NetPanel
 
How Browsers Work -By Tali Garsiel and Paul Irish
How Browsers Work -By Tali Garsiel and Paul IrishHow Browsers Work -By Tali Garsiel and Paul Irish
How Browsers Work -By Tali Garsiel and Paul Irish
 
Company Visitor Management System Report.docx
Company Visitor Management System Report.docxCompany Visitor Management System Report.docx
Company Visitor Management System Report.docx
 
Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabsWeb scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabs
 
JavaScripts & jQuery
JavaScripts & jQueryJavaScripts & jQuery
JavaScripts & jQuery
 
[PyConZA 2017] Web Scraping: Unleash your Internet Viking
[PyConZA 2017] Web Scraping: Unleash your Internet Viking[PyConZA 2017] Web Scraping: Unleash your Internet Viking
[PyConZA 2017] Web Scraping: Unleash your Internet Viking
 
1 Introduction to Drupal Web Development
1 Introduction to Drupal Web Development1 Introduction to Drupal Web Development
1 Introduction to Drupal Web Development
 
Performance (browser)
Performance (browser)Performance (browser)
Performance (browser)
 
Ruby On Rails Siddhesh
Ruby On Rails SiddheshRuby On Rails Siddhesh
Ruby On Rails Siddhesh
 
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptx
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptxHow to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptx
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pptx
 
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdf
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdfHow to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdf
How to Scrape Amazon Best Seller Lists with Python and BeautifulSoup.pdf
 
The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018
 
Lecture 1 (2)
Lecture 1 (2)Lecture 1 (2)
Lecture 1 (2)
 
Html intake 38 lect1
Html intake 38 lect1Html intake 38 lect1
Html intake 38 lect1
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
 
Web technologies part-2
Web technologies part-2Web technologies part-2
Web technologies part-2
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
 
Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHP
 
HES2011 - joernchen - Ruby on Rails from a Code Auditor Perspective
HES2011 - joernchen - Ruby on Rails from a Code Auditor PerspectiveHES2011 - joernchen - Ruby on Rails from a Code Auditor Perspective
HES2011 - joernchen - Ruby on Rails from a Code Auditor Perspective
 

Dernier

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 

Dernier (20)

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 

Intro to web scraping with Python

  • 1. Intro to Web Scraping with Python Vancouver PyLadies Meetup March 7, 2016 Maris Lemba
  • 2. 2 Before we begin ● Do consider whether the site you are interested in allows scraping by examining its robots.txt file – This robots exclusion standard is used by web servers to communicate with web crawlers and other robots – Located in the top-level directory of web server – Can specify which directories the site owner does not want robots to access – Can have different access rules for different robots ● If you intend to publish the output from your scraping, consider whether it is legal to do so – It is generally ok to re-publish “factual data” #Sample robots.txt file User-agent: Googlebot Disallow: /folder/ User-agent: * Disallow: /photos/
  • 3. 3 Basic components of a web page ● HTML – a set of markup tags for describing web documents ● CSS – style sheet language for describing the presentation of an HTML document ● JavaScript – popular client-side programming language used for web applications Sample html source This is what it would look like in the browser If we wanted to scrape information from the “Day of the week” table, we can locate it in the html inside the <table> ... </table> element. If there are multiple tables, we can use attributes such as id to identify the right table.
  • 4. 4 How the web works user interface web server web browser database, data handling, etc server-based system user interface Ajax “engine” (javascript) web browser HTTP request HTTP request HTML + CSS data web server database, data handling, etc server-based system XML/JSON or HTML/CSS or Javascript data Schematic adapted from https://en.wikipedia.org/wiki/Ajax_%28programming%29 ClientServer Javascript call HTML+CSS data Conventional model of web application Ajax model of web application
  • 5. 5 Scraping ad hoc vs at scale ● Use case 1 (basic scraping): I need to scrape a few pages, maybe crawl a small site. I can wait to have all data retrievals done sequentially. – This is what we will focus on in this talk – Two scenarios: ● All the data you need is in the HTML files you get from the server ● The page is dynamically generated using Ajax ● Use case 2: I am crawling large sites periodically. I need task scheduling, asynchronous retrieval, pipeline automation, and lots of other tools to make this easier. – You may want to look into a crawling/scraping framework like Scrapy
  • 6. 6 Agenda ● We'll walk through examples for the two scenarios under “use case 1” – We will use two web sites, pacificroadrunners.ca/firsthalf/ and ultrasignup.com, to scrape the results for all years of two running races ● Pacific Road Runners First Half Marathon ● Chuckanut 50K
  • 7. 7 Basic scraping: static HTML ● Tasks: – Inspect the HTML source to locate where in the Document Object Model (DOM) the data of interest resides ● Right-click on the page in browser window and select “View Page Source” – Retrieve the HTML file using built-in urllib library (note that the interface to this library has changed between Python 2 and 3) – Parse the HTML file and extract the data of interest, using either a built-in parser like html.parser, or third-party libraries such as lxml, html5lib or BeautifulSoup ● Which library you choose for parsing depends on – how sloppy your html is: if your html is “incorrect”, the parsers may handle it differently – whether you want to maximize speed (choose lxml) – what external dependencies you can live with (lxml is a binding to popular C libraries libxml2 and libxslt) – how “easy” you want the interface to be (BeautifulSoup has an easy-to-get-started API and great documentation with examples)
  • 8. 8 urllib library ● High level standard library for retrieving data across the Web (not just html files) ● urllib library was split into several submodules in Python 3 for working with URLs: – urllib.request, urllib.parse, urllib.error,  urllib.robotparser ● The urllib.request provides methods for opening URLs ● The urllib.request.urlopen() method is similar to the built-in function open() for opening files
  • 9. 9 BeautifulSoup library ● Python third-party library for extracting data from html and xml files ● Works with html.parser, lxml, html5lib ● Provides ways to navigate, search and modify the parse tree based on the position in the parse tree, tag name, tag attributes, css classes using regular expressions, user- defined functions etc ● Excellent tutorial with examples at http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ● Supports Python 2.7 and 3
  • 10. 10 Quick intro to BeautifulSoup ● BeautifulSoup constructor creates an object that represents original documents as a nested data structure ● The simplest way to address a tag is to use its name ● find() and find_all() methods allow to search for an element; find() returns the first element found, and find_all() a list of elements ● You can access a tag's attributes by treating the tag like a dictionary ● You can extract text from the element Parsing samplepage.html with BeautifulSoup
  • 11. 11 Example: Results tab of Vancouver First Half race The main Results page has links to yearly results
  • 12. 12 HTML source inspection: front page of results Find the links in the HTML source file corresponding to the rows of yearly results you see in the browser Note that the links we are interested in are all inside <div class=”entry-content”> ... </div> The links to each year's page are inside an <a> tag, as a value to attribute named href The text field of the <a> ... </a> element is a string containing the year of the race
  • 13. 13 Extract all links to individual year's results Note that the collected links are relative; we can append them to the main url to get absolute links
  • 14. 14 HTML source inspection: individual year Find the table of results in the DOM This is the table we want, each runner's info in <tr> ... </tr> Each participant's entry is a table row <tr> ... </tr>. The text (participant's placement, name, finishing time, etc) are inside <td> ... </td> elements nested in the table row.
  • 15. 15 Scrape results for all years First five finishers of the 2016 race on the web page Let's print out the first five of the 2016 entry in our dictionary
  • 16. 16 What is Ajax ● Name comes from Asynchronous JavaScript and XML, even though XML is no longer the only format for data exchange ● Refers to a group of web technologies used to implement web applications that communicate with the server in the background without interfering with the display of the page or reloading the entire page ● For web scraping, this means that the data you are interested in may not be in the source HTML file you receive from the server but is generated on the client side ● You can use browser development tools (“inspect element”) to inspect the Ajax-generated HTML that your browser displays
  • 17. 17 Second example: ultrasignup.com So far, web site looks similar to previous example. We are interested in scraping results for all years of the race.
  • 18. 18 Results page for an individual year This is a sample page for 2014 results. It looks like the results are in a neat table.
  • 19. 19 Inspecting the DOM: the result table is not filled out in the source HTML However, the <table id=”list”> element is empty when we view source html file. This means we can't simply extract the result table from the HTML file that the server sent to us like we did in the previous example. We see all result rows in <table id=”list”> </table> when inspecting the element in browser developer tools such as Mozilla Firebug (which shows the generated HTML) Again, the details of the results are within the <td> ... </td> element inside each <tr> ... </tr> element
  • 20. 20 Scraping dynamically generated HTML ● Your scraping system needs to be able to execute JavaScript and either retrieve the results directly, or generate the HTML as seen in the browser and scrape data from that ● A popular choice for the latter is to use Selenium, a suite of tools for browser automation – Python's binding to Selenium suite is the library selenium – Selenium supports several browsers (Firefox, Chrome, etc), but for scraping a headless non-graphical browser like PhantomJS is a good choice – Once Selenium has generated the HTML after running JavaScript, you can either access it as a string and use this as input to BeautifulSoup for data extraction, or you can navigate the DOM and find elements within Selenium
  • 21. 21 Quick intro to selenium ● Selenium WebDriver accepts commands, sends them to the browser, and retrieves results ● You can navigate and search the DOM – find_element_by_id('my_id') – find_elements_by_tag_name('table') – find_elements_by_class_name('my_class_name') – find_elements_by_xpath() ● For example, find_elements_by_xpath(“//table[@id='list']”) will find a table with the id = “list” ● Tutorial on xpath syntax: www.w3schools.com/xsl/xpath_intro.asp – and other methods (using css selectors, link text, etc) ● You can click on links using WebDriver's click() method ● You can retrieve the generated html source by calling .page_source on webdriver instance and parse it outside of selenium ● More on Selenium with Python: http://selenium-python.readthedocs.org/index.html
  • 22. 22 Scraping ultrasignup race results using Selenium and PhantomJS
  • 23. 23 And it works! Note that the first two lists (rows in web page table) in our dictionary corresponding to a year do not actually contain finisher results. Let's ignore those, and print out the subsequent results for the first 12 finishers. This table is from the 2014 results page for comparison: the results match
  • 24. 24 Additional resources for scraping projects Additional third-party libraries you may want to look at: ● requests library – Filling out forms/login information before accessing data and managing cookies ● mechanize library – Navigating through forms, session and cookie handling – Does not yet support Python 3 – Does not run JavaScript ● scrapy library – Framework for web crawling and scraping: managing crawling, asynchronous scheduling and processing of requests, cookie and session handling, built-in support for data extraction etc – Does not yet support Python 3
  • 25. 25 References for further reading ● BeautifulSoup – http://www.crummy.com/software/Bea utifulSoup/bs4/doc/ ● Selenium with Python – http://selenium-python.readthedocs.org/ ● General web scraping – Web Scraping with Python by Ryan Mitchell, O'Reilly 2015