SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
SCRAPING FROM THE WEB
   An Overview That Does Not Contain Too Much Cussing




                     Feihong Hsu
                        ChiPy
                   February 14, 2013
Organization


Definition of scraper

Common types of scrapers

Components of a scraping system

Pro tips
What I mean when I say scraper


Any program that retrieves structured data from the web, and
then transforms it to conform with a different structure.

Wait, isn’t that just ETL? (extract, transform, load)

Well, sort of, but I don’t want to call it that...
Notes
Some people would say that “scraping” only applies to web
pages. I would argue that getting data from a CSV or JSON file is
qualitatively not all that different. So I lump them all together.

Why not ETL? Because ETL implies that there are rules and
expectations, and these two things don’t exist in the world of
open government data. They can change the structure of their
dataset without telling you, or even take the dataset down on a
whim. A program that pulls down government data is often going
to be a bit hacky by necessity, so “scraper” seems like a good
term for that.
Main types of scrapers
CSV                PDF

RSS/Atom           Database dump

JSON               GIS

XML                Mixed

HTML crawler

Web browser
CSV

import csv

You should usually use csv.DictReader.

If the column names are all caps, consider making them
lowercase.

Watch out for CSV datasets that don’t have the same number of
elements on each row.
def get_rows(csv_file):
  reader = csv.reader(open(csv_file))
  # Get the column names, lowercased.
  column_names = tuple(k.lower() for k in reader.next())
  for row in reader:
        yield dict(zip(column_names, row))
JSON



import json
XML

import lxml.etree

Get rid of namespaces in the input document. http://bit.ly/
LO5x7H

A lot of XML datasets have a fairly flat structure. In these cases,
convert the elements to dictionaries.
<root>
   <items>
      <item>
          <id>3930277-ac</id>
          <name>Frodo Samwise</name>
          <age>56</age>
          <occupation>Tolkien scholar</occupation>
          <description>Short, with hairy feet</description>
      </item>
      ...
   </items>
</root>
import lxml.etree
tree = lxml.etree.fromstring(SOME_XML_STRING)
for el in tree.findall('items/item'):
   children = el.getchildren()
   # Keys are element names.
   keys = (c.tag for c in children)
   # Values are element text contents.
   values = (c.text for c in children)
   yield dict(zip(keys, values))
HTML

import requests

import lxml.html

I generally use XPath, but pyquery seems fine too.

If the HTML is very funky, use html5lib as the parser.

Sometimes data can be scraped from a chunk of JavaScript
embedded in the page.
Notes


Please don’t use urllib2.

If you do use html5lib for parsing, remember that you can do so
from within lxml itself. http://lxml.de/html5parser.html
Web browser

If you need a real browser to scrape the data, it’s often not worth
it.

But there are tools out there.

I wrote PunkyBrowster, but I can't really recommend it over
ghost.py. It seems to have a better API, supports PySide and Qt,
and has a more permissive license (MIT).
PDF
Not as hard as it looks.

There are no Python libraries that handle all kinds of PDF
documents in the wild.

Use the pdftohtml command to convert the PDF to XML.

When debugging, use pdftohtml to generate HTML that you can
inspect in the browser.

If the text in the PDF is in tabular format, you can group text cells
by proximity.
Notes
The “group by proximity” strategy works like this:

1. Find a text cell that has a very distinct pattern (probably a date
cell). This is your “anchor”.

2. Find all cells that have the same row position as the anchor
(possibly off by a few pixels).

3. Figure out which grouped cells belong to which fields based
on column position.
RSS/Atom

import feedparser

Sometimes feedparser can’t handle custom fields, and you’ll have
to fall back to lxml.etree.

Unfortunately, plenty of RSS feeds are not compliant XML.
Either do some custom munging or try html5lib.
Database dump


If it’s a Microsoft Access file, use mbtools to dump the data.

Sometimes it’s a ZIP file containing CSV files, each of which
corresponds to a separate table dump.

Just load it all into a SQLite database and run queries on it.
Notes


We wrote code that simulated joins using lists of dictionaries.
This was painful to write and not so much fun to read. Don’t do
this.
GIS


I haven’t worked much with KML or SHP files.

If an organization provides GIS files for download, they usually
offer other options as well. Look for those instead.
Mixed


This is very common.

For example: an organization offers a CSV download, but you
have to scrape their web page to find the link for it.
Components of a scraping
          system
Downloader

Cacher

Raw item retriever

Existing item detector

Item transformer

Status reporter
Notes
Caching is essential when scraping a dataset that involves a large
number of HTML pages. Test runs can take hours if you’re
making requests over the network. A good caching system pretty
prints the files it downloads so you can more easily inspect them.

Reporting is essential if you’re managing a group of scrapers.
Since you KNOW that at least one of your scrapers will be
broken at any time, you might as well know which ones are
broken. A good reporting mechanism shows when your scrapers
break, as well as when the dataset itself has issues (determined
heuristically).
Steps to writing a scraper
Find the data source

Find the metadata

Analysis (verify the primary key)

Develop

Test

Fix (repeat ∞ times)
Notes

The Analysis step should also include noting which fields should
be lookup fields (see design pattern slide).

The Testing step is always done on real data and has three
phases: dry run (nothing added or updated), dry run with
lookups (only lookups are added), and production run. I run all
three phases on my local instance before deploying to
production.
A very useful tool for HTML
           scraping

Firefinder (http://bit.ly/kr0UOY)

Extension for Firebug

Allows you to test CSS and XPath expressions on any page, and
visually inspect the results.
Look, it’s Firefinder!
Storing scraped data

Don’t create tables before you understand how you want to use
the data.

Consider using ZODB (or another nonrelational DB)

Adrian Holovaty’s talk on how EveryBlock avoided creating new
tables for each dataset: http://bit.ly/Yl6VAZ (relevant part
starts at 7:10)
Design patterns


If a field contains a finite number of possible values, use a lookup
table instead of storing each value.

Make a scraper superclass that incorporates common scraper
logic.
Notes


The scraper superclass will probably have convenience methods
for converting dates/times, cleaning HTML, looking for existing
items, etc. It should also incorporate the caching and reporting
logic.
Working with government data

Some data sources are only available at certain times of day.

Be careful about rate limiting and IP blocking.

Data scraped from a web page shouldn’t be used for analyzing
trends.

When you’re stuck, give them a phone call.
Notes



If you do manage to find an actual person to talk to you, keep a
record of their contact information and do NOT lose it! They are
your first line of defense when a dataset you rely on goes down.
Pro tips

When you don’t know what encoding the content is in, use
charade, not chardet.

Remember to clean any HTML you intend to display.

If the dataset doesn’t allow filtering by date, it’s a lost cause
(unless you just care about historical data).

When your scraper fails, do NOT fix it. If a user complains,
consider fixing it.
I am done



Questions?

Contenu connexe

En vedette

En vedette (8)

Seminar_3D INTERNET
Seminar_3D INTERNETSeminar_3D INTERNET
Seminar_3D INTERNET
 
Seminar on 3 d internet
Seminar on 3 d internetSeminar on 3 d internet
Seminar on 3 d internet
 
3D Internet
3D Internet 3D Internet
3D Internet
 
Cyber Terrorism
Cyber TerrorismCyber Terrorism
Cyber Terrorism
 
Selenium ppt
Selenium pptSelenium ppt
Selenium ppt
 
3d internet
3d internet3d internet
3d internet
 
Computer science seminar topics
Computer science seminar topicsComputer science seminar topics
Computer science seminar topics
 
CYBER TERRORISM
     CYBER TERRORISM     CYBER TERRORISM
CYBER TERRORISM
 

Dernier

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Dernier (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Scraping from the Web: An Overview That Does Not Contain Too Much Cussing

  • 1. SCRAPING FROM THE WEB An Overview That Does Not Contain Too Much Cussing Feihong Hsu ChiPy February 14, 2013
  • 2. Organization Definition of scraper Common types of scrapers Components of a scraping system Pro tips
  • 3. What I mean when I say scraper Any program that retrieves structured data from the web, and then transforms it to conform with a different structure. Wait, isn’t that just ETL? (extract, transform, load) Well, sort of, but I don’t want to call it that...
  • 4. Notes Some people would say that “scraping” only applies to web pages. I would argue that getting data from a CSV or JSON file is qualitatively not all that different. So I lump them all together. Why not ETL? Because ETL implies that there are rules and expectations, and these two things don’t exist in the world of open government data. They can change the structure of their dataset without telling you, or even take the dataset down on a whim. A program that pulls down government data is often going to be a bit hacky by necessity, so “scraper” seems like a good term for that.
  • 5. Main types of scrapers CSV PDF RSS/Atom Database dump JSON GIS XML Mixed HTML crawler Web browser
  • 6. CSV import csv You should usually use csv.DictReader. If the column names are all caps, consider making them lowercase. Watch out for CSV datasets that don’t have the same number of elements on each row.
  • 7. def get_rows(csv_file): reader = csv.reader(open(csv_file)) # Get the column names, lowercased. column_names = tuple(k.lower() for k in reader.next()) for row in reader: yield dict(zip(column_names, row))
  • 9. XML import lxml.etree Get rid of namespaces in the input document. http://bit.ly/ LO5x7H A lot of XML datasets have a fairly flat structure. In these cases, convert the elements to dictionaries.
  • 10. <root> <items> <item> <id>3930277-ac</id> <name>Frodo Samwise</name> <age>56</age> <occupation>Tolkien scholar</occupation> <description>Short, with hairy feet</description> </item> ... </items> </root>
  • 11. import lxml.etree tree = lxml.etree.fromstring(SOME_XML_STRING) for el in tree.findall('items/item'): children = el.getchildren() # Keys are element names. keys = (c.tag for c in children) # Values are element text contents. values = (c.text for c in children) yield dict(zip(keys, values))
  • 12. HTML import requests import lxml.html I generally use XPath, but pyquery seems fine too. If the HTML is very funky, use html5lib as the parser. Sometimes data can be scraped from a chunk of JavaScript embedded in the page.
  • 13. Notes Please don’t use urllib2. If you do use html5lib for parsing, remember that you can do so from within lxml itself. http://lxml.de/html5parser.html
  • 14. Web browser If you need a real browser to scrape the data, it’s often not worth it. But there are tools out there. I wrote PunkyBrowster, but I can't really recommend it over ghost.py. It seems to have a better API, supports PySide and Qt, and has a more permissive license (MIT).
  • 15. PDF Not as hard as it looks. There are no Python libraries that handle all kinds of PDF documents in the wild. Use the pdftohtml command to convert the PDF to XML. When debugging, use pdftohtml to generate HTML that you can inspect in the browser. If the text in the PDF is in tabular format, you can group text cells by proximity.
  • 16. Notes The “group by proximity” strategy works like this: 1. Find a text cell that has a very distinct pattern (probably a date cell). This is your “anchor”. 2. Find all cells that have the same row position as the anchor (possibly off by a few pixels). 3. Figure out which grouped cells belong to which fields based on column position.
  • 17. RSS/Atom import feedparser Sometimes feedparser can’t handle custom fields, and you’ll have to fall back to lxml.etree. Unfortunately, plenty of RSS feeds are not compliant XML. Either do some custom munging or try html5lib.
  • 18. Database dump If it’s a Microsoft Access file, use mbtools to dump the data. Sometimes it’s a ZIP file containing CSV files, each of which corresponds to a separate table dump. Just load it all into a SQLite database and run queries on it.
  • 19. Notes We wrote code that simulated joins using lists of dictionaries. This was painful to write and not so much fun to read. Don’t do this.
  • 20. GIS I haven’t worked much with KML or SHP files. If an organization provides GIS files for download, they usually offer other options as well. Look for those instead.
  • 21. Mixed This is very common. For example: an organization offers a CSV download, but you have to scrape their web page to find the link for it.
  • 22. Components of a scraping system Downloader Cacher Raw item retriever Existing item detector Item transformer Status reporter
  • 23. Notes Caching is essential when scraping a dataset that involves a large number of HTML pages. Test runs can take hours if you’re making requests over the network. A good caching system pretty prints the files it downloads so you can more easily inspect them. Reporting is essential if you’re managing a group of scrapers. Since you KNOW that at least one of your scrapers will be broken at any time, you might as well know which ones are broken. A good reporting mechanism shows when your scrapers break, as well as when the dataset itself has issues (determined heuristically).
  • 24. Steps to writing a scraper Find the data source Find the metadata Analysis (verify the primary key) Develop Test Fix (repeat ∞ times)
  • 25. Notes The Analysis step should also include noting which fields should be lookup fields (see design pattern slide). The Testing step is always done on real data and has three phases: dry run (nothing added or updated), dry run with lookups (only lookups are added), and production run. I run all three phases on my local instance before deploying to production.
  • 26. A very useful tool for HTML scraping Firefinder (http://bit.ly/kr0UOY) Extension for Firebug Allows you to test CSS and XPath expressions on any page, and visually inspect the results.
  • 28. Storing scraped data Don’t create tables before you understand how you want to use the data. Consider using ZODB (or another nonrelational DB) Adrian Holovaty’s talk on how EveryBlock avoided creating new tables for each dataset: http://bit.ly/Yl6VAZ (relevant part starts at 7:10)
  • 29. Design patterns If a field contains a finite number of possible values, use a lookup table instead of storing each value. Make a scraper superclass that incorporates common scraper logic.
  • 30. Notes The scraper superclass will probably have convenience methods for converting dates/times, cleaning HTML, looking for existing items, etc. It should also incorporate the caching and reporting logic.
  • 31. Working with government data Some data sources are only available at certain times of day. Be careful about rate limiting and IP blocking. Data scraped from a web page shouldn’t be used for analyzing trends. When you’re stuck, give them a phone call.
  • 32. Notes If you do manage to find an actual person to talk to you, keep a record of their contact information and do NOT lose it! They are your first line of defense when a dataset you rely on goes down.
  • 33. Pro tips When you don’t know what encoding the content is in, use charade, not chardet. Remember to clean any HTML you intend to display. If the dataset doesn’t allow filtering by date, it’s a lost cause (unless you just care about historical data). When your scraper fails, do NOT fix it. If a user complains, consider fixing it.