Scraping data from the web and documents

Scraping Data from
Documents and the Web
Tommy Tavenner
National Wildlife Federation

What is it?
© 2014 Tommy Tavenner

What is Scraping?
• Converting data from human readable into machine readable
• This data is sometimes referred to as ‘unstructured’ but is really
just not structured properly for systematic parsing
• The data is often embedded in layers of formatting meta data.
Think HTML or PDF formatting like font colors and tables.
• The job of the scraper is to separate the data from the
formatting. In some cases even using the formatting to interpret
the data.

Is it Legal?

Is Scraping Legal?
• It depends
• Most publically available data in the US falls within the sphere
of copyright protection.
> Creativity in producing the source data
> The manner in which the data is presented
> Fair Use on the web
• What is the purpose of the scraping?

Is Scraping Legal?
• Terms of Service
> Does it explicitly prohibit scraping?
> Does it prohibit storing information privately?

Is Scraping Legal?
• Feist v. Rural Telephone (1991)
> Feist, a phone book compiler in Kansas, copied the contents of
Rural Telephone’s directory after Rural refused to license the
information.
> Rural sued Feist for copyright infringement. Because of the nature
of the information, the case eventually made it to the supreme
court.
> The case centered on originality and whether compiling facts
constitutes an original work.
> The court ruled that the phone directory did not constitute and
original compilation because no discretion was exercised in
deciding on contents.

Is Scraping Legal?
• LinkedIn case (2014)
> Suing a group of unknown defendants in California.
> LinkedIn alleges that this group used a series of bots and fake
profiles on the site to scrape content from other member profiles
> The case is based on the Digital Millennium Copyright Act.

Jargon
• Spider – Searches for links within content and follows, building
up a site map or web of content.
• Crawler – Synonym for Spider
• Training Data – Like in supervised machine learning, training
data is used to teach a spider how to interpret the content they
will be processing.
• IP Proxy/Switching – Regular switching of IP address used to
bypass restrictions on the number of connections per client set
by web servers. May be a sign of less than legal or honorable
intent in scraping.

Anatomy of a Scraper
Document Load
• Pull in the
complete web
page, PDF, XML,
etc.
Parsing
• Parse the HTML,
XML, or PDF meta
data into
something the
script can
understand
Extraction
• Use the results of
parsing to extract
the data we are
looking for
Transformation
•Convert the
data into
useful formats,
i.e. currency,
dates, etc.

Document
Load
• Load the entire document or HTML
page. Generally as a string of
characters.
• For larger documents this may involve
splitting it into multiple pages

Parsing
• Interpret the document to make searching
possible.
• Biggest potential failure point
• Specific to the source data.
• HTML Document Object Model
• PDF Grid Model

Extraction
• Search parsed data for particular
pieces of information
• i.e. file name, link, or table
• Separate data into individual pieces for
later processing

Transformation
• Convert data into proper output
• Apply standards
• Change type
• i.e. date string date

Visual Scraping tools
• Require no programming knowledge
• Primarily web-based
• Allow quick access to data
• Because they are not bespoke may require more scrubbing of
the data after scraping

ScraperWiki
• Paid Service with very basic free plan
• Focused on table extraction and Twitter data
• Takes a single page or document as its source

ScraperWiki
• Allows you to quickly access the data or summarize it.
• Works well with PDF’s of tables but struggles with mixed data.

Import.io
• In early stages, currently free with professional accounts
• Downloadable Java app – multi-platform
• Focused more on crawling sites to build up data sources
• Offers limited training or refining abilities to make sure it
extracts data correctly.
• Enables access to the data source either as a downloadable
file or as an API.

Import.io
• Data can be extracted either for a single page or a full site

Scrapinghub
• Designed for much larger scraping jobs, including multi-site

Scrapinghub
• Sits somewhere between a visual scraper and a scraping
library.
• Custom scrapers may be developed in Python and hosted by
Scrapinghub
• The autoscraper allows annotating pages and training the
scraper
• The crawler starts with a single page and works out from there
following links on the pages it finds and quickly building large
databases.

Scraping with a scripting language
• Libraries are available in most languages.
• Primarily make it easier to understand a certain format, i.e.
HTML or PDF.
• Require strong knowledge of the language
• Require more fine tuning but result in much higher quality data

R
• scrapeR – for parsing HTML/XML
• XML package – for parsing HTML/XML
• tm – for parsing PDFs using Xpdf or Poppler engines

Python
• ScraperWiki
• Scrapy
• BeautifulSoup – for parsing HTML
• XPath
• PDFMiner – for parsing PDFs

PHP
• Simple HTML DOM
• PDF Parser

Javascript
• NodeJS using Request and Cheerio
• jsPDF
• pdf2json

Scraping data from the web and documents

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Scraping data from the web and documents

Similaire à Scraping data from the web and documents (20)

Dernier

Dernier (20)

Scraping data from the web and documents