SlideShare a Scribd company logo
1 of 63
Download to read offline
Kyle Banerjee
banerjek@ohsu.edu
Web Scraping Basics
The truth of the matter is...
Web scraping is one of the
worst ways to get data!
What’s wrong with scraping?
1. Slow, resources intensive, not scalable
2. Unreliable -- breaks when website
changes and works poorly with
responsive design techniques
3. Difficult to parse data
4. Harvest looks like an attack
5. Often prohibited by TOS
Before writing a scraper
Call!
● Explore better options
● Check terms of service
● Ask permission
● Can you afford scrape
errors?
Alternatives to scraping
1. Data dumps
2. API
3. Direct database connections
4. Shipping drives
5. Shared infrastructure
Many datasets are easy to retrieve
You can often export search results
Why scrape the Web?
1. Might be the only method available
2. Sometimes can get precombined or
preprocessed info that would otherwise
be hard to generate
Things to know
1. Web scraping is about parsing and
cleaning.
2. You don’t need to be a programmer, but
scripting experience is very helpful.
Don’t use Excel. Seriously.
Excel
● Mangles your data
○ Identifiers and numeric data at risk
● Cannot handle carriage returns in data
● Crashes with large files
● OpenRefine is a better tool for situations
where you think you need Excel
http://openrefine.org
Harvesting options
● Free utilities
● Purchased software
● DaaS (Data as a Service) -- hosted web
spidering
● Write your own
Watch out for spider traps!
● Web pages that intentionally or
unintentionally cause a crawler to make
an infinite number of requests
● No algorithm can detect all spider traps
Ask for help!
1. Methods described here are familiar to
almost all systems people
2. Domain experts can help you identify tools
and shortcuts that are especially relevant
to you
3. Bouncing ideas off *anyone* usually results
in a superior outcome
Handy skills
Skill Benefit
DOM Identify and extract data
Regular expressions Identify and extract data
Command line Process large files
Scripting
Automate repetitive tasks
Perform complex operations
Handy basic tools
Tool Benefit
Web scraping service Simplify data acquisition
cURL (command line)
Easily retrieve data using
APIs
wget (command line)
Recursively retrieve web
pages
OpenRefine Process and clean data
Power tools
Tool Benefit
grep, sed, awk, tr, paste
Select and transform data in
VERY large files quickly
jq Easily manipulate JSON
xml2json Convert XML to JSON
csvkit
Utilities to convert to and
work with CSV
scrape
HTML extraction using XPath
and CSS selectors
Web scraping, the easy way
● Hosted services allow you to easily target
specific structures and pages
● Programming experience unnecessary, but
helpful
● For unfamiliar problems, ask for help
Hosted example, Scrapinghub
Scrapinghub data output
Document Object Model (DOM)
● Programming interface for HTML and XML
documents
● Supported by many languages/environments
● Represents documents in a tree structure
● Used to directly access content
Document Object Model (DOM) Tree
/document/html/body/div/p = “text node”
XPath is a syntax for defining
parts of an XML document
The Swiss Army Knife of data
Regular Expressions
● Special strings that allow you to search
and replace based on patterns
● Supported in a wide variety of software
and all operating systems
Regular expressions can...
● Use logic, capitalization, edges of
words/lines, express ranges, use bits (or
all) of what you matched in replacements
● Convert free text into XML into delimited
text or codes and vice versa
● Find complex patterns using proximity
indicators and/or involving multiple lines
● Select preferred versions of fields
Quick Regular Expression Guide
^ Match the start of the line
$ Match the end of the line
. Match any single character
* Match zero or more of the previous character
[A-D,G-J,0-5]* [A-D,G-J,0-5]* = match zero or more of ABCDGHIJ012345
[^A-C] Match any one character that is NOT A,B, or C
(dog)
Match the word "dog", including case, and remember that text
to be used later in the match or replacement
1
Insert the first remembered text as if it were typed here (2 for
second, 3 for 3rd, etc.)

Use to match special characters.  matches a backslash, *
matches an asterisk, etc.
Data can contain weird problems
● XML metadata contained errors on every
field that contained an HTML entity (&
< > " ' etc)
<b>Oregon Health &amp</b>
<b> Science University</b>
● Error occurs in many fields scattered across
thousands of records
● But this can be fixed in seconds!
Regular expressions to the rescue!
● “Whenever a field ends in an HTML entity
minus the semicolon and is followed by an
identical field, join those into a single field and
fix the entity. Any line can begin with an
unknown number of tabs or spaces”
/^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/
Confusing at first, but easier than you think!
● Works on all platforms and is built into a
lot of software (including Office)
● Ask for help! Programmers can help you
with syntax
● Let’s walk through our example which
involves matching and joining unknown
fields across multiple lines...
Regular Expression Analysis
/^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/
^ Beginning of line
s*< Zero or more whitespace characters followed by “<”
([^>]+>) One or more characters that are not “>” followed by “>” (i.e.
a tag). Store in 1
(.*) Any characters to next part of pattern. Store in 2
(&[a-z]+) Ampersand followed by letters (HTML entities). Store in 3
</1n “</ followed by 1 (i.e. the closing tag) followed by a newline
s*<1 Any number of whitespace characters followed by tag 1
/<123;/ Replace everything up to this point with “<” followed by 1
(opening tag), 2 (field contents), 3, and “;” (fix HTML
entity). This effectively joins the fields
The command line
● Often the easiest way by far
● Process files of any size
● Combine the power of individual programs
in a single command (pipes)
● Supported by all major platforms
Getting started with the command line
● MacOS (use Terminal)
○ Install Homebrew
○ ‘brew install [package name]’
● Windows 10
○ Enable linux subsystem and go to bash terminal
○ ‘sudo apt-get install [package name]’
● Or install VirtualBox with linux
○ ‘sudo apt-get install [package name]’ from terminal
Learning the command line
● The power of pipes -- combine programs!
● Google solutions for specific problems --
there are many online examples
● Learn one command at a time. Don’t worry
about what you don’t need.
● Try, but give up fast. Ask linux geeks for
help.
Scripting is the command line!
● Simple text files that allow you to combine
utilities and programs written in any language
● No programming experience necessary
● Great for automating processes
● For unfamiliar problems, ask for help
wget
● A command line tool to retrieve data from web
servers
● Works on all operating systems
● Works with unstable connections
● Great for recursive downloads of data files
● Flexible. Can use patterns, specify depth, etc
wget example
wget --recursive ftp://157.98.192.110/ntp-cebs/datatype/microarray/HESI/
Filezilla is good for FTP using a GUI
cURL
● A tool to transfer data from or to a server
● Works with many protocols, can deal with
authentication
● Especially useful for APIs -- the preferred way
to download data using multiple transactions
Things that make life easier
1. JSON (JavaScript Object Notation)
2. XML (eXtensible Markup Language)
3. API (Application Programming Interface)
4. Specialized protocols
5. Using request headers to retrieve pages
that are easier to parse
There are only two kinds of data
1. Parseable
2. Unparseable
BUT
● Some structures are much easier to work
with than others
● Convert to whatever is easiest for the task
at hand
Generally speaking
● Strings
Easiest to work with, fastest, requires fewest resources,
greatest number of tools available.
● XML
Powerful but hardest to work with, slowest, requires
greatest number of resources, very inefficient for large files.
● JSON
Much more sophisticated access than strings, much easier
to work with than XML and requires fewer resources.
Awkward with certain data.
curl https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.json?di=04041346001043
JSON example
curl https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.xml?di=04041346001043
XML example
When processing large XML files
● Convert to JSON if possible, use string
based tools, or at least break the file into
smaller XML documents.
● DOM based tools such as XSLT must load
entire file into memory where it can take 10
times more space for processing
● If you need DOM based tools such XSLT,
break file into many chunks where each
record is its own document
Using APIs
● Most common type is REST (REpresentative
State Transfer) -- a fancy way of saying they
work like a Web form
● Normally have to transmit credentials or other
information. cURL is very good for this
How about Linked Data?
● Uses relationships to connect data
● Great for certain types of complex data
● You must have programming skills to download
and use these
● Often can be interacted with via API
● Can be flattened and manipulated using
traditional tools
grep
● Command line utility to select lines
matching a regular expression
● Very good for extracting just the data
you’re interested in
● Use with small or very large (terabytes)
files
sed
● Command line utility to select, parse, and
transform lines
● Great for “fixing” data so that it can be
used with other programs
● Extremely powerful and works great with
very large (terabytes) files
tr
● Command line utility to translate individual
characters from one to another
● Great for prepping data in files too large
to load into any program
● Particularly useful in combination with sed
for fixing large delimited files containing
line breaks within the data itself
paste
● Command line utility that prints
corresponding lines of files side by side
● Great for combining data from large files
● Also very handy for fixing data
Delimited file with bad line feeds
{myfile.txt}
a1,a2,a3,a4,a5
,a6
b1,b2,b3,b4
,b5,b6
c1,c2,c3,c4,c5,c6
d1
,d2,d3,d4,
d5,d6
Fixed in seconds!
tr "n" "," < myfile.txt | 
sed 's/,+/,/g' | tr "," "n" | paste -s -d",,,,,n"
a1,a2,a3,a4,a5,a6
b1,b2,b3,b4,b5,b6
c1,c2,c3,c4,c5,c6
d1,d2,d3,d4,d5,d6
The power of pipes!
Command Analysis
tr "n" "," < myfile.txt | sed 's/,+/,/g' | tr "," "n" |paste -s -d",,,,,n"
tr “n” “,” < myfile.txt Convert all newlines to commas
| sed ‘/s,+/,/g’ Pipe to sed, convert all multiple instances of
commas to a single comma. Sed step is
necessary because you don’t know how
many newlines are bogus or where they are
| tr “,” “n” Pipe to tr which converts all commas into
newlines
| paste -s -d “,,,,,”n” Pipe to paste command which converts
single column file to output 6 columns wide
using a comma as a delimiter terminated by
a newline
awk
● Outstanding for reading, transforming,
and creating data in rows and columns
● Complete pattern scanning language for
text, but typically used to transform the
output of other commands
Extract 2nd and 5th fields
a1 a2 a3 a4 a5 a6
b1 b2 b3 b4 b5 b6
c1 c2 c3 c4 c5 c6
d1 d2 d3 d4 d5 d6
awk '{print $2,$5}' myfile
a2 a5
b2 b5
c2 c5
d2 d5
{myfile}
jq
● Like sed, but optimized for JSON
● Includes logical and conditional operators,
variables, functions, and powerful features
● Very good for selecting, filtering, and
formatting more complex data
curl https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.json?di=04041346001043
JSON example
Extract deviceID if cuff detected
curl
https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.js
on?di=04041346001043 | jq '.gudid.device |
select(.brandName | test("cuff")) |
.identifiers.identifier.deviceId'
"04041346001043"
The power of pipes!
Don’t try to remember all this!
● Ask for help -- this stuff is easy
for linux geeks
● Google can help you with
commands/syntax
● Online forums are also helpful,
but don’t mind the trolls
If you want a GUI, use OpenRefine
http://openrefine.org
● Sophisticated, including regular
expression support
● Convert between different formats
● Up to a couple hundred thousand rows
● Even has clustering capabilities!
Normalization is more conceptual than technical
● Every situation is unique and depends on the
data you have and what you need
● Don’t fob off data analysis on technical
people who don’t understand your data
● It’s sometimes not possible to fix everything
Solutions are often domain specific!
● Data sources
● Challenges
● Tools
● Tricks
Questions?
Kyle Banerjee
banerjek@ohsu.edu

More Related Content

What's hot

Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingCynthiaCruz55
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupTushar Mittal
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction ServicePromptCloud
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in PythonSatwik Kansal
 
Web scraping &amp; browser automation
Web scraping &amp; browser automationWeb scraping &amp; browser automation
Web scraping &amp; browser automationBHAWESH RAJPAL
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in pythonSaurav Tomar
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web ArchitectureChamnap Chhorn
 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSchool of Data
 
HTML presentation for beginners
HTML presentation for beginnersHTML presentation for beginners
HTML presentation for beginnersjeroenvdmeer
 
Introduction to HTML5
Introduction to HTML5Introduction to HTML5
Introduction to HTML5Gil Fink
 
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | Edureka
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | EdurekaTop 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | Edureka
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | EdurekaEdureka!
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)Amir Fahmideh
 

What's hot (20)

Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful Soup
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
Web scraping &amp; browser automation
Web scraping &amp; browser automationWeb scraping &amp; browser automation
Web scraping &amp; browser automation
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in python
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web Architecture
 
WEB Scraping.pptx
WEB Scraping.pptxWEB Scraping.pptx
WEB Scraping.pptx
 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data Scraping
 
HTML presentation for beginners
HTML presentation for beginnersHTML presentation for beginners
HTML presentation for beginners
 
Ajax ppt
Ajax pptAjax ppt
Ajax ppt
 
Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Introduction to HTML5
Introduction to HTML5Introduction to HTML5
Introduction to HTML5
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 
Ajax
AjaxAjax
Ajax
 
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | Edureka
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | EdurekaTop 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | Edureka
Top 5 Frameworks In Python | Django, Web2Py, Flask, Bottle, CherryPy | Edureka
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 

Similar to Web Scraping Basics

Normalizing Data for Migrations
Normalizing Data for MigrationsNormalizing Data for Migrations
Normalizing Data for MigrationsKyle Banerjee
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...David Horvath
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftTalentica Software
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Modelchomas kandar
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Modelchomas kandar
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...IndicThreads
 
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdfDatabase & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdfInSync2011
 
Server Logs: After Excel Fails
Server Logs: After Excel FailsServer Logs: After Excel Fails
Server Logs: After Excel FailsOliver Mason
 
Advanced web application architecture - Talk
Advanced web application architecture - TalkAdvanced web application architecture - Talk
Advanced web application architecture - TalkMatthias Noback
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017Corey Huinker
 
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20Phil Wilkins
 

Similar to Web Scraping Basics (20)

Normalizing Data for Migrations
Normalizing Data for MigrationsNormalizing Data for Migrations
Normalizing Data for Migrations
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
 
8023.ppt
8023.ppt8023.ppt
8023.ppt
 
Protocol Buffers
Protocol BuffersProtocol Buffers
Protocol Buffers
 
Markup For Dummies (Russ Ward)
Markup For Dummies (Russ Ward)Markup For Dummies (Russ Ward)
Markup For Dummies (Russ Ward)
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Model
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Model
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdfDatabase & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
 
API
APIAPI
API
 
Server Logs: After Excel Fails
Server Logs: After Excel FailsServer Logs: After Excel Fails
Server Logs: After Excel Fails
 
Advanced web application architecture - Talk
Advanced web application architecture - TalkAdvanced web application architecture - Talk
Advanced web application architecture - Talk
 
Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Ruby on rails intro
Ruby on rails introRuby on rails intro
Ruby on rails intro
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
 
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
 

More from Kyle Banerjee

Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma APIKyle Banerjee
 
Keep it Safe, Stupid, or an Intro to Digital Preservation
Keep it Safe, Stupid, or an Intro to Digital PreservationKeep it Safe, Stupid, or an Intro to Digital Preservation
Keep it Safe, Stupid, or an Intro to Digital PreservationKyle Banerjee
 
Future Directions in Metadata
Future Directions in MetadataFuture Directions in Metadata
Future Directions in MetadataKyle Banerjee
 
Переход от отдельных библиотечных систем к объединенной системе Альма
Переход от отдельных библиотечных систем к объединенной системе АльмаПереход от отдельных библиотечных систем к объединенной системе Альма
Переход от отдельных библиотечных систем к объединенной системе АльмаKyle Banerjee
 
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL DatabasesDropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL DatabasesKyle Banerjee
 
Batch metadata assignment to archival photograph collections using facial rec...
Batch metadata assignment to archival photograph collections using facial rec...Batch metadata assignment to archival photograph collections using facial rec...
Batch metadata assignment to archival photograph collections using facial rec...Kyle Banerjee
 
Intro to XML in libraries
Intro to XML in librariesIntro to XML in libraries
Intro to XML in librariesKyle Banerjee
 

More from Kyle Banerjee (8)

Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 
Demystifying RDF
Demystifying RDFDemystifying RDF
Demystifying RDF
 
Keep it Safe, Stupid, or an Intro to Digital Preservation
Keep it Safe, Stupid, or an Intro to Digital PreservationKeep it Safe, Stupid, or an Intro to Digital Preservation
Keep it Safe, Stupid, or an Intro to Digital Preservation
 
Future Directions in Metadata
Future Directions in MetadataFuture Directions in Metadata
Future Directions in Metadata
 
Переход от отдельных библиотечных систем к объединенной системе Альма
Переход от отдельных библиотечных систем к объединенной системе АльмаПереход от отдельных библиотечных систем к объединенной системе Альма
Переход от отдельных библиотечных систем к объединенной системе Альма
 
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL DatabasesDropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
 
Batch metadata assignment to archival photograph collections using facial rec...
Batch metadata assignment to archival photograph collections using facial rec...Batch metadata assignment to archival photograph collections using facial rec...
Batch metadata assignment to archival photograph collections using facial rec...
 
Intro to XML in libraries
Intro to XML in librariesIntro to XML in libraries
Intro to XML in libraries
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 

Web Scraping Basics

  • 2. The truth of the matter is... Web scraping is one of the worst ways to get data!
  • 3. What’s wrong with scraping? 1. Slow, resources intensive, not scalable 2. Unreliable -- breaks when website changes and works poorly with responsive design techniques 3. Difficult to parse data 4. Harvest looks like an attack 5. Often prohibited by TOS
  • 4. Before writing a scraper Call! ● Explore better options ● Check terms of service ● Ask permission ● Can you afford scrape errors?
  • 5. Alternatives to scraping 1. Data dumps 2. API 3. Direct database connections 4. Shipping drives 5. Shared infrastructure
  • 6. Many datasets are easy to retrieve
  • 7. You can often export search results
  • 8. Why scrape the Web? 1. Might be the only method available 2. Sometimes can get precombined or preprocessed info that would otherwise be hard to generate
  • 9. Things to know 1. Web scraping is about parsing and cleaning. 2. You don’t need to be a programmer, but scripting experience is very helpful.
  • 10. Don’t use Excel. Seriously.
  • 11. Excel ● Mangles your data ○ Identifiers and numeric data at risk ● Cannot handle carriage returns in data ● Crashes with large files ● OpenRefine is a better tool for situations where you think you need Excel http://openrefine.org
  • 12. Harvesting options ● Free utilities ● Purchased software ● DaaS (Data as a Service) -- hosted web spidering ● Write your own
  • 13. Watch out for spider traps! ● Web pages that intentionally or unintentionally cause a crawler to make an infinite number of requests ● No algorithm can detect all spider traps
  • 14. Ask for help! 1. Methods described here are familiar to almost all systems people 2. Domain experts can help you identify tools and shortcuts that are especially relevant to you 3. Bouncing ideas off *anyone* usually results in a superior outcome
  • 15. Handy skills Skill Benefit DOM Identify and extract data Regular expressions Identify and extract data Command line Process large files Scripting Automate repetitive tasks Perform complex operations
  • 16. Handy basic tools Tool Benefit Web scraping service Simplify data acquisition cURL (command line) Easily retrieve data using APIs wget (command line) Recursively retrieve web pages OpenRefine Process and clean data
  • 17. Power tools Tool Benefit grep, sed, awk, tr, paste Select and transform data in VERY large files quickly jq Easily manipulate JSON xml2json Convert XML to JSON csvkit Utilities to convert to and work with CSV scrape HTML extraction using XPath and CSS selectors
  • 18. Web scraping, the easy way ● Hosted services allow you to easily target specific structures and pages ● Programming experience unnecessary, but helpful ● For unfamiliar problems, ask for help
  • 21. Document Object Model (DOM) ● Programming interface for HTML and XML documents ● Supported by many languages/environments ● Represents documents in a tree structure ● Used to directly access content
  • 22. Document Object Model (DOM) Tree /document/html/body/div/p = “text node” XPath is a syntax for defining parts of an XML document
  • 23. The Swiss Army Knife of data Regular Expressions ● Special strings that allow you to search and replace based on patterns ● Supported in a wide variety of software and all operating systems
  • 24. Regular expressions can... ● Use logic, capitalization, edges of words/lines, express ranges, use bits (or all) of what you matched in replacements ● Convert free text into XML into delimited text or codes and vice versa ● Find complex patterns using proximity indicators and/or involving multiple lines ● Select preferred versions of fields
  • 25. Quick Regular Expression Guide ^ Match the start of the line $ Match the end of the line . Match any single character * Match zero or more of the previous character [A-D,G-J,0-5]* [A-D,G-J,0-5]* = match zero or more of ABCDGHIJ012345 [^A-C] Match any one character that is NOT A,B, or C (dog) Match the word "dog", including case, and remember that text to be used later in the match or replacement 1 Insert the first remembered text as if it were typed here (2 for second, 3 for 3rd, etc.) Use to match special characters. matches a backslash, * matches an asterisk, etc.
  • 26. Data can contain weird problems ● XML metadata contained errors on every field that contained an HTML entity (&amp; &lt; &gt; &quot; &apos; etc) <b>Oregon Health &amp</b> <b> Science University</b> ● Error occurs in many fields scattered across thousands of records ● But this can be fixed in seconds!
  • 27. Regular expressions to the rescue! ● “Whenever a field ends in an HTML entity minus the semicolon and is followed by an identical field, join those into a single field and fix the entity. Any line can begin with an unknown number of tabs or spaces” /^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/
  • 28. Confusing at first, but easier than you think! ● Works on all platforms and is built into a lot of software (including Office) ● Ask for help! Programmers can help you with syntax ● Let’s walk through our example which involves matching and joining unknown fields across multiple lines...
  • 29. Regular Expression Analysis /^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/ ^ Beginning of line s*< Zero or more whitespace characters followed by “<” ([^>]+>) One or more characters that are not “>” followed by “>” (i.e. a tag). Store in 1 (.*) Any characters to next part of pattern. Store in 2 (&[a-z]+) Ampersand followed by letters (HTML entities). Store in 3 </1n “</ followed by 1 (i.e. the closing tag) followed by a newline s*<1 Any number of whitespace characters followed by tag 1 /<123;/ Replace everything up to this point with “<” followed by 1 (opening tag), 2 (field contents), 3, and “;” (fix HTML entity). This effectively joins the fields
  • 30. The command line ● Often the easiest way by far ● Process files of any size ● Combine the power of individual programs in a single command (pipes) ● Supported by all major platforms
  • 31. Getting started with the command line ● MacOS (use Terminal) ○ Install Homebrew ○ ‘brew install [package name]’ ● Windows 10 ○ Enable linux subsystem and go to bash terminal ○ ‘sudo apt-get install [package name]’ ● Or install VirtualBox with linux ○ ‘sudo apt-get install [package name]’ from terminal
  • 32. Learning the command line ● The power of pipes -- combine programs! ● Google solutions for specific problems -- there are many online examples ● Learn one command at a time. Don’t worry about what you don’t need. ● Try, but give up fast. Ask linux geeks for help.
  • 33. Scripting is the command line! ● Simple text files that allow you to combine utilities and programs written in any language ● No programming experience necessary ● Great for automating processes ● For unfamiliar problems, ask for help
  • 34. wget ● A command line tool to retrieve data from web servers ● Works on all operating systems ● Works with unstable connections ● Great for recursive downloads of data files ● Flexible. Can use patterns, specify depth, etc
  • 35. wget example wget --recursive ftp://157.98.192.110/ntp-cebs/datatype/microarray/HESI/
  • 36. Filezilla is good for FTP using a GUI
  • 37. cURL ● A tool to transfer data from or to a server ● Works with many protocols, can deal with authentication ● Especially useful for APIs -- the preferred way to download data using multiple transactions
  • 38. Things that make life easier 1. JSON (JavaScript Object Notation) 2. XML (eXtensible Markup Language) 3. API (Application Programming Interface) 4. Specialized protocols 5. Using request headers to retrieve pages that are easier to parse
  • 39. There are only two kinds of data 1. Parseable 2. Unparseable BUT ● Some structures are much easier to work with than others ● Convert to whatever is easiest for the task at hand
  • 40. Generally speaking ● Strings Easiest to work with, fastest, requires fewest resources, greatest number of tools available. ● XML Powerful but hardest to work with, slowest, requires greatest number of resources, very inefficient for large files. ● JSON Much more sophisticated access than strings, much easier to work with than XML and requires fewer resources. Awkward with certain data.
  • 43. When processing large XML files ● Convert to JSON if possible, use string based tools, or at least break the file into smaller XML documents. ● DOM based tools such as XSLT must load entire file into memory where it can take 10 times more space for processing ● If you need DOM based tools such XSLT, break file into many chunks where each record is its own document
  • 44. Using APIs ● Most common type is REST (REpresentative State Transfer) -- a fancy way of saying they work like a Web form ● Normally have to transmit credentials or other information. cURL is very good for this
  • 45. How about Linked Data? ● Uses relationships to connect data ● Great for certain types of complex data ● You must have programming skills to download and use these ● Often can be interacted with via API ● Can be flattened and manipulated using traditional tools
  • 46. grep ● Command line utility to select lines matching a regular expression ● Very good for extracting just the data you’re interested in ● Use with small or very large (terabytes) files
  • 47. sed ● Command line utility to select, parse, and transform lines ● Great for “fixing” data so that it can be used with other programs ● Extremely powerful and works great with very large (terabytes) files
  • 48. tr ● Command line utility to translate individual characters from one to another ● Great for prepping data in files too large to load into any program ● Particularly useful in combination with sed for fixing large delimited files containing line breaks within the data itself
  • 49. paste ● Command line utility that prints corresponding lines of files side by side ● Great for combining data from large files ● Also very handy for fixing data
  • 50. Delimited file with bad line feeds {myfile.txt} a1,a2,a3,a4,a5 ,a6 b1,b2,b3,b4 ,b5,b6 c1,c2,c3,c4,c5,c6 d1 ,d2,d3,d4, d5,d6
  • 51. Fixed in seconds! tr "n" "," < myfile.txt | sed 's/,+/,/g' | tr "," "n" | paste -s -d",,,,,n" a1,a2,a3,a4,a5,a6 b1,b2,b3,b4,b5,b6 c1,c2,c3,c4,c5,c6 d1,d2,d3,d4,d5,d6 The power of pipes!
  • 52. Command Analysis tr "n" "," < myfile.txt | sed 's/,+/,/g' | tr "," "n" |paste -s -d",,,,,n" tr “n” “,” < myfile.txt Convert all newlines to commas | sed ‘/s,+/,/g’ Pipe to sed, convert all multiple instances of commas to a single comma. Sed step is necessary because you don’t know how many newlines are bogus or where they are | tr “,” “n” Pipe to tr which converts all commas into newlines | paste -s -d “,,,,,”n” Pipe to paste command which converts single column file to output 6 columns wide using a comma as a delimiter terminated by a newline
  • 53. awk ● Outstanding for reading, transforming, and creating data in rows and columns ● Complete pattern scanning language for text, but typically used to transform the output of other commands
  • 54. Extract 2nd and 5th fields a1 a2 a3 a4 a5 a6 b1 b2 b3 b4 b5 b6 c1 c2 c3 c4 c5 c6 d1 d2 d3 d4 d5 d6 awk '{print $2,$5}' myfile a2 a5 b2 b5 c2 c5 d2 d5 {myfile}
  • 55. jq ● Like sed, but optimized for JSON ● Includes logical and conditional operators, variables, functions, and powerful features ● Very good for selecting, filtering, and formatting more complex data
  • 57. Extract deviceID if cuff detected curl https://accessgudid.nlm.nih.gov/api/v1/devices/lookup.js on?di=04041346001043 | jq '.gudid.device | select(.brandName | test("cuff")) | .identifiers.identifier.deviceId' "04041346001043" The power of pipes!
  • 58. Don’t try to remember all this! ● Ask for help -- this stuff is easy for linux geeks ● Google can help you with commands/syntax ● Online forums are also helpful, but don’t mind the trolls
  • 59. If you want a GUI, use OpenRefine http://openrefine.org ● Sophisticated, including regular expression support ● Convert between different formats ● Up to a couple hundred thousand rows ● Even has clustering capabilities!
  • 60.
  • 61. Normalization is more conceptual than technical ● Every situation is unique and depends on the data you have and what you need ● Don’t fob off data analysis on technical people who don’t understand your data ● It’s sometimes not possible to fix everything
  • 62. Solutions are often domain specific! ● Data sources ● Challenges ● Tools ● Tricks