Scraping Scripting Hacking

scraping,

http://www.flickr.com/photos/juan23/82888194/
scripting and
hacking your way to
API-less data
[AKA: if you don’t have data
feeds, we’ll get it anyway]

overview

• “getting data out”
• non-exhaustive (and rapid!)
• slightly random
• live examples (hopefully)
• mainly non-technical(ish)
• mainly non-illegal. I think.

anything goes

• have no fear!
• feel no remorse!
• be shameless!
• long live the open data revolution!

you

• half newbie, half “done some”

me

• not really a developer
• ..but code enough ASP (stop giggling)
to do what I want to do
• slides will be at slideshare.net/dmje
• www.electronicmuseum.org.uk
• mike.ellis@eduserv.org.uk

we <3 data

• we want programmatic access...
• ...but sites are often lacking
• ...and APIs are usually a pipe dream

http://www.ucas.com/instit/i/h60.html

http://unicorn.lib.ic.ac.uk/uhtbin/opac/webcentral

scraping

• copy & paste, without having to copy &
paste...
• an inexact but really rather beautiful
science

Set xmlhttp = Server.CreateObject("MSXML2.ServerXMLHTTP.4.0")

Call xmlhttp.Open("GET",url,False)
Call xmlhttp.send

ReturnedXML = xmlhttp.responsetext

scraping (cont)

• frowned on by purists...
• but really rather powerful
• http://hoard.it

extraction #1: Y!Pipes

• find your data on page
• view source
• determine the delimeters
• put it into Pipes
• extract the output

originating page | output

extraction #2: Google Docs

• create a new google spreadsheet
• find the URL of the data you want
• identify how it is encapsulated (list/
table)
• use the importHTML() function (others for
feeds, xml, data, etc)
• dump out data as...CSV/XML/RSS/etc


extraction #3: dapper.net

• go to dapper.net/open
• identify several of the urls with the same
“shapes” that you want to scrape
• use the dapper dashboard to identify
content areas
• build the “dapp”
• pass in url’s of pages you want to extract
data from
• extract results from the output (xml,
flash, csv, etc)


extraction #4: YQL

• view source on the page you want to grab
• go to http://developer.yahoo.com/yql/console/
• get your XPath hat on and build a query
• grab the data from a RESTful query

http://developer.yahoo.com/yql/console/?
q=select%20*%20from%20html%20where%20url%3D
%22http%3A%2F%2Fopenlibrary.org%2Fsearch%3Fq
%3Dkeri%2Bhulme%22%20and%20xpath%3D%27%2F%2Fa
%5B%40class%3D%22result%22%5D%27


extraction #5: httrack

• grab a copy of httrack (or similar)from
http://www.httrack.com/
• point it at the bit of the site you want,
make sure the filters are correct, and push
go...
• you now have a local copy of the site, to
munge as you see fit

extraction #6: hacked search

• get an API key from Yahoo!
• use it to search within a domain
• script a standard download script to pick
out each page and download it
• hack that mumma
• (variation on a theme: build a simple
spider...)

now you’ve got your data..

• once you’ve got your data, you usually
need to munge it...

munging #1: regex!

• I’m terrible at regex
• ([A-PR-UWYZ0-9][A-HK-Y0-9]
[AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}
[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)
• but it’s incredibly powerful...

output

munging #2: find/replace

• use whatever scripting language you work
best with
• (even Word...)
• you’ll find that replace double space,
replace weird characters, replace paragraph
marks are about the most common needs

munging #3: mail merge!

• for rapid builds of html, javascript or
xml
• have a source document (often extracted or
munged from other sites) in Excel
• you can use filters to effectively grab
the data you need
• build the merge in Word, using the
“directory” option
• copy and paste the result out

munging #4: html removal

• have a function handy that you can pass a
block of html
• it is handy to have a script where you can
define which particular tags to remove or
leave in place

munging #5: html tidy

• grab a copy of html tidy from
http://tidy.sourceforge.net/
• tidy is available as a downloadable .exe
or a component that you can pass data to in
your code

processing #1: Open Calais

• a service from Reuters for analysing
blocks of text for semantic “meaning”
• get an API key from Open Calais
• send data via a POST to the REST service
• retrieve results from the RDF
• OR...just paste your text into
http://sws.clearforest.com/calaisviewer/

output

processing #2: Yahoo! TE

• a webservice for grabbing tags/terms from
blocks of text
• sign up for a Yahoo! API key
• pass your block of text using POST
• grab the results..

output

processing #3: geo!

• go to http://developer.yahoo.com/geo !

the ugly sisters

• Access
• Excel (!)

the last resorts

• FOI (frankie!)
• OCR (me)

the very last resort..

• re-type it...
• (or use Amazon Mechanical Turk)

Scraping Scripting Hacking

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (9)

Similaire à Scraping Scripting Hacking

Similaire à Scraping Scripting Hacking (20)

Plus de Mike Ellis

Plus de Mike Ellis (20)

Dernier

Dernier (20)

Scraping Scripting Hacking