SlideShare une entreprise Scribd logo
1  sur  27
scraping,




                               http://www.flickr.com/photos/juan23/82888194/
 scripting and
 hacking your way to
 API-less data
[AKA: if you don’t have data
feeds, we’ll get it anyway]
overview

•   “getting data out”
•   non-exhaustive (and rapid!)
•   slightly random
•   live examples (hopefully)
•   mainly non-technical(ish)
•   mainly non-illegal. I think.
anything goes

•   have no fear!
•   feel no remorse!
•   be shameless!
•   long live the open data revolution!
you

• half newbie, half “done some”
me

• not really a developer
• ..but code enough ASP (stop giggling)
to do what I want to do
• slides will be at slideshare.net/dmje
• www.electronicmuseum.org.uk
• mike.ellis@eduserv.org.uk
we <3 data

• we want programmatic access...
• ...but sites are often lacking
• ...and APIs are usually a pipe dream

 http://www.ucas.com/instit/i/h60.html




                                         http://unicorn.lib.ic.ac.uk/uhtbin/opac/webcentral
scraping

 • copy & paste, without having to copy &
 paste...
 • an inexact but really rather beautiful
 science




Set xmlhttp = Server.CreateObject("MSXML2.ServerXMLHTTP.4.0")

Call xmlhttp.Open("GET",url,False)
Call xmlhttp.send

ReturnedXML = xmlhttp.responsetext
scraping (cont)

• frowned on by purists...
• but really rather powerful
• http://hoard.it
extraction #1: Y!Pipes

•   find your data on page
•   view source
•   determine the delimeters
•   put it into Pipes
•   extract the output




                               originating page | output
extraction #2: Google Docs

• create a new google spreadsheet
• find the URL of the data you want
• identify how it is encapsulated (list/
table)
• use the importHTML() function (others for
feeds, xml, data, etc)
• dump out data as...CSV/XML/RSS/etc




                           originating page | output
extraction #3: dapper.net

• go to dapper.net/open
• identify several of the urls with the same
“shapes” that you want to scrape
• use the dapper dashboard to identify
content areas
• build the “dapp”
• pass in url’s of pages you want to extract
data from
• extract results from the output (xml,
flash, csv, etc)




                          originating page | output
extraction #4: YQL

•   view source on the page you want to grab
•   go to http://developer.yahoo.com/yql/console/
•   get your XPath hat on and build a query
•   grab the data from a RESTful query




      http://developer.yahoo.com/yql/console/?
      q=select%20*%20from%20html%20where%20url%3D
      %22http%3A%2F%2Fopenlibrary.org%2Fsearch%3Fq
      %3Dkeri%2Bhulme%22%20and%20xpath%3D%27%2F%2Fa
      %5B%40class%3D%22result%22%5D%27




                                   originating page | output
extraction #5: httrack

• grab a copy of httrack (or similar)from
  http://www.httrack.com/
• point it at the bit of the site you want,
make sure the filters are correct, and push
go...
• you now have a local copy of the site, to
munge as you see fit
extraction #6: hacked search

• get an API key from Yahoo!
• use it to search within a domain
• script a standard download script to pick
out each page and download it
• hack that mumma
• (variation on a theme: build a simple
spider...)
now you’ve got your data..

• once you’ve got your data, you usually
need to munge it...
munging #1: regex!

• I’m terrible at regex
• ([A-PR-UWYZ0-9][A-HK-Y0-9]
[AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}
[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)
• but it’s incredibly powerful...




                                            output
munging #2: find/replace

• use whatever scripting language you work
best with
• (even Word...)
• you’ll find that replace double space,
replace weird characters, replace paragraph
marks are about the most common needs
munging #3: mail merge!

• for rapid builds of html, javascript or
xml
• have a source document (often extracted or
munged from other sites) in Excel
• you can use filters to effectively grab
the data you need
• build the merge in Word, using the
“directory” option
• copy and paste the result out
munging #4: html removal

• have a function handy that you can pass a
block of html
• it is handy to have a script where you can
define which particular tags to remove or
leave in place
munging #5: html tidy

• grab a copy of html tidy from
 http://tidy.sourceforge.net/
• tidy is available as a downloadable .exe
or a component that you can pass data to in
your code
processing #1: Open Calais

• a service from Reuters for analysing
blocks of text for semantic “meaning”
• get an API key from Open Calais
• send data via a POST to the REST service
• retrieve results from the RDF
• OR...just paste your text into
http://sws.clearforest.com/calaisviewer/




                                             output
processing #2: Yahoo! TE

• a webservice for grabbing tags/terms from
blocks of text
• sign up for a Yahoo! API key
• pass your block of text using POST
• grab the results..




                                          output
processing #3: geo!

• go to http://developer.yahoo.com/geo !
the ugly sisters

• Access
• Excel (!)
the last resorts

• FOI (frankie!)
• OCR (me)
the very last resort..

• re-type it...
• (or use Amazon Mechanical Turk)
...any more?

Contenu connexe

En vedette

CLV e Mídia Programática
CLV e Mídia ProgramáticaCLV e Mídia Programática
CLV e Mídia ProgramáticaSociomantic Labs
 
Top Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to KnowTop Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to KnowInMobi
 
Calculating LTV Using Flurry
Calculating LTV Using FlurryCalculating LTV Using Flurry
Calculating LTV Using FlurryYaniv Nizan
 
Calculating LTV Using Google Analytics
Calculating LTV Using Google AnalyticsCalculating LTV Using Google Analytics
Calculating LTV Using Google AnalyticsYaniv Nizan
 
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...Eric Seufert
 
Two Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a SpreadsheetTwo Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a SpreadsheetEric Seufert
 
Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)Demac Media
 
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/201411 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014Rob Moffat
 
A step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime valueA step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime valueGeoff Fripp
 

En vedette (9)

CLV e Mídia Programática
CLV e Mídia ProgramáticaCLV e Mídia Programática
CLV e Mídia Programática
 
Top Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to KnowTop Mobile App Monetization Tactics You Ought to Know
Top Mobile App Monetization Tactics You Ought to Know
 
Calculating LTV Using Flurry
Calculating LTV Using FlurryCalculating LTV Using Flurry
Calculating LTV Using Flurry
 
Calculating LTV Using Google Analytics
Calculating LTV Using Google AnalyticsCalculating LTV Using Google Analytics
Calculating LTV Using Google Analytics
 
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
Eric Seufert, GDC 2014: Profitably launching Jelly Splash to #1, a marketing ...
 
Two Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a SpreadsheetTwo Methods for Modeling LTV with a Spreadsheet
Two Methods for Modeling LTV with a Spreadsheet
 
Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)Everything You Need to Know About Customer Lifetime Value (CLV)
Everything You Need to Know About Customer Lifetime Value (CLV)
 
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/201411 mobile growth hacks.  Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
11 mobile growth hacks. Presentation at LTV>CPI, Wooga, Berlin 27/02/2014
 
A step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime valueA step by-step guide to calculating customer lifetime value
A step by-step guide to calculating customer lifetime value
 

Similaire à Scraping Scripting Hacking

The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchainjasonhaddix
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internetdrgath
 
Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)Krzysztof Kotowicz
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxMichael Hackstein
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesLeo Loobeek
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internetdrgath
 
Jinx - Malware 2.0
Jinx - Malware 2.0Jinx - Malware 2.0
Jinx - Malware 2.0Itzik Kotler
 
[2010]我有一个梦想
[2010]我有一个梦想[2010]我有一个梦想
[2010]我有一个梦想Twinsen Liang
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring databodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring dataSara-Jayne Terp
 
Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008Eric D.
 
Basic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionBasic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionRob Dunn
 
On the Edge Systems Administration with Golang
On the Edge Systems Administration with GolangOn the Edge Systems Administration with Golang
On the Edge Systems Administration with GolangChris McEniry
 
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...Daniel Bohannon
 

Similaire à Scraping Scripting Hacking (20)

The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
 
Scrapy
ScrapyScrapy
Scrapy
 
Learning to code
Learning to codeLearning to code
Learning to code
 
Google Hacking 101
Google Hacking 101Google Hacking 101
Google Hacking 101
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)Html5: Something wicked this way comes (Hack in Paris)
Html5: Something wicked this way comes (Hack in Paris)
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
Protect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying TechniquesProtect Your Payloads: Modern Keying Techniques
Protect Your Payloads: Modern Keying Techniques
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
 
Jinx - Malware 2.0
Jinx - Malware 2.0Jinx - Malware 2.0
Jinx - Malware 2.0
 
[2010]我有一个梦想
[2010]我有一个梦想[2010]我有一个梦想
[2010]我有一个梦想
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Splunk bsides
Splunk bsidesSplunk bsides
Splunk bsides
 
Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008Yahoo! Search monkey API - CEBIT 2008
Yahoo! Search monkey API - CEBIT 2008
 
Basic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionBasic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 session
 
On the Edge Systems Administration with Golang
On the Edge Systems Administration with GolangOn the Edge Systems Administration with Golang
On the Edge Systems Administration with Golang
 
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...
Invoke-CradleCrafter: Moar PowerShell obFUsk8tion & Detection (@('Tech','niqu...
 

Plus de Mike Ellis

5 digital habits of highly effective museums
5 digital habits of highly effective museums5 digital habits of highly effective museums
5 digital habits of highly effective museumsMike Ellis
 
How to stop freelance from killing you
How to stop freelance from killing youHow to stop freelance from killing you
How to stop freelance from killing youMike Ellis
 
Getting collections online
Getting collections onlineGetting collections online
Getting collections onlineMike Ellis
 
Why Wordpress is better than your cms
Why Wordpress is better than your cmsWhy Wordpress is better than your cms
Why Wordpress is better than your cmsMike Ellis
 
Forget the objects, tell the stories
Forget the objects, tell the storiesForget the objects, tell the stories
Forget the objects, tell the storiesMike Ellis
 
Bath Digital general introduction
Bath Digital general introductionBath Digital general introduction
Bath Digital general introductionMike Ellis
 
Stop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tipsStop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tipsMike Ellis
 
Bathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeistBathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeistMike Ellis
 
Strategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things upStrategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things upMike Ellis
 
If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0) If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0) Mike Ellis
 
Mobile: the next frontier
Mobile: the next frontierMobile: the next frontier
Mobile: the next frontierMike Ellis
 
Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?Mike Ellis
 
The Intertubes Everywhere
The Intertubes EverywhereThe Intertubes Everywhere
The Intertubes EverywhereMike Ellis
 
Bathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The YearBathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The YearMike Ellis
 
The Benefits Of Doing Things Differently
The Benefits Of Doing Things DifferentlyThe Benefits Of Doing Things Differently
The Benefits Of Doing Things DifferentlyMike Ellis
 
Collaboration 2.0
Collaboration 2.0Collaboration 2.0
Collaboration 2.0Mike Ellis
 
Getting people together
Getting people togetherGetting people together
Getting people togetherMike Ellis
 
3 minutes, one technology: the piano
3 minutes, one technology: the piano3 minutes, one technology: the piano
3 minutes, one technology: the pianoMike Ellis
 
Don't Think Websites, think data
Don't Think Websites, think dataDon't Think Websites, think data
Don't Think Websites, think dataMike Ellis
 
Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"Mike Ellis
 

Plus de Mike Ellis (20)

5 digital habits of highly effective museums
5 digital habits of highly effective museums5 digital habits of highly effective museums
5 digital habits of highly effective museums
 
How to stop freelance from killing you
How to stop freelance from killing youHow to stop freelance from killing you
How to stop freelance from killing you
 
Getting collections online
Getting collections onlineGetting collections online
Getting collections online
 
Why Wordpress is better than your cms
Why Wordpress is better than your cmsWhy Wordpress is better than your cms
Why Wordpress is better than your cms
 
Forget the objects, tell the stories
Forget the objects, tell the storiesForget the objects, tell the stories
Forget the objects, tell the stories
 
Bath Digital general introduction
Bath Digital general introductionBath Digital general introduction
Bath Digital general introduction
 
Stop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tipsStop the noise - ten digital marketing tips
Stop the noise - ten digital marketing tips
 
Bathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeistBathcamp 2010 zeitgeist
Bathcamp 2010 zeitgeist
 
Strategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things upStrategic digital marketing: some ideas for joining things up
Strategic digital marketing: some ideas for joining things up
 
If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0) If you love your content, set it free (v3.0)
If you love your content, set it free (v3.0)
 
Mobile: the next frontier
Mobile: the next frontierMobile: the next frontier
Mobile: the next frontier
 
Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?Niche or Platform - what next for our institutions online?
Niche or Platform - what next for our institutions online?
 
The Intertubes Everywhere
The Intertubes EverywhereThe Intertubes Everywhere
The Intertubes Everywhere
 
Bathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The YearBathcamp #8: Quiz Of The Year
Bathcamp #8: Quiz Of The Year
 
The Benefits Of Doing Things Differently
The Benefits Of Doing Things DifferentlyThe Benefits Of Doing Things Differently
The Benefits Of Doing Things Differently
 
Collaboration 2.0
Collaboration 2.0Collaboration 2.0
Collaboration 2.0
 
Getting people together
Getting people togetherGetting people together
Getting people together
 
3 minutes, one technology: the piano
3 minutes, one technology: the piano3 minutes, one technology: the piano
3 minutes, one technology: the piano
 
Don't Think Websites, think data
Don't Think Websites, think dataDon't Think Websites, think data
Don't Think Websites, think data
 
Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"Everyware - "the future is already here, it's just not well distributed yet"
Everyware - "the future is already here, it's just not well distributed yet"
 

Dernier

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Dernier (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Scraping Scripting Hacking

  • 1. scraping, http://www.flickr.com/photos/juan23/82888194/ scripting and hacking your way to API-less data [AKA: if you don’t have data feeds, we’ll get it anyway]
  • 2. overview • “getting data out” • non-exhaustive (and rapid!) • slightly random • live examples (hopefully) • mainly non-technical(ish) • mainly non-illegal. I think.
  • 3. anything goes • have no fear! • feel no remorse! • be shameless! • long live the open data revolution!
  • 4. you • half newbie, half “done some”
  • 5. me • not really a developer • ..but code enough ASP (stop giggling) to do what I want to do • slides will be at slideshare.net/dmje • www.electronicmuseum.org.uk • mike.ellis@eduserv.org.uk
  • 6. we <3 data • we want programmatic access... • ...but sites are often lacking • ...and APIs are usually a pipe dream http://www.ucas.com/instit/i/h60.html http://unicorn.lib.ic.ac.uk/uhtbin/opac/webcentral
  • 7. scraping • copy & paste, without having to copy & paste... • an inexact but really rather beautiful science Set xmlhttp = Server.CreateObject("MSXML2.ServerXMLHTTP.4.0") Call xmlhttp.Open("GET",url,False) Call xmlhttp.send ReturnedXML = xmlhttp.responsetext
  • 8. scraping (cont) • frowned on by purists... • but really rather powerful • http://hoard.it
  • 9. extraction #1: Y!Pipes • find your data on page • view source • determine the delimeters • put it into Pipes • extract the output originating page | output
  • 10. extraction #2: Google Docs • create a new google spreadsheet • find the URL of the data you want • identify how it is encapsulated (list/ table) • use the importHTML() function (others for feeds, xml, data, etc) • dump out data as...CSV/XML/RSS/etc originating page | output
  • 11. extraction #3: dapper.net • go to dapper.net/open • identify several of the urls with the same “shapes” that you want to scrape • use the dapper dashboard to identify content areas • build the “dapp” • pass in url’s of pages you want to extract data from • extract results from the output (xml, flash, csv, etc) originating page | output
  • 12. extraction #4: YQL • view source on the page you want to grab • go to http://developer.yahoo.com/yql/console/ • get your XPath hat on and build a query • grab the data from a RESTful query http://developer.yahoo.com/yql/console/? q=select%20*%20from%20html%20where%20url%3D %22http%3A%2F%2Fopenlibrary.org%2Fsearch%3Fq %3Dkeri%2Bhulme%22%20and%20xpath%3D%27%2F%2Fa %5B%40class%3D%22result%22%5D%27 originating page | output
  • 13. extraction #5: httrack • grab a copy of httrack (or similar)from http://www.httrack.com/ • point it at the bit of the site you want, make sure the filters are correct, and push go... • you now have a local copy of the site, to munge as you see fit
  • 14. extraction #6: hacked search • get an API key from Yahoo! • use it to search within a domain • script a standard download script to pick out each page and download it • hack that mumma • (variation on a theme: build a simple spider...)
  • 15. now you’ve got your data.. • once you’ve got your data, you usually need to munge it...
  • 16. munging #1: regex! • I’m terrible at regex • ([A-PR-UWYZ0-9][A-HK-Y0-9] [AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2} [0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA) • but it’s incredibly powerful... output
  • 17. munging #2: find/replace • use whatever scripting language you work best with • (even Word...) • you’ll find that replace double space, replace weird characters, replace paragraph marks are about the most common needs
  • 18. munging #3: mail merge! • for rapid builds of html, javascript or xml • have a source document (often extracted or munged from other sites) in Excel • you can use filters to effectively grab the data you need • build the merge in Word, using the “directory” option • copy and paste the result out
  • 19. munging #4: html removal • have a function handy that you can pass a block of html • it is handy to have a script where you can define which particular tags to remove or leave in place
  • 20. munging #5: html tidy • grab a copy of html tidy from http://tidy.sourceforge.net/ • tidy is available as a downloadable .exe or a component that you can pass data to in your code
  • 21. processing #1: Open Calais • a service from Reuters for analysing blocks of text for semantic “meaning” • get an API key from Open Calais • send data via a POST to the REST service • retrieve results from the RDF • OR...just paste your text into http://sws.clearforest.com/calaisviewer/ output
  • 22. processing #2: Yahoo! TE • a webservice for grabbing tags/terms from blocks of text • sign up for a Yahoo! API key • pass your block of text using POST • grab the results.. output
  • 23. processing #3: geo! • go to http://developer.yahoo.com/geo !
  • 24. the ugly sisters • Access • Excel (!)
  • 25. the last resorts • FOI (frankie!) • OCR (me)
  • 26. the very last resort.. • re-type it... • (or use Amazon Mechanical Turk)