SlideShare une entreprise Scribd logo
1  sur  35
Don’t Scrape, Glean . . ,[object Object]
Scraping sucks.
def  lastlogin   (@hmodel/ &quot;//td[@class='text'][@width='193']&quot; ).first.innerHTML.split(&quot;<br />&quot;[ 9 ].strip[ -10 .. -1 ]   return  date[ -4 .. -1 ] + &quot;-&quot; + date[ -7 .. -6 ] + &quot;-&quot; + date[ -10 .. -9 ] end end end end
Hpricot for ‘Last login’ date on MySpace.
try :   lastlogin = self.soup.findAll( True , { &quot;width&quot; :  &quot;193&quot; })[ 0 ].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string   loginregex  =  re.compile(  r &quot; [0-9] / [0-9] +/ [0-9]* &quot;)   loginregex_inst  =  loginregex.search(lastlogin)   if  loginregex_inst  is   not None :   self.lastlogin  =  loginregex_inst.group()   except :   pass pass pass pass pass pass pass pass
Taken from a Python/BeautifulSoup library.
(The Ruby is prettier, but who’s counting?)
getElementsByClassName(“foo”)[0].children
It’s an edge case. MySpace’s HTML  is  worse than average.
But it is an ugly recipe for mental turmoil.
The alternative?
flickr.getPhotos()
And you get back nice XML or JSON (or even SOAP!) (or even SOAP!)
But ‘D.R.Y.’! APIs break that principle. APIs break that principle.
This is the data equivalent of the ‘accessible version’.
Enter GRDDL.
GRDDL defines a transformation process for XHTML » RDF.
XHTML ? That’s what the spec says. That’s what the spec says.
HTML 4 works too. Tidy ! !
RDF? Yes. Trust me. It’s not evil. It’s not evil. It’s not evil.
GRDDL can work like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.
You simply use HTML (or XML) in the normal way...
...and define how the data transformation.
You can even use it as a bridge for exisiting APIs and services.
Could even be used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?
Simple example: ‘Not Safe For Work’ ‘Not Safe For Work’
<a href=&quot; http://tubgirl.com &quot; class=&quot;nsfw&quot;>
I can write that. I can’t write xFolk by hand. I can’t write xFolk by hand.
Is ‘nsfw’ a good class name? No.
Do I care? No.
The data layer becomes separated like CSS is from HTML.
That’s the theory. Now for the demo. Now for the demo.
irc.freenode.net #swig #swhack #swhack #swhack
getsemantic.com [email_address] [email_address]
[email_address] http://tommorris.org http://tommorris.org

Contenu connexe

Similaire à Don't scrape, Glean!

XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARStephan Schmidt
 
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A TutorialOds Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorialsimienc
 
Csphtp1 18
Csphtp1 18Csphtp1 18
Csphtp1 18HUST
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaAjax Experience 2009
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaganohmad
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Sagakaven yan
 
XML processing with perl
XML processing with perlXML processing with perl
XML processing with perlJoe Jiang
 
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Guillaume Laforge
 
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP yucefmerhi
 
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on RailsA Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on RailsRafael García
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate WorksGoro Fuji
 
Debugging and Error handling
Debugging and Error handlingDebugging and Error handling
Debugging and Error handlingSuite Solutions
 
Система рендеринга в Magento
Система рендеринга в MagentoСистема рендеринга в Magento
Система рендеринга в MagentoMagecom Ukraine
 
WordPress Development Confoo 2010
WordPress Development Confoo 2010WordPress Development Confoo 2010
WordPress Development Confoo 2010Brendan Sera-Shriar
 
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITPLecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITPyucefmerhi
 

Similaire à Don't scrape, Glean! (20)

XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
 
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A TutorialOds Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
 
Csphtp1 18
Csphtp1 18Csphtp1 18
Csphtp1 18
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaga
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Saga
 
XML processing with perl
XML processing with perlXML processing with perl
XML processing with perl
 
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
 
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
 
Grails and Dojo
Grails and DojoGrails and Dojo
Grails and Dojo
 
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on RailsA Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
 
Debugging and Error handling
Debugging and Error handlingDebugging and Error handling
Debugging and Error handling
 
Система рендеринга в Magento
Система рендеринга в MagentoСистема рендеринга в Magento
Система рендеринга в Magento
 
WordPress Development Confoo 2010
WordPress Development Confoo 2010WordPress Development Confoo 2010
WordPress Development Confoo 2010
 
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITPLecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
 
JavaScript
JavaScriptJavaScript
JavaScript
 
Orm hero
Orm heroOrm hero
Orm hero
 

Dernier

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Dernier (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Don't scrape, Glean!