SlideShare une entreprise Scribd logo
1  sur  35
Don’t Scrape, Glean . . ,[object Object]
Scraping sucks.
def  lastlogin   (@hmodel/ &quot;//td[@class='text'][@width='193']&quot; ).first.innerHTML.split(&quot;<br />&quot;[ 9 ].strip[ -10 .. -1 ]   return  date[ -4 .. -1 ] + &quot;-&quot; + date[ -7 .. -6 ] + &quot;-&quot; + date[ -10 .. -9 ] end end end end
Hpricot for ‘Last login’ date on MySpace.
try :   lastlogin = self.soup.findAll( True , { &quot;width&quot; :  &quot;193&quot; })[ 0 ].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string   loginregex  =  re.compile(  r &quot; [0-9] / [0-9] +/ [0-9]* &quot;)   loginregex_inst  =  loginregex.search(lastlogin)   if  loginregex_inst  is   not None :   self.lastlogin  =  loginregex_inst.group()   except :   pass pass pass pass pass pass pass pass
Taken from a Python/BeautifulSoup library.
(The Ruby is prettier, but who’s counting?)
getElementsByClassName(“foo”)[0].children
It’s an edge case. MySpace’s HTML  is  worse than average.
But it is an ugly recipe for mental turmoil.
The alternative?
flickr.getPhotos()
And you get back nice XML or JSON (or even SOAP!) (or even SOAP!)
But ‘D.R.Y.’! APIs break that principle. APIs break that principle.
This is the data equivalent of the ‘accessible version’.
Enter GRDDL.
GRDDL defines a transformation process for XHTML » RDF.
XHTML ? That’s what the spec says. That’s what the spec says.
HTML 4 works too. Tidy ! !
RDF? Yes. Trust me. It’s not evil. It’s not evil. It’s not evil.
GRDDL can work like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.
You simply use HTML (or XML) in the normal way...
...and define how the data transformation.
You can even use it as a bridge for exisiting APIs and services.
Could even be used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?
Simple example: ‘Not Safe For Work’ ‘Not Safe For Work’
<a href=&quot; http://tubgirl.com &quot; class=&quot;nsfw&quot;>
I can write that. I can’t write xFolk by hand. I can’t write xFolk by hand.
Is ‘nsfw’ a good class name? No.
Do I care? No.
The data layer becomes separated like CSS is from HTML.
That’s the theory. Now for the demo. Now for the demo.
irc.freenode.net #swig #swhack #swhack #swhack
getsemantic.com [email_address] [email_address]
[email_address] http://tommorris.org http://tommorris.org

Contenu connexe

Similaire à Don't scrape, Glean!

XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARStephan Schmidt
 
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A TutorialOds Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorialsimienc
 
Csphtp1 18
Csphtp1 18Csphtp1 18
Csphtp1 18HUST
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaAjax Experience 2009
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaganohmad
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Sagakaven yan
 
XML processing with perl
XML processing with perlXML processing with perl
XML processing with perlJoe Jiang
 
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Guillaume Laforge
 
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP yucefmerhi
 
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on RailsA Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on RailsRafael García
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate WorksGoro Fuji
 
Debugging and Error handling
Debugging and Error handlingDebugging and Error handling
Debugging and Error handlingSuite Solutions
 
Система рендеринга в Magento
Система рендеринга в MagentoСистема рендеринга в Magento
Система рендеринга в MagentoMagecom Ukraine
 
WordPress Development Confoo 2010
WordPress Development Confoo 2010WordPress Development Confoo 2010
WordPress Development Confoo 2010Brendan Sera-Shriar
 
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITPLecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITPyucefmerhi
 

Similaire à Don't scrape, Glean! (20)

XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
 
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A TutorialOds Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
 
lf-2003_01-0269
lf-2003_01-0269lf-2003_01-0269
lf-2003_01-0269
 
Csphtp1 18
Csphtp1 18Csphtp1 18
Csphtp1 18
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaga
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Saga
 
XML processing with perl
XML processing with perlXML processing with perl
XML processing with perl
 
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
 
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
 
Grails and Dojo
Grails and DojoGrails and Dojo
Grails and Dojo
 
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on RailsA Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
 
Debugging and Error handling
Debugging and Error handlingDebugging and Error handling
Debugging and Error handling
 
Система рендеринга в Magento
Система рендеринга в MagentoСистема рендеринга в Magento
Система рендеринга в Magento
 
WordPress Development Confoo 2010
WordPress Development Confoo 2010WordPress Development Confoo 2010
WordPress Development Confoo 2010
 
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITPLecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
 
JavaScript
JavaScriptJavaScript
JavaScript
 
Orm hero
Orm heroOrm hero
Orm hero
 

Dernier

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Dernier (20)

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Don't scrape, Glean!