Soumettre la recherche
Mettre en ligne
Don't scrape, Glean!
•
Télécharger en tant que PPT, PDF
•
1 j'aime
•
742 vues
tommorris
Suivre
Lacks the demo part, alas, but it's the slides I used
Lire moins
Lire la suite
Technologie
Signaler
Partager
Signaler
Partager
1 sur 35
Télécharger maintenant
Recommandé
CSS naming | ceci n'est pas un pipe
CSS naming | ceci n'est pas un pipe
Wilfred Nas
2310 b xd
2310 b xd
Krazy Koder
Responsive Typography II
Responsive Typography II
Clarissa Peterson
My First Rails Plugin - Usertext
My First Rails Plugin - Usertext
frankieroberto
basic knowledge abot html
basic knowledge abot html
Ankit Dubey
zigbee
zigbee
mahamad juber
SAP NetWeaver Gateway - Gateway Service Consumption
SAP NetWeaver Gateway - Gateway Service Consumption
SAP PartnerEdge program for Application Development
NetWeaver Gateway- Gateway Service Consumption
NetWeaver Gateway- Gateway Service Consumption
SAP PartnerEdge program for Application Development
Recommandé
CSS naming | ceci n'est pas un pipe
CSS naming | ceci n'est pas un pipe
Wilfred Nas
2310 b xd
2310 b xd
Krazy Koder
Responsive Typography II
Responsive Typography II
Clarissa Peterson
My First Rails Plugin - Usertext
My First Rails Plugin - Usertext
frankieroberto
basic knowledge abot html
basic knowledge abot html
Ankit Dubey
zigbee
zigbee
mahamad juber
SAP NetWeaver Gateway - Gateway Service Consumption
SAP NetWeaver Gateway - Gateway Service Consumption
SAP PartnerEdge program for Application Development
NetWeaver Gateway- Gateway Service Consumption
NetWeaver Gateway- Gateway Service Consumption
SAP PartnerEdge program for Application Development
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Stephan Schmidt
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
simienc
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
Csphtp1 18
Csphtp1 18
HUST
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
Jsonsaga
Jsonsaga
nohmad
The JSON Saga
The JSON Saga
kaven yan
XML processing with perl
XML processing with perl
Joe Jiang
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Guillaume Laforge
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
yucefmerhi
Grails and Dojo
Grails and Dojo
Sven Haiges
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
Rafael García
How Xslate Works
How Xslate Works
Goro Fuji
Debugging and Error handling
Debugging and Error handling
Suite Solutions
Система рендеринга в Magento
Система рендеринга в Magento
Magecom Ukraine
WordPress Development Confoo 2010
WordPress Development Confoo 2010
Brendan Sera-Shriar
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
yucefmerhi
JavaScript
JavaScript
Doncho Minkov
Orm hero
Orm hero
Simone Di Maulo
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
LoriGlavin3
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
Ravi Sanghani
Contenu connexe
Similaire à Don't scrape, Glean!
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Stephan Schmidt
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
simienc
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
lf-2003_01-0269
lf-2003_01-0269
tutorialsruby
Csphtp1 18
Csphtp1 18
HUST
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Ajax Experience 2009
Jsonsaga
Jsonsaga
nohmad
The JSON Saga
The JSON Saga
kaven yan
XML processing with perl
XML processing with perl
Joe Jiang
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Guillaume Laforge
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
yucefmerhi
Grails and Dojo
Grails and Dojo
Sven Haiges
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
Rafael García
How Xslate Works
How Xslate Works
Goro Fuji
Debugging and Error handling
Debugging and Error handling
Suite Solutions
Система рендеринга в Magento
Система рендеринга в Magento
Magecom Ukraine
WordPress Development Confoo 2010
WordPress Development Confoo 2010
Brendan Sera-Shriar
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
yucefmerhi
JavaScript
JavaScript
Doncho Minkov
Orm hero
Orm hero
Simone Di Maulo
Similaire à Don't scrape, Glean!
(20)
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
Ods Markup And Tagsets: A Tutorial
Ods Markup And Tagsets: A Tutorial
lf-2003_01-0269
lf-2003_01-0269
lf-2003_01-0269
lf-2003_01-0269
Csphtp1 18
Csphtp1 18
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
Jsonsaga
Jsonsaga
The JSON Saga
The JSON Saga
XML processing with perl
XML processing with perl
Grails Introduction - IJTC 2007
Grails Introduction - IJTC 2007
Lecture 3 - Comm Lab: Web @ ITP
Lecture 3 - Comm Lab: Web @ ITP
Grails and Dojo
Grails and Dojo
A Toda Maquina Con Ruby on Rails
A Toda Maquina Con Ruby on Rails
How Xslate Works
How Xslate Works
Debugging and Error handling
Debugging and Error handling
Система рендеринга в Magento
Система рендеринга в Magento
WordPress Development Confoo 2010
WordPress Development Confoo 2010
Lecture 5 - Comm Lab: Web @ ITP
Lecture 5 - Comm Lab: Web @ ITP
JavaScript
JavaScript
Orm hero
Orm hero
Dernier
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
LoriGlavin3
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
Ravi Sanghani
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
panagenda
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
ThousandEyes
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
DianaGray10
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
Mydbops
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
Neo4j
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
HarshalMandlekar2
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Mark Goldstein
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Pim van der Noll
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
Ingrid Airi González
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
Raghuram Pandurangan
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
panagenda
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Alkin Tezuysal
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
LoriGlavin3
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
LoriGlavin3
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
ThousandEyes
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Scott Andery
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
LoriGlavin3
Dernier
(20)
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Don't scrape, Glean!
1.
2.
Scraping sucks.
3.
def lastlogin
(@hmodel/ "//td[@class='text'][@width='193']" ).first.innerHTML.split("<br />"[ 9 ].strip[ -10 .. -1 ] return date[ -4 .. -1 ] + "-" + date[ -7 .. -6 ] + "-" + date[ -10 .. -9 ] end end end end
4.
Hpricot for ‘Last
login’ date on MySpace.
5.
try :
lastlogin = self.soup.findAll( True , { "width" : "193" })[ 0 ].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string loginregex = re.compile( r " [0-9] / [0-9] +/ [0-9]* ") loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None : self.lastlogin = loginregex_inst.group() except : pass pass pass pass pass pass pass pass
6.
Taken from a
Python/BeautifulSoup library.
7.
(The Ruby is
prettier, but who’s counting?)
8.
getElementsByClassName(“foo”)[0].children
9.
It’s an edge
case. MySpace’s HTML is worse than average.
10.
But it is
an ugly recipe for mental turmoil.
11.
The alternative?
12.
flickr.getPhotos()
13.
And you get
back nice XML or JSON (or even SOAP!) (or even SOAP!)
14.
But ‘D.R.Y.’! APIs
break that principle. APIs break that principle.
15.
This is the
data equivalent of the ‘accessible version’.
16.
Enter GRDDL.
17.
GRDDL defines a
transformation process for XHTML » RDF.
18.
XHTML ? That’s
what the spec says. That’s what the spec says.
19.
HTML 4 works
too. Tidy ! !
20.
RDF? Yes. Trust
me. It’s not evil. It’s not evil. It’s not evil.
21.
GRDDL can work
like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.
22.
You simply use
HTML (or XML) in the normal way...
23.
...and define how
the data transformation.
24.
You can even
use it as a bridge for exisiting APIs and services.
25.
Could even be
used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?
26.
Simple example: ‘Not
Safe For Work’ ‘Not Safe For Work’
27.
<a href=" http://tubgirl.com
" class="nsfw">
28.
I can write
that. I can’t write xFolk by hand. I can’t write xFolk by hand.
29.
Is ‘nsfw’ a
good class name? No.
30.
Do I care?
No.
31.
The data layer
becomes separated like CSS is from HTML.
32.
That’s the theory.
Now for the demo. Now for the demo.
33.
irc.freenode.net #swig #swhack
#swhack #swhack
34.
getsemantic.com [email_address] [email_address]
35.
[email_address] http://tommorris.org http://tommorris.org
Télécharger maintenant