SlideShare une entreprise Scribd logo
1  sur  76
Télécharger pour lire hors ligne
DATA LIBERATION
Opening Up Data by Hook
or by Crook - Data
Scraping, Linkage and the
Value of a Good Identifier
                               Tony Hirst
                      Department of Communication
                             and Systems
                          The Open University
data NOT
information
              by Vick
[Disruptive
Innovation?]
Onlineinfo2012 - Scraping
“First” generation:
 data catalogues
Breathing life
 into data…
=importData(“CSV_URL”)
the spreadsheet becomes

A DATABASE
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
“Second” generation:
 data management
      systems
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
There’s lots more
data that’s locked
up in web pages…
Scraping…
Onlineinfo2012 - Scraping
“grabbing web content
in a machine readable
   format and then
 processing it for your
    own purposes”
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Original      Extract
                          Accessible
HTML web    Information
                          web page
  page         -> data
Recreating the
database that was used
     to populate a
   (templated) page
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
…quick’n’dirty
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Scrapers
                  SQLite
    Scraper      database




Views
   SQLitedatab
       ase
                 Scraper
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Sometimes the
 data is spread
across different
     files…
Onlineinfo2012 - Scraping
Row based
aggregation
Sometimes the
 data is spread
across different
  websites…
…   Normalisation…
Onlineinfo2012 - Scraping
Data
Enrichment
Column
Additions/An
 notations
Onlineinfo2012 - Scraping
Sometimes the
  data is split
across different
     files…
Column
based merge
Onlineinfo2012 - Scraping
-> Data
cleansing
Clustering…
http://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss/

/via Martin Hawksey/@mhawksey
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
“Finessing” a
  common
  identifer
Common identifiers
 (common KEYS) make
it MUCH easier to JOIN
   datasets by column
Book Title
-> ISBN
I am “psychemedia”
on Twitter, delicious,
slideshare, flickr, etc
         etc
Onlineinfo2012 - Scraping
Reconciliation…
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Onlineinfo2012 - Scraping
Linked
Data™
Onlineinfo2012 - Scraping
So who speaks SPARQL?




     Diners - Journal Canteen
     by avlxyz
You DON’T have to….
Just think about how one piece of
 data might be related to another
   through a common means of
        addressing them…
http://ouseful.info

 @psychemedia

Contenu connexe

Tendances

Soton2013 opendata
Soton2013 opendataSoton2013 opendata
Soton2013 opendataTony Hirst
 
non-slides-Thatcamp
non-slides-Thatcampnon-slides-Thatcamp
non-slides-ThatcampTrevor Owens
 
Data(base) taxonomy
Data(base) taxonomyData(base) taxonomy
Data(base) taxonomyDejan Radic
 
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsPromises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsEmily Nimsakont
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentationekansa
 
Relevance of clasification and indexing
Relevance of clasification and indexingRelevance of clasification and indexing
Relevance of clasification and indexingVaralakshmiRSR
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamEnno Meijers
 
What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?Emily Nimsakont
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTESShana McDanold
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshareHafabe
 
The network reconfigures the catalog
The network reconfigures the catalogThe network reconfigures the catalog
The network reconfigures the cataloglisld
 
Towards collaboration at scale: Libraries, the social and the technical
Towards collaboration at scale:  Libraries, the social and the technicalTowards collaboration at scale:  Libraries, the social and the technical
Towards collaboration at scale: Libraries, the social and the technicallisld
 
Linked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionLinked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionEmily Nimsakont
 
ECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldMarc D Anderson
 

Tendances (18)

Soton2013 opendata
Soton2013 opendataSoton2013 opendata
Soton2013 opendata
 
non-slides-Thatcamp
non-slides-Thatcampnon-slides-Thatcamp
non-slides-Thatcamp
 
I say NoSQL you say what
I say NoSQL you say whatI say NoSQL you say what
I say NoSQL you say what
 
Data(base) taxonomy
Data(base) taxonomyData(base) taxonomy
Data(base) taxonomy
 
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsPromises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentation
 
Relevance of clasification and indexing
Relevance of clasification and indexingRelevance of clasification and indexing
Relevance of clasification and indexing
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics Amsterdam
 
What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?
 
Databases and types of databases
Databases and types of databasesDatabases and types of databases
Databases and types of databases
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTES
 
Lodlam.slideshare
Lodlam.slideshareLodlam.slideshare
Lodlam.slideshare
 
LODLAM Landscape
LODLAM LandscapeLODLAM Landscape
LODLAM Landscape
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
The network reconfigures the catalog
The network reconfigures the catalogThe network reconfigures the catalog
The network reconfigures the catalog
 
Towards collaboration at scale: Libraries, the social and the technical
Towards collaboration at scale:  Libraries, the social and the technicalTowards collaboration at scale:  Libraries, the social and the technical
Towards collaboration at scale: Libraries, the social and the technical
 
Linked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An IntroductionLinked Data for Law Libraries: An Introduction
Linked Data for Law Libraries: An Introduction
 
ECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern WorldECS2019 - Managing Content Types in the Modern World
ECS2019 - Managing Content Types in the Modern World
 

En vedette

Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience requiredScrapinghub
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scrapingScrapinghub
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)Paige Bailey
 

En vedette (6)

Search34
Search34Search34
Search34
 
Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience required
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
chapter22.ppt
chapter22.pptchapter22.ppt
chapter22.ppt
 
Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)Python 101: Python for Absolute Beginners (PyTexas 2014)
Python 101: Python for Absolute Beginners (PyTexas 2014)
 

Similaire à Onlineinfo2012 - Scraping

What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...Emily Nimsakont
 
What is the Semantic Web
What is the Semantic WebWhat is the Semantic Web
What is the Semantic WebJuan Sequeda
 
Linked Data: so what?
Linked Data: so what?Linked Data: so what?
Linked Data: so what?MIUR
 
Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Emily Nimsakont
 
Linked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, MuseumsLinked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, Museumsljsmart
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futureslisld
 
Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2Jenel Farrell
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesAlessandro Adamou
 
Metadata in the age of data curation and linked data
Metadata in the age of data curation and linked dataMetadata in the age of data curation and linked data
Metadata in the age of data curation and linked dataRyan Johnson
 
Madrid Building blocks of Linked Data
Madrid Building blocks of Linked DataMadrid Building blocks of Linked Data
Madrid Building blocks of Linked DataVictor de Boer
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersPrattSILS
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
Libraries in a data-centered environment
Libraries in a data-centered environmentLibraries in a data-centered environment
Libraries in a data-centered environmentJakob .
 
What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection? What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection? Debra Shapiro
 
Semantic Mapping and LOD prez
Semantic Mapping and LOD prezSemantic Mapping and LOD prez
Semantic Mapping and LOD prezCarol Chiodo
 

Similaire à Onlineinfo2012 - Scraping (20)

What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
What Is Linked Data, and What Does it Mean for Libraries? ALAO TEDSIG Spring ...
 
What is the Semantic Web
What is the Semantic WebWhat is the Semantic Web
What is the Semantic Web
 
Linked Data: so what?
Linked Data: so what?Linked Data: so what?
Linked Data: so what?
 
Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?
 
Linked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, MuseumsLinked data for Libraries, Archives, Museums
Linked data for Libraries, Archives, Museums
 
Library discovery: past, present and some futures
Library discovery: past, present and some futuresLibrary discovery: past, present and some futures
Library discovery: past, present and some futures
 
Databases Essay
Databases EssayDatabases Essay
Databases Essay
 
Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2Lodlam saa 2011_jenelfarrell_2
Lodlam saa 2011_jenelfarrell_2
 
Sailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sSailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0s
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 
Metadata in the age of data curation and linked data
Metadata in the age of data curation and linked dataMetadata in the age of data curation and linked data
Metadata in the age of data curation and linked data
 
Madrid Building blocks of Linked Data
Madrid Building blocks of Linked DataMadrid Building blocks of Linked Data
Madrid Building blocks of Linked Data
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project posters
 
Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)Engineering a Semantic Web (Spring 2018)
Engineering a Semantic Web (Spring 2018)
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Libraries in a data-centered environment
Libraries in a data-centered environmentLibraries in a data-centered environment
Libraries in a data-centered environment
 
What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection? What flavor of linked data is best for your collection?
What flavor of linked data is best for your collection?
 
Linked library data
Linked library dataLinked library data
Linked library data
 
Semantic Mapping and LOD prez
Semantic Mapping and LOD prezSemantic Mapping and LOD prez
Semantic Mapping and LOD prez
 
Semantic web Santhosh N Basavarajappa
Semantic web   Santhosh N BasavarajappaSemantic web   Santhosh N Basavarajappa
Semantic web Santhosh N Basavarajappa
 

Plus de Tony Hirst

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiestaTony Hirst
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptxTony Hirst
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptxTony Hirst
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacksTony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriateTony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriateTony Hirst
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyterTony Hirst
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2Tony Hirst
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopTony Hirst
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireTony Hirst
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interestTony Hirst
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXTony Hirst
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefineTony Hirst
 
Conversations with data
Conversations with dataConversations with data
Conversations with dataTony Hirst
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingoTony Hirst
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Tony Hirst
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalismTony Hirst
 

Plus de Tony Hirst (20)

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiesta
 
Dev8d jupyter
Dev8d jupyterDev8d jupyter
Dev8d jupyter
 
Ili 16 robot
Ili 16 robotIli 16 robot
Ili 16 robot
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptx
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptx
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacks
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyter
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 Workshop
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wire
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interest
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKX
 
Week4
Week4Week4
Week4
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
 
Conversations with data
Conversations with dataConversations with data
Conversations with data
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingo
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
 

Dernier

Plano de marketing- inglês em formato ppt
Plano de marketing- inglês  em formato pptPlano de marketing- inglês  em formato ppt
Plano de marketing- inglês em formato pptElizangelaSoaresdaCo
 
Michael Vidyakin: Introduction to PMO (UA)
Michael Vidyakin: Introduction to PMO (UA)Michael Vidyakin: Introduction to PMO (UA)
Michael Vidyakin: Introduction to PMO (UA)Lviv Startup Club
 
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...AustraliaChapterIIBA
 
Ethical stalking by Mark Williams. UpliftLive 2024
Ethical stalking by Mark Williams. UpliftLive 2024Ethical stalking by Mark Williams. UpliftLive 2024
Ethical stalking by Mark Williams. UpliftLive 2024Winbusinessin
 
The End of Business as Usual: Rewire the Way You Work to Succeed in the Consu...
The End of Business as Usual: Rewire the Way You Work to Succeed in the Consu...The End of Business as Usual: Rewire the Way You Work to Succeed in the Consu...
The End of Business as Usual: Rewire the Way You Work to Succeed in the Consu...Brian Solis
 
PDT 88 - 4 million seed - Seed - Protecto.pdf
PDT 88 - 4 million seed - Seed - Protecto.pdfPDT 88 - 4 million seed - Seed - Protecto.pdf
PDT 88 - 4 million seed - Seed - Protecto.pdfHajeJanKamps
 
Cracking the ‘Business Process Outsourcing’ Code Main.pptx
Cracking the ‘Business Process Outsourcing’ Code Main.pptxCracking the ‘Business Process Outsourcing’ Code Main.pptx
Cracking the ‘Business Process Outsourcing’ Code Main.pptxWorkforce Group
 
Personal Brand Exploration Presentation Eric Bonilla
Personal Brand Exploration Presentation Eric BonillaPersonal Brand Exploration Presentation Eric Bonilla
Personal Brand Exploration Presentation Eric BonillaEricBonilla13
 
ISONIKE Ltd Accreditation for the Conformity Assessment and Certification of ...
ISONIKE Ltd Accreditation for the Conformity Assessment and Certification of ...ISONIKE Ltd Accreditation for the Conformity Assessment and Certification of ...
ISONIKE Ltd Accreditation for the Conformity Assessment and Certification of ...ISONIKELtd
 
Developing Coaching Skills: Mine, Yours, Ours
Developing Coaching Skills: Mine, Yours, OursDeveloping Coaching Skills: Mine, Yours, Ours
Developing Coaching Skills: Mine, Yours, OursKaiNexus
 
Anyhr.io | Presentation HR&Recruiting agency
Anyhr.io | Presentation HR&Recruiting agencyAnyhr.io | Presentation HR&Recruiting agency
Anyhr.io | Presentation HR&Recruiting agencyHanna Klim
 
Boat Trailers Market PPT: Growth, Outlook, Demand, Keyplayer Analysis and Opp...
Boat Trailers Market PPT: Growth, Outlook, Demand, Keyplayer Analysis and Opp...Boat Trailers Market PPT: Growth, Outlook, Demand, Keyplayer Analysis and Opp...
Boat Trailers Market PPT: Growth, Outlook, Demand, Keyplayer Analysis and Opp...IMARC Group
 
Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access
 
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for Business
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for BusinessQ2 2024 APCO Geopolitical Radar - The Global Operating Environment for Business
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for BusinessAPCO
 
Scrum Events & How to run them effectively
Scrum Events & How to run them effectivelyScrum Events & How to run them effectively
Scrum Events & How to run them effectivelyMarianna Nakou
 
AMAZON SELLER VIRTUAL ASSISTANT PRODUCT RESEARCH .pdf
AMAZON SELLER VIRTUAL ASSISTANT PRODUCT RESEARCH .pdfAMAZON SELLER VIRTUAL ASSISTANT PRODUCT RESEARCH .pdf
AMAZON SELLER VIRTUAL ASSISTANT PRODUCT RESEARCH .pdfJohnCarloValencia4
 
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John Meulemans
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John MeulemansBCE24 | Virtual Brand Ambassadors: Making Brands Personal - John Meulemans
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John MeulemansBBPMedia1
 
Upgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking ApplicationsUpgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking ApplicationsIntellect Design Arena Ltd
 
NewBase 25 March 2024 Energy News issue - 1710 by Khaled Al Awadi_compress...
NewBase  25 March  2024  Energy News issue - 1710 by Khaled Al Awadi_compress...NewBase  25 March  2024  Energy News issue - 1710 by Khaled Al Awadi_compress...
NewBase 25 March 2024 Energy News issue - 1710 by Khaled Al Awadi_compress...Khaled Al Awadi
 

Dernier (20)

Plano de marketing- inglês em formato ppt
Plano de marketing- inglês  em formato pptPlano de marketing- inglês  em formato ppt
Plano de marketing- inglês em formato ppt
 
Michael Vidyakin: Introduction to PMO (UA)
Michael Vidyakin: Introduction to PMO (UA)Michael Vidyakin: Introduction to PMO (UA)
Michael Vidyakin: Introduction to PMO (UA)
 
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...
 
Ethical stalking by Mark Williams. UpliftLive 2024
Ethical stalking by Mark Williams. UpliftLive 2024Ethical stalking by Mark Williams. UpliftLive 2024
Ethical stalking by Mark Williams. UpliftLive 2024
 
The End of Business as Usual: Rewire the Way You Work to Succeed in the Consu...
The End of Business as Usual: Rewire the Way You Work to Succeed in the Consu...The End of Business as Usual: Rewire the Way You Work to Succeed in the Consu...
The End of Business as Usual: Rewire the Way You Work to Succeed in the Consu...
 
PDT 88 - 4 million seed - Seed - Protecto.pdf
PDT 88 - 4 million seed - Seed - Protecto.pdfPDT 88 - 4 million seed - Seed - Protecto.pdf
PDT 88 - 4 million seed - Seed - Protecto.pdf
 
Cracking the ‘Business Process Outsourcing’ Code Main.pptx
Cracking the ‘Business Process Outsourcing’ Code Main.pptxCracking the ‘Business Process Outsourcing’ Code Main.pptx
Cracking the ‘Business Process Outsourcing’ Code Main.pptx
 
Personal Brand Exploration Presentation Eric Bonilla
Personal Brand Exploration Presentation Eric BonillaPersonal Brand Exploration Presentation Eric Bonilla
Personal Brand Exploration Presentation Eric Bonilla
 
WAM Corporate Presentation Mar 25 2024.pdf
WAM Corporate Presentation Mar 25 2024.pdfWAM Corporate Presentation Mar 25 2024.pdf
WAM Corporate Presentation Mar 25 2024.pdf
 
ISONIKE Ltd Accreditation for the Conformity Assessment and Certification of ...
ISONIKE Ltd Accreditation for the Conformity Assessment and Certification of ...ISONIKE Ltd Accreditation for the Conformity Assessment and Certification of ...
ISONIKE Ltd Accreditation for the Conformity Assessment and Certification of ...
 
Developing Coaching Skills: Mine, Yours, Ours
Developing Coaching Skills: Mine, Yours, OursDeveloping Coaching Skills: Mine, Yours, Ours
Developing Coaching Skills: Mine, Yours, Ours
 
Anyhr.io | Presentation HR&Recruiting agency
Anyhr.io | Presentation HR&Recruiting agencyAnyhr.io | Presentation HR&Recruiting agency
Anyhr.io | Presentation HR&Recruiting agency
 
Boat Trailers Market PPT: Growth, Outlook, Demand, Keyplayer Analysis and Opp...
Boat Trailers Market PPT: Growth, Outlook, Demand, Keyplayer Analysis and Opp...Boat Trailers Market PPT: Growth, Outlook, Demand, Keyplayer Analysis and Opp...
Boat Trailers Market PPT: Growth, Outlook, Demand, Keyplayer Analysis and Opp...
 
Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024
 
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for Business
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for BusinessQ2 2024 APCO Geopolitical Radar - The Global Operating Environment for Business
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for Business
 
Scrum Events & How to run them effectively
Scrum Events & How to run them effectivelyScrum Events & How to run them effectively
Scrum Events & How to run them effectively
 
AMAZON SELLER VIRTUAL ASSISTANT PRODUCT RESEARCH .pdf
AMAZON SELLER VIRTUAL ASSISTANT PRODUCT RESEARCH .pdfAMAZON SELLER VIRTUAL ASSISTANT PRODUCT RESEARCH .pdf
AMAZON SELLER VIRTUAL ASSISTANT PRODUCT RESEARCH .pdf
 
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John Meulemans
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John MeulemansBCE24 | Virtual Brand Ambassadors: Making Brands Personal - John Meulemans
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John Meulemans
 
Upgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking ApplicationsUpgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking Applications
 
NewBase 25 March 2024 Energy News issue - 1710 by Khaled Al Awadi_compress...
NewBase  25 March  2024  Energy News issue - 1710 by Khaled Al Awadi_compress...NewBase  25 March  2024  Energy News issue - 1710 by Khaled Al Awadi_compress...
NewBase 25 March 2024 Energy News issue - 1710 by Khaled Al Awadi_compress...
 

Onlineinfo2012 - Scraping

Notes de l'éditeur

  1. Tony HirstTwitter:@psychemediaBlog: http://blog.ouseful.infoPresentation prepared for: Online Info 12/11/2012DATA LIBERATION: OPENING UP DATA BY HOOK OR BY CROOK - DATA SCRAPING, LINKAGE AND THE VALUE OF A GOOD IDENTIFIERThe 1/9/90 rule is often used to characterise the way in which a small number of creators generate content that a larger number (but still small percentage in the greater scheme of things) comment on or amplify, whilst the majority just passively consume. In this presentation, I will explore the extent to which a similar view applies to the world of "data liberation". After reviewing the idea of data scraping, and some of the techniques surrounding it, I will describe how online tools such as Scraperwiki provide a platform for concentrating data scraping activity and expertise, as well as supporting the publication of data /as data/ in a variety of formats, in addition to 'end user' views in the form of graphical charts and interactive visualisations.One of the major motivations for data scraping is the aggregation of data from a variety of data sources into a larger, integrated whole. For example, the aggregation of research council funding data from separate research councils allows us to view a large proportion of the publicly funded research grants received by a single institution; or the collection of local council spending data across all UK councils allows us to see how councils spend money with each other across a range of transaction areas. But how do we actually create such aggregations when the data is sourced from different areas? In order to do this, we need to know when different datasets are actually talking about the same thing, which is where common identifiers come in. For it is surely the case that when we have common identifiers, we can have linkage, and as a result start to realise some of the benefits of Linked Data (as well as developing a wider appreciation of what those benefits might actually be...) (As an aside, I'll describe how we might go about deriving such identifiers when they are missing from a data set that might otherwise, or more conveniently, be expected to publish them.)Throughout the presentation, I will draw on practical examples of how aggregated "liberated" data has been used as the basis of wider interest, and even status quo disrupting, services, as well as reflecting on what other sources of data we might see the data liberators turning their attention to next...Key learning points:1 - What is "data scraping", how can I do it and is my website at risk of it?2 - Why the secret to understanding "Linked Data" is the very idea of it, not just (or not even) the technology.3 - How has data scraping been used to "open up" data in actual practice?
  2. The focus on this presentation is not the release of “information”, but the release of data in raw form so that it can be interpreted and presented in informative ways by other parties.
  3. The London Datastore is an early example of a council-centric open data website. Early signs suggest it is natural to locate data websites at addresses of the form data.COUNCILNAME.gov.uk or www.COUNCILNAME.gov.uk/data
  4. Another example that demonstrates how CSV can be used to help data flow is demonstrated by Google Spreadsheets. The =importData formula allows a user to specify a source data URL, and pull the CSV data found at that location in to the spreadsheet. Unlike Many Eyes Wikified, if the source data at the URL is updated, the updated will (eventually) be pulled into the spreadsheet automatically.
  5. One of the really good reasons for getting data into a data processing environment such as a spreadsheet is that you can start to work it. In the case of Google Spreadsheets, the spreadsheet environment can also be used as a database environment. That is, we can treat one or more data containing sheets in a spreadsheet as a database, and generate new views over the data, as well as running queries over that data.
  6. Another way of using a Google Spreadsheet as a database is via the Google Spreadsheets API. The GoogleVisualisation API (?) provides a way of passing queries written using the Google ???viz query language from an arbitrary web page or web application, and receiving the resulting data in a standard JSON based format, which also happens to play nicely with the Google Visualisation API???The Guardian Datastore explorer is a crude demonstration for 2009(??) demonstrating how data from the Guardian datastore, data that is stored across a range of Google spreadsheets, can be explored , queried and visualised via these APIs. Users can select a dataset from a drop down menu, fed from a delicious account to which various datastore spreadsheets have been bookmarked using a particular set of tags, or by pasting in the URL of an arbitrary (public) Google spreadsheet. The first row/headings of the data can then be previewed (a simple spreadsheet is assumed, in which column headings appear In the first row of the spreadsheet).
  7. A series of list boxes are then populated with the column labels and there names, and provide a certain amount of help for the creation of a query over the spreadsheet data. A range of output formats can also be selected, from simple HTML data tables, to a range of charts. URLs are also generated for HTML and CSV representations of the data returned from the query.
  8. One of the nice things about the data table widget (a standard GoogleVisualisation API component in this case, though similar examples exist for YUI, the Yahoo User Interface Libraries, or frameworks such as JQuery), is that is supports things like row sorting by column, (for free – no programming required!), allowing even further manipulation of the data, albeit at a simplistic level.(It’s probably worth pointing out here that it may be worth providing a preview of the column headings and first few rows (or a sample of random rows) of data when datasets are published, just so that users can see what sort of data is on offer without having to download the whole data set?)
  9. If you’re in the business of selling information as data, you are under threat where that information is published in an openly licensed way.
  10. Linked Data – the TM is something of a joke and refers to the particular style of publishing data according to set of principles first outlined by the inventor of the World Wide Web, Sir Tim Berners Lee – is one of the data formats that the Government’s data task force favour for the publication of data.
  11. There is a problem though – at the moment, there are barriers to entry to Linked Data world from both the query side (not many people speak SPARQL, or know how to construct a SPARQL query to an endpoint) and the results side (data is returned as RDF).
  12. So – do you speak SPARQL?