SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
may 2 0 1 1




MAKING THE GOV DATA OPEN
           MAREK SOTAK | ATOMIC ANT 

                     www.atomicant.co.uk
OH HAI!
ABOUT ME & ATOMIC ANT




Marek Sotak
 •   Web designer, developer
 •   From Prague, Czech Republic
 •   Over 5 years with Drupal - since v4.6
 •   Rootcandy admin theme
 •   Organising events - Drupal Design Camp, Local Meet-ups


 • @sotak on twitter
 • http://sotak.co.uk - personal blog/experiments


                        6 : 0                 2 : 1

atomicant.co.uk                                     #justsaying ;)
OH HAI!
ABOUT ME & ATOMIC ANT



•   Based in London & Prague
•   Human interface design, training, branding, development
•   Clients all over the world
•   http://atomicant.co.uk
OPEN DATA?
HUH?




                  What is OPEN DATA?




atomicant.co.uk
OPEN DATA?
HUH?

Wikileaks Iraq war logs: every death mapped   http://bit.ly/iraqwarlogs




atomicant.co.uk
OPEN DATA?
HUH?

Don't eat at ____ http://donteat.at




atomicant.co.uk
OPEN DATA?
HUH?



Don't eat at - http://donteat.at/




atomicant.co.uk
DATA MINING - SCRAPING
LET'S GET DIRTY



BigClean.org – Prague




atomicant.co.uk
DATA MINING - SCRAPING
LET'S GET DIRTY



There's a lot of data laying around on the internet that can be
useful → Crime reports, government reports, statistics,
missing pets register, current affairs

However sometimes they are in a format such as PDF, html,
etc... something you can't really take and perform
calculations, visualizations, filtering, etc... on.

Is it really that hard to publish something in a CSV, XML,.. ?




atomicant.co.uk
DATA MINING - SCRAPING
LET'S GET DIRTY



Ministry of the interior – Czech Republic
Public Collections
- open what?




atomicant.co.uk
DATA MINING - SCRAPING
LET'S GET DIRTY




atomicant.co.uk
DATA MINING - SCRAPING
LET'S GET DIRTY




atomicant.co.uk
DATA MINING - SCRAPING
LET'S GET DIRTY




atomicant.co.uk
DATA MINING - SCRAPING
LET'S GET DIRTY




atomicant.co.uk
DATA MINING - SCRAPING
LET'S GET DIRTY




                  Request a site/content


      Run through the html – DOM - selectors


          Do whatever you want with the data


                      Save the data
atomicant.co.uk
SCRAPERWIKI
REFINE AND SCRAPE DATA




atomicant.co.uk
SCRAPERWIKI
WHAT IS IT? HOW TO USE IT



Scrape and link data using Ruby, Python and PHP scripts
that run maintenance-free in the cloud. Request data for
scoops and better decisions.




atomicant.co.uk
DATA MINING - SCRAPING
LET'S GET DIRTY
SCRAPERWIKI
WHAT IS IT? HOW TO USE IT




atomicant.co.uk
SCRAPERWIKI
WHAT IS IT? HOW TO USE IT



     Why would you want to use SCRAPERWIKI rather than
     other scraping tools or custom code?




atomicant.co.uk
SCRAPERWIKI
WHAT IS IT? HOW TO USE IT




 • The dataset is available to everyone
 • Anyone can access the data through API
 • If the source changed and the scraper brakes, anyone can
   fix the scraper
 • Anyone can fork the scraper




atomicant.co.uk
IS THAT IT?
CERTAINLY NOT
SCRAPERWIKI
WHAT IS IT? HOW TO USE IT




atomicant.co.uk
GOOGLE REFINE
WHAT IS IT? HOW TO USE IT



Google Refine is a power tool for working with messy data,
cleaning it up, transforming it from one format into another,
extending it with web services,...




atomicant.co.uk
VISUALISE
TELL THE STORY



There is more to that

It's just not data with values in a spreadsheet or database

Data can tell the story!




atomicant.co.uk
GOOGLE FUSION TABLES
WHAT IS IT? HOW TO USE IT



Easy visualisation http://tables.googlelabs.com/




atomicant.co.uk
SCRAPING WITH DRUPAL
AND NOW FOR SOMETHING COMPLETELY DIFFERENT



Feeds – http://drupal.org/project/feeds

Scraping
Feeds query path parser - project/feeds_querypath_parser
Feeds xpath parser – project/feeds_xpathparser

Cleaning up data
Feeds tamper - http://drupal.org/project/feeds_tamper




atomicant.co.uk
VISUALISE WITH DRUPAL
AND NOW FOR SOMETHING COMPLETELY DIFFERENT



Mapping
- Location – http://drupal.org/project/location
- Openlayers – http://drupal.org/project/openlayers
- Gmap – http://drupal.org/project/gmap


Graphs/Charts
- Graphs
- Graphs Charts
- Open Flash Chart
- Views



atomicant.co.uk
GO! SCRAPE IT!
CHALLENGE



EU Open Data Challenge
- €20,000 to win
- 28 days left to enter


http://opendatachallenge.org/




atomicant.co.uk
TOOLS
SCRAPING DATA



ScraperWiki – http://scraperwiki.com

PHP Simple HTML DOM – http://bit.ly/phphtmldom

PHPQuery - http://code.google.com/p/phpquery/

Open Data Kit - http://opendatakit.org/




atomicant.co.uk
TOOLS
CLEANING DATA



Google Refine - http://code.google.com/p/google-refine/




atomicant.co.uk
TOOLS
VISUALIZING DATA



Google fusion tables - http://tables.googlelabs.com/

The Best Tools for Visualization - http://rww.to/toolsforvis




atomicant.co.uk
TOOLS
VISUALIZING DATA



OpenHeatmap http://bit.ly/openheatmap




atomicant.co.uk
THANK YOU
Q&A | LETS CONNECT




                   QUESTIONS?




@sotak - twitter
http://sotak.co.uk - personal blog
http://atomicant.co.uk - company website


atomicant.co.uk

Contenu connexe

Similaire à Making the gov data more open

What's next? J and beyond keynote 2015
What's next? J and beyond keynote 2015What's next? J and beyond keynote 2015
What's next? J and beyond keynote 2015Christian Heilmann
 
Utilizing open-data
Utilizing open-dataUtilizing open-data
Utilizing open-dataccalnan
 
Utilizing Open Government Data Using Drupal
Utilizing Open Government Data Using DrupalUtilizing Open Government Data Using Drupal
Utilizing Open Government Data Using Drupalccalnan
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
APIs in production - we built it, can we fix it?
APIs in production - we built it, can we fix it?APIs in production - we built it, can we fix it?
APIs in production - we built it, can we fix it?Martin Gutenbrunner
 
SoundCloud Platform Do:s and Don't:s at How To Web 2011
SoundCloud Platform Do:s and Don't:s at How To Web 2011SoundCloud Platform Do:s and Don't:s at How To Web 2011
SoundCloud Platform Do:s and Don't:s at How To Web 2011Eric Wahlforss
 
WAPWG Clark defining capturing_web-based_if
WAPWG Clark defining capturing_web-based_ifWAPWG Clark defining capturing_web-based_if
WAPWG Clark defining capturing_web-based_ifSara Day Thomson
 
Leancamp - are you ready to rock
Leancamp - are you ready to rockLeancamp - are you ready to rock
Leancamp - are you ready to rockChristian Heilmann
 
Christian heilmann an-open-web-for-all
Christian heilmann   an-open-web-for-allChristian heilmann   an-open-web-for-all
Christian heilmann an-open-web-for-allHow to Web
 
Why We Need a Dark(er) Web
Why We Need a Dark(er) WebWhy We Need a Dark(er) Web
Why We Need a Dark(er) WebJeroen Baert
 
New Era of Software with modern Application Security v1.0
New Era of Software with modern Application Security v1.0New Era of Software with modern Application Security v1.0
New Era of Software with modern Application Security v1.0Dinis Cruz
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internetdrgath
 
Mind the Gap - All things Open 2015 Keynote
Mind the Gap - All things Open 2015 KeynoteMind the Gap - All things Open 2015 Keynote
Mind the Gap - All things Open 2015 KeynoteChristian Heilmann
 
Javascript State of the Union 2015 - English
Javascript State of the Union 2015 - EnglishJavascript State of the Union 2015 - English
Javascript State of the Union 2015 - EnglishHuge
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internetdrgath
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Intro to data visualisation
Intro to data visualisationIntro to data visualisation
Intro to data visualisationAnna Gerber
 
SoundCloud API Do:s and Don't:s
SoundCloud API Do:s and Don't:sSoundCloud API Do:s and Don't:s
SoundCloud API Do:s and Don't:sEric Wahlforss
 

Similaire à Making the gov data more open (20)

What's next? J and beyond keynote 2015
What's next? J and beyond keynote 2015What's next? J and beyond keynote 2015
What's next? J and beyond keynote 2015
 
Utilizing open-data
Utilizing open-dataUtilizing open-data
Utilizing open-data
 
Utilizing Open Government Data Using Drupal
Utilizing Open Government Data Using DrupalUtilizing Open Government Data Using Drupal
Utilizing Open Government Data Using Drupal
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
APIs in production - we built it, can we fix it?
APIs in production - we built it, can we fix it?APIs in production - we built it, can we fix it?
APIs in production - we built it, can we fix it?
 
SoundCloud Platform Do:s and Don't:s at How To Web 2011
SoundCloud Platform Do:s and Don't:s at How To Web 2011SoundCloud Platform Do:s and Don't:s at How To Web 2011
SoundCloud Platform Do:s and Don't:s at How To Web 2011
 
WAPWG Clark defining capturing_web-based_if
WAPWG Clark defining capturing_web-based_ifWAPWG Clark defining capturing_web-based_if
WAPWG Clark defining capturing_web-based_if
 
Leancamp - are you ready to rock
Leancamp - are you ready to rockLeancamp - are you ready to rock
Leancamp - are you ready to rock
 
Christian heilmann an-open-web-for-all
Christian heilmann   an-open-web-for-allChristian heilmann   an-open-web-for-all
Christian heilmann an-open-web-for-all
 
Why We Need a Dark(er) Web
Why We Need a Dark(er) WebWhy We Need a Dark(er) Web
Why We Need a Dark(er) Web
 
New Era of Software with modern Application Security v1.0
New Era of Software with modern Application Security v1.0New Era of Software with modern Application Security v1.0
New Era of Software with modern Application Security v1.0
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
 
RAA 2013
RAA 2013RAA 2013
RAA 2013
 
Mind the Gap - All things Open 2015 Keynote
Mind the Gap - All things Open 2015 KeynoteMind the Gap - All things Open 2015 Keynote
Mind the Gap - All things Open 2015 Keynote
 
An open web for all
An open web for allAn open web for all
An open web for all
 
Javascript State of the Union 2015 - English
Javascript State of the Union 2015 - EnglishJavascript State of the Union 2015 - English
Javascript State of the Union 2015 - English
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Intro to data visualisation
Intro to data visualisationIntro to data visualisation
Intro to data visualisation
 
SoundCloud API Do:s and Don't:s
SoundCloud API Do:s and Don't:sSoundCloud API Do:s and Don't:s
SoundCloud API Do:s and Don't:s
 

Making the gov data more open

  • 1. may 2 0 1 1 MAKING THE GOV DATA OPEN MAREK SOTAK | ATOMIC ANT  www.atomicant.co.uk
  • 2. OH HAI! ABOUT ME & ATOMIC ANT Marek Sotak • Web designer, developer • From Prague, Czech Republic • Over 5 years with Drupal - since v4.6 • Rootcandy admin theme • Organising events - Drupal Design Camp, Local Meet-ups • @sotak on twitter • http://sotak.co.uk - personal blog/experiments 6 : 0 2 : 1 atomicant.co.uk #justsaying ;)
  • 3. OH HAI! ABOUT ME & ATOMIC ANT • Based in London & Prague • Human interface design, training, branding, development • Clients all over the world • http://atomicant.co.uk
  • 4. OPEN DATA? HUH? What is OPEN DATA? atomicant.co.uk
  • 5. OPEN DATA? HUH? Wikileaks Iraq war logs: every death mapped http://bit.ly/iraqwarlogs atomicant.co.uk
  • 6. OPEN DATA? HUH? Don't eat at ____ http://donteat.at atomicant.co.uk
  • 7. OPEN DATA? HUH? Don't eat at - http://donteat.at/ atomicant.co.uk
  • 8. DATA MINING - SCRAPING LET'S GET DIRTY BigClean.org – Prague atomicant.co.uk
  • 9. DATA MINING - SCRAPING LET'S GET DIRTY There's a lot of data laying around on the internet that can be useful → Crime reports, government reports, statistics, missing pets register, current affairs However sometimes they are in a format such as PDF, html, etc... something you can't really take and perform calculations, visualizations, filtering, etc... on. Is it really that hard to publish something in a CSV, XML,.. ? atomicant.co.uk
  • 10. DATA MINING - SCRAPING LET'S GET DIRTY Ministry of the interior – Czech Republic Public Collections - open what? atomicant.co.uk
  • 11. DATA MINING - SCRAPING LET'S GET DIRTY atomicant.co.uk
  • 12. DATA MINING - SCRAPING LET'S GET DIRTY atomicant.co.uk
  • 13. DATA MINING - SCRAPING LET'S GET DIRTY atomicant.co.uk
  • 14. DATA MINING - SCRAPING LET'S GET DIRTY atomicant.co.uk
  • 15. DATA MINING - SCRAPING LET'S GET DIRTY Request a site/content Run through the html – DOM - selectors Do whatever you want with the data Save the data atomicant.co.uk
  • 16. SCRAPERWIKI REFINE AND SCRAPE DATA atomicant.co.uk
  • 17. SCRAPERWIKI WHAT IS IT? HOW TO USE IT Scrape and link data using Ruby, Python and PHP scripts that run maintenance-free in the cloud. Request data for scoops and better decisions. atomicant.co.uk
  • 18. DATA MINING - SCRAPING LET'S GET DIRTY
  • 19. SCRAPERWIKI WHAT IS IT? HOW TO USE IT atomicant.co.uk
  • 20. SCRAPERWIKI WHAT IS IT? HOW TO USE IT Why would you want to use SCRAPERWIKI rather than other scraping tools or custom code? atomicant.co.uk
  • 21. SCRAPERWIKI WHAT IS IT? HOW TO USE IT • The dataset is available to everyone • Anyone can access the data through API • If the source changed and the scraper brakes, anyone can fix the scraper • Anyone can fork the scraper atomicant.co.uk
  • 23. SCRAPERWIKI WHAT IS IT? HOW TO USE IT atomicant.co.uk
  • 24. GOOGLE REFINE WHAT IS IT? HOW TO USE IT Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services,... atomicant.co.uk
  • 25. VISUALISE TELL THE STORY There is more to that It's just not data with values in a spreadsheet or database Data can tell the story! atomicant.co.uk
  • 26. GOOGLE FUSION TABLES WHAT IS IT? HOW TO USE IT Easy visualisation http://tables.googlelabs.com/ atomicant.co.uk
  • 27. SCRAPING WITH DRUPAL AND NOW FOR SOMETHING COMPLETELY DIFFERENT Feeds – http://drupal.org/project/feeds Scraping Feeds query path parser - project/feeds_querypath_parser Feeds xpath parser – project/feeds_xpathparser Cleaning up data Feeds tamper - http://drupal.org/project/feeds_tamper atomicant.co.uk
  • 28. VISUALISE WITH DRUPAL AND NOW FOR SOMETHING COMPLETELY DIFFERENT Mapping - Location – http://drupal.org/project/location - Openlayers – http://drupal.org/project/openlayers - Gmap – http://drupal.org/project/gmap Graphs/Charts - Graphs - Graphs Charts - Open Flash Chart - Views atomicant.co.uk
  • 29. GO! SCRAPE IT! CHALLENGE EU Open Data Challenge - €20,000 to win - 28 days left to enter http://opendatachallenge.org/ atomicant.co.uk
  • 30. TOOLS SCRAPING DATA ScraperWiki – http://scraperwiki.com PHP Simple HTML DOM – http://bit.ly/phphtmldom PHPQuery - http://code.google.com/p/phpquery/ Open Data Kit - http://opendatakit.org/ atomicant.co.uk
  • 31. TOOLS CLEANING DATA Google Refine - http://code.google.com/p/google-refine/ atomicant.co.uk
  • 32. TOOLS VISUALIZING DATA Google fusion tables - http://tables.googlelabs.com/ The Best Tools for Visualization - http://rww.to/toolsforvis atomicant.co.uk
  • 34. THANK YOU Q&A | LETS CONNECT QUESTIONS? @sotak - twitter http://sotak.co.uk - personal blog http://atomicant.co.uk - company website atomicant.co.uk