SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg
Mining the Web at Data Publica
Different needs, different techniques
   ● Scraping
   ● Focused crawling
   ● Prospective crawling
Mining the Web at Data Publica
Scraping
  ● Identified resources
  ● Configured extractors
  ● Structured content
  ● Not scalable
Mining the Web at Data Publica
Focused crawling
  ● Identified entities
  ● Fuzzy extraction
  ● Structured content using text-mining
  ● Scalable
  ● Useful to get meta information on known
    entities
Mining the Web at Data Publica
Prospective crawling
  ● No starting point
  ● Fuzzy extraction
  ● Structured content using text-mining
  ● Very hard to scale
  ● Heavy resources needed : CPU, RAM,
    HDD

It makes your life easier to use a third-party !
From a crawl to a map
Goal : build a map of the french open data
actors on the web
  ● As a graph
  ● Showing websites
From a crawl to a map
Using Common Crawl
  ● Large web crawl archives fully accessible
  ● Good coverage of french web
  ● Easy access via AWS / MapReduce jobs
From a crawl to a map
Working on french web
 ● Irrelevant to use tld .fr for detection
 ● Detecting page language
 ● Giving websites a "frenchness" score
     ○ Sw = amount of fr pages / total of pages
     ○ Cutoff manually chosen via testing on french
       websites
From a crawl to a map
Working on Open Data websites
 ● Building an Open Data "vocabulary"
 ● Detecting if page speaks about Open
    Data
 ● Giving websites an "opendataness" score
     ○ Sw = amount of Open Data pages / total of pages
     ○ Cutoff manually chosen via testing on Open Data
       websites
From a crawl to a map
Building graph
  ● Inside our subset
     ○ Inlinks
     ○ Outlinks
  ● Generating two files
     ○ nodes.csv (list of websites with an id)
     ○ edges.csv (directed links between websites)


              A inlink                A outlink
                             Node A



                  A inlink
From a crawl to a map
Building graph
  ● Links tell a lot about websites
     ○ Authorities
     ○ Hubs
From a crawl to a map
Visualizing graph using Gephi
  ● Load graph
  ● Spatialize graph
     ○ links between websites create "attraction", to
       make them appear near each other
     ○ the more inlinks, bigger the node (= authority)
     ○ categorizing web site for better understanding (a
       color per category)
        ■ Companies, Non profit/blogs, Governement
           agencies
     ○ communities can now appear !
From a crawl to a map
From a crawl to a map
Visualizing graph on the web
  ● Sigma.js
  ● Uses Gephi files
  ● Gives better interactivity
Analyze
● The final graph is a good way to understand
  interactions between actors
  ○ Open Data is definitely initiated by a Non Profit
    movement
  ○ Companies are beginning to work on the subject
  ○ French state only had some sporadic initiatives for
    now
● This graph is to be generated again in near
  futur, to see changes in this ecosystem
Results
● Large scale crawl made easy
  ○ Easy to focus on mining the results instead of
    finding/storing the data
● Nice workflow from raw data to an
  understandable visualisation
● The final graph is a good way to understand
  interactions between actors
Feedback
● Common Crawl
  ○ Common crawl doesn't have an exhaustive crawl of
    the french web for now
  ○ Data is not fresh as it could be
  ○ It is missing an index to access at least domains,
    and maybe pages in O(1)
● Methodology
  ○ Opendataness scoring can put aside some websites
    not enough focused on open data even if relevant
Resources
● http://webatlas.
  fr/tempshare/OpenDataActeursTypes.pdf
   ○ poster by Franck Ghitalla
● http://french-opendata.data-publica.
  com/index.html
   ○ dynamic visualisation of the results, by Data Publica
● http://fr.slideshare.net/willounet/a-sneak-
  peek-into-the-web-presentation,
   ○ A sneak peek into the web, by GL
● http://french-opendata.data-publica.com/
   ○ Project host page
Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg

Contenu connexe

Tendances

Entity Seo Mastery
Entity Seo MasteryEntity Seo Mastery
Entity Seo MasteryDixon Jones
 
Introduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdfIntroduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdfHeather Hedden
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural searchDmitry Kan
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOSearch Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOKoray Tugberk GUBUR
 
International Search Engine Optimization - Multilingual SEO
International Search Engine Optimization - Multilingual SEOInternational Search Engine Optimization - Multilingual SEO
International Search Engine Optimization - Multilingual SEOHuman Level
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...Neo4j
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0Minwoo Kim
 
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are PricelessKnowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are PricelessEnterprise Knowledge
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
 
Taller SEO con Screaming Frog #seoconlarana
Taller SEO con Screaming Frog  #seoconlaranaTaller SEO con Screaming Frog  #seoconlarana
Taller SEO con Screaming Frog #seoconlaranaMJ Cachón Yáñez
 
NY WebPerf Sept '22 - Performance Mistakes - An HTTP Archive Deep Dive
NY WebPerf Sept '22 - Performance Mistakes - An HTTP Archive Deep DiveNY WebPerf Sept '22 - Performance Mistakes - An HTTP Archive Deep Dive
NY WebPerf Sept '22 - Performance Mistakes - An HTTP Archive Deep DivePaul Calvano
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseDatabricks
 

Tendances (20)

Enterprise Knowledge Graph
Enterprise Knowledge GraphEnterprise Knowledge Graph
Enterprise Knowledge Graph
 
Entity Seo Mastery
Entity Seo MasteryEntity Seo Mastery
Entity Seo Mastery
 
Introduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdfIntroduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdf
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Entity seo
Entity seoEntity seo
Entity seo
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEOSearch Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
 
International Search Engine Optimization - Multilingual SEO
International Search Engine Optimization - Multilingual SEOInternational Search Engine Optimization - Multilingual SEO
International Search Engine Optimization - Multilingual SEO
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0
 
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are PricelessKnowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
 
Taller SEO con Screaming Frog #seoconlarana
Taller SEO con Screaming Frog  #seoconlaranaTaller SEO con Screaming Frog  #seoconlarana
Taller SEO con Screaming Frog #seoconlarana
 
NY WebPerf Sept '22 - Performance Mistakes - An HTTP Archive Deep Dive
NY WebPerf Sept '22 - Performance Mistakes - An HTTP Archive Deep DiveNY WebPerf Sept '22 - Performance Mistakes - An HTTP Archive Deep Dive
NY WebPerf Sept '22 - Performance Mistakes - An HTTP Archive Deep Dive
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 

Similaire à Mapping french open data actors on the web with common crawl

How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...OSMFstateofthemap
 
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...OW2
 
Open Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open SourceOpen Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open SourceBoris van Hoytema
 
City of Amsterdam: High velocity development
City of Amsterdam: High velocity developmentCity of Amsterdam: High velocity development
City of Amsterdam: High velocity developmentBoris van Hoytema
 
Web Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxWeb Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxHitechIOT
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4Bridget Gibbons
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionSammy Fung
 
OutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps PerformanceOutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps PerformanceDaniel Reis
 
Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance OutSystems
 
What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012lokku
 
Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3Hiroyuki Nakaji
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015Kanwal Prakash Singh
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015Kanwal Prakash Singh
 
Tools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserTools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserSafe Software
 
Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Torben Brodt
 
OER World Map Project
OER World Map Project OER World Map Project
OER World Map Project Robert Farrow
 

Similaire à Mapping french open data actors on the web with common crawl (20)

How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
 
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
 
Open Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open SourceOpen Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open Source
 
DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014
 
Linking knowledge spaces
Linking knowledge spacesLinking knowledge spaces
Linking knowledge spaces
 
City of Amsterdam: High velocity development
City of Amsterdam: High velocity developmentCity of Amsterdam: High velocity development
City of Amsterdam: High velocity development
 
Web Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxWeb Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptx
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
 
OutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps PerformanceOutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps Performance
 
Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance
 
What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012
 
marc portier_westtoer
marc portier_westtoermarc portier_westtoer
marc portier_westtoer
 
Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
 
Tools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserTools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web Browser
 
Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04
 
OER World Map Project
OER World Map Project OER World Map Project
OER World Map Project
 

Plus de data publica

Plus de data publica (12)

Open data Websmatch
Open data WebsmatchOpen data Websmatch
Open data Websmatch
 
Web smatch wod2012
Web smatch wod2012Web smatch wod2012
Web smatch wod2012
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open data
 
Suez environnement frédéric charles
Suez environnement frédéric charlesSuez environnement frédéric charles
Suez environnement frédéric charles
 
Tinyclues david bessis
Tinyclues david bessisTinyclues david bessis
Tinyclues david bessis
 
Treerank richard drai
Treerank richard draiTreerank richard drai
Treerank richard drai
 
Bime analytics
Bime analyticsBime analytics
Bime analytics
 
Cours emi cfd
Cours emi cfdCours emi cfd
Cours emi cfd
 
Utc data publica1
Utc data publica1Utc data publica1
Utc data publica1
 
Pikko
PikkoPikko
Pikko
 
Isthma
IsthmaIsthma
Isthma
 
Hurence
HurenceHurence
Hurence
 

Mapping french open data actors on the web with common crawl

  • 1. Mapping french Open Data actors on the web with Common Crawl guillaume.lebourgeois@data-publica.com @glebourg
  • 2. Mining the Web at Data Publica Different needs, different techniques ● Scraping ● Focused crawling ● Prospective crawling
  • 3. Mining the Web at Data Publica Scraping ● Identified resources ● Configured extractors ● Structured content ● Not scalable
  • 4. Mining the Web at Data Publica Focused crawling ● Identified entities ● Fuzzy extraction ● Structured content using text-mining ● Scalable ● Useful to get meta information on known entities
  • 5. Mining the Web at Data Publica Prospective crawling ● No starting point ● Fuzzy extraction ● Structured content using text-mining ● Very hard to scale ● Heavy resources needed : CPU, RAM, HDD It makes your life easier to use a third-party !
  • 6. From a crawl to a map Goal : build a map of the french open data actors on the web ● As a graph ● Showing websites
  • 7. From a crawl to a map Using Common Crawl ● Large web crawl archives fully accessible ● Good coverage of french web ● Easy access via AWS / MapReduce jobs
  • 8. From a crawl to a map Working on french web ● Irrelevant to use tld .fr for detection ● Detecting page language ● Giving websites a "frenchness" score ○ Sw = amount of fr pages / total of pages ○ Cutoff manually chosen via testing on french websites
  • 9. From a crawl to a map Working on Open Data websites ● Building an Open Data "vocabulary" ● Detecting if page speaks about Open Data ● Giving websites an "opendataness" score ○ Sw = amount of Open Data pages / total of pages ○ Cutoff manually chosen via testing on Open Data websites
  • 10. From a crawl to a map Building graph ● Inside our subset ○ Inlinks ○ Outlinks ● Generating two files ○ nodes.csv (list of websites with an id) ○ edges.csv (directed links between websites) A inlink A outlink Node A A inlink
  • 11. From a crawl to a map Building graph ● Links tell a lot about websites ○ Authorities ○ Hubs
  • 12. From a crawl to a map Visualizing graph using Gephi ● Load graph ● Spatialize graph ○ links between websites create "attraction", to make them appear near each other ○ the more inlinks, bigger the node (= authority) ○ categorizing web site for better understanding (a color per category) ■ Companies, Non profit/blogs, Governement agencies ○ communities can now appear !
  • 13. From a crawl to a map
  • 14. From a crawl to a map Visualizing graph on the web ● Sigma.js ● Uses Gephi files ● Gives better interactivity
  • 15. Analyze ● The final graph is a good way to understand interactions between actors ○ Open Data is definitely initiated by a Non Profit movement ○ Companies are beginning to work on the subject ○ French state only had some sporadic initiatives for now ● This graph is to be generated again in near futur, to see changes in this ecosystem
  • 16. Results ● Large scale crawl made easy ○ Easy to focus on mining the results instead of finding/storing the data ● Nice workflow from raw data to an understandable visualisation ● The final graph is a good way to understand interactions between actors
  • 17. Feedback ● Common Crawl ○ Common crawl doesn't have an exhaustive crawl of the french web for now ○ Data is not fresh as it could be ○ It is missing an index to access at least domains, and maybe pages in O(1) ● Methodology ○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant
  • 18. Resources ● http://webatlas. fr/tempshare/OpenDataActeursTypes.pdf ○ poster by Franck Ghitalla ● http://french-opendata.data-publica. com/index.html ○ dynamic visualisation of the results, by Data Publica ● http://fr.slideshare.net/willounet/a-sneak- peek-into-the-web-presentation, ○ A sneak peek into the web, by GL ● http://french-opendata.data-publica.com/ ○ Project host page
  • 19. Mapping french Open Data actors on the web with Common Crawl guillaume.lebourgeois@data-publica.com @glebourg