SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg
Mining the Web at Data Publica
Different needs, different techniques
   ● Scraping
   ● Focused crawling
   ● Prospective crawling
Mining the Web at Data Publica
Scraping
  ● Identified resources
  ● Configured extractors
  ● Structured content
  ● Not scalable
Mining the Web at Data Publica
Focused crawling
  ● Identified entities
  ● Fuzzy extraction
  ● Structured content using text-mining
  ● Scalable
  ● Useful to get meta information on known
    entities
Mining the Web at Data Publica
Prospective crawling
  ● No starting point
  ● Fuzzy extraction
  ● Structured content using text-mining
  ● Very hard to scale
  ● Heavy resources needed : CPU, RAM,
    HDD

It makes your life easier to use a third-party !
From a crawl to a map
Goal : build a map of the french open data
actors on the web
  ● As a graph
  ● Showing websites
From a crawl to a map
Using Common Crawl
  ● Large web crawl archives fully accessible
  ● Good coverage of french web
  ● Easy access via AWS / MapReduce jobs
From a crawl to a map
Working on french web
 ● Irrelevant to use tld .fr for detection
 ● Detecting page language
 ● Giving websites a "frenchness" score
     ○ Sw = amount of fr pages / total of pages
     ○ Cutoff manually chosen via testing on french
       websites
From a crawl to a map
Working on Open Data websites
 ● Building an Open Data "vocabulary"
 ● Detecting if page speaks about Open
    Data
 ● Giving websites an "opendataness" score
     ○ Sw = amount of Open Data pages / total of pages
     ○ Cutoff manually chosen via testing on Open Data
       websites
From a crawl to a map
Building graph
  ● Inside our subset
     ○ Inlinks
     ○ Outlinks
  ● Generating two files
     ○ nodes.csv (list of websites with an id)
     ○ edges.csv (directed links between websites)


              A inlink                A outlink
                             Node A



                  A inlink
From a crawl to a map
Building graph
  ● Links tell a lot about websites
     ○ Authorities
     ○ Hubs
From a crawl to a map
Visualizing graph using Gephi
  ● Load graph
  ● Spatialize graph
     ○ links between websites create "attraction", to
       make them appear near each other
     ○ the more inlinks, bigger the node (= authority)
     ○ categorizing web site for better understanding (a
       color per category)
        ■ Companies, Non profit/blogs, Governement
           agencies
     ○ communities can now appear !
From a crawl to a map
From a crawl to a map
Visualizing graph on the web
  ● Sigma.js
  ● Uses Gephi files
  ● Gives better interactivity
Analyze
● The final graph is a good way to understand
  interactions between actors
  ○ Open Data is definitely initiated by a Non Profit
    movement
  ○ Companies are beginning to work on the subject
  ○ French state only had some sporadic initiatives for
    now
● This graph is to be generated again in near
  futur, to see changes in this ecosystem
Results
● Large scale crawl made easy
  ○ Easy to focus on mining the results instead of
    finding/storing the data
● Nice workflow from raw data to an
  understandable visualisation
● The final graph is a good way to understand
  interactions between actors
Feedback
● Common Crawl
  ○ Common crawl doesn't have an exhaustive crawl of
    the french web for now
  ○ Data is not fresh as it could be
  ○ It is missing an index to access at least domains,
    and maybe pages in O(1)
● Methodology
  ○ Opendataness scoring can put aside some websites
    not enough focused on open data even if relevant
Resources
● http://webatlas.
  fr/tempshare/OpenDataActeursTypes.pdf
   ○ poster by Franck Ghitalla
● http://french-opendata.data-publica.
  com/index.html
   ○ dynamic visualisation of the results, by Data Publica
● http://fr.slideshare.net/willounet/a-sneak-
  peek-into-the-web-presentation,
   ○ A sneak peek into the web, by GL
● http://french-opendata.data-publica.com/
   ○ Project host page
Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg

Contenu connexe

Tendances

Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
Introduction to the BioLink datamodel
Introduction to the BioLink datamodelIntroduction to the BioLink datamodel
Introduction to the BioLink datamodelChris Mungall
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
 
Numeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrNumeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrVadim Kirilchuk
 
RDF 개념 및 구문 소개
RDF 개념 및 구문 소개RDF 개념 및 구문 소개
RDF 개념 및 구문 소개Dongbum Kim
 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic WebLuigi De Russis
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리Junyi Song
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQLOlaf Hartig
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Roopa Tangirala
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Node.js Express
Node.js  ExpressNode.js  Express
Node.js ExpressEyal Vardi
 

Tendances (20)

Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
Introduction to the BioLink datamodel
Introduction to the BioLink datamodelIntroduction to the BioLink datamodel
Introduction to the BioLink datamodel
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Jena
JenaJena
Jena
 
Numeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and SolrNumeric Range Queries in Lucene and Solr
Numeric Range Queries in Lucene and Solr
 
Spark
SparkSpark
Spark
 
RDF 개념 및 구문 소개
RDF 개념 및 구문 소개RDF 개념 및 구문 소개
RDF 개념 및 구문 소개
 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic Web
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Attacking GraphQL
Attacking GraphQLAttacking GraphQL
Attacking GraphQL
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Google Hacking 101
Google Hacking 101Google Hacking 101
Google Hacking 101
 
Node.js Express
Node.js  ExpressNode.js  Express
Node.js Express
 

Similaire à Mapping french open data actors on the web with common crawl

How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...OSMFstateofthemap
 
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...OW2
 
Open Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open SourceOpen Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open SourceBoris van Hoytema
 
City of Amsterdam: High velocity development
City of Amsterdam: High velocity developmentCity of Amsterdam: High velocity development
City of Amsterdam: High velocity developmentBoris van Hoytema
 
Web Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxWeb Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxHitechIOT
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4Bridget Gibbons
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionSammy Fung
 
OutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps PerformanceOutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps PerformanceDaniel Reis
 
Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance OutSystems
 
What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012lokku
 
Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3Hiroyuki Nakaji
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015Kanwal Prakash Singh
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015Kanwal Prakash Singh
 
Tools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserTools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserSafe Software
 
Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Torben Brodt
 
OER World Map Project
OER World Map Project OER World Map Project
OER World Map Project Robert Farrow
 

Similaire à Mapping french open data actors on the web with common crawl (20)

How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
 
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
 
Open Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open SourceOpen Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open Source
 
DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014
 
Linking knowledge spaces
Linking knowledge spacesLinking knowledge spaces
Linking knowledge spaces
 
City of Amsterdam: High velocity development
City of Amsterdam: High velocity developmentCity of Amsterdam: High velocity development
City of Amsterdam: High velocity development
 
Web Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxWeb Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptx
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
 
OutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps PerformanceOutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps Performance
 
Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance
 
What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012
 
marc portier_westtoer
marc portier_westtoermarc portier_westtoer
marc portier_westtoer
 
Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
 
Tools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserTools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web Browser
 
Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04
 
OER World Map Project
OER World Map Project OER World Map Project
OER World Map Project
 

Plus de data publica

Plus de data publica (12)

Open data Websmatch
Open data WebsmatchOpen data Websmatch
Open data Websmatch
 
Web smatch wod2012
Web smatch wod2012Web smatch wod2012
Web smatch wod2012
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open data
 
Suez environnement frédéric charles
Suez environnement frédéric charlesSuez environnement frédéric charles
Suez environnement frédéric charles
 
Tinyclues david bessis
Tinyclues david bessisTinyclues david bessis
Tinyclues david bessis
 
Treerank richard drai
Treerank richard draiTreerank richard drai
Treerank richard drai
 
Bime analytics
Bime analyticsBime analytics
Bime analytics
 
Cours emi cfd
Cours emi cfdCours emi cfd
Cours emi cfd
 
Utc data publica1
Utc data publica1Utc data publica1
Utc data publica1
 
Pikko
PikkoPikko
Pikko
 
Isthma
IsthmaIsthma
Isthma
 
Hurence
HurenceHurence
Hurence
 

Mapping french open data actors on the web with common crawl

  • 1. Mapping french Open Data actors on the web with Common Crawl guillaume.lebourgeois@data-publica.com @glebourg
  • 2. Mining the Web at Data Publica Different needs, different techniques ● Scraping ● Focused crawling ● Prospective crawling
  • 3. Mining the Web at Data Publica Scraping ● Identified resources ● Configured extractors ● Structured content ● Not scalable
  • 4. Mining the Web at Data Publica Focused crawling ● Identified entities ● Fuzzy extraction ● Structured content using text-mining ● Scalable ● Useful to get meta information on known entities
  • 5. Mining the Web at Data Publica Prospective crawling ● No starting point ● Fuzzy extraction ● Structured content using text-mining ● Very hard to scale ● Heavy resources needed : CPU, RAM, HDD It makes your life easier to use a third-party !
  • 6. From a crawl to a map Goal : build a map of the french open data actors on the web ● As a graph ● Showing websites
  • 7. From a crawl to a map Using Common Crawl ● Large web crawl archives fully accessible ● Good coverage of french web ● Easy access via AWS / MapReduce jobs
  • 8. From a crawl to a map Working on french web ● Irrelevant to use tld .fr for detection ● Detecting page language ● Giving websites a "frenchness" score ○ Sw = amount of fr pages / total of pages ○ Cutoff manually chosen via testing on french websites
  • 9. From a crawl to a map Working on Open Data websites ● Building an Open Data "vocabulary" ● Detecting if page speaks about Open Data ● Giving websites an "opendataness" score ○ Sw = amount of Open Data pages / total of pages ○ Cutoff manually chosen via testing on Open Data websites
  • 10. From a crawl to a map Building graph ● Inside our subset ○ Inlinks ○ Outlinks ● Generating two files ○ nodes.csv (list of websites with an id) ○ edges.csv (directed links between websites) A inlink A outlink Node A A inlink
  • 11. From a crawl to a map Building graph ● Links tell a lot about websites ○ Authorities ○ Hubs
  • 12. From a crawl to a map Visualizing graph using Gephi ● Load graph ● Spatialize graph ○ links between websites create "attraction", to make them appear near each other ○ the more inlinks, bigger the node (= authority) ○ categorizing web site for better understanding (a color per category) ■ Companies, Non profit/blogs, Governement agencies ○ communities can now appear !
  • 13. From a crawl to a map
  • 14. From a crawl to a map Visualizing graph on the web ● Sigma.js ● Uses Gephi files ● Gives better interactivity
  • 15. Analyze ● The final graph is a good way to understand interactions between actors ○ Open Data is definitely initiated by a Non Profit movement ○ Companies are beginning to work on the subject ○ French state only had some sporadic initiatives for now ● This graph is to be generated again in near futur, to see changes in this ecosystem
  • 16. Results ● Large scale crawl made easy ○ Easy to focus on mining the results instead of finding/storing the data ● Nice workflow from raw data to an understandable visualisation ● The final graph is a good way to understand interactions between actors
  • 17. Feedback ● Common Crawl ○ Common crawl doesn't have an exhaustive crawl of the french web for now ○ Data is not fresh as it could be ○ It is missing an index to access at least domains, and maybe pages in O(1) ● Methodology ○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant
  • 18. Resources ● http://webatlas. fr/tempshare/OpenDataActeursTypes.pdf ○ poster by Franck Ghitalla ● http://french-opendata.data-publica. com/index.html ○ dynamic visualisation of the results, by Data Publica ● http://fr.slideshare.net/willounet/a-sneak- peek-into-the-web-presentation, ○ A sneak peek into the web, by GL ● http://french-opendata.data-publica.com/ ○ Project host page
  • 19. Mapping french Open Data actors on the web with Common Crawl guillaume.lebourgeois@data-publica.com @glebourg