SlideShare a Scribd company logo
1 of 16
In today's web
Information Extraction
from the Web
Benjamin Habegger
University of Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205
Seminary on Information Extraction from the Web
ENSIAS, Rabat, Morocco - June 19, 2013
About Me
@b_habegger
http://www.linkedin.com/in/benjaminhabegger
benjamin.habegger@insa-lyon.fr
Where is the web today ?
Web of humans
● Interlinked documents
● Social Web
● Web 2.0
● Crowd-sourcing
Web of machines
● REST / API
● Service Interaction
● Open Data
● Semantic Web
Somehow we're creating 2 webs
Web of DataWeb of humans
HTML
Javascript
CSS
RDF
REST
SPARQL
There are some interactions
Open data still has some way to go
Data thrown on the web in its original format
● Not many standardized formats
● Not many standardized semantics
● Can be
– An Excel, CSV file
– A REST service
Still the Linked Open Data and
Semantic Web are emerging
● Vocabularies
– Foaf
– Dublin Core
– …
● Datasets
– DBPedia
– ...
But still, can't we dream a little ?
Having (a little) smarter machines...
Shared web
Learning capabilities
Making our web robots smarter
could even help improve our web...
What does the following query give you today ?
“lyon informatique emploi”
Do you see any jobs there ?
Nope, listing of pages which
contain lists of jobs, ...
There's still a long way to go...
but information extraction from the web
is a little step in making machines smarter
And there are many people
interested out there...
Freelancer.com search for web scrapping
So where does information
extraction from the web fit in ?
Open DataOpen Data
Linked DataLinked Data
Semantic WebSemantic Web
Information ExtractionInformation Extraction
Machine LearningMachine Learning
Pattern MiningPattern Mining
Data IntegrationData Integration
Standardized VocabulariesStandardized Vocabularies
Machine LearningMachine Learning
Web ScrappingWeb Scrapping
And what is it about ?
...
Data for humans
Data for machines
How do we do that ?
We'll see that after the break :)
http://www.slideshare.net/BenjaminHabegger/2013-06ensiasrabatiealg

More Related Content

Viewers also liked (7)

Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013
 
Anne-Catherine Gerber 1954 - 2015
Anne-Catherine Gerber 1954 - 2015Anne-Catherine Gerber 1954 - 2015
Anne-Catherine Gerber 1954 - 2015
 
Feedback from a startup experience in collaboration with academia
Feedback from a startup experience in collaboration with academiaFeedback from a startup experience in collaboration with academia
Feedback from a startup experience in collaboration with academia
 
Predicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian SequencesPredicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian Sequences
 
Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache Spark
 
Comparing Ontotext KIM and Apache Stanbol
Comparing Ontotext KIM and Apache StanbolComparing Ontotext KIM and Apache Stanbol
Comparing Ontotext KIM and Apache Stanbol
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 

Similar to Information Extraction from the Web - In today's web

Semantic Web 2.0
Semantic Web 2.0Semantic Web 2.0
Semantic Web 2.0
hchen1
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
animove
 
Mooc And Document Orientated Nosql Database
Mooc And Document Orientated Nosql DatabaseMooc And Document Orientated Nosql Database
Mooc And Document Orientated Nosql Database
Karen Oliver
 
Apprendre Via les Objets Xin Chen
Apprendre Via les Objets  Xin ChenApprendre Via les Objets  Xin Chen
Apprendre Via les Objets Xin Chen
cecilechen85
 

Similar to Information Extraction from the Web - In today's web (20)

CILIP Conference - x metadata evolution the final mile - Richard Wallis
CILIP Conference - x metadata evolution the final mile - Richard WallisCILIP Conference - x metadata evolution the final mile - Richard Wallis
CILIP Conference - x metadata evolution the final mile - Richard Wallis
 
Microblogging: A Semantic Web and Distributed Approach
Microblogging: A Semantic Web and Distributed ApproachMicroblogging: A Semantic Web and Distributed Approach
Microblogging: A Semantic Web and Distributed Approach
 
The Semantic Web: The Why? What? How?
The Semantic Web: The Why? What? How?The Semantic Web: The Why? What? How?
The Semantic Web: The Why? What? How?
 
Semantic Web 2.0
Semantic Web 2.0Semantic Web 2.0
Semantic Web 2.0
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
 
Introduction to APIs and Linked Data
Introduction to APIs and Linked DataIntroduction to APIs and Linked Data
Introduction to APIs and Linked Data
 
The semantic web
The semantic webThe semantic web
The semantic web
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 
Web Developments & Trends
Web Developments & TrendsWeb Developments & Trends
Web Developments & Trends
 
Mooc And Document Orientated Nosql Database
Mooc And Document Orientated Nosql DatabaseMooc And Document Orientated Nosql Database
Mooc And Document Orientated Nosql Database
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
WebGUI And The Semantic Web
WebGUI And The Semantic WebWebGUI And The Semantic Web
WebGUI And The Semantic Web
 
Linked data and semantic wikis
Linked data and semantic wikisLinked data and semantic wikis
Linked data and semantic wikis
 
Web Data Management in the RDF Age
Web Data Management in the RDF AgeWeb Data Management in the RDF Age
Web Data Management in the RDF Age
 
The Web, The User and the Library (and why to get in between)
The Web, The User and the Library (and why to get in between)The Web, The User and the Library (and why to get in between)
The Web, The User and the Library (and why to get in between)
 
Apprendre Via les Objets Xin Chen
Apprendre Via les Objets  Xin ChenApprendre Via les Objets  Xin Chen
Apprendre Via les Objets Xin Chen
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
Semantic Web: Explanation
Semantic Web: ExplanationSemantic Web: Explanation
Semantic Web: Explanation
 
Office 2010 cloud computing farhad_javidi
Office 2010 cloud computing farhad_javidiOffice 2010 cloud computing farhad_javidi
Office 2010 cloud computing farhad_javidi
 

Recently uploaded

Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 

Recently uploaded (20)

AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 

Information Extraction from the Web - In today's web

  • 1. In today's web Information Extraction from the Web Benjamin Habegger University of Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205 Seminary on Information Extraction from the Web ENSIAS, Rabat, Morocco - June 19, 2013
  • 3. Where is the web today ? Web of humans ● Interlinked documents ● Social Web ● Web 2.0 ● Crowd-sourcing Web of machines ● REST / API ● Service Interaction ● Open Data ● Semantic Web
  • 4. Somehow we're creating 2 webs Web of DataWeb of humans HTML Javascript CSS RDF REST SPARQL
  • 5. There are some interactions
  • 6. Open data still has some way to go Data thrown on the web in its original format ● Not many standardized formats ● Not many standardized semantics ● Can be – An Excel, CSV file – A REST service
  • 7. Still the Linked Open Data and Semantic Web are emerging ● Vocabularies – Foaf – Dublin Core – … ● Datasets – DBPedia – ...
  • 8. But still, can't we dream a little ? Having (a little) smarter machines... Shared web Learning capabilities
  • 9. Making our web robots smarter could even help improve our web... What does the following query give you today ? “lyon informatique emploi”
  • 10. Do you see any jobs there ?
  • 11. Nope, listing of pages which contain lists of jobs, ...
  • 12. There's still a long way to go... but information extraction from the web is a little step in making machines smarter
  • 13. And there are many people interested out there... Freelancer.com search for web scrapping
  • 14. So where does information extraction from the web fit in ? Open DataOpen Data Linked DataLinked Data Semantic WebSemantic Web Information ExtractionInformation Extraction Machine LearningMachine Learning Pattern MiningPattern Mining Data IntegrationData Integration Standardized VocabulariesStandardized Vocabularies Machine LearningMachine Learning Web ScrappingWeb Scrapping
  • 15. And what is it about ? ... Data for humans Data for machines
  • 16. How do we do that ? We'll see that after the break :) http://www.slideshare.net/BenjaminHabegger/2013-06ensiasrabatiealg