SlideShare a Scribd company logo
1 of 20
Information Extraction from the WWW using Machine Learning Techniques Lee McCluskey, Dept of Informatics email: lee@hud.ac.uk
Motivation ,[object Object],[object Object],[object Object],[object Object],[object Object]
Overview of Talk ,[object Object],[object Object],[object Object]
Information Extraction from the WWW – WHY? ,[object Object],[object Object],[object Object],[object Object],[object Object]
Information Extraction from the WWW – WHY? ,[object Object],[object Object],[object Object]
Information Extraction from The Web ,[object Object],[object Object],[object Object],“ Natural Language Understanding” - take raw (English) text from a web page and turn into some logic representing its meaning. EASIER HARDER
Information Extraction from The Web WRAPPERS WEB PAGES STRUCTURED DATA BA red 555 sue MSc red 123 dave PhD grey 345 bill BSc blue 664 tom
Information Extraction ,[object Object],[object Object],[object Object],[object Object]
Example of Automated Extraction <residential> <house> < location>   <city>   Hebden Bridge  </city> <county>  West Yorkshire  </county> <country>  UK  </country> </location> <agent-phone>  01422 843222 </agent-phone> <listed-price>  £350,000  </listed-price> <comments>  Bijou residence on the  edge of this popular little town...    </comments> </house> ... </residential> <h1>  Residential Housing  </h1> <ul> House For Sale <li>  location: Hebden Bridge  <li>  agent-phone: 01422 843222 <li>  listed-price: £350,000 <li>  comments:  Bijou residence on the  edge of this popular little town...  </ul> <hr> <ul>  House For Sale ... </ul> ... Source:  HTML  ======>  Destination: XML NB: XML + schema + recognised names wrapper
Information Extraction ,[object Object],[object Object],[object Object],[object Object],[object Object]
Using ‘Rule Induction’ to learn wrappers for html pages ,[object Object],[object Object],[object Object],[object Object]
Rule Induction is an area of Machine Learning ,[object Object],Similarity-Based  Learning Explanation-Based  Learning Neural Networks Learning from Examples Learning by Observation Rule Induction Symbolic Learning Sub-symbolic learning Genetic Approaches
Rule Induction from Examples   ,[object Object],[object Object],[object Object],[object Object],[object Object]
Actual IE Example: University of Southern California’s Info Sciences Institute (ISI)’s   “Information agent” ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Heracles’  Stalker  inductive algorithm ,[object Object],[object Object],[object Object],[object Object]
Example of training examples ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Problems with Wrapper Induction ,[object Object],[object Object],[object Object],[object Object]
Summary ,[object Object],[object Object],[object Object],[object Object]
Extra Reading  ,[object Object],[object Object],[object Object],[object Object]
Related Legal/ Ethical/ Professional/ Methodological Issues ,[object Object],[object Object],[object Object],[object Object]

More Related Content

Similar to Semantic Web

Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersEmanuele Della Valle
 
PowerPoint
PowerPointPowerPoint
PowerPointVideoguy
 
Linked data business models
Linked data business modelsLinked data business models
Linked data business modelsJesus Contreras
 
Semantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientistsSemantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientistsEmanuele Della Valle
 
Ashford 4 ­ Week 3 ­ Weekly Lecture      Weekly Lec.docx
Ashford 4 ­ Week 3 ­ Weekly Lecture      Weekly Lec.docxAshford 4 ­ Week 3 ­ Weekly Lecture      Weekly Lec.docx
Ashford 4 ­ Week 3 ­ Weekly Lecture      Weekly Lec.docxdavezstarr61655
 
Semantic web
Semantic webSemantic web
Semantic webcat_us
 
F0362036045
F0362036045F0362036045
F0362036045theijes
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 
Implementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoCImplementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoCjimfuller2009
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic WebBarry Smith
 
ACOMP_2014_submission_70
ACOMP_2014_submission_70ACOMP_2014_submission_70
ACOMP_2014_submission_70David Nguyen
 
Help your library be omnipresent without spending a
Help your library be omnipresent without spending aHelp your library be omnipresent without spending a
Help your library be omnipresent without spending aNina McHale
 
Information On Line Transaction Processing
Information On Line Transaction ProcessingInformation On Line Transaction Processing
Information On Line Transaction ProcessingStefanie Yang
 
Mdst 3559-03-03-sql-php-2
Mdst 3559-03-03-sql-php-2Mdst 3559-03-03-sql-php-2
Mdst 3559-03-03-sql-php-2Rafael Alvarado
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and RetrievalOptum
 
BUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docxBUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docxjasoninnes20
 
Introduction to the Web and HTML
Introduction to the Web and HTMLIntroduction to the Web and HTML
Introduction to the Web and HTMLSiddharthBorderwala
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
 

Similar to Semantic Web (20)

Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS Practitioners
 
PowerPoint
PowerPointPowerPoint
PowerPoint
 
Linked data business models
Linked data business modelsLinked data business models
Linked data business models
 
Semantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientistsSemantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientists
 
Semantic Web - Introduction
Semantic Web - IntroductionSemantic Web - Introduction
Semantic Web - Introduction
 
Ashford 4 ­ Week 3 ­ Weekly Lecture      Weekly Lec.docx
Ashford 4 ­ Week 3 ­ Weekly Lecture      Weekly Lec.docxAshford 4 ­ Week 3 ­ Weekly Lecture      Weekly Lec.docx
Ashford 4 ­ Week 3 ­ Weekly Lecture      Weekly Lec.docx
 
Semantic web
Semantic webSemantic web
Semantic web
 
F0362036045
F0362036045F0362036045
F0362036045
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
Implementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoCImplementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoC
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Web
 
ACOMP_2014_submission_70
ACOMP_2014_submission_70ACOMP_2014_submission_70
ACOMP_2014_submission_70
 
Help your library be omnipresent without spending a
Help your library be omnipresent without spending aHelp your library be omnipresent without spending a
Help your library be omnipresent without spending a
 
Information On Line Transaction Processing
Information On Line Transaction ProcessingInformation On Line Transaction Processing
Information On Line Transaction Processing
 
Mdst 3559-03-03-sql-php-2
Mdst 3559-03-03-sql-php-2Mdst 3559-03-03-sql-php-2
Mdst 3559-03-03-sql-php-2
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
BUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docxBUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docx
 
Introduction to the Web and HTML
Introduction to the Web and HTMLIntroduction to the Web and HTML
Introduction to the Web and HTML
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Semantic Web

  • 1. Information Extraction from the WWW using Machine Learning Techniques Lee McCluskey, Dept of Informatics email: lee@hud.ac.uk
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. Information Extraction from The Web WRAPPERS WEB PAGES STRUCTURED DATA BA red 555 sue MSc red 123 dave PhD grey 345 bill BSc blue 664 tom
  • 8.
  • 9. Example of Automated Extraction <residential> <house> < location> <city> Hebden Bridge </city> <county> West Yorkshire </county> <country> UK </country> </location> <agent-phone> 01422 843222 </agent-phone> <listed-price> £350,000 </listed-price> <comments> Bijou residence on the edge of this popular little town... </comments> </house> ... </residential> <h1> Residential Housing </h1> <ul> House For Sale <li> location: Hebden Bridge <li> agent-phone: 01422 843222 <li> listed-price: £350,000 <li> comments: Bijou residence on the edge of this popular little town... </ul> <hr> <ul> House For Sale ... </ul> ... Source: HTML ======> Destination: XML NB: XML + schema + recognised names wrapper
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.

Editor's Notes

  1. 04/26/10 Points to make: 1) XML is an extensible markup language to describe structured data 2) XML is similar to HTML in that -- they are both markup languages, descending from SGML -- they both use tags 3) XML differs from HTML in that -- XML tags on data elements identify the meaning of data, rather than specifying how data should be formatted, as in HTML. XML therefore separates the three components of documents: content, structure, and presentation. -- relationships among data elements are provided via simple nesting The example should hopefully make these points clear. It shows the data from the same source, as published in HTML and XML formats. Note that even the XML document is more verbose, it also provides information in a far more convenient and usable format from a data management perspective.