SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Data and Information
                        Extraction on the Web
                           Gestione delle Informazioni su Web - 2009/2010
                                            Tommaso Teofili
                                    tommaso [at] apache [dot] org




lunedì 12 aprile 2010
Agenda
                        Search

                          Goals

                          Problems

                        Data extraction

                        Information extraction

                        Mixing things together


lunedì 12 aprile 2010
Search - Goals

                        Find what we are looking for

                          Quickly

                          Easily

                        Have suggestions on other interesting related
                        stuff

                        Turn results into useful knowledge


lunedì 12 aprile 2010
What are you looking for?
lunedì 12 aprile 2010
Problems when googling


                         Where to search what we are looking for

                         How to write good queries (i.e.: relations
                         between terms?)

                         How to evaluate when a query is good




lunedì 12 aprile 2010
Search sources


                        Redundant, unhomogeneous, widespread,
                        public, noisy, free, sometimes standard, semi-
                        structured, linked, reachable...

                        in one word:   the Web


lunedì 12 aprile 2010
Focused search sources

                         Address interesting sources for the desired
                         domain

                         Where possible, filter out the unclean and
                         fragmented ones

                         Choose the most standard and well
                         structured ones



lunedì 12 aprile 2010
Fragmented sources
lunedì 12 aprile 2010
Structered sources
lunedì 12 aprile 2010
Data extraction

                        Automatically collect data from the Web

                        Crawl data from domain specific sources

                        Aggregate homogeneous data (i.e.: using
                        equivalence classes)

                        Save (portions of downloaded) data to a
                        convenient separate storage (DB, file system,
                        repository, etc.)


lunedì 12 aprile 2010
Data extraction - Crawling

                        From scratch (good luck!)

                        Leveraging existing facilities (wget, HtmlUnit,
                        Selenium, Apache HttpClient, Ning’s Async
                        HttpClient, etc.)

                        Playing with existing projects (RoadRunner,
                        Webpipe, Apache Nutch, etc.)



lunedì 12 aprile 2010
Data extraction - HttpClient
lunedì 12 aprile 2010
Data extraction - HtmlUnit
lunedì 12 aprile 2010
Data extraction - Aggregating

                        Downloaded resources can be assigned to
                        equivalence classes

                        Crawling process is inherently defining page
                        classes to which pages belong automatically

                        Relations between page classes

                        RoadRunner, Webpipe, etc.



lunedì 12 aprile 2010
Data extraction - EC




lunedì 12 aprile 2010
Data extraction - EC


             “teams indexes” class




                                              “teams” class




                            “players” class                   “coaches” class

lunedì 12 aprile 2010
Data extraction - Relevance


                        What do we really deserve?

                          Depending on the specific domain

                        Not all pages in all classes could be relevant

                        We could be interested only in a subset of
                        the found page classes



lunedì 12 aprile 2010
Data extraction - Example



                         We may be interested
                         in retrieving only
                         information regarding
                         players (Player class)




lunedì 12 aprile 2010
Data extraction - Problems
                        Server unavailability (HTTP 404, 403, 303, etc.)

                        Security and bandwith filters (don’t get your crawler
                        machine IP banned!)

                        Client unavailability (memory and storage space are
                        unlimited only in theory)

                        Encoding

                        Legal issues

                        ...


lunedì 12 aprile 2010
From Data to Information
lunedì 12 aprile 2010
Data vs Information
                        Data                    Information

                          Rough                   Clean

                          Semi-structured         Structured

                          Mixed content           Focused

                          Unmutable               Managed

                          Navigation oriented     Domain oriented



lunedì 12 aprile 2010
From Data to Information


                        We have crawled a lot of data

                        We eventually have some rough structure
                        (page classes and relations)

                        We want to pick only what we need




lunedì 12 aprile 2010
Information extraction - Pruning

                        We want to filter out at least:

                          Banners, advertisement, etc.

                          Headers/Footers

                          Navigation bars/Search boxes

                          Everything else not related with content

                        We may use XPath


lunedì 12 aprile 2010
Information extraction - Pruning

lunedì 12 aprile 2010
Information extraction - Pruning

lunedì 12 aprile 2010
Information extraction

                        Once we have extracted content

                        We are now interested in getting useful
                        information from it -> knowledge

                        Look for some matchings between extracted
                        data and our domain model




lunedì 12 aprile 2010
Information extraction - Example

                        Navigate XML (HTML DOM) nodes with XPath

                        Navigate content and find specific
                        “parts” (nodes or sub-trees)

                        Tag such “parts” as objects or properties
                        inside a (specific) domain model

                        Eventually need to traverse DOM multiple
                        times


lunedì 12 aprile 2010
Information extraction - Name

lunedì 12 aprile 2010
Information extraction - Date of Birth

lunedì 12 aprile 2010
Information extraction - Team

lunedì 12 aprile 2010
Information extraction - Example


                        A Player (taken from the Player pageclass)

                        with name, date of birth and belonging to a
                        team

                        We now know that “Francesco Totti” is a Player
                        of “Italy” team and was born on “27/09/1976”

                        We can apply such XPaths to all PageClass
                        instances and get information about each player



lunedì 12 aprile 2010
Information extraction - Wrapper


                        Context navigation

                           RoadRunner

                           Webpipe

                        Statistical analysis

                           ExAlg

                        Other...



lunedì 12 aprile 2010
Information extraction - Problems



                        Not well structured sources

                        Frequently changing sources

                        False positives

                        Corrupted extracted data




lunedì 12 aprile 2010
False positives
lunedì 12 aprile 2010
Information extraction - Relevance


                        Using wrappers we can get a lot of
                        information

                        We could rank what is relevant in the:

                          “page” context

                          the domain model

                        For efficiency and “reasoning” purposes


lunedì 12 aprile 2010
Information extraction - relevance

lunedì 12 aprile 2010
Information extraction - Metadata


                        Stream extracted information into our
                        domain model

                        Extracted information -> Metadata

                        Populated domain objects contain

                          interesting semantics

                          relations


lunedì 12 aprile 2010
Store Metadata
                        DB (with classic relational schema)

                        Filesystem (XML)

                        Key-Value repository

                        Index

                        Triple Store

                        ...


lunedì 12 aprile 2010
Query enriched data

                        Exploit acquired metadata semantics to build
                        SQL-like (with attributes and relations of our
                        domain model) queries on previously
                        unstructered data

                        Extract hidden knowledge querying
                        aggregated metadata




lunedì 12 aprile 2010
Sample queries
                        Get “young players”

                          SELECT * FROM giocatore g WHERE g.dob
                          AFTER 1993/01/01

                        Aggregate queries

                          Find the average age in each team

                          Find the average age of World Cup
                          players


lunedì 12 aprile 2010
Information extraction
                             on the Web
lunedì 12 aprile 2010
References
                        http://www.w3.org/TR/xpath/

                        http://www.w3.org/DOM/

                        http://www.dia.uniroma3.it/db/roadRunner/

                        http://www.slideshare.net/n0on3/exalg-overview

                        http://www.ricercaitaliana.it/prin/unita_op-2006093591_002.htm

                        http://incubator.apache.org/uima/downloads/releaseDocs/2.3.0-incubating/docs/html/
                        overview_and_setup/overview_and_setup.html

                        http://en.wikipedia.org/wiki/Web_scraping

                        http://www.alchemyapi.com/api/scrape/




lunedì 12 aprile 2010

Contenu connexe

Tendances

Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
 
Ch. 16 Database Case Study: XML/XSLT
Ch. 16 Database Case Study: XML/XSLTCh. 16 Database Case Study: XML/XSLT
Ch. 16 Database Case Study: XML/XSLTmh-108
 
A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"Michael Nelson
 
Linked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and MuseumsLinked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and Museumstrevorthornton
 
Study Support and Integration of Cultural Information Resources with Linked Data
Study Support and Integration of Cultural Information Resources with Linked DataStudy Support and Integration of Cultural Information Resources with Linked Data
Study Support and Integration of Cultural Information Resources with Linked DataKAMURA
 
20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museumsandrea huang
 
Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Mathieu d'Aquin
 
A structured catalog of open educational datasets
A structured catalog of open educational datasetsA structured catalog of open educational datasets
A structured catalog of open educational datasetsStefan Dietze
 
Experiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's MissingExperiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's MissingCloudera, Inc.
 

Tendances (10)

Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
 
Ch. 16 Database Case Study: XML/XSLT
Ch. 16 Database Case Study: XML/XSLTCh. 16 Database Case Study: XML/XSLT
Ch. 16 Database Case Study: XML/XSLT
 
A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"A Research Agenda for "Obsolete Data or Resources"
A Research Agenda for "Obsolete Data or Resources"
 
Implementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource ConditionsImplementing Linked Data in Low-Resource Conditions
Implementing Linked Data in Low-Resource Conditions
 
Linked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and MuseumsLinked Open Data Fundamentals for Libraries, Archives and Museums
Linked Open Data Fundamentals for Libraries, Archives and Museums
 
Study Support and Integration of Cultural Information Resources with Linked Data
Study Support and Integration of Cultural Information Resources with Linked DataStudy Support and Integration of Cultural Information Resources with Linked Data
Study Support and Integration of Cultural Information Resources with Linked Data
 
20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums
 
Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?
 
A structured catalog of open educational datasets
A structured catalog of open educational datasetsA structured catalog of open educational datasets
A structured catalog of open educational datasets
 
Experiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's MissingExperiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's Missing
 

En vedette

Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
 
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesEnterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesYunyao Li
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 
Information Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram DataInformation Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram DataGerard de Melo
 
SystemT: Declarative Information Extraction
SystemT: Declarative Information ExtractionSystemT: Declarative Information Extraction
SystemT: Declarative Information ExtractionYunyao Li
 
Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...Cataldo Musto
 
Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...
Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...
Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...JimKellerES
 
Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)
Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)
Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)Paul Bradshaw
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extractionR A Akerkar
 
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介Koji Matsuda
 
Compensation ahmed's
Compensation ahmed'sCompensation ahmed's
Compensation ahmed'sHaamed Ahmed
 
5 tactics to personalize your email message for better results final
5 tactics to personalize your email message for better results final5 tactics to personalize your email message for better results final
5 tactics to personalize your email message for better results finalMarketingSherpa
 
Production proposal – meeting minutes
Production proposal – meeting minutesProduction proposal – meeting minutes
Production proposal – meeting minuteshamdi_jama
 
Meeting Minute Unit 28
Meeting Minute Unit 28Meeting Minute Unit 28
Meeting Minute Unit 28Mansour Ahaidi
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionFlorian Leitner
 
Job analysis
Job analysisJob analysis
Job analysispaiils111
 

En vedette (20)

Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesEnterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Information Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram DataInformation Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram Data
 
Data presentation 2
Data presentation 2Data presentation 2
Data presentation 2
 
Presentation of data
Presentation of dataPresentation of data
Presentation of data
 
SystemT: Declarative Information Extraction
SystemT: Declarative Information ExtractionSystemT: Declarative Information Extraction
SystemT: Declarative Information Extraction
 
Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...Combining Distributional Semantics and Entity Linking for Context-aware Conte...
Combining Distributional Semantics and Entity Linking for Context-aware Conte...
 
Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...
Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...
Start with Small Data: How to Understand your Visitors, Capture Data, and Pro...
 
Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)
Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)
Compile, Clean, Connect: The mantra of data journalism (Future Everything 2011)
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extraction
 
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介
 
Compensation ahmed's
Compensation ahmed'sCompensation ahmed's
Compensation ahmed's
 
5 tactics to personalize your email message for better results final
5 tactics to personalize your email message for better results final5 tactics to personalize your email message for better results final
5 tactics to personalize your email message for better results final
 
Minutes of meeting form
Minutes of meeting formMinutes of meeting form
Minutes of meeting form
 
Production proposal – meeting minutes
Production proposal – meeting minutesProduction proposal – meeting minutes
Production proposal – meeting minutes
 
Meeting Minute Unit 28
Meeting Minute Unit 28Meeting Minute Unit 28
Meeting Minute Unit 28
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Job analysis
Job analysisJob analysis
Job analysis
 

Similaire à Data and Information Extraction on the Web

Introducing Riak and Ripple
Introducing Riak and RippleIntroducing Riak and Ripple
Introducing Riak and RippleSean Cribbs
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedStefan Dietze
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
 
Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3Olivier Dobberkau
 
Open Data Commons - OSSAT 14 April 2010
Open Data Commons - OSSAT 14 April 2010Open Data Commons - OSSAT 14 April 2010
Open Data Commons - OSSAT 14 April 2010Jordan Hatcher
 
03 Custom Classes
03 Custom Classes03 Custom Classes
03 Custom ClassesMahmoud
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebStefan Dietze
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemCameron Kiddle
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebStefan Dietze
 
Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2rusersla
 
SharePoint 2010 Data View webparts - Advanced editing methods
SharePoint 2010 Data View webparts - Advanced editing methodsSharePoint 2010 Data View webparts - Advanced editing methods
SharePoint 2010 Data View webparts - Advanced editing methodsOrbit One - We create coherence
 
Ili structuredauthoring
Ili structuredauthoringIli structuredauthoring
Ili structuredauthoringTony Hirst
 
Linking Open Data with Drupal
Linking Open Data with DrupalLinking Open Data with Drupal
Linking Open Data with Drupalemmanuel_jamin
 
Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Mathieu d'Aquin
 
SemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise DataSemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise Data3 Round Stones
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked datavafopoulos
 
SharePoint Taxonomy and Metadata 11-19-09
SharePoint Taxonomy and Metadata 11-19-09SharePoint Taxonomy and Metadata 11-19-09
SharePoint Taxonomy and Metadata 11-19-09Stephanie Lemieux
 

Similaire à Data and Information Extraction on the Web (20)

Introducing Riak and Ripple
Introducing Riak and RippleIntroducing Riak and Ripple
Introducing Riak and Ripple
 
20130206 open refine
20130206  open refine20130206  open refine
20130206 open refine
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Lucene revolution with Data Harmony
Lucene revolution with Data HarmonyLucene revolution with Data Harmony
Lucene revolution with Data Harmony
 
Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3
 
Open Data Commons - OSSAT 14 April 2010
Open Data Commons - OSSAT 14 April 2010Open Data Commons - OSSAT 14 April 2010
Open Data Commons - OSSAT 14 April 2010
 
03 Custom Classes
03 Custom Classes03 Custom Classes
03 Custom Classes
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management System
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
 
Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2
 
SharePoint 2010 Data View webparts - Advanced editing methods
SharePoint 2010 Data View webparts - Advanced editing methodsSharePoint 2010 Data View webparts - Advanced editing methods
SharePoint 2010 Data View webparts - Advanced editing methods
 
OER Search
OER SearchOER Search
OER Search
 
Ili structuredauthoring
Ili structuredauthoringIli structuredauthoring
Ili structuredauthoring
 
Linking Open Data with Drupal
Linking Open Data with DrupalLinking Open Data with Drupal
Linking Open Data with Drupal
 
Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...Linked Data at the Open University: From Technical Challenges to Organization...
Linked Data at the Open University: From Technical Challenges to Organization...
 
SemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise DataSemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise Data
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
 
SharePoint Taxonomy and Metadata 11-19-09
SharePoint Taxonomy and Metadata 11-19-09SharePoint Taxonomy and Metadata 11-19-09
SharePoint Taxonomy and Metadata 11-19-09
 

Plus de Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on codeTommaso Teofili
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA IntroductionTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
 

Plus de Tommaso Teofili (17)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 

Dernier

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Dernier (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Data and Information Extraction on the Web

  • 1. Data and Information Extraction on the Web Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org lunedì 12 aprile 2010
  • 2. Agenda Search Goals Problems Data extraction Information extraction Mixing things together lunedì 12 aprile 2010
  • 3. Search - Goals Find what we are looking for Quickly Easily Have suggestions on other interesting related stuff Turn results into useful knowledge lunedì 12 aprile 2010
  • 4. What are you looking for? lunedì 12 aprile 2010
  • 5. Problems when googling Where to search what we are looking for How to write good queries (i.e.: relations between terms?) How to evaluate when a query is good lunedì 12 aprile 2010
  • 6. Search sources Redundant, unhomogeneous, widespread, public, noisy, free, sometimes standard, semi- structured, linked, reachable... in one word: the Web lunedì 12 aprile 2010
  • 7. Focused search sources Address interesting sources for the desired domain Where possible, filter out the unclean and fragmented ones Choose the most standard and well structured ones lunedì 12 aprile 2010
  • 10. Data extraction Automatically collect data from the Web Crawl data from domain specific sources Aggregate homogeneous data (i.e.: using equivalence classes) Save (portions of downloaded) data to a convenient separate storage (DB, file system, repository, etc.) lunedì 12 aprile 2010
  • 11. Data extraction - Crawling From scratch (good luck!) Leveraging existing facilities (wget, HtmlUnit, Selenium, Apache HttpClient, Ning’s Async HttpClient, etc.) Playing with existing projects (RoadRunner, Webpipe, Apache Nutch, etc.) lunedì 12 aprile 2010
  • 12. Data extraction - HttpClient lunedì 12 aprile 2010
  • 13. Data extraction - HtmlUnit lunedì 12 aprile 2010
  • 14. Data extraction - Aggregating Downloaded resources can be assigned to equivalence classes Crawling process is inherently defining page classes to which pages belong automatically Relations between page classes RoadRunner, Webpipe, etc. lunedì 12 aprile 2010
  • 15. Data extraction - EC lunedì 12 aprile 2010
  • 16. Data extraction - EC “teams indexes” class “teams” class “players” class “coaches” class lunedì 12 aprile 2010
  • 17. Data extraction - Relevance What do we really deserve? Depending on the specific domain Not all pages in all classes could be relevant We could be interested only in a subset of the found page classes lunedì 12 aprile 2010
  • 18. Data extraction - Example We may be interested in retrieving only information regarding players (Player class) lunedì 12 aprile 2010
  • 19. Data extraction - Problems Server unavailability (HTTP 404, 403, 303, etc.) Security and bandwith filters (don’t get your crawler machine IP banned!) Client unavailability (memory and storage space are unlimited only in theory) Encoding Legal issues ... lunedì 12 aprile 2010
  • 20. From Data to Information lunedì 12 aprile 2010
  • 21. Data vs Information Data Information Rough Clean Semi-structured Structured Mixed content Focused Unmutable Managed Navigation oriented Domain oriented lunedì 12 aprile 2010
  • 22. From Data to Information We have crawled a lot of data We eventually have some rough structure (page classes and relations) We want to pick only what we need lunedì 12 aprile 2010
  • 23. Information extraction - Pruning We want to filter out at least: Banners, advertisement, etc. Headers/Footers Navigation bars/Search boxes Everything else not related with content We may use XPath lunedì 12 aprile 2010
  • 24. Information extraction - Pruning lunedì 12 aprile 2010
  • 25. Information extraction - Pruning lunedì 12 aprile 2010
  • 26. Information extraction Once we have extracted content We are now interested in getting useful information from it -> knowledge Look for some matchings between extracted data and our domain model lunedì 12 aprile 2010
  • 27. Information extraction - Example Navigate XML (HTML DOM) nodes with XPath Navigate content and find specific “parts” (nodes or sub-trees) Tag such “parts” as objects or properties inside a (specific) domain model Eventually need to traverse DOM multiple times lunedì 12 aprile 2010
  • 28. Information extraction - Name lunedì 12 aprile 2010
  • 29. Information extraction - Date of Birth lunedì 12 aprile 2010
  • 30. Information extraction - Team lunedì 12 aprile 2010
  • 31. Information extraction - Example A Player (taken from the Player pageclass) with name, date of birth and belonging to a team We now know that “Francesco Totti” is a Player of “Italy” team and was born on “27/09/1976” We can apply such XPaths to all PageClass instances and get information about each player lunedì 12 aprile 2010
  • 32. Information extraction - Wrapper Context navigation RoadRunner Webpipe Statistical analysis ExAlg Other... lunedì 12 aprile 2010
  • 33. Information extraction - Problems Not well structured sources Frequently changing sources False positives Corrupted extracted data lunedì 12 aprile 2010
  • 35. Information extraction - Relevance Using wrappers we can get a lot of information We could rank what is relevant in the: “page” context the domain model For efficiency and “reasoning” purposes lunedì 12 aprile 2010
  • 36. Information extraction - relevance lunedì 12 aprile 2010
  • 37. Information extraction - Metadata Stream extracted information into our domain model Extracted information -> Metadata Populated domain objects contain interesting semantics relations lunedì 12 aprile 2010
  • 38. Store Metadata DB (with classic relational schema) Filesystem (XML) Key-Value repository Index Triple Store ... lunedì 12 aprile 2010
  • 39. Query enriched data Exploit acquired metadata semantics to build SQL-like (with attributes and relations of our domain model) queries on previously unstructered data Extract hidden knowledge querying aggregated metadata lunedì 12 aprile 2010
  • 40. Sample queries Get “young players” SELECT * FROM giocatore g WHERE g.dob AFTER 1993/01/01 Aggregate queries Find the average age in each team Find the average age of World Cup players lunedì 12 aprile 2010
  • 41. Information extraction on the Web lunedì 12 aprile 2010
  • 42. References http://www.w3.org/TR/xpath/ http://www.w3.org/DOM/ http://www.dia.uniroma3.it/db/roadRunner/ http://www.slideshare.net/n0on3/exalg-overview http://www.ricercaitaliana.it/prin/unita_op-2006093591_002.htm http://incubator.apache.org/uima/downloads/releaseDocs/2.3.0-incubating/docs/html/ overview_and_setup/overview_and_setup.html http://en.wikipedia.org/wiki/Web_scraping http://www.alchemyapi.com/api/scrape/ lunedì 12 aprile 2010