Slides about "Information and Data Extraction on the Web" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Data and Information Extraction on the Web
1. Data and Information
Extraction on the Web
Gestione delle Informazioni su Web - 2009/2010
Tommaso Teofili
tommaso [at] apache [dot] org
lunedì 12 aprile 2010
2. Agenda
Search
Goals
Problems
Data extraction
Information extraction
Mixing things together
lunedì 12 aprile 2010
3. Search - Goals
Find what we are looking for
Quickly
Easily
Have suggestions on other interesting related
stuff
Turn results into useful knowledge
lunedì 12 aprile 2010
5. Problems when googling
Where to search what we are looking for
How to write good queries (i.e.: relations
between terms?)
How to evaluate when a query is good
lunedì 12 aprile 2010
6. Search sources
Redundant, unhomogeneous, widespread,
public, noisy, free, sometimes standard, semi-
structured, linked, reachable...
in one word: the Web
lunedì 12 aprile 2010
7. Focused search sources
Address interesting sources for the desired
domain
Where possible, filter out the unclean and
fragmented ones
Choose the most standard and well
structured ones
lunedì 12 aprile 2010
10. Data extraction
Automatically collect data from the Web
Crawl data from domain specific sources
Aggregate homogeneous data (i.e.: using
equivalence classes)
Save (portions of downloaded) data to a
convenient separate storage (DB, file system,
repository, etc.)
lunedì 12 aprile 2010
11. Data extraction - Crawling
From scratch (good luck!)
Leveraging existing facilities (wget, HtmlUnit,
Selenium, Apache HttpClient, Ning’s Async
HttpClient, etc.)
Playing with existing projects (RoadRunner,
Webpipe, Apache Nutch, etc.)
lunedì 12 aprile 2010
14. Data extraction - Aggregating
Downloaded resources can be assigned to
equivalence classes
Crawling process is inherently defining page
classes to which pages belong automatically
Relations between page classes
RoadRunner, Webpipe, etc.
lunedì 12 aprile 2010
16. Data extraction - EC
“teams indexes” class
“teams” class
“players” class “coaches” class
lunedì 12 aprile 2010
17. Data extraction - Relevance
What do we really deserve?
Depending on the specific domain
Not all pages in all classes could be relevant
We could be interested only in a subset of
the found page classes
lunedì 12 aprile 2010
18. Data extraction - Example
We may be interested
in retrieving only
information regarding
players (Player class)
lunedì 12 aprile 2010
19. Data extraction - Problems
Server unavailability (HTTP 404, 403, 303, etc.)
Security and bandwith filters (don’t get your crawler
machine IP banned!)
Client unavailability (memory and storage space are
unlimited only in theory)
Encoding
Legal issues
...
lunedì 12 aprile 2010
21. Data vs Information
Data Information
Rough Clean
Semi-structured Structured
Mixed content Focused
Unmutable Managed
Navigation oriented Domain oriented
lunedì 12 aprile 2010
22. From Data to Information
We have crawled a lot of data
We eventually have some rough structure
(page classes and relations)
We want to pick only what we need
lunedì 12 aprile 2010
23. Information extraction - Pruning
We want to filter out at least:
Banners, advertisement, etc.
Headers/Footers
Navigation bars/Search boxes
Everything else not related with content
We may use XPath
lunedì 12 aprile 2010
26. Information extraction
Once we have extracted content
We are now interested in getting useful
information from it -> knowledge
Look for some matchings between extracted
data and our domain model
lunedì 12 aprile 2010
27. Information extraction - Example
Navigate XML (HTML DOM) nodes with XPath
Navigate content and find specific
“parts” (nodes or sub-trees)
Tag such “parts” as objects or properties
inside a (specific) domain model
Eventually need to traverse DOM multiple
times
lunedì 12 aprile 2010
31. Information extraction - Example
A Player (taken from the Player pageclass)
with name, date of birth and belonging to a
team
We now know that “Francesco Totti” is a Player
of “Italy” team and was born on “27/09/1976”
We can apply such XPaths to all PageClass
instances and get information about each player
lunedì 12 aprile 2010
32. Information extraction - Wrapper
Context navigation
RoadRunner
Webpipe
Statistical analysis
ExAlg
Other...
lunedì 12 aprile 2010
33. Information extraction - Problems
Not well structured sources
Frequently changing sources
False positives
Corrupted extracted data
lunedì 12 aprile 2010
35. Information extraction - Relevance
Using wrappers we can get a lot of
information
We could rank what is relevant in the:
“page” context
the domain model
For efficiency and “reasoning” purposes
lunedì 12 aprile 2010
37. Information extraction - Metadata
Stream extracted information into our
domain model
Extracted information -> Metadata
Populated domain objects contain
interesting semantics
relations
lunedì 12 aprile 2010
38. Store Metadata
DB (with classic relational schema)
Filesystem (XML)
Key-Value repository
Index
Triple Store
...
lunedì 12 aprile 2010
39. Query enriched data
Exploit acquired metadata semantics to build
SQL-like (with attributes and relations of our
domain model) queries on previously
unstructered data
Extract hidden knowledge querying
aggregated metadata
lunedì 12 aprile 2010
40. Sample queries
Get “young players”
SELECT * FROM giocatore g WHERE g.dob
AFTER 1993/01/01
Aggregate queries
Find the average age in each team
Find the average age of World Cup
players
lunedì 12 aprile 2010