Data and Information Extraction on the Web

Data and Information
Extraction on the Web
Gestione delle Informazioni su Web - 2009/2010
Tommaso Teoﬁli
tommaso [at] apache [dot] org

lunedì 12 aprile 2010

Agenda
Search

Goals

Problems

Data extraction

Information extraction

Mixing things together


Search - Goals

Find what we are looking for

Quickly

Easily

Have suggestions on other interesting related
stuff

Turn results into useful knowledge


What are you looking for?

Problems when googling

Where to search what we are looking for

How to write good queries (i.e.: relations
between terms?)

How to evaluate when a query is good


Search sources

Redundant, unhomogeneous, widespread,
public, noisy, free, sometimes standard, semi-
structured, linked, reachable...

in one word: the Web


Focused search sources

Address interesting sources for the desired
domain

Where possible, ﬁlter out the unclean and
fragmented ones

Choose the most standard and well
structured ones


Fragmented sources

Structered sources

Data extraction

Automatically collect data from the Web

Crawl data from domain speciﬁc sources

Aggregate homogeneous data (i.e.: using
equivalence classes)

Save (portions of downloaded) data to a
convenient separate storage (DB, ﬁle system,
repository, etc.)


Data extraction - Crawling

From scratch (good luck!)

Leveraging existing facilities (wget, HtmlUnit,
Selenium, Apache HttpClient, Ning’s Async
HttpClient, etc.)

Playing with existing projects (RoadRunner,
Webpipe, Apache Nutch, etc.)


Data extraction - HttpClient

Data extraction - HtmlUnit

Data extraction - Aggregating

Downloaded resources can be assigned to
equivalence classes

Crawling process is inherently deﬁning page
classes to which pages belong automatically

Relations between page classes

RoadRunner, Webpipe, etc.


Data extraction - EC


Data extraction - EC

“teams indexes” class

“teams” class

“players” class “coaches” class


Data extraction - Relevance

What do we really deserve?

Depending on the speciﬁc domain

Not all pages in all classes could be relevant

We could be interested only in a subset of
the found page classes


Data extraction - Example

We may be interested
in retrieving only
information regarding
players (Player class)


Data extraction - Problems
Server unavailability (HTTP 404, 403, 303, etc.)

Security and bandwith ﬁlters (don’t get your crawler
machine IP banned!)

Client unavailability (memory and storage space are
unlimited only in theory)

Encoding

Legal issues

...


From Data to Information

Data vs Information
Data Information

Rough Clean

Semi-structured Structured

Mixed content Focused

Unmutable Managed

Navigation oriented Domain oriented


From Data to Information

We have crawled a lot of data

We eventually have some rough structure
(page classes and relations)

We want to pick only what we need


Information extraction - Pruning

We want to ﬁlter out at least:

Banners, advertisement, etc.

Headers/Footers

Navigation bars/Search boxes

Everything else not related with content

We may use XPath


Information extraction - Pruning



Once we have extracted content

We are now interested in getting useful
information from it -> knowledge

Look for some matchings between extracted
data and our domain model


Information extraction - Example

Navigate XML (HTML DOM) nodes with XPath

Navigate content and find specific
“parts” (nodes or sub-trees)

Tag such “parts” as objects or properties
inside a (specific) domain model

Eventually need to traverse DOM multiple
times


Information extraction - Name


Information extraction - Date of Birth


Information extraction - Team


Information extraction - Example

A Player (taken from the Player pageclass)

with name, date of birth and belonging to a
team

We now know that “Francesco Totti” is a Player
of “Italy” team and was born on “27/09/1976”

We can apply such XPaths to all PageClass
instances and get information about each player


Information extraction - Wrapper

Context navigation

RoadRunner

Webpipe

Statistical analysis

ExAlg

Other...


Information extraction - Problems

Not well structured sources

Frequently changing sources

False positives

Corrupted extracted data


False positives

Information extraction - Relevance

Using wrappers we can get a lot of
information

We could rank what is relevant in the:

“page” context

the domain model

For efﬁciency and “reasoning” purposes


Information extraction - relevance


Information extraction - Metadata

Stream extracted information into our
domain model

Extracted information -> Metadata

Populated domain objects contain

interesting semantics

relations


Store Metadata
DB (with classic relational schema)

Filesystem (XML)

Key-Value repository

Index

Triple Store

...


Query enriched data

Exploit acquired metadata semantics to build
SQL-like (with attributes and relations of our
domain model) queries on previously
unstructered data

Extract hidden knowledge querying
aggregated metadata


Sample queries
Get “young players”

SELECT * FROM giocatore g WHERE g.dob
AFTER 1993/01/01

Aggregate queries

Find the average age in each team

Find the average age of World Cup
players


on the Web

References
http://www.w3.org/TR/xpath/

http://www.w3.org/DOM/

http://www.dia.uniroma3.it/db/roadRunner/

http://www.slideshare.net/n0on3/exalg-overview

http://www.ricercaitaliana.it/prin/unita_op-2006093591_002.htm

http://incubator.apache.org/uima/downloads/releaseDocs/2.3.0-incubating/docs/html/
overview_and_setup/overview_and_setup.html

http://en.wikipedia.org/wiki/Web_scraping

http://www.alchemyapi.com/api/scrape/


Data and Information Extraction on the Web

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (10)

En vedette

En vedette (20)

Similaire à Data and Information Extraction on the Web

Similaire à Data and Information Extraction on the Web (20)

Plus de Tommaso Teofili

Plus de Tommaso Teofili (17)

Dernier

Dernier (20)

Data and Information Extraction on the Web