SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Mining Semi-structured Data:
Understanding Web-tables – Building a
Taxonomy for 2xn Tables
Emir Mu˜noz, MSc.
emir.munoz@deri.org
Galway, Ireland – 19 July 2012
Introduction WTT-Detection WTT-Interpretation On-going work Future work 1/39
Outline
1 Introduction
2 WTT-Detection
3 WTT-Interpretation
4 On-going work
5 Future work
Introduction WTT-Detection WTT-Interpretation On-going work Future work 2/39
Introduction I
Tables
They are used as a compact and efficient way to present
relational information.
They are inherently concise as well as information rich.
The automatic understanding of tables has many applications
including:
Knowledge management
Information retrieval
Web and text mining
Summarization, and
Content delivery to mobile devices.
Interesting for domains like: medicine, health-care, finance,
e-science (e.g., biotechnology), and public policy.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 3/39
Introduction II
Web-tables (WTT) examples
Introduction WTT-Detection WTT-Interpretation On-going work Future work 4/39
Introduction III
Table understanding in Web documents include [WH02]:
Table detection,
Functional and structural analysis, and
Table interpretation.
Cafarella in [CHW+08] estimated that there are around 14.1
billion HTML tables, out of which 154 million contain high
quality relational data.
This represents a large source of knowledge, yet we do not
have systems that can understand and exploit this knowledge
properly.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 5/39
Outline
1 Introduction
2 WTT-Detection
3 WTT-Interpretation
4 On-going work
5 Future work
Introduction WTT-Detection WTT-Interpretation On-going work Future work 6/39
Table detection I
In practice, tables are not only used to present relational
information,
... they are also used to create multiple-column layouts to
facilitate easy viewing.
The presence of the HTML tag <table> does not ensure a
relational table, or more general, a table with content.
[WH02] A ML approach for Table Detection
Wang and Hu discriminated genuine and non-genuine tables on the
grounds of their content. They checked as to whether they contain
logical relations among the cells, or they are just used as a
mechanism for grouping content. In so doing, they used a tree
classifier.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 7/39
Table detection II
In genuine or relational tables there are logical relations
among the cells.
Non-genuine or non-relational tables are used as a mechanism
for grouping contents.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 8/39
Table detection: [WH02] I
A Machine Learning Based Approach for Table Detection on The Web
Then, they define weights derived from the traditional tf ∗ idf
measures used in IR, and define similarity based on the vector
space model.
Their initial database contains a total of 2,851 pages
harvested from Google directory and News using predefined
keywords known to have a higher chance to recall genuine
tables, from around 200 web sites.
They selected 1,393 pages out of these database (chosen
randomly). (11,477 <table> nodes.)
For training they used 9-fold cross validation.
They experimented with decision trees and SVMs for
separating genuine and non-genuine tables.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 9/39
Table detection: [WH02] II
A Machine Learning Based Approach for Table Detection on The Web
1,740 are genuine (15.16%) and 9,737 are non-genuine
(84.84%) tables.
The results reported are
R = 94.25%, P = 97.50%, F = 95.88%.
(The pages was obtained by querying Google using keywords
like “table”, “stock”, “weather”.)
Introduction WTT-Detection WTT-Interpretation On-going work Future work 10/39
Table detection: [CP10b] I
Web-Scale Knowledge Extraction from Semi-Structured Tables
Tables called Attribute/Value
They propose a classification algorithm
for recognizing layout tables and
attribute/value tables. In their work,
they adopted the Gradient Boosted
Decision Tree classification model, with
classes ATTRIBUTE/VALUE,
LAYOUT, and OTHER (e.g., calendars,
forms, enumerations).
Introduction WTT-Detection WTT-Interpretation On-going work Future work 11/39
Table detection: [CP10b] II
Web-Scale Knowledge Extraction from Semi-Structured Tables
Introduction WTT-Detection WTT-Interpretation On-going work Future work 12/39
Table detection: [CP10b] III
Web-Scale Knowledge Extraction from Semi-Structured Tables
Tables list attributes but rarely contain the subject in the
table proper.
Their focus is on detection of the subject of the table. They
call this open research problem: Protagonist Detection.
Relational tables considered in their work encode facts, or
semantic triples of the form < p, s, o >.
There are three different places where the protagonist could
be found:
a) within the table (occasionally found in the table with a generic
attribute such as name or model);
b) within the document or the HTML <title> tag; and
c) anchor texts offer well defined boundaries for identifying
protagonist candidates, the document body proposes fewer
clues.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 13/39
Table detection: [CP10b] IV
Web-Scale Knowledge Extraction from Semi-Structured Tables
Introduction WTT-Detection WTT-Interpretation On-going work Future work 14/39
Table detection: [CP10a, CP11] I
Web-scale Table Census and Classification
They extend their previous work, proposing a much
finer-grained table-type classification and report an overall
accuracy of 75.2%.
From a total of 1.2 billion documents, they extracted 8.2
billion tables (2.6 billion unique tables).
In detail, 75% of the pages contain at least one table with an
average of 9.1 tables per document.
In preliminary experiments, when trying to identify the
protagonist of A-V tables, they use an N-gram based
approach using a commercial search engine’s web link graph.
They find the correct protagonist in 90% of the cases in its
top-20 ranked candidates, and in 79% of the cases in its top-3.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 15/39
Table detection: [CP10a, CP11] II
Table classes
[CP10a, CP11] propose the following table type taxonomy.
(This proposal and others are only based on a syntactic
structure of tables.)
Introduction WTT-Detection WTT-Interpretation On-going work Future work 16/39
Outline
1 Introduction
2 WTT-Detection
3 WTT-Interpretation
4 On-going work
5 Future work
Introduction WTT-Detection WTT-Interpretation On-going work Future work 17/39
WTT-Interpretation I
Recovering Table Semantics
There are some works focused on mapping spreadsheets into
RDF, but such systems require human intervention.
[MFSJ10] proposed an approach that uses linked data to
interpret tables and associate their components with nodes in
a reference linked data collection.
To provide general purpose knowledge as well as specific facts
about significant people, places, organizations, events and
many other entities of interest.
[SFMJ10] used RDF for exporting and encoding the
information embodied in tables.
Describing techniques to automatically infer a (partial)
semantic model for information in tables using both table
headings, if available, and the values stored in table cells.
The techniques have been prototyped for a subset of linked
data that covers the core of Wikipedia.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 18/39
WTT-Interpretation II
Recovering Table Semantics
City Mayor State Population
Boston T. Menino MA 610,000
New York M. Bloomberg NY 8,400,000
Philadelphia M. Nutter PA 1,500,000
Baltimore S. Dixon MD 640,000
Washington A. Fenty DC 595,000
@prefix dbp: <http://dbpedia.org/resource/> .
@prefix dbpo: <http://dbpedia.org/ontology/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix cyc: <http://www.cyc.com/2004/06/04/cyc#> .
dbp:Boston dbpo:leaderName
dbp:Thomas_Menino;
cyc:partOf dbp:Massachusetts;
dbpo:populationTotal "610000"^^xsd:integer .
dbp:New_York_City ...
...
Introduction WTT-Detection WTT-Interpretation On-going work Future work 19/39
WTT-Interpretation III
Recovering Table Semantics
When predicting entity classes in a column, [SFMJ10] used
DBpedia (85.71%), Yago (71.42%), Word-Net (71.42%) and
Freebase (90.47%).
Entity types and their correct prediction: Places (61.64%),
Persons (90.76%) and Organizations (66.667%).
To describe relations between columns in a table, they take all
pairs of entities in the same row (already linked to Wikipedia)
and query DBpedia for the set of relations.
http://dbpedia.org/ontology/largestCity
http://dbpedia.org/ontology/PopulatedPlace/largestCity
http://dbpedia.org/ontology/capital
http://dbpedia.org/ontology/PopulatedPlace/capital
http://dbpedia.org/property/capital
http://dbpedia.org/property/largestcity
Introduction WTT-Detection WTT-Interpretation On-going work Future work 20/39
WTT-Interpretation IV
Recovering Table Semantics
The relation that appears the maximum number of times is
the selected.
The evaluation test set is very small, just 5 tables taken from
Google Squared.
Another example about basketball players:
Name Team Position
Michael Jordan Chicago Shooting guard
Allen Iverson Philadelphia Point guard
Yao Ming Houston Center
Tim Duncan San Antonio Power forward
It is important to discover relations between the table
columns, but not only 2-ary relations.
[MFJ11] also analyzed the Government Linked data.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 21/39
WTT-Interpretation I
WTT as a very large repository of facts
[YTT01] focused their work on a probabilistic method to
integrate tables according the category of objects represented
in each table. (Performing an attribute clusterization.)
[YT01, TI06] proposed methods to ontology extraction from
web-tables using the relations represented by structures into
the table. (The table structures have to be given by humans.)
An IR approach presented in [YTL11], extracts structured
data from WTT, aggregates and cleans such data and stores
them in a database. They create a very large repository of
entity-attribute-value triples.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 22/39
WTT-Interpretation II
WTT as a very large repository of facts
(How this works?) A good example is the query “Saint
Patrick’s day”, any search engine could directly show “17
March” within their top-ranked results.
http://en.wikipedia.org/wiki/Public_holidays_in_the_Republic_of_Ireland
Introduction WTT-Detection WTT-Interpretation On-going work Future work 23/39
WTT-Interpretation III
WTT as a very large repository of facts
(Hypothesis) Recovering semantics guide to a better search
and quality filter. Some enunciated problems:
Take for instance, a table about trees and a piece of text like
“...North America species such as Green Ash...”. From the
WTT we could infer that “Green Ash” is a species of tree
a.k.a. “Fraxinus pennsylvanica”.
Use schema statistics to automatically compute attribute
synonyms (more complete than thesaurus).
e.g., e-mail—email, phone—telephone, e-mail address—email
address, date—last-modified
It is still necessary to recover large fractions of binary
relationships and techniques for recovering numerical
relationships (e.g. population, GDP) [VHM+11].
Introduction WTT-Detection WTT-Interpretation On-going work Future work 24/39
Outline
1 Introduction
2 WTT-Detection
3 WTT-Interpretation
4 On-going work
5 Future work
Introduction WTT-Detection WTT-Interpretation On-going work Future work 25/39
On-going work I
Introduction
Our initial aim was to understand relational tables.
We parse HTML pages to extract HTML tables using
NekoHTML library.
We have a corpus comprising 8.2 billion tables.
A table is parsed as a matrix using Tartar [PCS+07]. Dealing
with the cell spans.
We manually annotated 14,695 randomly chosen tables:
10,923 content-poor (74.33%) and 3,785 content-rich
(25.76%) tables.
We made a content-poor and content-rich table predictor
using the same features as [CP11] using a max-entropy model,
and 10-fold cross-validation.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 26/39
On-going work II
Introduction
From a set of 115 features, we selected 19 via a greedy
property selection algorithm. The 19 achieved an accuracy of
89.46%.
Most important features:
Presence of the <select> tag in a column
Distinct strings in the 1st column
Distinct tags in a column
Distinct tags in a row
Non-empty cells in columns or rows
Presence of links
Presence of colon “:”
Presence of break line <br>
Presence of input fields (HTML)
Presence of numbers in a rows
Presence of the <th> tag
Introduction WTT-Detection WTT-Interpretation On-going work Future work 27/39
On-going work III
Introduction
Our aim is proposing a based-in-content taxonomy for WTT
instead of previously based in syntax structure.
We are now developing a 2xn table predictor with classes
focused in content.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 28/39
On-going work I
Why this work?
Why a new taxonomy?
[WH02] classify WTT as genuine or non-genuine tables.
[CHW+
08] classify WTT as relational or no-relational tables.
Crestan and Pantel’s taxonomy is a more general purpose
taxonomy for tables, focused on the syntax of the tables, not
in their semantic.
Intuitively, all the classes of the taxonomy of [CP11] are not
useful, and only A-V class is used.
Moreover, What it means to say that a table is A-V?
A A-V table could have spatio-temporal attributes or universal
facts or, even describe a person or a company or a product.
All the previous approaches needs a little bit of focus given for
the “message” of the tables.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 29/39
On-going work II
Why this work?
Why 2xn tables are important?
2xn class is larger than A-V class.
They are about 20% of all tables in the Web.
Previous A-V tables were identified by the presence of “:”
(colon).
We hope that extending the research to 2xn tables, discover
those A-V tables that are not indicated by the “colon rule”.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 30/39
On-going work
Proposed Taxonomy
We introduce new classes, that could be important, e.g., to be
used with ontologies (e.g., FOAF) in a search engine.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 31/39
On-going work
WTT examples according to our taxonomy
Introduction WTT-Detection WTT-Interpretation On-going work Future work 32/39
On-going work
Distribution
We are manually tagging 174,748 unique WTT.
The distribution per class until now is:
Class %
Social networks 34.2%
Spatio-temporal information 28.9%
Products 28.4%
Resources 4.3%
Universal facts 3.2%
Other 1.0%
Events 0.03%
Introduction WTT-Detection WTT-Interpretation On-going work Future work 33/39
Outline
1 Introduction
2 WTT-Detection
3 WTT-Interpretation
4 On-going work
5 Future work
Introduction WTT-Detection WTT-Interpretation On-going work Future work 34/39
Future work – Open problems I
Tables in web-pages can be used to model the data of
web-sites, in particular, its main entities and the key relations
thereof. This entails:
a) Web tables bear syntactic and semantic information that it is
useful for determining what they are talking about. Thus,
patterns across web-tables can be exploited to automatically
understand their “message”.
b) Once the ”message” of the tables of a specific web-site is
determined, it is possible to infer the main entities that this
web-site talks about.
c) Once the relevant entities of a web-site are detected, it is
plausible to recognize prominent relationships between these
entities. Thus, we will be able to link data between the chief
entities of a web-site.
d) Once predominant relations and entities of a web-site are
determined, it is possible to link data between different
web-sites.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 35/39
Future work – Open problems II
Extracting RDF from wikipedia tables (not only infobox).
Relation extraction – all kind of relations.
Taxonomies definition.
Other levels, like: rankings, definitions.
Complex table understanding.
Table integration.
Protagonist detection for web-tables.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 36/39
References I
If you want to go further
Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang.
Webtables: exploring the power of tables on the web.
PVLDB, 1(1):538–549, 2008.
Eric Crestan and Patrick Pantel.
A fine-grained taxonomy of tables on the web.
In Jimmy Huang, Nick Koudas, Gareth J. F. Jones, Xindong Wu, Kevyn Collins-Thompson, and Aijun An,
editors, CIKM, pages 1405–1408. ACM, 2010.
Eric Crestan and Patrick Pantel.
Web-scale knowledge extraction from semi-structured tables.
In Proceedings of the 19th international conference on World wide web, WWW ’10, pages 1081–1082, New
York, NY, USA, 2010. ACM.
Eric Crestan and Patrick Pantel.
Web-scale table census and classification.
In Irwin King, Wolfgang Nejdl, and Hang Li, editors, WSDM, pages 545–554. ACM, 2011.
Varish Mulwad, Tim Finin, and Anupam Joshi.
Automatically Generating Government Linked Data from Tables.
In Working notes of AAAI Fall Symposium on Open Government Knowledge: AI Opportunities and
Challenges. November 2011.
Varish Mulwad, Tim Finin, Zareen Syed, and Anupam Joshi.
Using linked data to interpret tables.
In Proceedings of the the First International Workshop on Consuming Linked Data, November 2010.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 37/39
References II
If you want to go further
Aleksander Pivk, Philipp Cimiano, York Sure, Matjaz Gams, Vladislav Rajkovic, and Rudi Studer.
Transforming arbitrary tables into logical form with TARTAR.
Data Knowl. Eng., 60(3):567–595, 2007.
Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi.
Exploiting a Web of Semantic Data for Interpreting Tables.
In in Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, Raleigh NC, USA, April
26–27th 2010.
Masahiro Tanaka and Toru Ishida.
Ontology Extraction from Tables on the Web.
In in Proceedings of the International Symposium on Applications on Internet, pages 284–290, Washington
DC, USA, 2006.
Petros Venetis, Alon Y. Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, Gengxin Miao, and
Chung Wu.
Recovering semantics of tables on the web.
PVLDB, 4(9):528–538, 2011.
Yalin Wang and Jianying Hu.
A Machine Learning Based Approach for Table Detection on the Web.
In In Proceedings of the 11th Int’l Conf. on World Wide Web (WWW’02), pages 242–250. ACM Press,
2002.
Minoru Yoshida and Kentaro Torisawa.
Extracting Ontologies from World Wide Web via HTML Tables.
In In Proceedings of the Pacific Association for Computational Linguistics (PACLING 2001, pages 332–341,
2001.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 38/39
References III
If you want to go further
Xiaoxin Yin, Wenzhao Tan, and Chao Liu.
FACTO: a fact lookup engine based on web tables.
In Proceedings of the 20th international conference on World wide web, WWW ’11, pages 507–516, New
York, NY, USA, 2011. ACM.
Minoru Yoshida, Kentaro Torisawa, and Jun’ichi Tsujii.
A method to integrate tables of the World Wide Web.
In In Proceedings of the International Workshop on Web Document Analysis (WDA 2001, pages 31–34,
2001.
Introduction WTT-Detection WTT-Interpretation On-going work Future work 39/39

Contenu connexe

Tendances

Data clustering a review
Data clustering a reviewData clustering a review
Data clustering a reviewunyil96
 
Progress Report 20091009
Progress Report 20091009Progress Report 20091009
Progress Report 20091009xoanon
 
Introduction to Data Structure
Introduction to Data StructureIntroduction to Data Structure
Introduction to Data StructureJazz Jinia Bhowmik
 
An improvised frequent pattern tree
An improvised frequent pattern treeAn improvised frequent pattern tree
An improvised frequent pattern treeIJDKP
 
Mapping inheritance structures_mapping_class
Mapping inheritance structures_mapping_classMapping inheritance structures_mapping_class
Mapping inheritance structures_mapping_classTodor Kolev
 

Tendances (6)

Data clustering a review
Data clustering a reviewData clustering a review
Data clustering a review
 
Progress Report 20091009
Progress Report 20091009Progress Report 20091009
Progress Report 20091009
 
Introduction to Data Structure
Introduction to Data StructureIntroduction to Data Structure
Introduction to Data Structure
 
An improvised frequent pattern tree
An improvised frequent pattern treeAn improvised frequent pattern tree
An improvised frequent pattern tree
 
ADB introduction
ADB introductionADB introduction
ADB introduction
 
Mapping inheritance structures_mapping_class
Mapping inheritance structures_mapping_classMapping inheritance structures_mapping_class
Mapping inheritance structures_mapping_class
 

Similaire à Understanding Web Tables and Building a Taxonomy for 2xN Tables

PDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.pptPDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.pptssuser52a19e
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4CLARIAH
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27IJARIIE JOURNAL
 
A use case-driven iterative method for building a provenance-aware GCIS onto...
A use case-driven iterative method for building a provenance-aware GCIS onto...A use case-driven iterative method for building a provenance-aware GCIS onto...
A use case-driven iterative method for building a provenance-aware GCIS onto...Xiaogang (Marshall) Ma
 
Efficient top k retrieval on massive data
Efficient top k retrieval on massive dataEfficient top k retrieval on massive data
Efficient top k retrieval on massive dataPvrtechnologies Nellore
 
Applying statistical dependency analysis techniques In a Data mining Domain
Applying statistical dependency analysis techniques In a Data mining DomainApplying statistical dependency analysis techniques In a Data mining Domain
Applying statistical dependency analysis techniques In a Data mining DomainWaqas Tariq
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data toIJwest
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Mohit Sngg
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper PresentationShubham Singh
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open DataBlerina Spahiu
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Baroclinic Channel Model in Fluid Dynamics
Baroclinic Channel Model in Fluid DynamicsBaroclinic Channel Model in Fluid Dynamics
Baroclinic Channel Model in Fluid DynamicsIJERA Editor
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsIJMER
 

Similaire à Understanding Web Tables and Building a Taxonomy for 2xN Tables (20)

PggLas12
PggLas12PggLas12
PggLas12
 
PDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.pptPDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.ppt
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27
 
A use case-driven iterative method for building a provenance-aware GCIS onto...
A use case-driven iterative method for building a provenance-aware GCIS onto...A use case-driven iterative method for building a provenance-aware GCIS onto...
A use case-driven iterative method for building a provenance-aware GCIS onto...
 
Efficient top k retrieval on massive data
Efficient top k retrieval on massive dataEfficient top k retrieval on massive data
Efficient top k retrieval on massive data
 
K044055762
K044055762K044055762
K044055762
 
Applying statistical dependency analysis techniques In a Data mining Domain
Applying statistical dependency analysis techniques In a Data mining DomainApplying statistical dependency analysis techniques In a Data mining Domain
Applying statistical dependency analysis techniques In a Data mining Domain
 
G045033841
G045033841G045033841
G045033841
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data to
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
Big Data and IOT
Big Data and IOTBig Data and IOT
Big Data and IOT
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
G1803054653
G1803054653G1803054653
G1803054653
 
Baroclinic Channel Model in Fluid Dynamics
Baroclinic Channel Model in Fluid DynamicsBaroclinic Channel Model in Fluid Dynamics
Baroclinic Channel Model in Fluid Dynamics
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 

Plus de net2-project

Random Manhattan Indexing
Random Manhattan IndexingRandom Manhattan Indexing
Random Manhattan Indexingnet2-project
 
Extracting Information for Context-aware Meeting Preparation
Extracting Information for Context-aware Meeting PreparationExtracting Information for Context-aware Meeting Preparation
Extracting Information for Context-aware Meeting Preparationnet2-project
 
Vector spaces for information extraction - Random Projection Example
Vector spaces for information extraction - Random Projection ExampleVector spaces for information extraction - Random Projection Example
Vector spaces for information extraction - Random Projection Examplenet2-project
 
Borders of Decidability in Verification of Data-Centric Dynamic Systems
Borders of Decidability in Verification of Data-Centric Dynamic SystemsBorders of Decidability in Verification of Data-Centric Dynamic Systems
Borders of Decidability in Verification of Data-Centric Dynamic Systemsnet2-project
 
Exchanging OWL 2 QL Knowledge Bases
Exchanging OWL 2 QL Knowledge BasesExchanging OWL 2 QL Knowledge Bases
Exchanging OWL 2 QL Knowledge Basesnet2-project
 
Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1net2-project
 
Extending DBpedia (LOD) using WikiTables
Extending DBpedia (LOD) using WikiTablesExtending DBpedia (LOD) using WikiTables
Extending DBpedia (LOD) using WikiTablesnet2-project
 
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...net2-project
 
Managing Social Communities
Managing Social CommunitiesManaging Social Communities
Managing Social Communitiesnet2-project
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDFnet2-project
 
Exchanging more than Complete Data
Exchanging more than Complete DataExchanging more than Complete Data
Exchanging more than Complete Datanet2-project
 
Exchanging More than Complete Data
Exchanging More than Complete DataExchanging More than Complete Data
Exchanging More than Complete Datanet2-project
 
Exchanging More than Complete Data
Exchanging More than Complete DataExchanging More than Complete Data
Exchanging More than Complete Datanet2-project
 
Answer-set programming
Answer-set programmingAnswer-set programming
Answer-set programmingnet2-project
 
Evolving web, evolving search
Evolving web, evolving searchEvolving web, evolving search
Evolving web, evolving searchnet2-project
 
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)net2-project
 

Plus de net2-project (17)

Random Manhattan Indexing
Random Manhattan IndexingRandom Manhattan Indexing
Random Manhattan Indexing
 
Extracting Information for Context-aware Meeting Preparation
Extracting Information for Context-aware Meeting PreparationExtracting Information for Context-aware Meeting Preparation
Extracting Information for Context-aware Meeting Preparation
 
Vector spaces for information extraction - Random Projection Example
Vector spaces for information extraction - Random Projection ExampleVector spaces for information extraction - Random Projection Example
Vector spaces for information extraction - Random Projection Example
 
Borders of Decidability in Verification of Data-Centric Dynamic Systems
Borders of Decidability in Verification of Data-Centric Dynamic SystemsBorders of Decidability in Verification of Data-Centric Dynamic Systems
Borders of Decidability in Verification of Data-Centric Dynamic Systems
 
Exchanging OWL 2 QL Knowledge Bases
Exchanging OWL 2 QL Knowledge BasesExchanging OWL 2 QL Knowledge Bases
Exchanging OWL 2 QL Knowledge Bases
 
Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1Federation and Navigation in SPARQL 1.1
Federation and Navigation in SPARQL 1.1
 
Extending DBpedia (LOD) using WikiTables
Extending DBpedia (LOD) using WikiTablesExtending DBpedia (LOD) using WikiTables
Extending DBpedia (LOD) using WikiTables
 
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
 
Managing Social Communities
Managing Social CommunitiesManaging Social Communities
Managing Social Communities
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDF
 
Exchanging more than Complete Data
Exchanging more than Complete DataExchanging more than Complete Data
Exchanging more than Complete Data
 
Exchanging More than Complete Data
Exchanging More than Complete DataExchanging More than Complete Data
Exchanging More than Complete Data
 
Exchanging More than Complete Data
Exchanging More than Complete DataExchanging More than Complete Data
Exchanging More than Complete Data
 
Answer-set programming
Answer-set programmingAnswer-set programming
Answer-set programming
 
Evolving web, evolving search
Evolving web, evolving searchEvolving web, evolving search
Evolving web, evolving search
 
XSPARQL Tutorial
XSPARQL TutorialXSPARQL Tutorial
XSPARQL Tutorial
 
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
 

Dernier

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Dernier (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Understanding Web Tables and Building a Taxonomy for 2xN Tables

  • 1. Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables Emir Mu˜noz, MSc. emir.munoz@deri.org Galway, Ireland – 19 July 2012 Introduction WTT-Detection WTT-Interpretation On-going work Future work 1/39
  • 2. Outline 1 Introduction 2 WTT-Detection 3 WTT-Interpretation 4 On-going work 5 Future work Introduction WTT-Detection WTT-Interpretation On-going work Future work 2/39
  • 3. Introduction I Tables They are used as a compact and efficient way to present relational information. They are inherently concise as well as information rich. The automatic understanding of tables has many applications including: Knowledge management Information retrieval Web and text mining Summarization, and Content delivery to mobile devices. Interesting for domains like: medicine, health-care, finance, e-science (e.g., biotechnology), and public policy. Introduction WTT-Detection WTT-Interpretation On-going work Future work 3/39
  • 4. Introduction II Web-tables (WTT) examples Introduction WTT-Detection WTT-Interpretation On-going work Future work 4/39
  • 5. Introduction III Table understanding in Web documents include [WH02]: Table detection, Functional and structural analysis, and Table interpretation. Cafarella in [CHW+08] estimated that there are around 14.1 billion HTML tables, out of which 154 million contain high quality relational data. This represents a large source of knowledge, yet we do not have systems that can understand and exploit this knowledge properly. Introduction WTT-Detection WTT-Interpretation On-going work Future work 5/39
  • 6. Outline 1 Introduction 2 WTT-Detection 3 WTT-Interpretation 4 On-going work 5 Future work Introduction WTT-Detection WTT-Interpretation On-going work Future work 6/39
  • 7. Table detection I In practice, tables are not only used to present relational information, ... they are also used to create multiple-column layouts to facilitate easy viewing. The presence of the HTML tag <table> does not ensure a relational table, or more general, a table with content. [WH02] A ML approach for Table Detection Wang and Hu discriminated genuine and non-genuine tables on the grounds of their content. They checked as to whether they contain logical relations among the cells, or they are just used as a mechanism for grouping content. In so doing, they used a tree classifier. Introduction WTT-Detection WTT-Interpretation On-going work Future work 7/39
  • 8. Table detection II In genuine or relational tables there are logical relations among the cells. Non-genuine or non-relational tables are used as a mechanism for grouping contents. Introduction WTT-Detection WTT-Interpretation On-going work Future work 8/39
  • 9. Table detection: [WH02] I A Machine Learning Based Approach for Table Detection on The Web Then, they define weights derived from the traditional tf ∗ idf measures used in IR, and define similarity based on the vector space model. Their initial database contains a total of 2,851 pages harvested from Google directory and News using predefined keywords known to have a higher chance to recall genuine tables, from around 200 web sites. They selected 1,393 pages out of these database (chosen randomly). (11,477 <table> nodes.) For training they used 9-fold cross validation. They experimented with decision trees and SVMs for separating genuine and non-genuine tables. Introduction WTT-Detection WTT-Interpretation On-going work Future work 9/39
  • 10. Table detection: [WH02] II A Machine Learning Based Approach for Table Detection on The Web 1,740 are genuine (15.16%) and 9,737 are non-genuine (84.84%) tables. The results reported are R = 94.25%, P = 97.50%, F = 95.88%. (The pages was obtained by querying Google using keywords like “table”, “stock”, “weather”.) Introduction WTT-Detection WTT-Interpretation On-going work Future work 10/39
  • 11. Table detection: [CP10b] I Web-Scale Knowledge Extraction from Semi-Structured Tables Tables called Attribute/Value They propose a classification algorithm for recognizing layout tables and attribute/value tables. In their work, they adopted the Gradient Boosted Decision Tree classification model, with classes ATTRIBUTE/VALUE, LAYOUT, and OTHER (e.g., calendars, forms, enumerations). Introduction WTT-Detection WTT-Interpretation On-going work Future work 11/39
  • 12. Table detection: [CP10b] II Web-Scale Knowledge Extraction from Semi-Structured Tables Introduction WTT-Detection WTT-Interpretation On-going work Future work 12/39
  • 13. Table detection: [CP10b] III Web-Scale Knowledge Extraction from Semi-Structured Tables Tables list attributes but rarely contain the subject in the table proper. Their focus is on detection of the subject of the table. They call this open research problem: Protagonist Detection. Relational tables considered in their work encode facts, or semantic triples of the form < p, s, o >. There are three different places where the protagonist could be found: a) within the table (occasionally found in the table with a generic attribute such as name or model); b) within the document or the HTML <title> tag; and c) anchor texts offer well defined boundaries for identifying protagonist candidates, the document body proposes fewer clues. Introduction WTT-Detection WTT-Interpretation On-going work Future work 13/39
  • 14. Table detection: [CP10b] IV Web-Scale Knowledge Extraction from Semi-Structured Tables Introduction WTT-Detection WTT-Interpretation On-going work Future work 14/39
  • 15. Table detection: [CP10a, CP11] I Web-scale Table Census and Classification They extend their previous work, proposing a much finer-grained table-type classification and report an overall accuracy of 75.2%. From a total of 1.2 billion documents, they extracted 8.2 billion tables (2.6 billion unique tables). In detail, 75% of the pages contain at least one table with an average of 9.1 tables per document. In preliminary experiments, when trying to identify the protagonist of A-V tables, they use an N-gram based approach using a commercial search engine’s web link graph. They find the correct protagonist in 90% of the cases in its top-20 ranked candidates, and in 79% of the cases in its top-3. Introduction WTT-Detection WTT-Interpretation On-going work Future work 15/39
  • 16. Table detection: [CP10a, CP11] II Table classes [CP10a, CP11] propose the following table type taxonomy. (This proposal and others are only based on a syntactic structure of tables.) Introduction WTT-Detection WTT-Interpretation On-going work Future work 16/39
  • 17. Outline 1 Introduction 2 WTT-Detection 3 WTT-Interpretation 4 On-going work 5 Future work Introduction WTT-Detection WTT-Interpretation On-going work Future work 17/39
  • 18. WTT-Interpretation I Recovering Table Semantics There are some works focused on mapping spreadsheets into RDF, but such systems require human intervention. [MFSJ10] proposed an approach that uses linked data to interpret tables and associate their components with nodes in a reference linked data collection. To provide general purpose knowledge as well as specific facts about significant people, places, organizations, events and many other entities of interest. [SFMJ10] used RDF for exporting and encoding the information embodied in tables. Describing techniques to automatically infer a (partial) semantic model for information in tables using both table headings, if available, and the values stored in table cells. The techniques have been prototyped for a subset of linked data that covers the core of Wikipedia. Introduction WTT-Detection WTT-Interpretation On-going work Future work 18/39
  • 19. WTT-Interpretation II Recovering Table Semantics City Mayor State Population Boston T. Menino MA 610,000 New York M. Bloomberg NY 8,400,000 Philadelphia M. Nutter PA 1,500,000 Baltimore S. Dixon MD 640,000 Washington A. Fenty DC 595,000 @prefix dbp: <http://dbpedia.org/resource/> . @prefix dbpo: <http://dbpedia.org/ontology/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix cyc: <http://www.cyc.com/2004/06/04/cyc#> . dbp:Boston dbpo:leaderName dbp:Thomas_Menino; cyc:partOf dbp:Massachusetts; dbpo:populationTotal "610000"^^xsd:integer . dbp:New_York_City ... ... Introduction WTT-Detection WTT-Interpretation On-going work Future work 19/39
  • 20. WTT-Interpretation III Recovering Table Semantics When predicting entity classes in a column, [SFMJ10] used DBpedia (85.71%), Yago (71.42%), Word-Net (71.42%) and Freebase (90.47%). Entity types and their correct prediction: Places (61.64%), Persons (90.76%) and Organizations (66.667%). To describe relations between columns in a table, they take all pairs of entities in the same row (already linked to Wikipedia) and query DBpedia for the set of relations. http://dbpedia.org/ontology/largestCity http://dbpedia.org/ontology/PopulatedPlace/largestCity http://dbpedia.org/ontology/capital http://dbpedia.org/ontology/PopulatedPlace/capital http://dbpedia.org/property/capital http://dbpedia.org/property/largestcity Introduction WTT-Detection WTT-Interpretation On-going work Future work 20/39
  • 21. WTT-Interpretation IV Recovering Table Semantics The relation that appears the maximum number of times is the selected. The evaluation test set is very small, just 5 tables taken from Google Squared. Another example about basketball players: Name Team Position Michael Jordan Chicago Shooting guard Allen Iverson Philadelphia Point guard Yao Ming Houston Center Tim Duncan San Antonio Power forward It is important to discover relations between the table columns, but not only 2-ary relations. [MFJ11] also analyzed the Government Linked data. Introduction WTT-Detection WTT-Interpretation On-going work Future work 21/39
  • 22. WTT-Interpretation I WTT as a very large repository of facts [YTT01] focused their work on a probabilistic method to integrate tables according the category of objects represented in each table. (Performing an attribute clusterization.) [YT01, TI06] proposed methods to ontology extraction from web-tables using the relations represented by structures into the table. (The table structures have to be given by humans.) An IR approach presented in [YTL11], extracts structured data from WTT, aggregates and cleans such data and stores them in a database. They create a very large repository of entity-attribute-value triples. Introduction WTT-Detection WTT-Interpretation On-going work Future work 22/39
  • 23. WTT-Interpretation II WTT as a very large repository of facts (How this works?) A good example is the query “Saint Patrick’s day”, any search engine could directly show “17 March” within their top-ranked results. http://en.wikipedia.org/wiki/Public_holidays_in_the_Republic_of_Ireland Introduction WTT-Detection WTT-Interpretation On-going work Future work 23/39
  • 24. WTT-Interpretation III WTT as a very large repository of facts (Hypothesis) Recovering semantics guide to a better search and quality filter. Some enunciated problems: Take for instance, a table about trees and a piece of text like “...North America species such as Green Ash...”. From the WTT we could infer that “Green Ash” is a species of tree a.k.a. “Fraxinus pennsylvanica”. Use schema statistics to automatically compute attribute synonyms (more complete than thesaurus). e.g., e-mail—email, phone—telephone, e-mail address—email address, date—last-modified It is still necessary to recover large fractions of binary relationships and techniques for recovering numerical relationships (e.g. population, GDP) [VHM+11]. Introduction WTT-Detection WTT-Interpretation On-going work Future work 24/39
  • 25. Outline 1 Introduction 2 WTT-Detection 3 WTT-Interpretation 4 On-going work 5 Future work Introduction WTT-Detection WTT-Interpretation On-going work Future work 25/39
  • 26. On-going work I Introduction Our initial aim was to understand relational tables. We parse HTML pages to extract HTML tables using NekoHTML library. We have a corpus comprising 8.2 billion tables. A table is parsed as a matrix using Tartar [PCS+07]. Dealing with the cell spans. We manually annotated 14,695 randomly chosen tables: 10,923 content-poor (74.33%) and 3,785 content-rich (25.76%) tables. We made a content-poor and content-rich table predictor using the same features as [CP11] using a max-entropy model, and 10-fold cross-validation. Introduction WTT-Detection WTT-Interpretation On-going work Future work 26/39
  • 27. On-going work II Introduction From a set of 115 features, we selected 19 via a greedy property selection algorithm. The 19 achieved an accuracy of 89.46%. Most important features: Presence of the <select> tag in a column Distinct strings in the 1st column Distinct tags in a column Distinct tags in a row Non-empty cells in columns or rows Presence of links Presence of colon “:” Presence of break line <br> Presence of input fields (HTML) Presence of numbers in a rows Presence of the <th> tag Introduction WTT-Detection WTT-Interpretation On-going work Future work 27/39
  • 28. On-going work III Introduction Our aim is proposing a based-in-content taxonomy for WTT instead of previously based in syntax structure. We are now developing a 2xn table predictor with classes focused in content. Introduction WTT-Detection WTT-Interpretation On-going work Future work 28/39
  • 29. On-going work I Why this work? Why a new taxonomy? [WH02] classify WTT as genuine or non-genuine tables. [CHW+ 08] classify WTT as relational or no-relational tables. Crestan and Pantel’s taxonomy is a more general purpose taxonomy for tables, focused on the syntax of the tables, not in their semantic. Intuitively, all the classes of the taxonomy of [CP11] are not useful, and only A-V class is used. Moreover, What it means to say that a table is A-V? A A-V table could have spatio-temporal attributes or universal facts or, even describe a person or a company or a product. All the previous approaches needs a little bit of focus given for the “message” of the tables. Introduction WTT-Detection WTT-Interpretation On-going work Future work 29/39
  • 30. On-going work II Why this work? Why 2xn tables are important? 2xn class is larger than A-V class. They are about 20% of all tables in the Web. Previous A-V tables were identified by the presence of “:” (colon). We hope that extending the research to 2xn tables, discover those A-V tables that are not indicated by the “colon rule”. Introduction WTT-Detection WTT-Interpretation On-going work Future work 30/39
  • 31. On-going work Proposed Taxonomy We introduce new classes, that could be important, e.g., to be used with ontologies (e.g., FOAF) in a search engine. Introduction WTT-Detection WTT-Interpretation On-going work Future work 31/39
  • 32. On-going work WTT examples according to our taxonomy Introduction WTT-Detection WTT-Interpretation On-going work Future work 32/39
  • 33. On-going work Distribution We are manually tagging 174,748 unique WTT. The distribution per class until now is: Class % Social networks 34.2% Spatio-temporal information 28.9% Products 28.4% Resources 4.3% Universal facts 3.2% Other 1.0% Events 0.03% Introduction WTT-Detection WTT-Interpretation On-going work Future work 33/39
  • 34. Outline 1 Introduction 2 WTT-Detection 3 WTT-Interpretation 4 On-going work 5 Future work Introduction WTT-Detection WTT-Interpretation On-going work Future work 34/39
  • 35. Future work – Open problems I Tables in web-pages can be used to model the data of web-sites, in particular, its main entities and the key relations thereof. This entails: a) Web tables bear syntactic and semantic information that it is useful for determining what they are talking about. Thus, patterns across web-tables can be exploited to automatically understand their “message”. b) Once the ”message” of the tables of a specific web-site is determined, it is possible to infer the main entities that this web-site talks about. c) Once the relevant entities of a web-site are detected, it is plausible to recognize prominent relationships between these entities. Thus, we will be able to link data between the chief entities of a web-site. d) Once predominant relations and entities of a web-site are determined, it is possible to link data between different web-sites. Introduction WTT-Detection WTT-Interpretation On-going work Future work 35/39
  • 36. Future work – Open problems II Extracting RDF from wikipedia tables (not only infobox). Relation extraction – all kind of relations. Taxonomies definition. Other levels, like: rankings, definitions. Complex table understanding. Table integration. Protagonist detection for web-tables. Introduction WTT-Detection WTT-Interpretation On-going work Future work 36/39
  • 37. References I If you want to go further Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538–549, 2008. Eric Crestan and Patrick Pantel. A fine-grained taxonomy of tables on the web. In Jimmy Huang, Nick Koudas, Gareth J. F. Jones, Xindong Wu, Kevyn Collins-Thompson, and Aijun An, editors, CIKM, pages 1405–1408. ACM, 2010. Eric Crestan and Patrick Pantel. Web-scale knowledge extraction from semi-structured tables. In Proceedings of the 19th international conference on World wide web, WWW ’10, pages 1081–1082, New York, NY, USA, 2010. ACM. Eric Crestan and Patrick Pantel. Web-scale table census and classification. In Irwin King, Wolfgang Nejdl, and Hang Li, editors, WSDM, pages 545–554. ACM, 2011. Varish Mulwad, Tim Finin, and Anupam Joshi. Automatically Generating Government Linked Data from Tables. In Working notes of AAAI Fall Symposium on Open Government Knowledge: AI Opportunities and Challenges. November 2011. Varish Mulwad, Tim Finin, Zareen Syed, and Anupam Joshi. Using linked data to interpret tables. In Proceedings of the the First International Workshop on Consuming Linked Data, November 2010. Introduction WTT-Detection WTT-Interpretation On-going work Future work 37/39
  • 38. References II If you want to go further Aleksander Pivk, Philipp Cimiano, York Sure, Matjaz Gams, Vladislav Rajkovic, and Rudi Studer. Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng., 60(3):567–595, 2007. Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi. Exploiting a Web of Semantic Data for Interpreting Tables. In in Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, Raleigh NC, USA, April 26–27th 2010. Masahiro Tanaka and Toru Ishida. Ontology Extraction from Tables on the Web. In in Proceedings of the International Symposium on Applications on Internet, pages 284–290, Washington DC, USA, 2006. Petros Venetis, Alon Y. Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. Recovering semantics of tables on the web. PVLDB, 4(9):528–538, 2011. Yalin Wang and Jianying Hu. A Machine Learning Based Approach for Table Detection on the Web. In In Proceedings of the 11th Int’l Conf. on World Wide Web (WWW’02), pages 242–250. ACM Press, 2002. Minoru Yoshida and Kentaro Torisawa. Extracting Ontologies from World Wide Web via HTML Tables. In In Proceedings of the Pacific Association for Computational Linguistics (PACLING 2001, pages 332–341, 2001. Introduction WTT-Detection WTT-Interpretation On-going work Future work 38/39
  • 39. References III If you want to go further Xiaoxin Yin, Wenzhao Tan, and Chao Liu. FACTO: a fact lookup engine based on web tables. In Proceedings of the 20th international conference on World wide web, WWW ’11, pages 507–516, New York, NY, USA, 2011. ACM. Minoru Yoshida, Kentaro Torisawa, and Jun’ichi Tsujii. A method to integrate tables of the World Wide Web. In In Proceedings of the International Workshop on Web Document Analysis (WDA 2001, pages 31–34, 2001. Introduction WTT-Detection WTT-Interpretation On-going work Future work 39/39