Ekaw2014 ziqi zhang

Learning with Partial Data for Semantic Table Interpretation
Ziqi Zhang
Department of Computer Science, University of Sheffield

Semantic Table Interpretation
•Input
• Ontology
• Relational table
•Goals/Tasks
• Column – classes/concepts
• Cell – named entities
• Column, Column – relation
Thing
Company
Work
Time Period
… …
Ent:2kGames
Ent:THQ
…
VidoeGameCompany
Video Game
Year
Name
Publisher
Year
1
Gears of War
Microsoft
2006
2
Civilization IV
2k Games
2006
3
Titan Quest
THQ
2006
99
Civilization V
2k Games
2010
Table of video games (PC)
< … … >
… …
Rel:publishedBy
Rel:publishedBy
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Motivation
•SoA semantic table interpretation methods, e.g. [Limaye2010, Venetis2011, Mulwad2013]
Limitation
Algorithm is ‘exhaustive’, but unnecessary
Goal: Assign a concept to this column
Hint: Content in the column gives useful clues
How much do we need for inference (99 rows in this example)?
- Human: SOME (learn by example)
- SoA: ALL
Name
Publisher
Year
1
Gears of War
Microsoft
2006
2
Civilization IV
2k Games
2006
3
Titan Quest
THQ
2006
99
Civilization V
2k Games
2010
< … … >

Research Questions
•Can machines ‘learn by example’
• inference using only partial data (sample)
• achieving good accuracy
•How to choose a sample
• does it matter (e.g., in terms of accuracy)
• how to optimize
Zhang, Z. (2014). Towards efficient and effective semantic table interpretation. In Proceedings of the 13th International Semantic Web Conference, 487-502
TableMiner
(contribution of this work)
Sample Selection

Method

TableMiner (modified)
•Incremental inference (I-Inf) to address two tasks
• Column classification
•Using some data in the column
• Cell disambiguation
•Using column label to constrain disambiguation

•Incremental inference (I-Inf) Tj – a column; Cj – candidate concepts for the column; Ei,j candidate entities for a cell

1
2
3
… …
Until Cj changes little (convergence)

Cj=
{<c1,s1’>, <c2,s2‘>, <c3,s3‘>, …. <c11,s11‘>}
Column label (class) used as constraint in selecting candidate entities for disambiguation

Sample Selection – the Principle
•‘Order matters’
• TableMiner processes data in order until convergence
• Changing the order means
•(Possibly) Different convergence speed
•Different data are processed
•Change the order of cells in a column (and corresponding row) such that
• cells that are ‘easier’ to disambiguate come to the top
•because the class for a column depends on cells in the column

Sample Selection- ‘name length’ hypothesis
•Longer names are easier to disambiguate than shorter names
• e.g., “Manchester” v.s. “Manchester United F.C.”
•Method name length (nl):
•nl(Ti,j) = # of tokens in cell Ti,j
•Re-order table rows by sorting on column Tj using nl(Ti,j)

•Names that have a richer feature representation are easier to disambiguate
• B.O.W. representation using row context
• ‘one-sense-per-discourse’ (in non-subject columns)
•
Sample Selection- ‘feature density’ hypothesis

•Method ‘duplicate content cell’ (dup)
• re-arrange the target column and table following ospd
• dup(Ti,j) = # of times text of Ti,j is duplicated in column Tj
• Re-order table rows by sorting on column Tj using dup(Ti,j)

•Method ‘feature representation size’ (rep)
• re-arrange the target column and table following ospd
• rep(Ti,j) = # of tokens in the B.O.W. representation of Ti,j
• Re-order table rows by sorting on column Tj using rep(Ti,j)

Evaluation

Data
•Data
• Freebase as reference ontology/background knowledge
• Limaye200 – 200 Web tables from Limaye2010 originally annotated with Wikipedia
•Column classes are manually annotated
• LimayeAll – 6310 Web tables from Limaye2010
•Names in content cells are automatically mapped to Freebase

Settings
•Baseline
• 푇푀푏푠 – modified TableMiner to use all cells in a column for column classification (everything else unchanged)
•Comparison*
• 푇푀푚표푑 푛푙 - TableMiner using name length sample selection method
• 푇푀푚표푑 푑푢푝 - TableMiner using duplicate content cell sample selection method
• 푇푀푚표푑 푟푒푝 - TableMiner using feature representation size sample selection method * The original TableMiner is modified. For details and other settings see paper.

Results
•Results in F1
•Convergence speed in column classification
•Reduced candidate named entities for disambiguation
푇푀푏푠
푇푀푚표푑 푛푙
푇푀푚표푑 푑푢푝
푇푀푚표푑 푟푒푝
Classification (Limaye200)
72.1
72.3
72.0
72.1
Disambiguation (LimayeAll)
80.9
81.3
81.22
81.24
푇푀푏푠
Limaye200
100%
36.3%
36.1%
35.3%
푇푀푏푠
Limaye200
0
32.4%
48.1%
46.8%

Results
•Results in F1
•Convergence speed in column classification
•Reduced candidate named entities disambiguation
푇푀푏푠
Classification (Limaye200)
72.1
72.3
72.0
72.1
Disambiguation (LimayeAll)
80.9
81.3
81.22
81.24
푇푀푏푠
Limaye200
100%
36.3%
36.1%
35.3%
푇푀푏푠
Limaye200
0
32.4%
48.1%
46.8%
Comparable or better accuracy
But uses only partial data for column classification
… and process much less data for disambiguation

Conclusion
•Learning with partial data for semantic table interpretation can be both effective and efficient
•The choice of sample selection methods makes limited difference in terms of accuracy and efficiency

Thank you
@ziqizhang_zz http://staffwww.dcs.shef.ac.uk/people/Z.Zhang

Ekaw2014 ziqi zhang

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (7)

En vedette

En vedette (7)

Similaire à Ekaw2014 ziqi zhang

Similaire à Ekaw2014 ziqi zhang (20)

Ekaw2014 ziqi zhang