Semi-automated Exploration and Extraction of Data in Scientific Tables
Ekaw2014 ziqi zhang
1. Learning with Partial Data for Semantic Table Interpretation
Ziqi Zhang
Department of Computer Science, University of Sheffield
2. Semantic Table Interpretation
•Input
• Ontology
• Relational table
•Goals/Tasks
• Column – classes/concepts
• Cell – named entities
• Column, Column – relation
Thing
Company
Work
Time Period
… …
Ent:2kGames
Ent:THQ
…
VidoeGameCompany
Video Game
Year
Name
Publisher
Year
1
Gears of War
Microsoft
2006
2
Civilization IV
2k Games
2006
3
Titan Quest
THQ
2006
99
Civilization V
2k Games
2010
Table of video games (PC)
< … … >
… …
Rel:publishedBy
Rel:publishedBy
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
3. Motivation
•SoA semantic table interpretation methods, e.g. [Limaye2010, Venetis2011, Mulwad2013]
Limitation
Algorithm is ‘exhaustive’, but unnecessary
Goal: Assign a concept to this column
Hint: Content in the column gives useful clues
How much do we need for inference (99 rows in this example)?
- Human: SOME (learn by example)
- SoA: ALL
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Name
Publisher
Year
1
Gears of War
Microsoft
2006
2
Civilization IV
2k Games
2006
3
Titan Quest
THQ
2006
99
Civilization V
2k Games
2010
< … … >
4. Research Questions
•Can machines ‘learn by example’
• inference using only partial data (sample)
• achieving good accuracy
•How to choose a sample
• does it matter (e.g., in terms of accuracy)
• how to optimize
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Zhang, Z. (2014). Towards efficient and effective semantic table interpretation. In Proceedings of the 13th International Semantic Web Conference, 487-502
TableMiner
(contribution of this work)
Sample Selection
5. Method
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
6. TableMiner (modified)
•Incremental inference (I-Inf) to address two tasks
• Column classification
•Using some data in the column
• Cell disambiguation
•Using column label to constrain disambiguation
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
7. •Incremental inference (I-Inf) Tj – a column; Cj – candidate concepts for the column; Ei,j candidate entities for a cell
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
TableMiner (modified)
8. Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
TableMiner (modified)
1
2
3
… …
Until Cj changes little (convergence)
9. Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
TableMiner (modified)
Cj=
{<c1,s1’>, <c2,s2‘>, <c3,s3‘>, …. <c11,s11‘>}
Column label (class) used as constraint in selecting candidate entities for disambiguation
10. Sample Selection – the Principle
•‘Order matters’
• TableMiner processes data in order until convergence
• Changing the order means
•(Possibly) Different convergence speed
•Different data are processed
•Change the order of cells in a column (and corresponding row) such that
• cells that are ‘easier’ to disambiguate come to the top
•because the class for a column depends on cells in the column
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
11. Sample Selection- ‘name length’ hypothesis
•Longer names are easier to disambiguate than shorter names
• e.g., “Manchester” v.s. “Manchester United F.C.”
•Method name length (nl):
•nl(Ti,j) = # of tokens in cell Ti,j
•Re-order table rows by sorting on column Tj using nl(Ti,j)
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
12. •Names that have a richer feature representation are easier to disambiguate
• B.O.W. representation using row context
• ‘one-sense-per-discourse’ (in non-subject columns)
•
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Sample Selection- ‘feature density’ hypothesis
13. •Method ‘duplicate content cell’ (dup)
• re-arrange the target column and table following ospd
• dup(Ti,j) = # of times text of Ti,j is duplicated in column Tj
• Re-order table rows by sorting on column Tj using dup(Ti,j)
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Sample Selection- ‘feature density’ hypothesis
14. •Method ‘feature representation size’ (rep)
• re-arrange the target column and table following ospd
• rep(Ti,j) = # of tokens in the B.O.W. representation of Ti,j
• Re-order table rows by sorting on column Tj using rep(Ti,j)
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Sample Selection- ‘feature density’ hypothesis
15. Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
16. Evaluation
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
17. Data
•Data
• Freebase as reference ontology/background knowledge
• Limaye200 – 200 Web tables from Limaye2010 originally annotated with Wikipedia
•Column classes are manually annotated
• LimayeAll – 6310 Web tables from Limaye2010
•Names in content cells are automatically mapped to Freebase
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
18. Settings
•Baseline
• 푇푀푏푠 – modified TableMiner to use all cells in a column for column classification (everything else unchanged)
•Comparison*
• 푇푀푚표푑 푛푙 - TableMiner using name length sample selection method
• 푇푀푚표푑 푑푢푝 - TableMiner using duplicate content cell sample selection method
• 푇푀푚표푑 푟푒푝 - TableMiner using feature representation size sample selection method * The original TableMiner is modified. For details and other settings see paper.
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
19. Results
•Results in F1
•Convergence speed in column classification
•Reduced candidate named entities for disambiguation
푇푀푏푠
푇푀푚표푑 푛푙
푇푀푚표푑 푑푢푝
푇푀푚표푑 푟푒푝
Classification (Limaye200)
72.1
72.3
72.0
72.1
Disambiguation (LimayeAll)
80.9
81.3
81.22
81.24
푇푀푏푠
푇푀푚표푑 푛푙
푇푀푚표푑 푑푢푝
푇푀푚표푑 푟푒푝
Limaye200
100%
36.3%
36.1%
35.3%
푇푀푏푠
푇푀푚표푑 푛푙
푇푀푚표푑 푑푢푝
푇푀푚표푑 푟푒푝
Limaye200
0
32.4%
48.1%
46.8%
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
20. Results
•Results in F1
•Convergence speed in column classification
•Reduced candidate named entities disambiguation
푇푀푏푠
푇푀푚표푑 푛푙
푇푀푚표푑 푑푢푝
푇푀푚표푑 푟푒푝
Classification (Limaye200)
72.1
72.3
72.0
72.1
Disambiguation (LimayeAll)
80.9
81.3
81.22
81.24
푇푀푏푠
푇푀푚표푑 푛푙
푇푀푚표푑 푑푢푝
푇푀푚표푑 푟푒푝
Limaye200
100%
36.3%
36.1%
35.3%
푇푀푏푠
푇푀푚표푑 푛푙
푇푀푚표푑 푑푢푝
푇푀푚표푑 푟푒푝
Limaye200
0
32.4%
48.1%
46.8%
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Comparable or better accuracy
But uses only partial data for column classification
… and process much less data for disambiguation
21. Conclusion
•Learning with partial data for semantic table interpretation can be both effective and efficient
•The choice of sample selection methods makes limited difference in terms of accuracy and efficiency
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
22. Thank you
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
@ziqizhang_zz http://staffwww.dcs.shef.ac.uk/people/Z.Zhang