SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Learning with Partial Data for Semantic Table Interpretation 
Ziqi Zhang 
Department of Computer Science, University of Sheffield
Semantic Table Interpretation 
•Input 
• Ontology 
• Relational table 
•Goals/Tasks 
• Column – classes/concepts 
• Cell – named entities 
• Column, Column – relation 
Thing 
Company 
Work 
Time Period 
… … 
Ent:2kGames 
Ent:THQ 
… 
VidoeGameCompany 
Video Game 
Year 
Name 
Publisher 
Year 
1 
Gears of War 
Microsoft 
2006 
2 
Civilization IV 
2k Games 
2006 
3 
Titan Quest 
THQ 
2006 
99 
Civilization V 
2k Games 
2010 
Table of video games (PC) 
< … … > 
… … 
Rel:publishedBy 
Rel:publishedBy 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Motivation 
•SoA semantic table interpretation methods, e.g. [Limaye2010, Venetis2011, Mulwad2013] 
Limitation 
Algorithm is ‘exhaustive’, but unnecessary 
Goal: Assign a concept to this column 
Hint: Content in the column gives useful clues 
How much do we need for inference (99 rows in this example)? 
- Human: SOME (learn by example) 
- SoA: ALL 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation 
Name 
Publisher 
Year 
1 
Gears of War 
Microsoft 
2006 
2 
Civilization IV 
2k Games 
2006 
3 
Titan Quest 
THQ 
2006 
99 
Civilization V 
2k Games 
2010 
< … … >
Research Questions 
•Can machines ‘learn by example’ 
• inference using only partial data (sample) 
• achieving good accuracy 
•How to choose a sample 
• does it matter (e.g., in terms of accuracy) 
• how to optimize 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation 
Zhang, Z. (2014). Towards efficient and effective semantic table interpretation. In Proceedings of the 13th International Semantic Web Conference, 487-502 
TableMiner 
(contribution of this work) 
Sample Selection
Method 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
TableMiner (modified) 
•Incremental inference (I-Inf) to address two tasks 
• Column classification 
•Using some data in the column 
• Cell disambiguation 
•Using column label to constrain disambiguation 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
•Incremental inference (I-Inf) Tj – a column; Cj – candidate concepts for the column; Ei,j candidate entities for a cell 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation 
TableMiner (modified)
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation 
TableMiner (modified) 
1 
2 
3 
… … 
Until Cj changes little (convergence)
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation 
TableMiner (modified) 
Cj= 
{<c1,s1’>, <c2,s2‘>, <c3,s3‘>, …. <c11,s11‘>} 
Column label (class) used as constraint in selecting candidate entities for disambiguation
Sample Selection – the Principle 
•‘Order matters’ 
• TableMiner processes data in order until convergence 
• Changing the order means 
•(Possibly) Different convergence speed 
•Different data are processed 
•Change the order of cells in a column (and corresponding row) such that 
• cells that are ‘easier’ to disambiguate come to the top 
•because the class for a column depends on cells in the column 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Sample Selection- ‘name length’ hypothesis 
•Longer names are easier to disambiguate than shorter names 
• e.g., “Manchester” v.s. “Manchester United F.C.” 
•Method name length (nl): 
•nl(Ti,j) = # of tokens in cell Ti,j 
•Re-order table rows by sorting on column Tj using nl(Ti,j) 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
•Names that have a richer feature representation are easier to disambiguate 
• B.O.W. representation using row context 
• ‘one-sense-per-discourse’ (in non-subject columns) 
• 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation 
Sample Selection- ‘feature density’ hypothesis
•Method ‘duplicate content cell’ (dup) 
• re-arrange the target column and table following ospd 
• dup(Ti,j) = # of times text of Ti,j is duplicated in column Tj 
• Re-order table rows by sorting on column Tj using dup(Ti,j) 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation 
Sample Selection- ‘feature density’ hypothesis
•Method ‘feature representation size’ (rep) 
• re-arrange the target column and table following ospd 
• rep(Ti,j) = # of tokens in the B.O.W. representation of Ti,j 
• Re-order table rows by sorting on column Tj using rep(Ti,j) 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation 
Sample Selection- ‘feature density’ hypothesis
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Evaluation 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Data 
•Data 
• Freebase as reference ontology/background knowledge 
• Limaye200 – 200 Web tables from Limaye2010 originally annotated with Wikipedia 
•Column classes are manually annotated 
• LimayeAll – 6310 Web tables from Limaye2010 
•Names in content cells are automatically mapped to Freebase 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Settings 
•Baseline 
• 푇푀푏푠 – modified TableMiner to use all cells in a column for column classification (everything else unchanged) 
•Comparison* 
• 푇푀푚표푑 푛푙 - TableMiner using name length sample selection method 
• 푇푀푚표푑 푑푢푝 - TableMiner using duplicate content cell sample selection method 
• 푇푀푚표푑 푟푒푝 - TableMiner using feature representation size sample selection method * The original TableMiner is modified. For details and other settings see paper. 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Results 
•Results in F1 
•Convergence speed in column classification 
•Reduced candidate named entities for disambiguation 
푇푀푏푠 
푇푀푚표푑 푛푙 
푇푀푚표푑 푑푢푝 
푇푀푚표푑 푟푒푝 
Classification (Limaye200) 
72.1 
72.3 
72.0 
72.1 
Disambiguation (LimayeAll) 
80.9 
81.3 
81.22 
81.24 
푇푀푏푠 
푇푀푚표푑 푛푙 
푇푀푚표푑 푑푢푝 
푇푀푚표푑 푟푒푝 
Limaye200 
100% 
36.3% 
36.1% 
35.3% 
푇푀푏푠 
푇푀푚표푑 푛푙 
푇푀푚표푑 푑푢푝 
푇푀푚표푑 푟푒푝 
Limaye200 
0 
32.4% 
48.1% 
46.8% 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Results 
•Results in F1 
•Convergence speed in column classification 
•Reduced candidate named entities disambiguation 
푇푀푏푠 
푇푀푚표푑 푛푙 
푇푀푚표푑 푑푢푝 
푇푀푚표푑 푟푒푝 
Classification (Limaye200) 
72.1 
72.3 
72.0 
72.1 
Disambiguation (LimayeAll) 
80.9 
81.3 
81.22 
81.24 
푇푀푏푠 
푇푀푚표푑 푛푙 
푇푀푚표푑 푑푢푝 
푇푀푚표푑 푟푒푝 
Limaye200 
100% 
36.3% 
36.1% 
35.3% 
푇푀푏푠 
푇푀푚표푑 푛푙 
푇푀푚표푑 푑푢푝 
푇푀푚표푑 푟푒푝 
Limaye200 
0 
32.4% 
48.1% 
46.8% 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation 
Comparable or better accuracy 
But uses only partial data for column classification 
… and process much less data for disambiguation
Conclusion 
•Learning with partial data for semantic table interpretation can be both effective and efficient 
•The choice of sample selection methods makes limited difference in terms of accuracy and efficiency 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
Thank you 
Z. Zhang / Learning with Partial Data for Semantic Table Interpretation 
@ziqizhang_zz http://staffwww.dcs.shef.ac.uk/people/Z.Zhang

Contenu connexe

Tendances (7)

1.introduction to data_structures
1.introduction to data_structures1.introduction to data_structures
1.introduction to data_structures
 
358 33 powerpoint-slides_5-arrays_chapter-5
358 33 powerpoint-slides_5-arrays_chapter-5358 33 powerpoint-slides_5-arrays_chapter-5
358 33 powerpoint-slides_5-arrays_chapter-5
 
Mcq question bank
Mcq question bankMcq question bank
Mcq question bank
 
Data structure
Data structureData structure
Data structure
 
Array
ArrayArray
Array
 
Ii pu cs practical viva voce questions
Ii pu cs  practical viva voce questionsIi pu cs  practical viva voce questions
Ii pu cs practical viva voce questions
 
Advanced c c++
Advanced c c++Advanced c c++
Advanced c c++
 

En vedette

En vedette (7)

Intro to Semantic Web
Intro to Semantic WebIntro to Semantic Web
Intro to Semantic Web
 
Situations as attractors for semantic interpretation
Situations as attractors for semantic interpretationSituations as attractors for semantic interpretation
Situations as attractors for semantic interpretation
 
The Boundary between Syntax and Semantics - Prof. Fredreck J. Newmeyer
The Boundary between Syntax and Semantics - Prof. Fredreck J. NewmeyerThe Boundary between Syntax and Semantics - Prof. Fredreck J. Newmeyer
The Boundary between Syntax and Semantics - Prof. Fredreck J. Newmeyer
 
Semantic barriers in communication
Semantic barriers in communicationSemantic barriers in communication
Semantic barriers in communication
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
NISM MUTUAL FUND MODEL TEST
NISM MUTUAL FUND MODEL TESTNISM MUTUAL FUND MODEL TEST
NISM MUTUAL FUND MODEL TEST
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 

Similaire à Ekaw2014 ziqi zhang

Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
Carl Lu
 

Similaire à Ekaw2014 ziqi zhang (20)

Towards Efficient and Effective Semantic Table Interpretation
Towards Efficient and Effective Semantic Table InterpretationTowards Efficient and Effective Semantic Table Interpretation
Towards Efficient and Effective Semantic Table Interpretation
 
Tech Jam 01 - Database Querying
Tech Jam 01 - Database QueryingTech Jam 01 - Database Querying
Tech Jam 01 - Database Querying
 
DrawingML Subject: Tables
DrawingML Subject: TablesDrawingML Subject: Tables
DrawingML Subject: Tables
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
Toward Description Generation for Tables in Scientific Articles
Toward Description Generation for Tables in Scientific ArticlesToward Description Generation for Tables in Scientific Articles
Toward Description Generation for Tables in Scientific Articles
 
SQL
SQLSQL
SQL
 
Oracle sql tutorial
Oracle sql tutorialOracle sql tutorial
Oracle sql tutorial
 
6.1\9 SSIS 2008R2_Training - DataFlow Transformations
6.1\9 SSIS 2008R2_Training - DataFlow Transformations6.1\9 SSIS 2008R2_Training - DataFlow Transformations
6.1\9 SSIS 2008R2_Training - DataFlow Transformations
 
Intro to JMP for statistics
Intro to JMP for statisticsIntro to JMP for statistics
Intro to JMP for statistics
 
Symbol Table
Symbol TableSymbol Table
Symbol Table
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
EnviroInsite training workshop - Database fundamentals
EnviroInsite training workshop - Database fundamentalsEnviroInsite training workshop - Database fundamentals
EnviroInsite training workshop - Database fundamentals
 
Etl2
Etl2Etl2
Etl2
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
OracleSQLraining.pptx
OracleSQLraining.pptxOracleSQLraining.pptx
OracleSQLraining.pptx
 
Practice on Practical SQL
Practice on Practical SQLPractice on Practical SQL
Practice on Practical SQL
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
 
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
 
Basic data analysis using R.
Basic data analysis using R.Basic data analysis using R.
Basic data analysis using R.
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 

Ekaw2014 ziqi zhang

  • 1. Learning with Partial Data for Semantic Table Interpretation Ziqi Zhang Department of Computer Science, University of Sheffield
  • 2. Semantic Table Interpretation •Input • Ontology • Relational table •Goals/Tasks • Column – classes/concepts • Cell – named entities • Column, Column – relation Thing Company Work Time Period … … Ent:2kGames Ent:THQ … VidoeGameCompany Video Game Year Name Publisher Year 1 Gears of War Microsoft 2006 2 Civilization IV 2k Games 2006 3 Titan Quest THQ 2006 99 Civilization V 2k Games 2010 Table of video games (PC) < … … > … … Rel:publishedBy Rel:publishedBy Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
  • 3. Motivation •SoA semantic table interpretation methods, e.g. [Limaye2010, Venetis2011, Mulwad2013] Limitation Algorithm is ‘exhaustive’, but unnecessary Goal: Assign a concept to this column Hint: Content in the column gives useful clues How much do we need for inference (99 rows in this example)? - Human: SOME (learn by example) - SoA: ALL Z. Zhang / Learning with Partial Data for Semantic Table Interpretation Name Publisher Year 1 Gears of War Microsoft 2006 2 Civilization IV 2k Games 2006 3 Titan Quest THQ 2006 99 Civilization V 2k Games 2010 < … … >
  • 4. Research Questions •Can machines ‘learn by example’ • inference using only partial data (sample) • achieving good accuracy •How to choose a sample • does it matter (e.g., in terms of accuracy) • how to optimize Z. Zhang / Learning with Partial Data for Semantic Table Interpretation Zhang, Z. (2014). Towards efficient and effective semantic table interpretation. In Proceedings of the 13th International Semantic Web Conference, 487-502 TableMiner (contribution of this work) Sample Selection
  • 5. Method Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
  • 6. TableMiner (modified) •Incremental inference (I-Inf) to address two tasks • Column classification •Using some data in the column • Cell disambiguation •Using column label to constrain disambiguation Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
  • 7. •Incremental inference (I-Inf) Tj – a column; Cj – candidate concepts for the column; Ei,j candidate entities for a cell Z. Zhang / Learning with Partial Data for Semantic Table Interpretation TableMiner (modified)
  • 8. Z. Zhang / Learning with Partial Data for Semantic Table Interpretation TableMiner (modified) 1 2 3 … … Until Cj changes little (convergence)
  • 9. Z. Zhang / Learning with Partial Data for Semantic Table Interpretation TableMiner (modified) Cj= {<c1,s1’>, <c2,s2‘>, <c3,s3‘>, …. <c11,s11‘>} Column label (class) used as constraint in selecting candidate entities for disambiguation
  • 10. Sample Selection – the Principle •‘Order matters’ • TableMiner processes data in order until convergence • Changing the order means •(Possibly) Different convergence speed •Different data are processed •Change the order of cells in a column (and corresponding row) such that • cells that are ‘easier’ to disambiguate come to the top •because the class for a column depends on cells in the column Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
  • 11. Sample Selection- ‘name length’ hypothesis •Longer names are easier to disambiguate than shorter names • e.g., “Manchester” v.s. “Manchester United F.C.” •Method name length (nl): •nl(Ti,j) = # of tokens in cell Ti,j •Re-order table rows by sorting on column Tj using nl(Ti,j) Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
  • 12. •Names that have a richer feature representation are easier to disambiguate • B.O.W. representation using row context • ‘one-sense-per-discourse’ (in non-subject columns) • Z. Zhang / Learning with Partial Data for Semantic Table Interpretation Sample Selection- ‘feature density’ hypothesis
  • 13. •Method ‘duplicate content cell’ (dup) • re-arrange the target column and table following ospd • dup(Ti,j) = # of times text of Ti,j is duplicated in column Tj • Re-order table rows by sorting on column Tj using dup(Ti,j) Z. Zhang / Learning with Partial Data for Semantic Table Interpretation Sample Selection- ‘feature density’ hypothesis
  • 14. •Method ‘feature representation size’ (rep) • re-arrange the target column and table following ospd • rep(Ti,j) = # of tokens in the B.O.W. representation of Ti,j • Re-order table rows by sorting on column Tj using rep(Ti,j) Z. Zhang / Learning with Partial Data for Semantic Table Interpretation Sample Selection- ‘feature density’ hypothesis
  • 15. Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
  • 16. Evaluation Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
  • 17. Data •Data • Freebase as reference ontology/background knowledge • Limaye200 – 200 Web tables from Limaye2010 originally annotated with Wikipedia •Column classes are manually annotated • LimayeAll – 6310 Web tables from Limaye2010 •Names in content cells are automatically mapped to Freebase Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
  • 18. Settings •Baseline • 푇푀푏푠 – modified TableMiner to use all cells in a column for column classification (everything else unchanged) •Comparison* • 푇푀푚표푑 푛푙 - TableMiner using name length sample selection method • 푇푀푚표푑 푑푢푝 - TableMiner using duplicate content cell sample selection method • 푇푀푚표푑 푟푒푝 - TableMiner using feature representation size sample selection method * The original TableMiner is modified. For details and other settings see paper. Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
  • 19. Results •Results in F1 •Convergence speed in column classification •Reduced candidate named entities for disambiguation 푇푀푏푠 푇푀푚표푑 푛푙 푇푀푚표푑 푑푢푝 푇푀푚표푑 푟푒푝 Classification (Limaye200) 72.1 72.3 72.0 72.1 Disambiguation (LimayeAll) 80.9 81.3 81.22 81.24 푇푀푏푠 푇푀푚표푑 푛푙 푇푀푚표푑 푑푢푝 푇푀푚표푑 푟푒푝 Limaye200 100% 36.3% 36.1% 35.3% 푇푀푏푠 푇푀푚표푑 푛푙 푇푀푚표푑 푑푢푝 푇푀푚표푑 푟푒푝 Limaye200 0 32.4% 48.1% 46.8% Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
  • 20. Results •Results in F1 •Convergence speed in column classification •Reduced candidate named entities disambiguation 푇푀푏푠 푇푀푚표푑 푛푙 푇푀푚표푑 푑푢푝 푇푀푚표푑 푟푒푝 Classification (Limaye200) 72.1 72.3 72.0 72.1 Disambiguation (LimayeAll) 80.9 81.3 81.22 81.24 푇푀푏푠 푇푀푚표푑 푛푙 푇푀푚표푑 푑푢푝 푇푀푚표푑 푟푒푝 Limaye200 100% 36.3% 36.1% 35.3% 푇푀푏푠 푇푀푚표푑 푛푙 푇푀푚표푑 푑푢푝 푇푀푚표푑 푟푒푝 Limaye200 0 32.4% 48.1% 46.8% Z. Zhang / Learning with Partial Data for Semantic Table Interpretation Comparable or better accuracy But uses only partial data for column classification … and process much less data for disambiguation
  • 21. Conclusion •Learning with partial data for semantic table interpretation can be both effective and efficient •The choice of sample selection methods makes limited difference in terms of accuracy and efficiency Z. Zhang / Learning with Partial Data for Semantic Table Interpretation
  • 22. Thank you Z. Zhang / Learning with Partial Data for Semantic Table Interpretation @ziqizhang_zz http://staffwww.dcs.shef.ac.uk/people/Z.Zhang