The presentation slides of conference IC2020
https://webikeo.fr/webinar/ic-2-partie-1
Yoan Chabot, Thomas Labbé, Jixiong Liu, Raphaël Troncy
DAGOBAH : Un système d’annotation sémantique de données tabulaires indépendant du contexte
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Dagobahic2020orange
1. Orange restricted
DAGOBAH
An End-to-End Context-Free Tabular Data
Semantic Annotation System
Yoan Chabot Thomas Labbé Jixiong Liu Raphaël Troncy
Orange Orange Orange EURECOM
@yoan_chabot @rtroncy@tau_labbe @yansera1
DAGOBAH-IC 202001
2. Context & Goals
▪ Design a semantic engine able to query (semi-)structured data
DAGOBAH-IC 202002
I don’t
know
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
In which city was "Our
Happy Lives" filmed?
P840
narrative
location
3. Context & Goals
▪ Design a semantic engine able to query (semi-)structured data
DAGOBAH-IC 202002
In which city was "Our
happy lives" filmed?
In
Belfort!
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
Movie Location
Our Happy Lives Belfort
The French
Kissers
Rennes
P840
narrative
location
P840
narrative
location
DAGOBAH
4. Movie Location
Our Happy Lives Belfort
The French Kissers Rennes
Tabular Data to Knowledge Graph Matching
DAGOBAH-IC 202003
CTA Column-Type Annotation
CEA Cell-Entity Annotation
CPA Columns-Property Annotation
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
P840
narrative
location
P840
narrative
location
CPA
5. State of the Art
▪ Disambiguate cell values (CEA)
▪ 2 Strategies
▪ For each cell, lookup for the most probable entity. [1] [2]
▪ Joint disambiguation of each cell considering the entire row. [3]
▪ Matches for entities can be made using:
▪ Syntactic comparisons [1][2]
▪ Alignment of ontologies [1][3]
▪ Word embeddings [2][3]
▪ Extract column type (CTA)
▪ Majority voting based on CEA outputs [4]
▪ Extract relationships between columns (CPA)
▪ Majority voting based on previously determined types and entities [5]
[1] LIMAYE G., SARAWAGI S. & CHAKRABARTI S. (2010). Annotating and searching web tables using entities, types and relationships.
In 36th International Conference on Very Large Data Bases (VLDB), p. 1338–1347.
[2] FERNANDEZ R. C., MANSOUR E., QAHTAN A. A., ELMAGARMID A., ILYAS I., MADDEN S., OUZZANI M., STONEBRAKER M. & TANG N. (2018).
Seeping semantics : Linking datasets using word embeddings for data discovery. In 34th International Conference on Data Engineering (ICDE), p. 989–1000.
[3] EFTHYMIOU V., HASSANZADEH O., RODRIGUEZ-MURO M. & CHRISTOPHIDES V. (2017). Matching web tables with knowledge base entities : From entity lookups to entity
embeddings. In 16th International Semantic Web Conference (ISWC), p. 260–277.
[4] MULWAD V., FININ T., SYED Z. & JOSHI A. (2010). Using linked data to interpret tables. In 1 st International Workshop on Consuming Linked Data (COLD).
[5] RAN C., SHEN W., WANG J. & ZHU X. (2016). Domain-specific knowledge base enrichment using wikipedia tables.
In IEEE International Conference on Data Mining (ICDM), p. 349–358. DAGOBAH-IC 202004
7. Challenges Requiring Pre-processing
Pre-processing
• Relational table
• Horizontal
• Header: True, index = 0
• Key column: 0
• Primitive Typing: [Object, Unit, Unit, Object]
Lake Area Depth County
Windermere 14,73 km² 66 m Cumbria
Kielder Reservoir 10,86 km² 52 m Northumberland
Ullswater 8,9 km² 63 m Lake district
Bassenthwaite
Lake
5,1 km² 21 m Cumbria
Derwent Water 5,1 km² 22 m Lake District
DAGOBAH-IC 202006
Challenges:
• Table nature
• Table orientation
• Column header presence
• Key column identification
• Column type detection
15. Evaluation Dataset- Semtab2019
DAGOBAH-IC 202014
SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
Table from:Ernesto et al. (2020). SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems
Statistics of the datasets in each SemTab round
16. ▪ T:denotes all the columns for annotation.
▪ P: The most fine-grained classes in the (ontology) hierarchy that also appear in the
ground truth.
▪ O:Involving the super-classes (excluding very generic top classes like owl:Thing) of
perfect classes
▪ W:Other annotations not in the ground truths.
DAGOBAH-IC 202015
Assessment Criteria
17. Results
Task CTA CEA CPA
Criteria AH AP F1 Precision F1 Precision
Baseline
Embedding
0.479
1.212
0.242
0.336
0.883
0.841
0.892
0.853
0.415
-
0.347
-
Task CTA CEA CPA
Criteria AH AP F1 Precision F1 Precision
Round 2
Baseline
Mtab
0.641
1.414
0.247
0.276
0.713
0.911
0.816
0.911
0.533
0.881
0.919
0.929
Round 3
Baseline
Mtab
0.745
1.956
0.161
0.261
0.725
0.970
0.745
0.970
0.519
0.844
0.826
0.845
Round 4
Baseline
Mtab
0.684
2.012
0.206
0.300
0.578
0.983
0.599
0.983
0.398
0.832
0.874
0.832
DAGOBAH-IC 202016
▪ DAGOBAH
result
for Round 1:
✓ Mtab is the winner of this challenge
✓ Relatively behind Mtab due to missing Wikidata – DBpedia type mappings
18. Conclusions
Approach Pros Cons
Baseline ▪ High coverage (multiple sources)
▪ Computational efficiency
▪ Lookup-services dependency (reliability)
▪ Blackbox (indexing, scoring…)
▪ Queries volume
Embedding ▪ Lookup strategy independence
▪ Relevant clustering even with few data
▪ Generalization (no tailored cleaning + less
heuristics in lookups and scoring)
▪ Computational performances
▪ K optimization
▪ Embedding dependency
DAGOBAH-IC 202017
▪ New homogeneity factor that improves the pre-processing
▪ 2 approaches:
▪ Baseline composed of lookups and majority voting
▪ Clustering of embeddings
▪ Performance bottlenecks (due to the challenge context):
✓ Light Data cleaning … on purpose
✓ Basic lookup strategies … on purpose (e.g. no use of dictionary)
19. Future Work
✓ Test other Wikidata embeddings methods (Currently TransE)
✓ Compute joint embeddings with Wikipedia/DBpedia to enhance coverage
✓ Experiment more clustering algorithms and parameters on different datasets
✓ Learn data table embedding and find vectoral transformation(s) with KG embedding space
✓ …
DAGOBAH-IC 202018