Dagobahic2020orange

Orange restricted
DAGOBAH
An End-to-End Context-Free Tabular Data
Semantic Annotation System
Yoan Chabot Thomas Labbé Jixiong Liu Raphaël Troncy
Orange Orange Orange EURECOM
@yoan_chabot @rtroncy@tau_labbe @yansera1
DAGOBAH-IC 202001

Context & Goals
▪ Design a semantic engine able to query (semi-)structured data
DAGOBAH-IC 202002
I don’t
know
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
In which city was "Our
Happy Lives" filmed?
P840
narrative
location

Context & Goals
▪ Design a semantic engine able to query (semi-)structured data
DAGOBAH-IC 202002
In which city was "Our
happy lives" filmed?
In
Belfort!
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
Movie Location
Our Happy Lives Belfort
The French
Kissers
Rennes
P840
narrative
location
P840
narrative
location
DAGOBAH

Movie Location
Our Happy Lives Belfort
The French Kissers Rennes
Tabular Data to Knowledge Graph Matching
DAGOBAH-IC 202003
CTA Column-Type Annotation
CEA Cell-Entity Annotation
CPA Columns-Property Annotation
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
P840
narrative
location
P840
narrative
location
CPA

State of the Art
▪ Disambiguate cell values (CEA)
▪ 2 Strategies
▪ For each cell, lookup for the most probable entity. [1] [2]
▪ Joint disambiguation of each cell considering the entire row. [3]
▪ Matches for entities can be made using:
▪ Syntactic comparisons [1][2]
▪ Alignment of ontologies [1][3]
▪ Word embeddings [2][3]
▪ Extract column type (CTA)
▪ Majority voting based on CEA outputs [4]
▪ Extract relationships between columns (CPA)
▪ Majority voting based on previously determined types and entities [5]
[1] LIMAYE G., SARAWAGI S. & CHAKRABARTI S. (2010). Annotating and searching web tables using entities, types and relationships.
In 36th International Conference on Very Large Data Bases (VLDB), p. 1338–1347.
[2] FERNANDEZ R. C., MANSOUR E., QAHTAN A. A., ELMAGARMID A., ILYAS I., MADDEN S., OUZZANI M., STONEBRAKER M. & TANG N. (2018).
Seeping semantics : Linking datasets using word embeddings for data discovery. In 34th International Conference on Data Engineering (ICDE), p. 989–1000.
[3] EFTHYMIOU V., HASSANZADEH O., RODRIGUEZ-MURO M. & CHRISTOPHIDES V. (2017). Matching web tables with knowledge base entities : From entity lookups to entity
embeddings. In 16th International Semantic Web Conference (ISWC), p. 260–277.
[4] MULWAD V., FININ T., SYED Z. & JOSHI A. (2010). Using linked data to interpret tables. In 1 st International Workshop on Consuming Linked Data (COLD).
[5] RAN C., SHEN W., WANG J. & ZHU X. (2016). Domain-specific knowledge base enrichment using wikipedia tables.
In IEEE International Conference on Data Mining (ICDM), p. 349–358. DAGOBAH-IC 202004

The DAGOBAH Approach
▪ 1st step: pre-processing to identify tables characteristics (orientation, key-column…)
▪ 2nd step: annotations workflows
▪ Method 1: Baseline lookups
▪ Method 2: Embedding approach
Preprocessing
Embedding
Baseline
Annotations workflows
DAGOBAH-IC 202005

Challenges Requiring Pre-processing
Pre-processing
• Relational table
• Horizontal
• Header: True, index = 0
• Key column: 0
• Primitive Typing: [Object, Unit, Unit, Object]
Lake Area Depth County
Windermere 14,73 km² 66 m Cumbria
Kielder Reservoir 10,86 km² 52 m Northumberland
Ullswater 8,9 km² 63 m Lake district
Bassenthwaite
Lake
5,1 km² 21 m Cumbria
Derwent Water 5,1 km² 22 m Lake District
DAGOBAH-IC 202006
Challenges:
• Table nature
• Table orientation
• Column header presence
• Key column identification
• Column type detection

Contribution: New Homogeneity Factor
𝐻𝑜𝑚 𝑥 = [
1
𝑙𝑒𝑛(𝑥)
෍
𝑡 𝑖∈ 𝑥
(1 − 1 − 2 ∗
𝑐𝑜𝑢𝑛𝑡 𝑡𝑖
𝑙𝑒𝑛 𝑥
2
)]
2
item 1 item 2 item 3 Hom
String_number String_number String_number 0
String_number String_number String_normal 0.89
String_number String_normal String_normal 0.89
String_normal String_normal String_normal 0
DAGOBAH-IC 202007
• String_Normal (France)
• String_Datetime (2020-06-30)
• String_Uppercase (IC)
• String_Number (150 km)
• Number (454)
• Boolean (Yes)
▪ We have 6 cell types:

Example: New homogeneity Factor
Datatable
corpus
(CSV, TSV,
HTML, …)
Converter
Table in WTC format
Table orientation
Header
detection
Primitive typing
DWTC algorithm [1]
Key column
detection
• Object
• Unit
• Number
• Date
• Unknown
Pre-processed tables
Content-based algorithm
(homogeneity factor)
Lake Area Depth Country Hom. RH
Windermere String_number String_number String unknown 0.89
Kielder Reservoir String_number String_number String unknown 0.89
Ullswater String_number String_number String unknown 0.89
Bassenthwaite Lake String_number String_number String unknown 0.89
Derwent Water String_number String_number String unknown 0.89
Hom. CH 0 0 0
𝐻𝑜𝑚 𝑥 = [
1
𝑙𝑒𝑛(𝑥)
෍
𝑡 𝑖∈ 𝑥
(1 − 1 − 2 ∗
𝑐𝑜𝑢𝑛𝑡 𝑡𝑖
𝑙𝑒𝑛 𝑥
2
)]
2
∃ 𝑐𝑜𝑙 𝑤ℎ𝑒𝑟𝑒 𝐻𝑜𝑚 𝑐𝑜𝑙 0: 3 ≠ 0 → 𝑯𝒆𝒂𝒅𝒆𝒓 = 𝒕𝒓𝒖𝒆
𝑀𝑒𝑎𝑛 𝐶𝐻 < 𝑀𝑒𝑎𝑛(𝑅𝐻) → 𝑯𝒐𝒓𝒊𝒛𝒐𝒏𝒕𝒂𝒍
[1] https://subversion.assembla.com/svn/commondata/WDCFramework/tags/1/0/3/ DAGOBAH-IC 202008

Evaluation: New homogeneity Factor
DAGOBAH-IC 202009
Precision =
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝑎𝑙𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
Precision of pre-processing tasks
▪ Evaluation on SemTab 2019 Round 1 (64 tables)
SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/

The DAGOBAH Approach
▪ 1st step: pre-processing to identify tables characteristics (orientation, key-column…)
▪ 2nd step: annotations workflows
▪ Method 1: Baseline lookups
▪ Method 2: Embedding approach
Annotations workflows
DAGOBAH-IC 202010
Preprocessing
Embedding
Baseline

Baseline Lookups
Pre-processed
tables
API
Server
Ingestion
Lake Area
Windermere 14,73 km²
Kielder Reservoir 10,86 km²
API
CirrusSearch
API
Entities Lookups
{title: "Q119936",
label: "Windermere"},
{title: "Q390370",
label: "Windermere"}
…
{"mainType": "populated place",
"types": "settlement"
"subTypes": ""}
Type(s) selection
Types scoring
Entities
Disambiguation
CTA output
CEA output
1
3
4
6
7
7
SPARQL2
DBpedia entity
uri & types
5
𝑆𝑠𝑝𝑒𝑐 𝑡 =
𝑐𝑜𝑢𝑛𝑡(𝑡)
𝑠𝑢𝑚(𝑡𝑖)
∗ log
𝑐𝑜𝑢𝑛𝑡(𝑟𝑜𝑤𝑠)
𝑐𝑜𝑢𝑛𝑡(𝑡𝑖)
‒ Lookups from all tables cells
(4 external sources + 1
internal Wikidata ES)
‒ Wikidata as pivot metadata
‒ DBpedia translation (uri &
types)
‒ TF-IDF-like types scoring
‒ Entities disambiguation with
target type(s)
1
3
4
2
6
7
DAGOBAH-IC 202011

Embedding Approach
EMBEDDING
OpenKE [1]
Id: ["Q223687"],
label:["Wes Anderson"],
aliases:["Wesley Wales Anderson"],
types:["Q5","dbPedia.Person"],
subTypes:["dbPedia.Director","Q2526255"," Q36180"]
Q223687
Title Director
Rushmore Anderson
Fight Club Fincher
Entities
Lookup
Candidates
clustering
Lookup + Table based
hyperparameters
Clusters scoring
Candidates’ types
scoring
CTA output
Candidates’ entities
scoring
CEA output
1
3
5
Lookup
candidates2
4
Embedding
Enrichment
6
‒ Embedding enrichment
through Wikidata ES server
‒ Regex + Levenshtein lookup
‒ K-means clustering over
candidates' space
‒ Scoring algorithm to extract
best cluster and deduce
target type
‒ Candidates disambiguation
from clusters, types and
entities scores
[1] OpenKE TransE Wikidata Embeddings :
http://139.129.163.161/index/toolkits#pretrained-wikidata
1
2
3
4
5
6
DAGOBAH-IC 202012

Embedding Approach Example
𝑺 𝒆 𝑾𝒆𝒔 𝑨𝒏𝒅𝒆𝒓𝒔𝒐𝒏 >
Entities disambiguation:
Entities scoring (CEA):
𝑆 𝑒 𝑃𝑎𝑢𝑙 𝑇ℎ𝑜𝑚𝑎𝑠 𝐴𝑛𝑑𝑒𝑟𝑠𝑜𝑛 ,
𝑆 𝑒 𝑃𝑎𝑢𝑙 𝑊. 𝑆. 𝐴𝑛𝑑𝑒𝑟𝑠𝑜𝑛
𝑆 𝑘 𝑐𝑙𝑢𝑠𝑡𝑒𝑟#2
𝑆𝑐 𝑄941209
Candidates scoring (CTA)
Clusters scoring
DAGOBAH-IC 202013

Evaluation Dataset- Semtab2019
DAGOBAH-IC 202014
SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
Table from：Ernesto et al. (2020). SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems
Statistics of the datasets in each SemTab round

▪ T：denotes all the columns for annotation.
▪ P： The most fine-grained classes in the (ontology) hierarchy that also appear in the
ground truth.
▪ O：Involving the super-classes (excluding very generic top classes like owl:Thing) of
perfect classes
▪ W：Other annotations not in the ground truths.
DAGOBAH-IC 202015
Assessment Criteria

Results
Task CTA CEA CPA
Criteria AH AP F1 Precision F1 Precision
Baseline
Embedding
0.479
1.212
0.242
0.336
0.883
0.841
0.892
0.853
0.415
-
0.347
-
Task CTA CEA CPA
Criteria AH AP F1 Precision F1 Precision
Round 2
Baseline
Mtab
0.641
1.414
0.247
0.276
0.713
0.911
0.816
0.911
0.533
0.881
0.919
0.929
Round 3
Baseline
Mtab
0.745
1.956
0.161
0.261
0.725
0.970
0.745
0.970
0.519
0.844
0.826
0.845
Round 4
Baseline
Mtab
0.684
2.012
0.206
0.300
0.578
0.983
0.599
0.983
0.398
0.832
0.874
0.832
DAGOBAH-IC 202016
▪ DAGOBAH
result
for Round 1:
✓ Mtab is the winner of this challenge
✓ Relatively behind Mtab due to missing Wikidata – DBpedia type mappings

Conclusions
Approach Pros Cons
Baseline ▪ High coverage (multiple sources)
▪ Computational efficiency
▪ Lookup-services dependency (reliability)
▪ Blackbox (indexing, scoring…)
▪ Queries volume
Embedding ▪ Lookup strategy independence
▪ Relevant clustering even with few data
▪ Generalization (no tailored cleaning + less
heuristics in lookups and scoring)
▪ Computational performances
▪ K optimization
▪ Embedding dependency
DAGOBAH-IC 202017
▪ New homogeneity factor that improves the pre-processing
▪ 2 approaches:
▪ Baseline composed of lookups and majority voting
▪ Clustering of embeddings
▪ Performance bottlenecks (due to the challenge context):
✓ Light Data cleaning … on purpose
✓ Basic lookup strategies … on purpose (e.g. no use of dictionary)

Future Work
✓ Test other Wikidata embeddings methods (Currently TransE)
✓ Compute joint embeddings with Wikipedia/DBpedia to enhance coverage
✓ Experiment more clustering algorithms and parameters on different datasets
✓ Learn data table embedding and find vectoral transformation(s) with KG embedding space
✓ …
DAGOBAH-IC 202018

Orange restricted
DAGOBAH
Datatable-powered Accurate-knowledge Graph
for Outstanding and Beautiful Answers to Humans
Twitter: @yansera1
Jixiong.liu@orange.com
Slides are available: https://www.slideshare.net/JixiongLIU/dagobahic2020orange

Dagobahic2020orange

Recommandé

Recommandé

Contenu connexe

Similaire à Dagobahic2020orange

Similaire à Dagobahic2020orange (20)

Dernier

Dernier (20)

Dagobahic2020orange