SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Orange restricted
DAGOBAH
An End-to-End Context-Free Tabular Data
Semantic Annotation System
Yoan Chabot Thomas Labbé Jixiong Liu Raphaël Troncy
Orange Orange Orange EURECOM
@yoan_chabot @rtroncy@tau_labbe @yansera1
DAGOBAH-IC 202001
Context & Goals
▪ Design a semantic engine able to query (semi-)structured data
DAGOBAH-IC 202002
I don’t
know
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
In which city was "Our
Happy Lives" filmed?
P840
narrative
location
Context & Goals
▪ Design a semantic engine able to query (semi-)structured data
DAGOBAH-IC 202002
In which city was "Our
happy lives" filmed?
In
Belfort!
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
Movie Location
Our Happy Lives Belfort
The French
Kissers
Rennes
P840
narrative
location
P840
narrative
location
DAGOBAH
Movie Location
Our Happy Lives Belfort
The French Kissers Rennes
Tabular Data to Knowledge Graph Matching
DAGOBAH-IC 202003
CTA Column-Type Annotation
CEA Cell-Entity Annotation
CPA Columns-Property Annotation
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
P840
narrative
location
P840
narrative
location
CPA
State of the Art
▪ Disambiguate cell values (CEA)
▪ 2 Strategies
▪ For each cell, lookup for the most probable entity. [1] [2]
▪ Joint disambiguation of each cell considering the entire row. [3]
▪ Matches for entities can be made using:
▪ Syntactic comparisons [1][2]
▪ Alignment of ontologies [1][3]
▪ Word embeddings [2][3]
▪ Extract column type (CTA)
▪ Majority voting based on CEA outputs [4]
▪ Extract relationships between columns (CPA)
▪ Majority voting based on previously determined types and entities [5]
[1] LIMAYE G., SARAWAGI S. & CHAKRABARTI S. (2010). Annotating and searching web tables using entities, types and relationships.
In 36th International Conference on Very Large Data Bases (VLDB), p. 1338–1347.
[2] FERNANDEZ R. C., MANSOUR E., QAHTAN A. A., ELMAGARMID A., ILYAS I., MADDEN S., OUZZANI M., STONEBRAKER M. & TANG N. (2018).
Seeping semantics : Linking datasets using word embeddings for data discovery. In 34th International Conference on Data Engineering (ICDE), p. 989–1000.
[3] EFTHYMIOU V., HASSANZADEH O., RODRIGUEZ-MURO M. & CHRISTOPHIDES V. (2017). Matching web tables with knowledge base entities : From entity lookups to entity
embeddings. In 16th International Semantic Web Conference (ISWC), p. 260–277.
[4] MULWAD V., FININ T., SYED Z. & JOSHI A. (2010). Using linked data to interpret tables. In 1 st International Workshop on Consuming Linked Data (COLD).
[5] RAN C., SHEN W., WANG J. & ZHU X. (2016). Domain-specific knowledge base enrichment using wikipedia tables.
In IEEE International Conference on Data Mining (ICDM), p. 349–358. DAGOBAH-IC 202004
The DAGOBAH Approach
▪ 1st step: pre-processing to identify tables characteristics (orientation, key-column…)
▪ 2nd step: annotations workflows
▪ Method 1: Baseline lookups
▪ Method 2: Embedding approach
Preprocessing
Embedding
Baseline
Annotations workflows
DAGOBAH-IC 202005
Challenges Requiring Pre-processing
Pre-processing
• Relational table
• Horizontal
• Header: True, index = 0
• Key column: 0
• Primitive Typing: [Object, Unit, Unit, Object]
Lake Area Depth County
Windermere 14,73 km² 66 m Cumbria
Kielder Reservoir 10,86 km² 52 m Northumberland
Ullswater 8,9 km² 63 m Lake district
Bassenthwaite
Lake
5,1 km² 21 m Cumbria
Derwent Water 5,1 km² 22 m Lake District
DAGOBAH-IC 202006
Challenges:
• Table nature
• Table orientation
• Column header presence
• Key column identification
• Column type detection
Contribution: New Homogeneity Factor
𝐻𝑜𝑚 𝑥 = [
1
𝑙𝑒𝑛(𝑥)
෍
𝑡 𝑖∈ 𝑥
(1 − 1 − 2 ∗
𝑐𝑜𝑢𝑛𝑡 𝑡𝑖
𝑙𝑒𝑛 𝑥
2
)]
2
item 1 item 2 item 3 Hom
String_number String_number String_number 0
String_number String_number String_normal 0.89
String_number String_normal String_normal 0.89
String_normal String_normal String_normal 0
DAGOBAH-IC 202007
• String_Normal (France)
• String_Datetime (2020-06-30)
• String_Uppercase (IC)
• String_Number (150 km)
• Number (454)
• Boolean (Yes)
▪ We have 6 cell types:
Example: New homogeneity Factor
Datatable
corpus
(CSV, TSV,
HTML, …)
Converter
Table in WTC format
Table orientation
Header
detection
Primitive typing
DWTC algorithm [1]
Key column
detection
• Object
• Unit
• Number
• Date
• Unknown
Pre-processed tables
Content-based algorithm
(homogeneity factor)
Lake Area Depth Country Hom. RH
Windermere String_number String_number String unknown 0.89
Kielder Reservoir String_number String_number String unknown 0.89
Ullswater String_number String_number String unknown 0.89
Bassenthwaite Lake String_number String_number String unknown 0.89
Derwent Water String_number String_number String unknown 0.89
Hom. CH 0 0 0
𝐻𝑜𝑚 𝑥 = [
1
𝑙𝑒𝑛(𝑥)
෍
𝑡 𝑖∈ 𝑥
(1 − 1 − 2 ∗
𝑐𝑜𝑢𝑛𝑡 𝑡𝑖
𝑙𝑒𝑛 𝑥
2
)]
2
∃ 𝑐𝑜𝑙 𝑤ℎ𝑒𝑟𝑒 𝐻𝑜𝑚 𝑐𝑜𝑙 0: 3 ≠ 0 → 𝑯𝒆𝒂𝒅𝒆𝒓 = 𝒕𝒓𝒖𝒆
𝑀𝑒𝑎𝑛 𝐶𝐻 < 𝑀𝑒𝑎𝑛(𝑅𝐻) → 𝑯𝒐𝒓𝒊𝒛𝒐𝒏𝒕𝒂𝒍
[1] https://subversion.assembla.com/svn/commondata/WDCFramework/tags/1/0/3/ DAGOBAH-IC 202008
Evaluation: New homogeneity Factor
DAGOBAH-IC 202009
Precision =
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝑎𝑙𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
Precision of pre-processing tasks
▪ Evaluation on SemTab 2019 Round 1 (64 tables)
SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
The DAGOBAH Approach
▪ 1st step: pre-processing to identify tables characteristics (orientation, key-column…)
▪ 2nd step: annotations workflows
▪ Method 1: Baseline lookups
▪ Method 2: Embedding approach
Annotations workflows
DAGOBAH-IC 202010
Preprocessing
Embedding
Baseline
Baseline Lookups
Pre-processed
tables
API
Server
Ingestion
Lake Area
Windermere 14,73 km²
Kielder Reservoir 10,86 km²
API
CirrusSearch
API
Entities Lookups
{title: "Q119936",
label: "Windermere"},
{title: "Q390370",
label: "Windermere"}
…
{"mainType": "populated place",
"types": "settlement"
"subTypes": ""}
Type(s) selection
Types scoring
Entities
Disambiguation
CTA output
CEA output
1
3
4
6
7
7
SPARQL2
DBpedia entity
uri & types
5
𝑆𝑠𝑝𝑒𝑐 𝑡 =
𝑐𝑜𝑢𝑛𝑡(𝑡)
𝑠𝑢𝑚(𝑡𝑖)
∗ log
𝑐𝑜𝑢𝑛𝑡(𝑟𝑜𝑤𝑠)
𝑐𝑜𝑢𝑛𝑡(𝑡𝑖)
‒ Lookups from all tables cells
(4 external sources + 1
internal Wikidata ES)
‒ Wikidata as pivot metadata
‒ DBpedia translation (uri &
types)
‒ TF-IDF-like types scoring
‒ Entities disambiguation with
target type(s)
1
3
4
2
6
7
DAGOBAH-IC 202011
Embedding Approach
EMBEDDING
OpenKE [1]
Id: ["Q223687"],
label:["Wes Anderson"],
aliases:["Wesley Wales Anderson"],
types:["Q5","dbPedia.Person"],
subTypes:["dbPedia.Director","Q2526255"," Q36180"]
Q223687
Title Director
Rushmore Anderson
Fight Club Fincher
Entities
Lookup
Candidates
clustering
Lookup + Table based
hyperparameters
Clusters scoring
Candidates’ types
scoring
CTA output
Candidates’ entities
scoring
CEA output
1
3
5
Lookup
candidates2
4
Embedding
Enrichment
6
‒ Embedding enrichment
through Wikidata ES server
‒ Regex + Levenshtein lookup
‒ K-means clustering over
candidates' space
‒ Scoring algorithm to extract
best cluster and deduce
target type
‒ Candidates disambiguation
from clusters, types and
entities scores
[1] OpenKE TransE Wikidata Embeddings :
http://139.129.163.161/index/toolkits#pretrained-wikidata
1
2
3
4
5
6
DAGOBAH-IC 202012
Embedding Approach Example
𝑺 𝒆 𝑾𝒆𝒔 𝑨𝒏𝒅𝒆𝒓𝒔𝒐𝒏 >
Entities disambiguation:
Entities scoring (CEA):
𝑆 𝑒 𝑃𝑎𝑢𝑙 𝑇ℎ𝑜𝑚𝑎𝑠 𝐴𝑛𝑑𝑒𝑟𝑠𝑜𝑛 ,
𝑆 𝑒 𝑃𝑎𝑢𝑙 𝑊. 𝑆. 𝐴𝑛𝑑𝑒𝑟𝑠𝑜𝑛
𝑆 𝑘 𝑐𝑙𝑢𝑠𝑡𝑒𝑟#2
𝑆𝑐 𝑄941209
Candidates scoring (CTA)
Clusters scoring
DAGOBAH-IC 202013
Evaluation Dataset- Semtab2019
DAGOBAH-IC 202014
SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
Table from:Ernesto et al. (2020). SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems
Statistics of the datasets in each SemTab round
▪ T:denotes all the columns for annotation.
▪ P: The most fine-grained classes in the (ontology) hierarchy that also appear in the
ground truth.
▪ O:Involving the super-classes (excluding very generic top classes like owl:Thing) of
perfect classes
▪ W:Other annotations not in the ground truths.
DAGOBAH-IC 202015
Assessment Criteria
Results
Task CTA CEA CPA
Criteria AH AP F1 Precision F1 Precision
Baseline
Embedding
0.479
1.212
0.242
0.336
0.883
0.841
0.892
0.853
0.415
-
0.347
-
Task CTA CEA CPA
Criteria AH AP F1 Precision F1 Precision
Round 2
Baseline
Mtab
0.641
1.414
0.247
0.276
0.713
0.911
0.816
0.911
0.533
0.881
0.919
0.929
Round 3
Baseline
Mtab
0.745
1.956
0.161
0.261
0.725
0.970
0.745
0.970
0.519
0.844
0.826
0.845
Round 4
Baseline
Mtab
0.684
2.012
0.206
0.300
0.578
0.983
0.599
0.983
0.398
0.832
0.874
0.832
DAGOBAH-IC 202016
▪ DAGOBAH
result
for Round 1:
✓ Mtab is the winner of this challenge
✓ Relatively behind Mtab due to missing Wikidata – DBpedia type mappings
Conclusions
Approach Pros Cons
Baseline ▪ High coverage (multiple sources)
▪ Computational efficiency
▪ Lookup-services dependency (reliability)
▪ Blackbox (indexing, scoring…)
▪ Queries volume
Embedding ▪ Lookup strategy independence
▪ Relevant clustering even with few data
▪ Generalization (no tailored cleaning + less
heuristics in lookups and scoring)
▪ Computational performances
▪ K optimization
▪ Embedding dependency
DAGOBAH-IC 202017
▪ New homogeneity factor that improves the pre-processing
▪ 2 approaches:
▪ Baseline composed of lookups and majority voting
▪ Clustering of embeddings
▪ Performance bottlenecks (due to the challenge context):
✓ Light Data cleaning … on purpose
✓ Basic lookup strategies … on purpose (e.g. no use of dictionary)
Future Work
✓ Test other Wikidata embeddings methods (Currently TransE)
✓ Compute joint embeddings with Wikipedia/DBpedia to enhance coverage
✓ Experiment more clustering algorithms and parameters on different datasets
✓ Learn data table embedding and find vectoral transformation(s) with KG embedding space
✓ …
DAGOBAH-IC 202018
Orange restricted
DAGOBAH
Datatable-powered Accurate-knowledge Graph
for Outstanding and Beautiful Answers to Humans
Twitter: @yansera1
Jixiong.liu@orange.com
Slides are available: https://www.slideshare.net/JixiongLIU/dagobahic2020orange

Contenu connexe

Similaire à Dagobahic2020orange

Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics EnvironmentIan Foster
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)Toshiyuki Shimono
 
CS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfCS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfssuser034ce1
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer OverviewOlav Sandstå
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
 
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus LinguistsTobias Gärtner
 
AI Deeplearning Programming
AI Deeplearning ProgrammingAI Deeplearning Programming
AI Deeplearning ProgrammingPaulSombat
 
Grid based distributed in memory indexing for moving objects
Grid based distributed in memory indexing for moving objectsGrid based distributed in memory indexing for moving objects
Grid based distributed in memory indexing for moving objectsYunsu Lee
 
Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Anastasija Nikiforova
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Machine Learning Applications
Machine Learning ApplicationsMachine Learning Applications
Machine Learning ApplicationsSimplyInsight
 
L'ingénierie dans les nuages
L'ingénierie dans les nuagesL'ingénierie dans les nuages
L'ingénierie dans les nuagesAndrew Forward
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0MapR Technologies
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...IRJET Journal
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdfRohanBorgalli
 
Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingChawanat Nakasan
 

Similaire à Dagobahic2020orange (20)

Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
 
CS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfCS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdf
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
 
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists
 
AI Deeplearning Programming
AI Deeplearning ProgrammingAI Deeplearning Programming
AI Deeplearning Programming
 
Piano rubyslava final
Piano rubyslava finalPiano rubyslava final
Piano rubyslava final
 
Grid based distributed in memory indexing for moving objects
Grid based distributed in memory indexing for moving objectsGrid based distributed in memory indexing for moving objects
Grid based distributed in memory indexing for moving objects
 
Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Armando Benitez -- Data x Desing
Armando Benitez -- Data x DesingArmando Benitez -- Data x Desing
Armando Benitez -- Data x Desing
 
Machine Learning Applications
Machine Learning ApplicationsMachine Learning Applications
Machine Learning Applications
 
L'ingénierie dans les nuages
L'ingénierie dans les nuagesL'ingénierie dans les nuages
L'ingénierie dans les nuages
 
[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research Meeting
 

Dernier

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Dernier (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Dagobahic2020orange

  • 1. Orange restricted DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation System Yoan Chabot Thomas Labbé Jixiong Liu Raphaël Troncy Orange Orange Orange EURECOM @yoan_chabot @rtroncy@tau_labbe @yansera1 DAGOBAH-IC 202001
  • 2. Context & Goals ▪ Design a semantic engine able to query (semi-)structured data DAGOBAH-IC 202002 I don’t know Q647 Rennes Q171545 Belfort Q484170 Commune of France Q745690 The French Kissers Q3344332 Our Happy Lives Q11424 film P31 instance of P31 instance of Q142 France P495 country of origin P17 country In which city was "Our Happy Lives" filmed? P840 narrative location
  • 3. Context & Goals ▪ Design a semantic engine able to query (semi-)structured data DAGOBAH-IC 202002 In which city was "Our happy lives" filmed? In Belfort! Q647 Rennes Q171545 Belfort Q484170 Commune of France Q745690 The French Kissers Q3344332 Our Happy Lives Q11424 film P31 instance of P31 instance of Q142 France P495 country of origin P17 country Movie Location Our Happy Lives Belfort The French Kissers Rennes P840 narrative location P840 narrative location DAGOBAH
  • 4. Movie Location Our Happy Lives Belfort The French Kissers Rennes Tabular Data to Knowledge Graph Matching DAGOBAH-IC 202003 CTA Column-Type Annotation CEA Cell-Entity Annotation CPA Columns-Property Annotation Q647 Rennes Q171545 Belfort Q484170 Commune of France Q745690 The French Kissers Q3344332 Our Happy Lives Q11424 film P31 instance of P31 instance of Q142 France P495 country of origin P17 country P840 narrative location P840 narrative location CPA
  • 5. State of the Art ▪ Disambiguate cell values (CEA) ▪ 2 Strategies ▪ For each cell, lookup for the most probable entity. [1] [2] ▪ Joint disambiguation of each cell considering the entire row. [3] ▪ Matches for entities can be made using: ▪ Syntactic comparisons [1][2] ▪ Alignment of ontologies [1][3] ▪ Word embeddings [2][3] ▪ Extract column type (CTA) ▪ Majority voting based on CEA outputs [4] ▪ Extract relationships between columns (CPA) ▪ Majority voting based on previously determined types and entities [5] [1] LIMAYE G., SARAWAGI S. & CHAKRABARTI S. (2010). Annotating and searching web tables using entities, types and relationships. In 36th International Conference on Very Large Data Bases (VLDB), p. 1338–1347. [2] FERNANDEZ R. C., MANSOUR E., QAHTAN A. A., ELMAGARMID A., ILYAS I., MADDEN S., OUZZANI M., STONEBRAKER M. & TANG N. (2018). Seeping semantics : Linking datasets using word embeddings for data discovery. In 34th International Conference on Data Engineering (ICDE), p. 989–1000. [3] EFTHYMIOU V., HASSANZADEH O., RODRIGUEZ-MURO M. & CHRISTOPHIDES V. (2017). Matching web tables with knowledge base entities : From entity lookups to entity embeddings. In 16th International Semantic Web Conference (ISWC), p. 260–277. [4] MULWAD V., FININ T., SYED Z. & JOSHI A. (2010). Using linked data to interpret tables. In 1 st International Workshop on Consuming Linked Data (COLD). [5] RAN C., SHEN W., WANG J. & ZHU X. (2016). Domain-specific knowledge base enrichment using wikipedia tables. In IEEE International Conference on Data Mining (ICDM), p. 349–358. DAGOBAH-IC 202004
  • 6. The DAGOBAH Approach ▪ 1st step: pre-processing to identify tables characteristics (orientation, key-column…) ▪ 2nd step: annotations workflows ▪ Method 1: Baseline lookups ▪ Method 2: Embedding approach Preprocessing Embedding Baseline Annotations workflows DAGOBAH-IC 202005
  • 7. Challenges Requiring Pre-processing Pre-processing • Relational table • Horizontal • Header: True, index = 0 • Key column: 0 • Primitive Typing: [Object, Unit, Unit, Object] Lake Area Depth County Windermere 14,73 km² 66 m Cumbria Kielder Reservoir 10,86 km² 52 m Northumberland Ullswater 8,9 km² 63 m Lake district Bassenthwaite Lake 5,1 km² 21 m Cumbria Derwent Water 5,1 km² 22 m Lake District DAGOBAH-IC 202006 Challenges: • Table nature • Table orientation • Column header presence • Key column identification • Column type detection
  • 8. Contribution: New Homogeneity Factor 𝐻𝑜𝑚 𝑥 = [ 1 𝑙𝑒𝑛(𝑥) ෍ 𝑡 𝑖∈ 𝑥 (1 − 1 − 2 ∗ 𝑐𝑜𝑢𝑛𝑡 𝑡𝑖 𝑙𝑒𝑛 𝑥 2 )] 2 item 1 item 2 item 3 Hom String_number String_number String_number 0 String_number String_number String_normal 0.89 String_number String_normal String_normal 0.89 String_normal String_normal String_normal 0 DAGOBAH-IC 202007 • String_Normal (France) • String_Datetime (2020-06-30) • String_Uppercase (IC) • String_Number (150 km) • Number (454) • Boolean (Yes) ▪ We have 6 cell types:
  • 9. Example: New homogeneity Factor Datatable corpus (CSV, TSV, HTML, …) Converter Table in WTC format Table orientation Header detection Primitive typing DWTC algorithm [1] Key column detection • Object • Unit • Number • Date • Unknown Pre-processed tables Content-based algorithm (homogeneity factor) Lake Area Depth Country Hom. RH Windermere String_number String_number String unknown 0.89 Kielder Reservoir String_number String_number String unknown 0.89 Ullswater String_number String_number String unknown 0.89 Bassenthwaite Lake String_number String_number String unknown 0.89 Derwent Water String_number String_number String unknown 0.89 Hom. CH 0 0 0 𝐻𝑜𝑚 𝑥 = [ 1 𝑙𝑒𝑛(𝑥) ෍ 𝑡 𝑖∈ 𝑥 (1 − 1 − 2 ∗ 𝑐𝑜𝑢𝑛𝑡 𝑡𝑖 𝑙𝑒𝑛 𝑥 2 )] 2 ∃ 𝑐𝑜𝑙 𝑤ℎ𝑒𝑟𝑒 𝐻𝑜𝑚 𝑐𝑜𝑙 0: 3 ≠ 0 → 𝑯𝒆𝒂𝒅𝒆𝒓 = 𝒕𝒓𝒖𝒆 𝑀𝑒𝑎𝑛 𝐶𝐻 < 𝑀𝑒𝑎𝑛(𝑅𝐻) → 𝑯𝒐𝒓𝒊𝒛𝒐𝒏𝒕𝒂𝒍 [1] https://subversion.assembla.com/svn/commondata/WDCFramework/tags/1/0/3/ DAGOBAH-IC 202008
  • 10. Evaluation: New homogeneity Factor DAGOBAH-IC 202009 Precision = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑎𝑙𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 Precision of pre-processing tasks ▪ Evaluation on SemTab 2019 Round 1 (64 tables) SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
  • 11. The DAGOBAH Approach ▪ 1st step: pre-processing to identify tables characteristics (orientation, key-column…) ▪ 2nd step: annotations workflows ▪ Method 1: Baseline lookups ▪ Method 2: Embedding approach Annotations workflows DAGOBAH-IC 202010 Preprocessing Embedding Baseline
  • 12. Baseline Lookups Pre-processed tables API Server Ingestion Lake Area Windermere 14,73 km² Kielder Reservoir 10,86 km² API CirrusSearch API Entities Lookups {title: "Q119936", label: "Windermere"}, {title: "Q390370", label: "Windermere"} … {"mainType": "populated place", "types": "settlement" "subTypes": ""} Type(s) selection Types scoring Entities Disambiguation CTA output CEA output 1 3 4 6 7 7 SPARQL2 DBpedia entity uri & types 5 𝑆𝑠𝑝𝑒𝑐 𝑡 = 𝑐𝑜𝑢𝑛𝑡(𝑡) 𝑠𝑢𝑚(𝑡𝑖) ∗ log 𝑐𝑜𝑢𝑛𝑡(𝑟𝑜𝑤𝑠) 𝑐𝑜𝑢𝑛𝑡(𝑡𝑖) ‒ Lookups from all tables cells (4 external sources + 1 internal Wikidata ES) ‒ Wikidata as pivot metadata ‒ DBpedia translation (uri & types) ‒ TF-IDF-like types scoring ‒ Entities disambiguation with target type(s) 1 3 4 2 6 7 DAGOBAH-IC 202011
  • 13. Embedding Approach EMBEDDING OpenKE [1] Id: ["Q223687"], label:["Wes Anderson"], aliases:["Wesley Wales Anderson"], types:["Q5","dbPedia.Person"], subTypes:["dbPedia.Director","Q2526255"," Q36180"] Q223687 Title Director Rushmore Anderson Fight Club Fincher Entities Lookup Candidates clustering Lookup + Table based hyperparameters Clusters scoring Candidates’ types scoring CTA output Candidates’ entities scoring CEA output 1 3 5 Lookup candidates2 4 Embedding Enrichment 6 ‒ Embedding enrichment through Wikidata ES server ‒ Regex + Levenshtein lookup ‒ K-means clustering over candidates' space ‒ Scoring algorithm to extract best cluster and deduce target type ‒ Candidates disambiguation from clusters, types and entities scores [1] OpenKE TransE Wikidata Embeddings : http://139.129.163.161/index/toolkits#pretrained-wikidata 1 2 3 4 5 6 DAGOBAH-IC 202012
  • 14. Embedding Approach Example 𝑺 𝒆 𝑾𝒆𝒔 𝑨𝒏𝒅𝒆𝒓𝒔𝒐𝒏 > Entities disambiguation: Entities scoring (CEA): 𝑆 𝑒 𝑃𝑎𝑢𝑙 𝑇ℎ𝑜𝑚𝑎𝑠 𝐴𝑛𝑑𝑒𝑟𝑠𝑜𝑛 , 𝑆 𝑒 𝑃𝑎𝑢𝑙 𝑊. 𝑆. 𝐴𝑛𝑑𝑒𝑟𝑠𝑜𝑛 𝑆 𝑘 𝑐𝑙𝑢𝑠𝑡𝑒𝑟#2 𝑆𝑐 𝑄941209 Candidates scoring (CTA) Clusters scoring DAGOBAH-IC 202013
  • 15. Evaluation Dataset- Semtab2019 DAGOBAH-IC 202014 SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/ Table from:Ernesto et al. (2020). SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems Statistics of the datasets in each SemTab round
  • 16. ▪ T:denotes all the columns for annotation. ▪ P: The most fine-grained classes in the (ontology) hierarchy that also appear in the ground truth. ▪ O:Involving the super-classes (excluding very generic top classes like owl:Thing) of perfect classes ▪ W:Other annotations not in the ground truths. DAGOBAH-IC 202015 Assessment Criteria
  • 17. Results Task CTA CEA CPA Criteria AH AP F1 Precision F1 Precision Baseline Embedding 0.479 1.212 0.242 0.336 0.883 0.841 0.892 0.853 0.415 - 0.347 - Task CTA CEA CPA Criteria AH AP F1 Precision F1 Precision Round 2 Baseline Mtab 0.641 1.414 0.247 0.276 0.713 0.911 0.816 0.911 0.533 0.881 0.919 0.929 Round 3 Baseline Mtab 0.745 1.956 0.161 0.261 0.725 0.970 0.745 0.970 0.519 0.844 0.826 0.845 Round 4 Baseline Mtab 0.684 2.012 0.206 0.300 0.578 0.983 0.599 0.983 0.398 0.832 0.874 0.832 DAGOBAH-IC 202016 ▪ DAGOBAH result for Round 1: ✓ Mtab is the winner of this challenge ✓ Relatively behind Mtab due to missing Wikidata – DBpedia type mappings
  • 18. Conclusions Approach Pros Cons Baseline ▪ High coverage (multiple sources) ▪ Computational efficiency ▪ Lookup-services dependency (reliability) ▪ Blackbox (indexing, scoring…) ▪ Queries volume Embedding ▪ Lookup strategy independence ▪ Relevant clustering even with few data ▪ Generalization (no tailored cleaning + less heuristics in lookups and scoring) ▪ Computational performances ▪ K optimization ▪ Embedding dependency DAGOBAH-IC 202017 ▪ New homogeneity factor that improves the pre-processing ▪ 2 approaches: ▪ Baseline composed of lookups and majority voting ▪ Clustering of embeddings ▪ Performance bottlenecks (due to the challenge context): ✓ Light Data cleaning … on purpose ✓ Basic lookup strategies … on purpose (e.g. no use of dictionary)
  • 19. Future Work ✓ Test other Wikidata embeddings methods (Currently TransE) ✓ Compute joint embeddings with Wikipedia/DBpedia to enhance coverage ✓ Experiment more clustering algorithms and parameters on different datasets ✓ Learn data table embedding and find vectoral transformation(s) with KG embedding space ✓ … DAGOBAH-IC 202018
  • 20. Orange restricted DAGOBAH Datatable-powered Accurate-knowledge Graph for Outstanding and Beautiful Answers to Humans Twitter: @yansera1 Jixiong.liu@orange.com Slides are available: https://www.slideshare.net/JixiongLIU/dagobahic2020orange