SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Named Entity Disambiguation
via
Large-Scale Graphs Analytics
Alberto Parravicini
2018-05-05
NECSTlab
3
4
● Finance
news have a direct impact on the market.
● Advertising
targeted advertising for each user.
● Recommender Systems
targeted recommendation for each user.
Understanding trending topics
5
● The identification of topics requires 2 main steps:
Extracting Topics from Text
6
● The identification of topics requires 2 main steps:
Extracting Topics from Text
1. Named Entity Recognition: spot names of persons, companies,
etc…
○ High-accuracy in the state-of-the-art [1]
[1] Huang, Zhiheng, Wei Xu, and Kai Yu. "Bidirectional LSTM-CRF models for sequence tagging."
7
● The identification of topics requires 2 main steps:
Extracting Topics from Text
2. Named Entity Disambiguation: connecting named entities to a
unique identity (e.g. Wikipedia page)
en.wikipedia.org/wiki/
Donald_Trump
en.wikipedia.org/wiki/
North_American_Free_Trade_
Agreement
● The identification of topics requires 2 main steps:
Extracting Topics from Text
2. Named Entity Disambiguation: connecting named entities to a
unique identity (e.g. Wikipedia page)
en.wikipedia.org/wiki/
Donald_Trump
en.wikipedia.org/wiki/
North_American_Free_Trade_
Agreement
1. en.wikipedia.org/wiki/Defensive_wall
2. .../wiki/Berlin wall
3. .../wiki/The Wall (album)
4. .../wiki/Mexico-United_States_barrier
Historically, most Named Entity Disambiguation techniques rely on
Rule-Based Natural Language Processing (NLP)
Current Approaches
10
● Pro:
○ Usually not computationally intensive
● Cons:
○ Can’t deal with ambiguity
○ Dependency on grammar and language
Historically, most Named Entity Disambiguation techniques rely on
Rule-Based Natural Language Processing (NLP)
Current Approaches
11
● Pro:
○ Usually not computationally intensive
● Cons:
○ Can’t deal with ambiguity
○ Dependency on grammar and language
Our Goal:
an approach which is language independent
and can deal with ambiguity
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
12
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
13
”Donald_Trump”
Subject
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
14
”Donald_Trump”
Subject
“birthPlace”
Relation
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
15
”Donald_Trump”
Subject
“Queens”
Object
“birthPlace”
Relation
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
16
”Donald_Trump”
Subject
“Queens”
Object
“birthPlace”
Relation
Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
17
Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
18
Preprocessing
& PageRank
Graph
Building
Preprocessing
Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
19
Candidate
selection
Preprocessing
& PageRank
Collective
Optimization
Graph
Building
New
Text
Entity
Disambiguation
Preprocessing In-Production Execution
Graph Building
From DBPedia, we build 2 graphs...
20
Graph Building
From DBPedia, we build 2 graphs...
21
Relation Graph
It contains “standard” relations
Graph Building
From DBPedia, we build 2 graphs...
22
Relation Graph
It contains “standard” relations
Redirects Graph
It contains “redirection” relations,
used to solve ambiguity
Graph Building
... and join them together!
23
Preprocessing
● We precompute 2 measures for edges and vertices
24
Preprocessing
● We precompute 2 measures for edges and vertices
Entropy: how much
“information” an edge has
25
Preprocessing
● We precompute 2 measures for edges and vertices
Entropy: how much
“information” an edge has
Salience: how
“important” each vertex
is, similar to PageRank
26
Preprocessing
● We precompute 2 measures for edges and vertices
Entropy: how much
“information” an edge has
Salience: how
“important” each vertex
is, similar to PageRank
27
Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
28
Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
Advantage 1:
Problem size reduction
29
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
Advantage 1:
Problem size reduction
Advantage 2:
Dealing with ambiguity
30
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
Collective Linking
31
Candidates
Collective Linking
32
Candidates
Collective Linking
Input Graph
Candidates
Collective Linking
34
Input Graph
Candidate Graphs
Candidates
Collective Linking
35
Input Graph
Candidate Graphs
Candidates
Salience Entropy
Collective Linking
36
Input Graph
Candidate Graphs
Best Match!
Candidates
Salience Entropy
Experimental Setup
Problem:
● We have huge graphs (~15M vertices, ~100M edges)
● We need fast execution time (a few seconds at most)
37
Experimental Setup
Problem:
● We have huge graphs (~15M vertices, ~100M edges)
● We need fast execution time (a few seconds at most)
Solution:
● Oracle PGX, state-of-the-art toolkit for graph
analytics.
○ Graph queries
○ Custom algorithms
○ Graph modifications
38
Preliminary Results
39
● We are still working on the 4th
stage of the pipeline
● According to the paper, > 75% disambiguation accuracy
● With our extensions, we can already obtain almost
80% accuracy on tweets
○ Similar to in-production data
Thank you!
Named Entity Disambiguation via Large-Scale Graphs Analytics
Alberto Parravicini
alberto.parravicini@mail.polimi.it
Entropy and Salience
● Entropy: computed on each relation/edge.
● Salience: computed on each vertex, similar to PageRank.
41
How random the
destinations of a
relation are
Graph Similarity
● First, compute a measure of topological similarity
● Then, combine it with salience and entropy
42
Percentage of
vertices in
common.
Salience of candidate Entropy of candidate
Oracle PGX
43
Pgx Shell Java/Python API
Pgx API
Pgx Engine
● Java Interface
● PGQL (queries)
● Green Marl (Algorithm DSL)
U.S.
Trump
MexicoNAFTA
Leveraging Graphs
● Wikipedia pages are used to
build a graph.
● We match the text to the
Knowledge Base through its
topological relations.
44
Leveraging Graphs
● Wikipedia pages are used to
build a graph.
● We match the text to the
Knowledge Base through its
topological relations.
45
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
U.S.
Trump
MexicoNAFTA
U.S. “Wall”

Contenu connexe

Similaire à Exploiting large-scale graph analytics for unsupervised Entity Linking

Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareTigerGraph
 
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...Neo4j
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RGraphRM
 
Graph Gurus Episode 5: Webinar PageRank
Graph Gurus Episode 5: Webinar PageRankGraph Gurus Episode 5: Webinar PageRank
Graph Gurus Episode 5: Webinar PageRankTigerGraph
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph GeneratorLDBC council
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningSean Yu
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpAdrian Ziegler
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuADNAVER D2
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphTigerGraph
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentationehtshamelahi
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
 
Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Robert Schadek
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesKonstantinos Xirogiannopoulos
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesPyData
 
An Overview of the Emerging Graph Landscape (Oct 2013)
An Overview of the Emerging Graph Landscape (Oct 2013)An Overview of the Emerging Graph Landscape (Oct 2013)
An Overview of the Emerging Graph Landscape (Oct 2013)Emil Eifrem
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad DataSteffen Staab
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Spencer Fox
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 

Similaire à Exploiting large-scale graph analytics for unsupervised Entity Linking (20)

Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
 
Graph Gurus Episode 5: Webinar PageRank
Graph Gurus Episode 5: Webinar PageRankGraph Gurus Episode 5: Webinar PageRank
Graph Gurus Episode 5: Webinar PageRank
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer Learning
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise Graph
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 
Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
An Overview of the Emerging Graph Landscape (Oct 2013)
An Overview of the Emerging Graph Landscape (Oct 2013)An Overview of the Emerging Graph Landscape (Oct 2013)
An Overview of the Emerging Graph Landscape (Oct 2013)
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad Data
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 

Plus de NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingNECST Lab @ Politecnico di Milano
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...NECST Lab @ Politecnico di Milano
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification SystemNECST Lab @ Politecnico di Milano
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingNECST Lab @ Politecnico di Milano
 

Plus de NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Dernier

Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxachiever3003
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 

Dernier (20)

Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptx
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 

Exploiting large-scale graph analytics for unsupervised Entity Linking

  • 1. Named Entity Disambiguation via Large-Scale Graphs Analytics Alberto Parravicini 2018-05-05 NECSTlab
  • 2.
  • 3. 3
  • 4. 4
  • 5. ● Finance news have a direct impact on the market. ● Advertising targeted advertising for each user. ● Recommender Systems targeted recommendation for each user. Understanding trending topics 5
  • 6. ● The identification of topics requires 2 main steps: Extracting Topics from Text 6
  • 7. ● The identification of topics requires 2 main steps: Extracting Topics from Text 1. Named Entity Recognition: spot names of persons, companies, etc… ○ High-accuracy in the state-of-the-art [1] [1] Huang, Zhiheng, Wei Xu, and Kai Yu. "Bidirectional LSTM-CRF models for sequence tagging." 7
  • 8. ● The identification of topics requires 2 main steps: Extracting Topics from Text 2. Named Entity Disambiguation: connecting named entities to a unique identity (e.g. Wikipedia page) en.wikipedia.org/wiki/ Donald_Trump en.wikipedia.org/wiki/ North_American_Free_Trade_ Agreement
  • 9. ● The identification of topics requires 2 main steps: Extracting Topics from Text 2. Named Entity Disambiguation: connecting named entities to a unique identity (e.g. Wikipedia page) en.wikipedia.org/wiki/ Donald_Trump en.wikipedia.org/wiki/ North_American_Free_Trade_ Agreement 1. en.wikipedia.org/wiki/Defensive_wall 2. .../wiki/Berlin wall 3. .../wiki/The Wall (album) 4. .../wiki/Mexico-United_States_barrier
  • 10. Historically, most Named Entity Disambiguation techniques rely on Rule-Based Natural Language Processing (NLP) Current Approaches 10 ● Pro: ○ Usually not computationally intensive ● Cons: ○ Can’t deal with ambiguity ○ Dependency on grammar and language
  • 11. Historically, most Named Entity Disambiguation techniques rely on Rule-Based Natural Language Processing (NLP) Current Approaches 11 ● Pro: ○ Usually not computationally intensive ● Cons: ○ Can’t deal with ambiguity ○ Dependency on grammar and language Our Goal: an approach which is language independent and can deal with ambiguity
  • 12. ● We exploit the structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 12
  • 13. ● We exploit the structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 13 ”Donald_Trump” Subject
  • 14. ● We exploit the structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 14 ”Donald_Trump” Subject “birthPlace” Relation
  • 15. ● We exploit the structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 15 ”Donald_Trump” Subject “Queens” Object “birthPlace” Relation
  • 16. ● We exploit the structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 16 ”Donald_Trump” Subject “Queens” Object “birthPlace” Relation
  • 17. Proposed Approach ● Our work extends the state-of-the-art method of Quantified Collective Validation (QCV) [2] ● High Level Pipeline: [2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation." 17
  • 18. Proposed Approach ● Our work extends the state-of-the-art method of Quantified Collective Validation (QCV) [2] ● High Level Pipeline: [2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation." 18 Preprocessing & PageRank Graph Building Preprocessing
  • 19. Proposed Approach ● Our work extends the state-of-the-art method of Quantified Collective Validation (QCV) [2] ● High Level Pipeline: [2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation." 19 Candidate selection Preprocessing & PageRank Collective Optimization Graph Building New Text Entity Disambiguation Preprocessing In-Production Execution
  • 20. Graph Building From DBPedia, we build 2 graphs... 20
  • 21. Graph Building From DBPedia, we build 2 graphs... 21 Relation Graph It contains “standard” relations
  • 22. Graph Building From DBPedia, we build 2 graphs... 22 Relation Graph It contains “standard” relations Redirects Graph It contains “redirection” relations, used to solve ambiguity
  • 23. Graph Building ... and join them together! 23
  • 24. Preprocessing ● We precompute 2 measures for edges and vertices 24
  • 25. Preprocessing ● We precompute 2 measures for edges and vertices Entropy: how much “information” an edge has 25
  • 26. Preprocessing ● We precompute 2 measures for edges and vertices Entropy: how much “information” an edge has Salience: how “important” each vertex is, similar to PageRank 26
  • 27. Preprocessing ● We precompute 2 measures for edges and vertices Entropy: how much “information” an edge has Salience: how “important” each vertex is, similar to PageRank 27
  • 28. Candidate Selection ● Idea: for each named entity, pick a small number of candidate vertices, through string similarity. 28
  • 29. Candidate Selection ● Idea: for each named entity, pick a small number of candidate vertices, through string similarity. Advantage 1: Problem size reduction 29 Candidates: 1. http://dbpedia.org/page/Defensive_wall 2. http://.../Berlin wall 3. http:/.../The Wall (album) 4. http://.../Mexico-United_States_barrier
  • 30. Candidate Selection ● Idea: for each named entity, pick a small number of candidate vertices, through string similarity. Advantage 1: Problem size reduction Advantage 2: Dealing with ambiguity 30 Candidates: 1. http://dbpedia.org/page/Defensive_wall 2. http://.../Berlin wall 3. http:/.../The Wall (album) 4. http://.../Mexico-United_States_barrier
  • 35. Collective Linking 35 Input Graph Candidate Graphs Candidates Salience Entropy
  • 36. Collective Linking 36 Input Graph Candidate Graphs Best Match! Candidates Salience Entropy
  • 37. Experimental Setup Problem: ● We have huge graphs (~15M vertices, ~100M edges) ● We need fast execution time (a few seconds at most) 37
  • 38. Experimental Setup Problem: ● We have huge graphs (~15M vertices, ~100M edges) ● We need fast execution time (a few seconds at most) Solution: ● Oracle PGX, state-of-the-art toolkit for graph analytics. ○ Graph queries ○ Custom algorithms ○ Graph modifications 38
  • 39. Preliminary Results 39 ● We are still working on the 4th stage of the pipeline ● According to the paper, > 75% disambiguation accuracy ● With our extensions, we can already obtain almost 80% accuracy on tweets ○ Similar to in-production data
  • 40. Thank you! Named Entity Disambiguation via Large-Scale Graphs Analytics Alberto Parravicini alberto.parravicini@mail.polimi.it
  • 41. Entropy and Salience ● Entropy: computed on each relation/edge. ● Salience: computed on each vertex, similar to PageRank. 41 How random the destinations of a relation are
  • 42. Graph Similarity ● First, compute a measure of topological similarity ● Then, combine it with salience and entropy 42 Percentage of vertices in common. Salience of candidate Entropy of candidate
  • 43. Oracle PGX 43 Pgx Shell Java/Python API Pgx API Pgx Engine ● Java Interface ● PGQL (queries) ● Green Marl (Algorithm DSL)
  • 44. U.S. Trump MexicoNAFTA Leveraging Graphs ● Wikipedia pages are used to build a graph. ● We match the text to the Knowledge Base through its topological relations. 44
  • 45. Leveraging Graphs ● Wikipedia pages are used to build a graph. ● We match the text to the Knowledge Base through its topological relations. 45 Candidates: 1. http://dbpedia.org/page/Defensive_wall 2. http://.../Berlin wall 3. http:/.../The Wall (album) 4. http://.../Mexico-United_States_barrier U.S. Trump MexicoNAFTA U.S. “Wall”