Exploiting large-scale graph analytics for unsupervised Entity Linking

Named Entity Disambiguation
via
Large-Scale Graphs Analytics
Alberto Parravicini
2018-05-05
NECSTlab

● Finance
news have a direct impact on the market.
● Advertising
targeted advertising for each user.
● Recommender Systems
targeted recommendation for each user.
Understanding trending topics
5

● The identification of topics requires 2 main steps:
Extracting Topics from Text
6

1. Named Entity Recognition: spot names of persons, companies,
etc…
○ High-accuracy in the state-of-the-art [1]
[1] Huang, Zhiheng, Wei Xu, and Kai Yu. "Bidirectional LSTM-CRF models for sequence tagging."
7

2. Named Entity Disambiguation: connecting named entities to a
unique identity (e.g. Wikipedia page)
en.wikipedia.org/wiki/
Donald_Trump
North_American_Free_Trade_
Agreement

2. Named Entity Disambiguation: connecting named entities to a
unique identity (e.g. Wikipedia page)
Donald_Trump
North_American_Free_Trade_
Agreement
1. en.wikipedia.org/wiki/Defensive_wall
2. .../wiki/Berlin wall
3. .../wiki/The Wall (album)
4. .../wiki/Mexico-United_States_barrier

Historically, most Named Entity Disambiguation techniques rely on
Rule-Based Natural Language Processing (NLP)
Current Approaches
10
● Pro:
○ Usually not computationally intensive
● Cons:
○ Can’t deal with ambiguity
○ Dependency on grammar and language

Historically, most Named Entity Disambiguation techniques rely on
Rule-Based Natural Language Processing (NLP)
Current Approaches
11
● Pro:
○ Usually not computationally intensive
● Cons:
○ Can’t deal with ambiguity
○ Dependency on grammar and language
Our Goal:
an approach which is language independent
and can deal with ambiguity

● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
12

(~100M edges)
structured way:
Proposed Approach
13
”Donald_Trump”
Subject

(~100M edges)
structured way:
Proposed Approach
14
”Donald_Trump”
Subject
“birthPlace”
Relation

(~100M edges)
structured way:
Proposed Approach
15
”Donald_Trump”
Subject
“Queens”
Object
“birthPlace”
Relation

(~100M edges)
structured way:
Proposed Approach
16
”Donald_Trump”
Subject
“Queens”
Object
“birthPlace”
Relation

Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
17

Proposed Approach
18
Preprocessing
& PageRank
Graph
Building
Preprocessing

Proposed Approach
19
Candidate
selection
Preprocessing
& PageRank
Collective
Optimization
Graph
Building
New
Text
Entity
Disambiguation
Preprocessing In-Production Execution

Graph Building
From DBPedia, we build 2 graphs...
20

Graph Building
21
Relation Graph
It contains “standard” relations

Graph Building
22
Relation Graph
It contains “standard” relations
Redirects Graph
It contains “redirection” relations,
used to solve ambiguity

Graph Building
... and join them together!
23

Preprocessing
● We precompute 2 measures for edges and vertices
24

Preprocessing
Entropy: how much
“information” an edge has
25

Preprocessing
Entropy: how much
Salience: how
“important” each vertex
is, similar to PageRank
26

Preprocessing
Entropy: how much
Salience: how
“important” each vertex
is, similar to PageRank
27

Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
28

Candidate Selection
Advantage 1:
Problem size reduction
29
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier

Candidate Selection
Advantage 1:
Problem size reduction
Advantage 2:
Dealing with ambiguity
30
Candidates:

Collective Linking
31
Candidates

Collective Linking
32
Candidates

Collective Linking
Input Graph
Candidates

Collective Linking
34
Input Graph
Candidate Graphs
Candidates

Collective Linking
35
Input Graph
Candidate Graphs
Candidates
Salience Entropy

Collective Linking
36
Input Graph
Candidate Graphs
Best Match!
Candidates
Salience Entropy

Experimental Setup
Problem:
● We have huge graphs (~15M vertices, ~100M edges)
● We need fast execution time (a few seconds at most)
37

Experimental Setup
Problem:
● We have huge graphs (~15M vertices, ~100M edges)
● We need fast execution time (a few seconds at most)
Solution:
● Oracle PGX, state-of-the-art toolkit for graph
analytics.
○ Graph queries
○ Custom algorithms
○ Graph modifications
38

Preliminary Results
39
● We are still working on the 4th
stage of the pipeline
● According to the paper, > 75% disambiguation accuracy
● With our extensions, we can already obtain almost
80% accuracy on tweets
○ Similar to in-production data

Thank you!
Named Entity Disambiguation via Large-Scale Graphs Analytics
Alberto Parravicini
alberto.parravicini@mail.polimi.it

Entropy and Salience
● Entropy: computed on each relation/edge.
● Salience: computed on each vertex, similar to PageRank.
41
How random the
destinations of a
relation are

Graph Similarity
● First, compute a measure of topological similarity
● Then, combine it with salience and entropy
42
Percentage of
vertices in
common.
Salience of candidate Entropy of candidate

Oracle PGX
43
Pgx Shell Java/Python API
Pgx API
Pgx Engine
● Java Interface
● PGQL (queries)
● Green Marl (Algorithm DSL)

U.S.
Trump
MexicoNAFTA
Leveraging Graphs
● Wikipedia pages are used to
build a graph.
● We match the text to the
Knowledge Base through its
topological relations.
44

Leveraging Graphs
● Wikipedia pages are used to
build a graph.
● We match the text to the
Knowledge Base through its
topological relations.
45
Candidates:
U.S.
Trump
MexicoNAFTA
U.S. “Wall”

Exploiting large-scale graph analytics for unsupervised Entity Linking

Recommandé

Recommandé

Contenu connexe

Similaire à Exploiting large-scale graph analytics for unsupervised Entity Linking

Similaire à Exploiting large-scale graph analytics for unsupervised Entity Linking (20)

Plus de NECST Lab @ Politecnico di Milano

Plus de NECST Lab @ Politecnico di Milano (20)

Dernier

Dernier (20)

Exploiting large-scale graph analytics for unsupervised Entity Linking