Entity Linking, the task of linking mentions (of persons, organizations, etc…) found in a document to a unique entity in a knowledge base, while deceptively simple, has proven to be very challenging to perform. This task is even harder when documents in different languages, or from restricted domains, are considered.
Entity Linking is important to understand the topic of articles or social media posts and can be used for marketing, advertising, and many more applications.
Most of the existing research on the topic is based on Natural Language Processing and on supervised models, which provide little flexibility and generalization capabilities.
Instead, it is possible to leverage the graph-like structure of large knowledge bases like DBpedia to vastly improve the quality of Entity Linking.
Furthermore, it is possible to represent input documents in a graph-like way and exploit measures of topological similarity between the original document and the knowledge base to collectively link all the mentions in a document at the same time.
In this work, we implement and extend the state-of-the-art Entity Linking system called Quantified Collective Validation, by using Oracle PGX to analyze in-memory and in a parallelized way the full DBpedia graph, in order to efficiently and effectively perform entity linking on tweets and news articles.
5. ● Finance
news have a direct impact on the market.
● Advertising
targeted advertising for each user.
● Recommender Systems
targeted recommendation for each user.
Understanding trending topics
5
6. ● The identification of topics requires 2 main steps:
Extracting Topics from Text
6
7. ● The identification of topics requires 2 main steps:
Extracting Topics from Text
1. Named Entity Recognition: spot names of persons, companies,
etc…
○ High-accuracy in the state-of-the-art [1]
[1] Huang, Zhiheng, Wei Xu, and Kai Yu. "Bidirectional LSTM-CRF models for sequence tagging."
7
8. ● The identification of topics requires 2 main steps:
Extracting Topics from Text
2. Named Entity Disambiguation: connecting named entities to a
unique identity (e.g. Wikipedia page)
en.wikipedia.org/wiki/
Donald_Trump
en.wikipedia.org/wiki/
North_American_Free_Trade_
Agreement
9. ● The identification of topics requires 2 main steps:
Extracting Topics from Text
2. Named Entity Disambiguation: connecting named entities to a
unique identity (e.g. Wikipedia page)
en.wikipedia.org/wiki/
Donald_Trump
en.wikipedia.org/wiki/
North_American_Free_Trade_
Agreement
1. en.wikipedia.org/wiki/Defensive_wall
2. .../wiki/Berlin wall
3. .../wiki/The Wall (album)
4. .../wiki/Mexico-United_States_barrier
10. Historically, most Named Entity Disambiguation techniques rely on
Rule-Based Natural Language Processing (NLP)
Current Approaches
10
● Pro:
○ Usually not computationally intensive
● Cons:
○ Can’t deal with ambiguity
○ Dependency on grammar and language
11. Historically, most Named Entity Disambiguation techniques rely on
Rule-Based Natural Language Processing (NLP)
Current Approaches
11
● Pro:
○ Usually not computationally intensive
● Cons:
○ Can’t deal with ambiguity
○ Dependency on grammar and language
Our Goal:
an approach which is language independent
and can deal with ambiguity
12. ● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
12
13. ● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
13
”Donald_Trump”
Subject
14. ● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
14
”Donald_Trump”
Subject
“birthPlace”
Relation
15. ● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
15
”Donald_Trump”
Subject
“Queens”
Object
“birthPlace”
Relation
16. ● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
16
”Donald_Trump”
Subject
“Queens”
Object
“birthPlace”
Relation
17. Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
17
18. Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
18
Preprocessing
& PageRank
Graph
Building
Preprocessing
19. Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
19
Candidate
selection
Preprocessing
& PageRank
Collective
Optimization
Graph
Building
New
Text
Entity
Disambiguation
Preprocessing In-Production Execution
22. Graph Building
From DBPedia, we build 2 graphs...
22
Relation Graph
It contains “standard” relations
Redirects Graph
It contains “redirection” relations,
used to solve ambiguity
26. Preprocessing
● We precompute 2 measures for edges and vertices
Entropy: how much
“information” an edge has
Salience: how
“important” each vertex
is, similar to PageRank
26
27. Preprocessing
● We precompute 2 measures for edges and vertices
Entropy: how much
“information” an edge has
Salience: how
“important” each vertex
is, similar to PageRank
27
28. Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
28
29. Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
Advantage 1:
Problem size reduction
29
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
30. Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
Advantage 1:
Problem size reduction
Advantage 2:
Dealing with ambiguity
30
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
37. Experimental Setup
Problem:
● We have huge graphs (~15M vertices, ~100M edges)
● We need fast execution time (a few seconds at most)
37
38. Experimental Setup
Problem:
● We have huge graphs (~15M vertices, ~100M edges)
● We need fast execution time (a few seconds at most)
Solution:
● Oracle PGX, state-of-the-art toolkit for graph
analytics.
○ Graph queries
○ Custom algorithms
○ Graph modifications
38
39. Preliminary Results
39
● We are still working on the 4th
stage of the pipeline
● According to the paper, > 75% disambiguation accuracy
● With our extensions, we can already obtain almost
80% accuracy on tweets
○ Similar to in-production data
40. Thank you!
Named Entity Disambiguation via Large-Scale Graphs Analytics
Alberto Parravicini
alberto.parravicini@mail.polimi.it
41. Entropy and Salience
● Entropy: computed on each relation/edge.
● Salience: computed on each vertex, similar to PageRank.
41
How random the
destinations of a
relation are
42. Graph Similarity
● First, compute a measure of topological similarity
● Then, combine it with salience and entropy
42
Percentage of
vertices in
common.
Salience of candidate Entropy of candidate
43. Oracle PGX
43
Pgx Shell Java/Python API
Pgx API
Pgx Engine
● Java Interface
● PGQL (queries)
● Green Marl (Algorithm DSL)
45. Leveraging Graphs
● Wikipedia pages are used to
build a graph.
● We match the text to the
Knowledge Base through its
topological relations.
45
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
U.S.
Trump
MexicoNAFTA
U.S. “Wall”