1. SwissLink
High-Precision, Context-Free Entity Linking
Exploiting Unambiguous Labels
Roman Prokofyev, Michael Luggen, Djellel Eddine Difallah, Philippe Cudré-Mauroux
eXascale Infolab, University of Fribourg, Switzerland
2. Entity Linking
“In natural language processing, entity linking, [...] is the task of determining
the identity of entities mentioned in text.
https://en.wikipedia.org/wiki/Entity_linking
Where the identity of an entity is commonly defined as an entry in a Knowledge
Base (KB).
It is usually solved in a multi-step process involving Named Entity Recognition
(NER) followed by a Candidate Selection and finally the Disambiguation.
2
3. Entity Linking
1. Named Entity Recognition (NER)
Distinguish between word of speech and defined concepts, also known as
named entities. Often involves a Part of Speech (POS) tagger.
2. Candidate Selection
Selecting possible candidates from the target Knowledge Base (where
entities are defined).
3. Disambiguation
Deciding which candidate is the correct identity corresponding to the
mention of a Named Entity. 3
4. Entity Linking
1. Named Entity Recognition (NER)
“It is a blast to visit Adam once more.”
2. Candidate Selection
Adam -> Adam (Name), Adam (City) in Oman, Amsterdam
3. Disambiguation
Adam -> https://en.wikipedia.org/wiki/Amsterdam
4
5. Motivation: High-precision context-free entity linking
● Certain applications require high-precision linked entities
○ Interactive applications where humans review results
○ Machine learning: training predictive
models may require high-precision
annotated text (no overfitting)
● Context-free
○ Works with any type of input:
text, tweets, search queries
○ But limited to unambiguous labels
The F1 score strikes a balance (harmonic mean) between precision and recall.
This is not necessarily the best optimization for the task at hand. 5
Precision
Recall
F1Score
6. Motivation: Categories of links to Wikipedia
What labels are used to link to entities (as Wikipedia pages) on the web?
Link by the most common label
web browser
Link by context
divided into three
subgroups: East,
West, and South
Link by reference
Wikipedia
Erroneous link
Oregon
Incorrectly linked entity even when
considering the context
<Web_browser>
381’623
times
<East_Slavic_languages>
<Angelina_Jolie>
16’333
times <University_of_Oregon>
6
7. Motivation: Prior probability scores
● Most important feature when not considering context
● Conditional probability P(link|label)
● Problems:
Does not necessarily capture ambiguity
Adam -> Adam (Name), Adam (City) in Oman, Amsterdam
Does not take categories into account
Wikipedia -> Angelina_Jolie [16’333]
7
8. Method (Problem)
Problem Formulation.
Given an arbitrary textual document ID
as input
Identify all named entities substrings {l1
, .., lk
}
And link them to their respective entities.
Effectively, our methods will return as output a set of label-entity pairs
OD
={(l1
,ez
),...,(lk
,ex
)}.
8
9. Method (Different Overall Approach)
Common
Named entity recognition -> candidate selection -> disambiguation
Context Free
Extract surface forms (KB or annotated corpus) -> clean and catalog -> fast
string matching
Surface form: a string representing an entity in a text.
Annotated corpus: e.g. Wikipedia articles, Common Crawl
9
10. Method (Catalog)
DBpedia
DBpedia labels can be considered as a catalog after the removal of ambiguous
labels. Downside: The labels in DBpedia are rather sparse.
Wikipedia
The internal links of Wikipedia are a good source of surface forms with links to
entities (Wikipedia pages). Downside: Noise is introduced due to the categories of
links.
10
11. Method
Ratio
Decide on which surface forms have ambiguous labels which can not be
considered without context.
Percentile method
Removes long tail and then readjusts weights to get better recall
11
12. Evaluation
Curated ground truth based on
Wikipedia articles allows us to
compare with manual annotations
in Wikipedia.
(30 randomly sampled articles)
● Ratio method: low recall
● Ratio+Percentile 99: best
12
13. Evaluation (Discussion)
● Increasing the ratio introduces more ambiguous labels -> direct impact on
precision
● The percentile method is balancing this effect by separating the ambiguity
from the popularity of the entities
● In general, we observe that the Percentile-Ratio method with 99-Percentile
and 10-Ratio strikes a good balance between high-precision results (>95%)
and reasonable recall (45%, 1309 entities)
13