SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
SwissLink
High-Precision, Context-Free Entity Linking
Exploiting Unambiguous Labels
Roman Prokofyev, Michael Luggen, Djellel Eddine Difallah, Philippe Cudré-Mauroux
eXascale Infolab, University of Fribourg, Switzerland
Entity Linking
“In natural language processing, entity linking, [...] is the task of determining
the identity of entities mentioned in text.
https://en.wikipedia.org/wiki/Entity_linking
Where the identity of an entity is commonly defined as an entry in a Knowledge
Base (KB).
It is usually solved in a multi-step process involving Named Entity Recognition
(NER) followed by a Candidate Selection and finally the Disambiguation.
2
Entity Linking
1. Named Entity Recognition (NER)
Distinguish between word of speech and defined concepts, also known as
named entities. Often involves a Part of Speech (POS) tagger.
2. Candidate Selection
Selecting possible candidates from the target Knowledge Base (where
entities are defined).
3. Disambiguation
Deciding which candidate is the correct identity corresponding to the
mention of a Named Entity. 3
Entity Linking
1. Named Entity Recognition (NER)
“It is a blast to visit Adam once more.”
2. Candidate Selection
Adam -> Adam (Name), Adam (City) in Oman, Amsterdam
3. Disambiguation
Adam -> https://en.wikipedia.org/wiki/Amsterdam
4
Motivation: High-precision context-free entity linking
● Certain applications require high-precision linked entities
○ Interactive applications where humans review results
○ Machine learning: training predictive
models may require high-precision
annotated text (no overfitting)
● Context-free
○ Works with any type of input:
text, tweets, search queries
○ But limited to unambiguous labels
The F1 score strikes a balance (harmonic mean) between precision and recall.
This is not necessarily the best optimization for the task at hand. 5
Precision
Recall
F1Score
Motivation: Categories of links to Wikipedia
What labels are used to link to entities (as Wikipedia pages) on the web?
Link by the most common label
web browser
Link by context
divided into three
subgroups: East,
West, and South
Link by reference
Wikipedia
Erroneous link
Oregon
Incorrectly linked entity even when
considering the context
<Web_browser>
381’623
times
<East_Slavic_languages>
<Angelina_Jolie>
16’333
times <University_of_Oregon>
6
Motivation: Prior probability scores
● Most important feature when not considering context
● Conditional probability P(link|label)
● Problems:
Does not necessarily capture ambiguity
Adam -> Adam (Name), Adam (City) in Oman, Amsterdam
Does not take categories into account
Wikipedia -> Angelina_Jolie [16’333]
7
Method (Problem)
Problem Formulation.
Given an arbitrary textual document ID
as input
Identify all named entities substrings {l1
, .., lk
}
And link them to their respective entities.
Effectively, our methods will return as output a set of label-entity pairs
OD
={(l1
,ez
),...,(lk
,ex
)}.
8
Method (Different Overall Approach)
Common
Named entity recognition -> candidate selection -> disambiguation
Context Free
Extract surface forms (KB or annotated corpus) -> clean and catalog -> fast
string matching
Surface form: a string representing an entity in a text.
Annotated corpus: e.g. Wikipedia articles, Common Crawl
9
Method (Catalog)
DBpedia
DBpedia labels can be considered as a catalog after the removal of ambiguous
labels. Downside: The labels in DBpedia are rather sparse.
Wikipedia
The internal links of Wikipedia are a good source of surface forms with links to
entities (Wikipedia pages). Downside: Noise is introduced due to the categories of
links.
10
Method
Ratio
Decide on which surface forms have ambiguous labels which can not be
considered without context.
Percentile method
Removes long tail and then readjusts weights to get better recall
11
Evaluation
Curated ground truth based on
Wikipedia articles allows us to
compare with manual annotations
in Wikipedia.
(30 randomly sampled articles)
● Ratio method: low recall
● Ratio+Percentile 99: best
12
Evaluation (Discussion)
● Increasing the ratio introduces more ambiguous labels -> direct impact on
precision
● The percentile method is balancing this effect by separating the ambiguity
from the popularity of the entities
● In general, we observe that the Percentile-Ratio method with 99-Percentile
and 10-Ratio strikes a good balance between high-precision results (>95%)
and reasonable recall (45%, 1309 entities)
13
High-Precision, Context-Free Entity Linking
Exploiting Unambiguous Labels
Links
Ground truth: https://github.com/eXascaleInfolab/Wikipedia30
Methods: https://github.com/eXascaleInfolab/kilogram
Evaluation: http://w3id.org/gerbil/experiment?id=201604300040
14
15

Contenu connexe

Tendances

香港六合彩 &raquo; SlideShare
香港六合彩 &raquo; SlideShare香港六合彩 &raquo; SlideShare
香港六合彩 &raquo; SlideShare
biyu
 

Tendances (20)

香港六合彩 &raquo; SlideShare
香港六合彩 &raquo; SlideShare香港六合彩 &raquo; SlideShare
香港六合彩 &raquo; SlideShare
 
Best C Sharp C# Training Online C# Online Course C# Online Training Best on...
Best C Sharp C# Training Online C# Online Course   C# Online Training Best on...Best C Sharp C# Training Online C# Online Course   C# Online Training Best on...
Best C Sharp C# Training Online C# Online Course C# Online Training Best on...
 
Testing in isolation
Testing in isolationTesting in isolation
Testing in isolation
 
C plusplus
C plusplusC plusplus
C plusplus
 
Object oriented programming concept
Object oriented programming conceptObject oriented programming concept
Object oriented programming concept
 
Object oriented programming C++
Object oriented programming C++Object oriented programming C++
Object oriented programming C++
 
General oops concepts
General oops conceptsGeneral oops concepts
General oops concepts
 
Pursuing Domain-Driven Design practices in PHP
Pursuing Domain-Driven Design practices in PHPPursuing Domain-Driven Design practices in PHP
Pursuing Domain-Driven Design practices in PHP
 
Introduction to Object Oriented Programming
Introduction to Object Oriented ProgrammingIntroduction to Object Oriented Programming
Introduction to Object Oriented Programming
 
Oop concepts classes_objects
Oop concepts classes_objectsOop concepts classes_objects
Oop concepts classes_objects
 
Object Oriented Programming Concepts
Object Oriented Programming ConceptsObject Oriented Programming Concepts
Object Oriented Programming Concepts
 
Object Oriented Concept
Object Oriented ConceptObject Oriented Concept
Object Oriented Concept
 
Std 12 computer chapter 6 object oriented concepts (part 1)
Std 12 computer chapter 6 object oriented concepts (part 1)Std 12 computer chapter 6 object oriented concepts (part 1)
Std 12 computer chapter 6 object oriented concepts (part 1)
 
Higher Order Applicative XML (Monterey 2002)
Higher Order Applicative XML (Monterey 2002)Higher Order Applicative XML (Monterey 2002)
Higher Order Applicative XML (Monterey 2002)
 
Object database standards, languages and design
Object database standards, languages and designObject database standards, languages and design
Object database standards, languages and design
 
Session 19 - Review Session
Session 19 - Review SessionSession 19 - Review Session
Session 19 - Review Session
 
General OOP concept [by-Digvijay]
General OOP concept [by-Digvijay]General OOP concept [by-Digvijay]
General OOP concept [by-Digvijay]
 
Object oriented programming
Object oriented programmingObject oriented programming
Object oriented programming
 
Inner Classes in Java
Inner Classes in JavaInner Classes in Java
Inner Classes in Java
 
Oop concept
Oop conceptOop concept
Oop concept
 

Similaire à Session 1.2 high-precision, context-free entity linking exploiting unambiguous labels

Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with Lydia
Jae Hong Kil
 
Introduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologiesIntroduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologies
TabassumMaktum
 

Similaire à Session 1.2 high-precision, context-free entity linking exploiting unambiguous labels (20)

The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with Lydia
 
Chapter 1- Introduction.ppt
Chapter 1- Introduction.pptChapter 1- Introduction.ppt
Chapter 1- Introduction.ppt
 
Core java part1
Core java  part1Core java  part1
Core java part1
 
CPP_,module2_1.pptx
CPP_,module2_1.pptxCPP_,module2_1.pptx
CPP_,module2_1.pptx
 
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
 
Java oo ps concepts
Java oo ps conceptsJava oo ps concepts
Java oo ps concepts
 
Answer ado.net pre-exam2018
Answer ado.net pre-exam2018Answer ado.net pre-exam2018
Answer ado.net pre-exam2018
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Oop java
Oop javaOop java
Oop java
 
Dom
DomDom
Dom
 
Introduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologiesIntroduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologies
 
Introduction to odbms
Introduction to odbmsIntroduction to odbms
Introduction to odbms
 
1 intro
1 intro1 intro
1 intro
 
Java Notes
Java NotesJava Notes
Java Notes
 
Code Search Based on Deep Neural Network and Code Mutation
Code Search Based on Deep Neural Network and Code MutationCode Search Based on Deep Neural Network and Code Mutation
Code Search Based on Deep Neural Network and Code Mutation
 
Unit 5.ppt
Unit 5.pptUnit 5.ppt
Unit 5.ppt
 
Java pdf
Java   pdfJava   pdf
Java pdf
 
Object oriented database concepts
Object oriented database conceptsObject oriented database concepts
Object oriented database concepts
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1
 

Plus de semanticsconference

Plus de semanticsconference (20)

Linear books to open world adventure
Linear books to open world adventureLinear books to open world adventure
Linear books to open world adventure
 
Session 4.3 semantic annotation for enhancing collaborative ideation
Session 4.3   semantic annotation for enhancing collaborative ideationSession 4.3   semantic annotation for enhancing collaborative ideation
Session 4.3 semantic annotation for enhancing collaborative ideation
 
Session 1.1 dalicc - data licenses clearance center
Session 1.1   dalicc - data licenses clearance centerSession 1.1   dalicc - data licenses clearance center
Session 1.1 dalicc - data licenses clearance center
 
Session 1.3 context information management across smart city knowledge domains
Session 1.3   context information management across smart city knowledge domainsSession 1.3   context information management across smart city knowledge domains
Session 1.3 context information management across smart city knowledge domains
 
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
Session 0.0   aussenac semanticsnl-pwebsem2017-v4Session 0.0   aussenac semanticsnl-pwebsem2017-v4
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
 
Session 0.0 keynote sandeep sacheti - final hi res
Session 0.0   keynote sandeep sacheti - final hi resSession 0.0   keynote sandeep sacheti - final hi res
Session 0.0 keynote sandeep sacheti - final hi res
 
Session 1.1 linked data applied: a field report from the netherlands
Session 1.1   linked data applied: a field report from the netherlandsSession 1.1   linked data applied: a field report from the netherlands
Session 1.1 linked data applied: a field report from the netherlands
 
Session 1.2 enrich your knowledge graphs: linked data integration with pool...
Session 1.2   enrich your knowledge graphs: linked data integration with pool...Session 1.2   enrich your knowledge graphs: linked data integration with pool...
Session 1.2 enrich your knowledge graphs: linked data integration with pool...
 
Session 1.4 connecting information from legislation and datasets using a ca...
Session 1.4   connecting information from legislation and datasets using a ca...Session 1.4   connecting information from legislation and datasets using a ca...
Session 1.4 connecting information from legislation and datasets using a ca...
 
Session 1.4 a distributed network of heritage information
Session 1.4   a distributed network of heritage informationSession 1.4   a distributed network of heritage information
Session 1.4 a distributed network of heritage information
 
Session 0.0 media panel - matthias priem - gtuo - semantics 2017
Session 0.0   media panel - matthias priem - gtuo - semantics 2017Session 0.0   media panel - matthias priem - gtuo - semantics 2017
Session 0.0 media panel - matthias priem - gtuo - semantics 2017
 
Session 1.3 semantic asset management in the dutch rail engineering and con...
Session 1.3   semantic asset management in the dutch rail engineering and con...Session 1.3   semantic asset management in the dutch rail engineering and con...
Session 1.3 semantic asset management in the dutch rail engineering and con...
 
Session 1.3 energy, smart homes &amp; smart grids: towards interoperability...
Session 1.3   energy, smart homes &amp; smart grids: towards interoperability...Session 1.3   energy, smart homes &amp; smart grids: towards interoperability...
Session 1.3 energy, smart homes &amp; smart grids: towards interoperability...
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichment
 
Session 2.3 semantics for safeguarding &amp; security – a police story
Session 2.3   semantics for safeguarding &amp; security – a police storySession 2.3   semantics for safeguarding &amp; security – a police story
Session 2.3 semantics for safeguarding &amp; security – a police story
 
Session 2.5 semantic similarity based clustering of license excerpts for im...
Session 2.5   semantic similarity based clustering of license excerpts for im...Session 2.5   semantic similarity based clustering of license excerpts for im...
Session 2.5 semantic similarity based clustering of license excerpts for im...
 
Session 4.2 unleash the triple: leveraging a corporate discovery interface....
Session 4.2   unleash the triple: leveraging a corporate discovery interface....Session 4.2   unleash the triple: leveraging a corporate discovery interface....
Session 4.2 unleash the triple: leveraging a corporate discovery interface....
 
Session 1.6 slovak public metadata governance and management based on linke...
Session 1.6   slovak public metadata governance and management based on linke...Session 1.6   slovak public metadata governance and management based on linke...
Session 1.6 slovak public metadata governance and management based on linke...
 
Session 5.6 towards a semantic outlier detection framework in wireless sens...
Session 5.6   towards a semantic outlier detection framework in wireless sens...Session 5.6   towards a semantic outlier detection framework in wireless sens...
Session 5.6 towards a semantic outlier detection framework in wireless sens...
 
Session 2.2 ontology-guided job market demand analysis: a cross-sectional s...
Session 2.2   ontology-guided job market demand analysis: a cross-sectional s...Session 2.2   ontology-guided job market demand analysis: a cross-sectional s...
Session 2.2 ontology-guided job market demand analysis: a cross-sectional s...
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Session 1.2 high-precision, context-free entity linking exploiting unambiguous labels

  • 1. SwissLink High-Precision, Context-Free Entity Linking Exploiting Unambiguous Labels Roman Prokofyev, Michael Luggen, Djellel Eddine Difallah, Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg, Switzerland
  • 2. Entity Linking “In natural language processing, entity linking, [...] is the task of determining the identity of entities mentioned in text. https://en.wikipedia.org/wiki/Entity_linking Where the identity of an entity is commonly defined as an entry in a Knowledge Base (KB). It is usually solved in a multi-step process involving Named Entity Recognition (NER) followed by a Candidate Selection and finally the Disambiguation. 2
  • 3. Entity Linking 1. Named Entity Recognition (NER) Distinguish between word of speech and defined concepts, also known as named entities. Often involves a Part of Speech (POS) tagger. 2. Candidate Selection Selecting possible candidates from the target Knowledge Base (where entities are defined). 3. Disambiguation Deciding which candidate is the correct identity corresponding to the mention of a Named Entity. 3
  • 4. Entity Linking 1. Named Entity Recognition (NER) “It is a blast to visit Adam once more.” 2. Candidate Selection Adam -> Adam (Name), Adam (City) in Oman, Amsterdam 3. Disambiguation Adam -> https://en.wikipedia.org/wiki/Amsterdam 4
  • 5. Motivation: High-precision context-free entity linking ● Certain applications require high-precision linked entities ○ Interactive applications where humans review results ○ Machine learning: training predictive models may require high-precision annotated text (no overfitting) ● Context-free ○ Works with any type of input: text, tweets, search queries ○ But limited to unambiguous labels The F1 score strikes a balance (harmonic mean) between precision and recall. This is not necessarily the best optimization for the task at hand. 5 Precision Recall F1Score
  • 6. Motivation: Categories of links to Wikipedia What labels are used to link to entities (as Wikipedia pages) on the web? Link by the most common label web browser Link by context divided into three subgroups: East, West, and South Link by reference Wikipedia Erroneous link Oregon Incorrectly linked entity even when considering the context <Web_browser> 381’623 times <East_Slavic_languages> <Angelina_Jolie> 16’333 times <University_of_Oregon> 6
  • 7. Motivation: Prior probability scores ● Most important feature when not considering context ● Conditional probability P(link|label) ● Problems: Does not necessarily capture ambiguity Adam -> Adam (Name), Adam (City) in Oman, Amsterdam Does not take categories into account Wikipedia -> Angelina_Jolie [16’333] 7
  • 8. Method (Problem) Problem Formulation. Given an arbitrary textual document ID as input Identify all named entities substrings {l1 , .., lk } And link them to their respective entities. Effectively, our methods will return as output a set of label-entity pairs OD ={(l1 ,ez ),...,(lk ,ex )}. 8
  • 9. Method (Different Overall Approach) Common Named entity recognition -> candidate selection -> disambiguation Context Free Extract surface forms (KB or annotated corpus) -> clean and catalog -> fast string matching Surface form: a string representing an entity in a text. Annotated corpus: e.g. Wikipedia articles, Common Crawl 9
  • 10. Method (Catalog) DBpedia DBpedia labels can be considered as a catalog after the removal of ambiguous labels. Downside: The labels in DBpedia are rather sparse. Wikipedia The internal links of Wikipedia are a good source of surface forms with links to entities (Wikipedia pages). Downside: Noise is introduced due to the categories of links. 10
  • 11. Method Ratio Decide on which surface forms have ambiguous labels which can not be considered without context. Percentile method Removes long tail and then readjusts weights to get better recall 11
  • 12. Evaluation Curated ground truth based on Wikipedia articles allows us to compare with manual annotations in Wikipedia. (30 randomly sampled articles) ● Ratio method: low recall ● Ratio+Percentile 99: best 12
  • 13. Evaluation (Discussion) ● Increasing the ratio introduces more ambiguous labels -> direct impact on precision ● The percentile method is balancing this effect by separating the ambiguity from the popularity of the entities ● In general, we observe that the Percentile-Ratio method with 99-Percentile and 10-Ratio strikes a good balance between high-precision results (>95%) and reasonable recall (45%, 1309 entities) 13
  • 14. High-Precision, Context-Free Entity Linking Exploiting Unambiguous Labels Links Ground truth: https://github.com/eXascaleInfolab/Wikipedia30 Methods: https://github.com/eXascaleInfolab/kilogram Evaluation: http://w3id.org/gerbil/experiment?id=201604300040 14
  • 15. 15