The document summarizes previous work on entity linking and knowledge base population tasks. It discusses the tasks of entity linking, which grounds entity mentions in documents to entries in a knowledge base, and slot filling, which learns attributes about target entities. It provides results from the TAC-KBP 2010 evaluation, showing entity linking accuracy for different entity types and domains. GPE entities were particularly difficult. Name similarity features and handling NIL queries impacted performance.
2. Overview of previous work
TAC-KBP 2010 - Combining Similarities and Regression
Classifiers for Entity Linking
1. Task definition: KBP and EL
2. System description
3. Results
4. Conclusions
10. Knowledge acquisition
List candidates for the Greek elections in June.
What party does Tsipras represents?
How old is he?
What does Syriza means?
How old is Samaras?
11. Knowledge acquisition
List candidates for the Greek elections in June.
What party does Tsipras represents?
How old is he?
What does Syriza means?
How old is Samaras?
12. TAC-KBP 2010 - Combining Similarities and Regression
Classifiers for Entity Linking
1. Task definition: KBP and EL
2. System description
3. Results
4. Conclusions
13. TAC-KBP 2010 - Combining Similarities and Regression
Classifiers for Entity Linking
Knowledge Base Population
César de Pablo, Juan Perea, Paloma Martínez
15. Knowledge Base Population
Knowledge
Base
KBP
from Wikipedia dump (2008)
●
Title, name, type, id,
●
wiki text,
●
several facts as [name, value]
● 1.3 million English newswire
documents
● Published from 1994 and 2008
● 488.240 webpages
17. IE = KBP?
Accurate extraction of facts – not annotation
Learn facts from corpus - repetition is not
important but helps confidence
Asserting wrong information is bad
Scalability
Provenance
QA = KBP?
18. IE = KBP?
Accurate extraction of facts – not annotation Slots are fixed but targets change
Learn facts from corpus - repetition is not Leverage knowledge from the KB
important but helps confidence
Global resolution - ground information to the KB
Asserting wrong information is bad
Avoid contradiction
Scalability
Detect novel info
Provenance
QA = KBP?
19. Task at TAC - KBP
●
Task –1: Slot Filling in
Entity Linking grounding entity mentions
document to KB entries
● Slot Filling – Learning attributes about target
entities
Task 2: Entity Linking
20. Task at TAC - KBP
●
Task –1: Slot Filling in
Entity Linking grounding entity mentions
document to KB entries
● Slot Filling – Learning attributes about target
entities
21. Task at TAC - KBP
● Entity Linking – grounding entity mentions in
document to KB entries
● Slot Filling – Learning attributes about target
entities
Task 2: Entity Linking
22. Entity Linking: Example
For a name string and a document, determine which mentions in
● Entity Linking – grounding entity entity in a KB
if any is being referred to by entries string
document to KB the name
● <query
id="EL006455">
Slot Filling – Learning attributes about target
<name>Reserve Bank</name>
entities
<docid>eng-NG-31-100316-11150589</docid>
<entity>E0700143</entity>
</query>
<query id="EL06472">
<name>Reserve Bank</name>
<docid>eng-NG-31-142262-10040510</docid>
<entity>E0421510</entity>
</query>
23. Entity Linking: Example
For a name string and a document, determine which mentions in
● Entity Linking – grounding entity entity in a KB
if any is being referred to by entries string
document to KB the name
● <query
id="EL006455">
Slot Filling – Learning attributes about target
<name>Reserve Bank</name>
entities
<docid>eng-NG-31-100316-11150589</docid>
<entity>E0700143</entity>
…
</query>
E0421510: Reserve Bank of Australia
…
E0700143: Reserve Bank of India
<query id="EL06472"> ....
<name>Reserve Bank</name>
<docid>eng-NG-31-142262-10040510</docid> NIL
<entity>E0421510</entity>
</query>
24. Entity Linking: Challenges
Focus on confusable entities
● Entity Linking – grounding entity mentions in
●
Ambiguous names : Reserve Bank, Alan Jackson, Fonda
document to KB entries
●● Slot Filling – Learning attributes about target
entities
25. Entity Linking: Challenges
Focus on confusable entities
● Entity Linking – grounding entity mentions in
●
Ambiguous names entries
document to KB
●● Multiple Name– Learning attributes about target
Slot Filling variants: Saddam Hussain, Saddam Hussein
entities
26. Entity Linking: Challenges
Focus on confusable entities
● Entity Linking – grounding entity mentions in
●
Ambiguous names entries
document to KB
●● Multiple Name– Learning attributes about target
Slot Filling variants
● entities
Acronym expansion: CDC, AZ
27. Entity Linking: Challenges
Focus on confusable entities
● Entity Linking – grounding entity mentions in
●
Ambiguous names entries
document to KB
●● Multiple Name– Learning attributes about target
Slot Filling variants
● entities
Acronym expansion
●
Variety of cases : Centre for Disease Control, European Centre
for Disease Control, AZ, Arizona, Astra Zeneca
28. Entity Linking: Challenges
Focus on confusable entities
● Entity Linking – grounding entity mentions in
●
Ambiguous names entries
document to KB
●● Multiple Name– Learning attributes about target
Slot Filling variants
● entities
Acronym expansion
●
Variety of cases
●
Pilot task – entity linking withouth text support
●
Identify missing entities – then cluster (2011)
29. Entity Linking: Evaluation
Name mention – document pairs
●
Accuracy micro = num correct / num queries
●
Accuracy macro = group by entities (2009)
queries NIL set genre % NIL
3904 2229 eval 2009 news 0.571
1500 426 train 2010 web 0.284
2250 1230 eval 2010 news + 0.547
web
30. uc3m EL system
●
Supervised architecture
● Entity Linking – grounding entity mentions in
●
Use similarities to KB entries or parts of them – avoid a
document between objects
wide feature vector
● Slot Filling – Learning attributes about target
●
entities
1) Candidate Entity Retrieval
2) Candidate Filtering
3) Validation (NIL classification)
32. 1) Candidate Retrieval
●
Each KB article is indexed using Lucene, using several
● Entity Linking – grounding entity mentions in
indexes and fields KB entries
document to
●
● ALIASFilling – names plus aliases extracted from wiki slots:
Slot - include Learning attributes about target
alias, abbreviation, website, etc.
entities
●
NER – Named entities extracted from text: <id, ne, text>
●
KB - entity slots <id, [(slot_name,slot_value)]>
●
WIKIPEDIA – anchorList, category, redirect, outlinks, inlinks
●
Each EL query transforms into several Lucene queries –
result [KB name, score] list
33. 1) Candidate Retrieval
●
EL Query: [Michael Jordan,eng-NG-31-100316-11150589]
● Entity Linking – grounding entity mentions in
●
Lucene queries:to KB entries
document
●
● name=Michael AND name = Jordan
Slot Filling – Learning attributes about target
●
entities
alias=Michael AND alias = Jordan
●
abbr=Michael AND abbr = Jordan
●
For each query:
●
[EL0989789, Michael Jordan, 25.00]
●
[EL6565356, Michael B. Jordan , 25.00]
●
[EL6565356, Michael I. Jordan , 25.00]
●
[EL6565356, Michael-Hakim Jordan , 25.00]
●
[EL6565356, Jordan , 20.00]
34. 2) Candidate Filtering
●
Classification problem
● Entity Linking – grounding entity mentions in
●
decide (EL query KB entries name + wiki text ) is a good
document to + text , KB
match
● Slot Filling – Learning attributes about target
●
In fact, rank by prediction confidence
entities
●
Use similarity scores as features – norm and unnorm
●
Use a cost sensitive classifier.
●
Best results: Model trees with linear regression leafs
35. Features
●
Index-based scores:
● Entity Linking – grounding entity mentions in
●
sim (EL queries, KB entries) directly from initial retrieval
document to KB entries
●
Context-similarity Learning attributes about target
● Slot Filling – scores:
● entities
sim(document, wikitext) o sim(document,slots)
●
Name similarity score:
●
sim (EL queries, KB entries) – more expensive: equal,
QcontainsE, EcontainsQ, Jaro, Jaro-Winkler, SLIM (based on
SecondString)
36. 3) Validation
●
Classification – selected candidate is good enough or NIL
● Entity Linking – grounding entity mentions in
●
Positive examples KBcorrect candidate example
document to – entries
● ● Slot Filling – Learning attributes about target
Negative examples – top ranked entities for those queries
entities
that do not have a link in the KB
●
Balanced dataset
●
Best classifier: Logistic Regression
37. EL results - main
●
●
● Entity Linking – grounding entity mentions in
document to KB web
news entriesnews+web Highest Median
●
750 ORG 0.69 0.67 0.67 0.85 0.68
● Slot GPE 0.52– Learning attributes about target
749
Filling 0.53 0.51 0.80 0.60
●
entities 0.82
751 PER 0.76 0.85 0.96 0.85
●
2250 ALL 0.67 0.65 0.68 0.87 0.69
●
● Influence of domain?
38. EL results - main
●
●
● Entity Linking – grounding entity mentions in
document to KB web
news entriesnews+web Highest Median
●
750 ORG 0.69 0.67 0.67 0.85 0.68
● Slot GPE 0.52– Learning attributes about target
749
Filling 0.53 0.51 0.80 0.60
●
entities 0.82
751 PER 0.76 0.85 0.96 0.85
●
2250 ALL 0.67 0.65 0.68 0.87 0.69
●
39. EL results - main
●
●
● Entity Linking – grounding entity mentions in
document to KB web
news entriesnews+web Highest Median
●
750 ORG 0.69 0.67 0.67 0.85 0.68
● Slot GPE 0.52– Learning attributes about target
749
Filling 0.53 0.51 0.80 0.60
●
entities 0.82
751 PER 0.76 0.85 0.96 0.85
●
2250 ALL 0.67 0.65 0.68 0.87 0.69
● GPE are particularly difficult
40. EL results - main
● AA
● Entity Linking – grounding entity mentions in
document to KB web
news entriesnews+web Highest Median
750 ORG 0.69 0.67 0.67 0.85 0.68
● Slot GPE 0.52– Learning attributes about target
749
Filling 0.53 0.51 0.80 0.60
entities 0.82
751 PER 0.76 0.85 0.96 0.85
2250 ALL 0.67 0.65 0.68 0.87 0.69
news web news+web Highest Median
2250 ALL 0.67 0.65 0.68 0.87 0.69
1020 noNIL 0.51 0.59 0.49
1230 NIL 0.81 0.70 0.82
41. EL results - main
● AA
● Entity Linking – grounding entity mentions in
document to KB web
news entriesnews+web Highest Median
750 ORG 0.69 0.67 0.67 0.85 0.68
● Slot GPE 0.52– Learning attributes about target
749
Filling 0.53 0.51 0.80 0.60
entities 0.82
751 PER 0.76 0.85 0.96 0.85
2250 ALL 0.67 0.65 0.68 0.87 0.69
news web news+web Highest Median
2250 ALL 0.67 0.65 0.68 0.87 0.69
1020 noNIL 0.51 0.59 0.49
1230 NIL 0.81 0.70 0.82
42. EL results – pilot w/o text
●
●
● Entity Linking – grounding entity mentions in
document to KB entries
news(main) news +n-sim NIL +n-sim all
●
2250 ALL 0.67 0.58 0.66 0.70
● Slot Filling – Learning attributes about target
1020 noNIL 0.51 0.35 0.40 0.47
●
entities NIL 0.81
1230 0.77 0.88 0.88
●
● Including name similarity scores helped
43. EL results – pilot w/o text
●
●
● Entity Linking – grounding entity mentions in
document to KB entries
news(main) news +n-sim NIL +n-sim all
●
2250 ALL 0.67 0.58 0.66 0.70
● Slot Filling – Learning attributes about target
1020 noNIL 0.51 0.35 0.40 0.47
●
entities NIL 0.81
1230 0.77 0.88 0.88
●
● Including name similarity scores helped
44. EL systems comparison
●
Prior on Link probability/popularity (Stanford-UBC 2009, LCC 2010,
● Entity Linking – grounding entity mentions in
Microsoft 2011)
document to KB entries
Learning to rank algorithms: ListNet (CUNY 2011)
● Slot Filling – Learning attributes about target
●
Expand queries: acronym expansion/correference (NUS 2011)
entities
●
Unsupervised system – entity co-ocurrence + PageRank
(WebTLab 2010)
●
Inductive EL – first cluster, then link (LCC 2011)
●
Collective entity linking (Microsoft 2011)
45. Conclusion
●
Supervised EL system
● Entity Linking – grounding entity mentions in
●
Influence of training size
document to KB entries
●● beware of training data distribution
Slot Filling – Learning attributes about target
● entities
Consider name-similarities even for reranking
●
Improve initial candidate retrieval
●
Perform collective Entity Linking
●
Efficiency?
46. Related tasks
● Cluster Documents Mentioning Entities
● Entity correference – document and cross-
document
● Add missing links between Wikipedia pages
● Link entities to matching Wikipedia articles