Identity resolution systems indicate if two individuals really are the same person. Identity retrieval systems help you find the individual you’re after. These systems appear anywhere from analysts’ desks to border crossings. But how do can you tell if a system's any good before it's deployed? You need to understand the problems it should tackle and how to measure how well it’s doing.
This talk considers metrics and data for evaluating identity resolution and retrieval systems. It also explores the linguistic challenges these systems face.
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
Linguistic Considerations of Identity Resolution (2008)
1. GOVERNMENT USERS
Conference
“Navigating the Human Terrain”
College Park, MD, May 20-21, 2008
Linguistic
Considerations of
Identity Resolution
David Murgatroyd
Software Architect
Basis Technology
3. 3
Introduction: An Exercise
Jim Killeen Kileen, J. D.
Jaime Kilin
كلين جمس
Is there a >50% chance these refer to the same
person? If…US Citizens; On a ferry to Spain;
In a documentary
4. 4
What is Identity Resolution?
Identity Resolution (aka Entity Resolution):
determining if two or more given references refer to
the same entity.
Different from name matching as it’s about
identity of entities not similarity of names
See also:
Murgatroyd, D. Some Linguistic Considerations of
Entity Resolution and Retrieval. In Proceedings of
LREC 2008 Workshop on Resources and Evaluation for
Identity Matching, Entity Resolution and Entity
Management.
5. 5
What sorts of references?
Non-linguistic reference examples:
Numerical identifiers
— SSN
— Some portions of address (Street Number, Zip Code)
Visual identifiers (e.g., pictures, symbols)
Biometrics (e.g., DNA, iris, signature, voice)
Linguistic reference examples:
Nouns or pronouns in documents (e.g., “the CEO of Basis”)
Names of associated/related entities
— Locations (e.g., Street or City Name)
— Organizations
— Individuals
Name of entity <- we’re going to focus on this one
6. 6
Let’s focus on names of people
Common and familiar
Often fairly identifying piece of personal
information
Demonstrate typical challenges of resolution
with linguistic data
8. 8
Variation (Intentional)
Variation may be intentional
References may be draw on a large set of names:
— Formality (e.g., nicknames)
— Transparency (e.g., aliases)
— Location (e.g., toponym)
— Life status
Vocation (e.g., titles)
Marital status (e.g., marriage/divorce/widowhood)
Parenthood (e.g., patronymic)
Faith (e.g., christening, pilgrimage)
Death (e.g., posthumous names)
— Dialect (e.g., adolescent girls preferring “Jenni” over “Jenny”)
— Style of text (e.g., “Sollermun” for “Solomon” in Huck Finn)
Jim Killeen
9. 9
Variation (Unintentional)
Variation may be unintentional, arising from:
Typos
— E.g., “Killeen” vs. “Kileen”
Guessing spelling based on pronunciation
— E.g., “Caliin”
Ambiguities inherent in the encoding (e.g., Unicode):
— Characters with the same glyph
E.g., Latin and Cyrillic small “i”
— Characters with similar glyphs
E.g., Latin “K” and Greenlandic “ĸ”
— Characters with composed/combined forms
E.g., ņ (n with cedilla) vs. ņ (n + combining cedilla)
Kileen, J. D.
10. 10
Composition
Names have differing orders:
Given v. Surname: “Killen, Jim” v. “Jim Killeen”
Varies by culture
Name references may be partial:
“Jim” v. “Jim Killeen”
11. 11
Under-specification
Name components may be abbreviated
Initials (e.g., “J. D.”)
Abbreviations (e.g., “Jas.”)
Name references may have incomplete…
orthography (e.g., Semitic languages)
segmentation (e.g., Asian languages)
phonology (e.g., Ideographic languages)
Kileen, J. D.
كلين جمس
12. 12
Frequency
Any person can make up a name (an open class)
A few are common, most are very uncommon
Zipfian distribution
Lesson:
Valuable to know
common names
Valuable to have a
strategy for unknown
names
13. 13
Multilinguality
Names may appear in many languages-of-use
This leads to variation at many linguistic levels.
Orthographic:
transliteration confronts skew in:
—orthographic-to-phonetic mappings of source and
target languages-of-use
—sound systems between the languages
كلين جمس <-> James Klein
14. 14
Multilinguality (cont’d)
Syntactic:
different languages-of-use may imply different name
word order
Semantic:
name words which communicate meaning (e.g.,
titles) may vary (e.g., “Jr.” for “الصغر “which
means “the younger”)
Pragmatic:
different languages-of-use may use different names
based on the audience (e.g., “Mr. Laden” vs. “المير”
which means “the prince”)
16. 16
Inputs & Outputs
Inputs options include:
Pair-wise: simple integration, but no shared effort
Set-based: harder integration, but able to optimize
Output options include:
Feature-based: with weights/tuning
Probability-based:
—more principled combination
—NOTE: similarity is not probability
17. 17
Integration Properties
Certain properties help make efficient
implementations:
Reflexivity:
—Resolve(a,a) is always true
—NOTE: does not imply Resolve(a,a’) where a~a’
Commutativity:
—Resolve(a,b) Resolve(b,a)
Transitivity:
—Resolve(a,b) & Resolve(b,c) => Resolve(a,c)
19. 19
Corpora: Find or Build?
Requirements:
Annotated for ground truth
Represent linguistic challenges
Scalable/practical
Options
Adapt public “database” corpora:
— Wikipedia:
Annotated: yes
Representative: somewhat
Scalable: yes
— Citation DBs:
Annotated: no
Representative: somewhat
Scalable: yes
20. 20
Corpora: Find or Build? (cont’d)
Adapt public “document” corpora:
— Co-reference documents:
Annotated: yes
Representative: less as often single doc/language-of-use
Scalable: yes
Create corpora by hand:
— From scratch: “parrot sessions” (auditory or visual)
Annotated: yes
Representative: largely
Scalable: no
— From un-annotated databases:
Annotated: no
Representative: yes
Scalable/practical: no; databases may be private
— Synthesize from generative model
Annotated: yes
Representative: no, tied to generating model
Scalable: yes
21. 21
Metrics
Back to our initial example
Jim Killeen Kileen, J. D.
Jaime Kilin
كلين جمس
Jim
JDKJimK illeen
J. Diw Killeen
Reference
System A
System B
22. 22
Metrics: Adopt or Create?
How to quantify the quality of the system’s resolutions
vs. the reference?
Goals:
Discriminative: separates good v. bad systems for users’ needs
Interpretable: number aligns with intuition
Considerations:
Assume transitive closure (TC) of output?
Apply weights to try to be more discriminative?
Common concepts:
Precision: % of stuff in answer that’s right
Recall: % of right stuff in answer
F-Score: Harmonic mean of these = 2*P*R/(P+R)
23. 23
Candidate Metrics
Pair-wise % correct: over all N*(N-1)/2 node pairs
Pair-wise P&R: based on links drawn
Edit-distance: # of links to add/subtract to correct
Metrics used in document co-reference resolution:
MUC-6: entity-based P&R on missing links from graph
B-CUBED: average per-reference P&R of links
CEAF (Constrained Entity-Alignment F): entities aligned
using some similarity measure; P&R are % of possible
similarity level achieved
24. 24
Comparing Metrics
Jim Killeen
Jaime Kilin
كلين جمس
Jim
JDKJimK illeen
J. Diw Killeen
Reference
System A
System B
Kileen, J. D.
No TCTC
3
6
1
4
Edit-dist
81858973717982B
90788062618279A
No TCTCNo TCTC
CEAF
(TC)
B-CUBED
(TC)
MUC-6
(TC)
Pairwise F% Correct
My preference
25. 25
Conclusion
Identity resolution systems face linguistic
challenges
They need to be carefully integrated to meet
these challenges
Evaluation corpora should reflect these
challenges
Evaluation metrics should align with qualitative
judgements
26. 26
Bibliography
Bagga, A., Baldwin., B. (1998). Algorithms for scoring coreference chains. In
Proceedings of the First International Conference on Language Resources
and Evaluation Workshop on Linguistic Coreference.
Fellegi, I. P., Sunter, A. B. (1969). A theory for record linkage. Journal of the
American Statistical Association, Vol. 64, No. 328, pp. 1183--1210.
Luo, X. (2005). On coreference resolution performance metrics. In Proc. of
HLT-EMNLP, pp 25--32.
Menestrina, D., Benjelloun, O., Garcia-Molina, H. (2006). Generic entity
resolution with data confidences. In First International VLDB Workshop on
Clean Databases. Seoul, Korea.
Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and
Retrieval. In Proceedings of LREC 2008 Workshop on Resources and
Evaluation for Identity Matching, Entity Resolution and Entity
Management.
Spock Team (2008). The Spock Challenge. http://challenge.spock.com/
(Retrieved February 5.)
Vilain, M. Burger, J. Aberdeen, J. Connolly, D., Hirschman, L. (1995). A
model-theoretic coreference scoring scheme. In Proceedings of the 6th
Message Understanding Conference (MUC6). Morgan Kaufmann, pp. 45--52.