SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
GOVERNMENT USERS
Conference
“Navigating the Human Terrain”
College Park, MD, May 20-21, 2008
Linguistic
Considerations of
Identity Resolution
David Murgatroyd
Software Architect
Basis Technology
2
Outline
 Introduction
 Linguistic Challenges
 Variation (Intentional & Unintentional)
 Composition
 Frequency
 Under-specification
 Multilinguality
 Integration Challenges
 Inputs & Outputs
 Properties
 Evaluation Challenges
 Corpora: Find or Build?
 Metrics: Adopt or Create?
 Conclusion
3
Introduction: An Exercise
Jim Killeen Kileen, J. D.
Jaime Kilin
‫كلين‬ ‫جمس‬
 Is there a >50% chance these refer to the same
person? If…US Citizens; On a ferry to Spain;
In a documentary
4
What is Identity Resolution?
 Identity Resolution (aka Entity Resolution):
 determining if two or more given references refer to
the same entity.
 Different from name matching as it’s about
identity of entities not similarity of names
 See also:
 Murgatroyd, D. Some Linguistic Considerations of
Entity Resolution and Retrieval. In Proceedings of
LREC 2008 Workshop on Resources and Evaluation for
Identity Matching, Entity Resolution and Entity
Management.
5
What sorts of references?
 Non-linguistic reference examples:
 Numerical identifiers
— SSN
— Some portions of address (Street Number, Zip Code)
 Visual identifiers (e.g., pictures, symbols)
 Biometrics (e.g., DNA, iris, signature, voice)
 Linguistic reference examples:
 Nouns or pronouns in documents (e.g., “the CEO of Basis”)
 Names of associated/related entities
— Locations (e.g., Street or City Name)
— Organizations
— Individuals
 Name of entity <- we’re going to focus on this one
6
Let’s focus on names of people
 Common and familiar
 Often fairly identifying piece of personal
information
 Demonstrate typical challenges of resolution
with linguistic data
7
Outline
 Introduction
 Linguistic Challenges
 Variation (Intentional & Unintentional)
 Composition
 Frequency
 Under-specification
 Multilinguality
 Integration Challenges
 Inputs & Outputs
 Properties
 Evaluation Challenges
 Corpora: Find or Build?
 Metrics: Adopt or Create?
 Conclusion
8
Variation (Intentional)
 Variation may be intentional
 References may be draw on a large set of names:
— Formality (e.g., nicknames)
— Transparency (e.g., aliases)
— Location (e.g., toponym)
— Life status
 Vocation (e.g., titles)
 Marital status (e.g., marriage/divorce/widowhood)
 Parenthood (e.g., patronymic)
 Faith (e.g., christening, pilgrimage)
 Death (e.g., posthumous names)
— Dialect (e.g., adolescent girls preferring “Jenni” over “Jenny”)
— Style of text (e.g., “Sollermun” for “Solomon” in Huck Finn)
Jim Killeen
9
Variation (Unintentional)
 Variation may be unintentional, arising from:
 Typos
— E.g., “Killeen” vs. “Kileen”
 Guessing spelling based on pronunciation
— E.g., “Caliin”
 Ambiguities inherent in the encoding (e.g., Unicode):
— Characters with the same glyph
 E.g., Latin and Cyrillic small “i”
— Characters with similar glyphs
 E.g., Latin “K” and Greenlandic “ĸ”
— Characters with composed/combined forms
 E.g., ņ (n with cedilla) vs. ņ (n + combining cedilla)
Kileen, J. D.
10
Composition
 Names have differing orders:
 Given v. Surname: “Killen, Jim” v. “Jim Killeen”
 Varies by culture
 Name references may be partial:
 “Jim” v. “Jim Killeen”
11
Under-specification
 Name components may be abbreviated
 Initials (e.g., “J. D.”)
 Abbreviations (e.g., “Jas.”)
 Name references may have incomplete…
 orthography (e.g., Semitic languages)
 segmentation (e.g., Asian languages)
 phonology (e.g., Ideographic languages)
Kileen, J. D.
‫كلين‬ ‫جمس‬
12
Frequency
 Any person can make up a name (an open class)
 A few are common, most are very uncommon
 Zipfian distribution
 Lesson:
 Valuable to know
common names
 Valuable to have a
strategy for unknown
names
13
Multilinguality
 Names may appear in many languages-of-use
 This leads to variation at many linguistic levels.
 Orthographic:
 transliteration confronts skew in:
—orthographic-to-phonetic mappings of source and
target languages-of-use
—sound systems between the languages
‫كلين‬ ‫جمس‬ <-> James Klein
14
Multilinguality (cont’d)
 Syntactic:
 different languages-of-use may imply different name
word order
 Semantic:
 name words which communicate meaning (e.g.,
titles) may vary (e.g., “Jr.” for “‫الصغر‬ “which
means “the younger”)
 Pragmatic:
 different languages-of-use may use different names
based on the audience (e.g., “Mr. Laden” vs. “‫المير‬”
which means “the prince”)
15
Outline
 Introduction
 Linguistic Challenges
 Variation (Intentional & Unintentional)
 Composition
 Frequency
 Under-specification
 Multilinguality
 Integration Challenges
 Inputs & Outputs
 Properties
 Evaluation Challenges
 Corpora: Find or Build?
 Metrics: Adopt or Create?
 Conclusion
16
Inputs & Outputs
 Inputs options include:
 Pair-wise: simple integration, but no shared effort
 Set-based: harder integration, but able to optimize
 Output options include:
 Feature-based: with weights/tuning
 Probability-based:
—more principled combination
—NOTE: similarity is not probability
17
Integration Properties
 Certain properties help make efficient
implementations:
 Reflexivity:
—Resolve(a,a) is always true
—NOTE: does not imply Resolve(a,a’) where a~a’
 Commutativity:
—Resolve(a,b)  Resolve(b,a)
 Transitivity:
—Resolve(a,b) & Resolve(b,c) => Resolve(a,c)
18
Outline
 Introduction
 Linguistic Challenges
 Variation (Intentional & Unintentional)
 Composition
 Frequency
 Under-specification
 Multilinguality
 Integration Challenges
 Inputs & Outputs
 Properties
 Evaluation Challenges
 Corpora: Find or Build?
 Metrics: Adopt or Create?
 Conclusion
19
Corpora: Find or Build?
 Requirements:
 Annotated for ground truth
 Represent linguistic challenges
 Scalable/practical
 Options
 Adapt public “database” corpora:
— Wikipedia:
 Annotated: yes
 Representative: somewhat
 Scalable: yes
— Citation DBs:
 Annotated: no
 Representative: somewhat
 Scalable: yes
20
Corpora: Find or Build? (cont’d)
 Adapt public “document” corpora:
— Co-reference documents:
 Annotated: yes
 Representative: less as often single doc/language-of-use
 Scalable: yes
 Create corpora by hand:
— From scratch: “parrot sessions” (auditory or visual)
 Annotated: yes
 Representative: largely
 Scalable: no
— From un-annotated databases:
 Annotated: no
 Representative: yes
 Scalable/practical: no; databases may be private
— Synthesize from generative model
 Annotated: yes
 Representative: no, tied to generating model
 Scalable: yes
21
Metrics
 Back to our initial example
Jim Killeen Kileen, J. D.
Jaime Kilin
‫كلين‬ ‫جمس‬
Jim
JDKJimK illeen
J. Diw Killeen
Reference
System A
System B
22
Metrics: Adopt or Create?
 How to quantify the quality of the system’s resolutions
vs. the reference?
 Goals:
 Discriminative: separates good v. bad systems for users’ needs
 Interpretable: number aligns with intuition
 Considerations:
 Assume transitive closure (TC) of output?
 Apply weights to try to be more discriminative?
 Common concepts:
 Precision: % of stuff in answer that’s right
 Recall: % of right stuff in answer
 F-Score: Harmonic mean of these = 2*P*R/(P+R)
23
Candidate Metrics
 Pair-wise % correct: over all N*(N-1)/2 node pairs
 Pair-wise P&R: based on links drawn
 Edit-distance: # of links to add/subtract to correct
 Metrics used in document co-reference resolution:
 MUC-6: entity-based P&R on missing links from graph
 B-CUBED: average per-reference P&R of links
 CEAF (Constrained Entity-Alignment F): entities aligned
using some similarity measure; P&R are % of possible
similarity level achieved
24
Comparing Metrics
Jim Killeen
Jaime Kilin
‫كلين‬ ‫جمس‬
Jim
JDKJimK illeen
J. Diw Killeen
Reference
System A
System B
Kileen, J. D.
No TCTC
3
6
1
4
Edit-dist
81858973717982B
90788062618279A
No TCTCNo TCTC
CEAF
(TC)
B-CUBED
(TC)
MUC-6
(TC)
Pairwise F% Correct
My preference
25
Conclusion
 Identity resolution systems face linguistic
challenges
 They need to be carefully integrated to meet
these challenges
 Evaluation corpora should reflect these
challenges
 Evaluation metrics should align with qualitative
judgements
26
Bibliography
Bagga, A., Baldwin., B. (1998). Algorithms for scoring coreference chains. In
Proceedings of the First International Conference on Language Resources
and Evaluation Workshop on Linguistic Coreference.
Fellegi, I. P., Sunter, A. B. (1969). A theory for record linkage. Journal of the
American Statistical Association, Vol. 64, No. 328, pp. 1183--1210.
Luo, X. (2005). On coreference resolution performance metrics. In Proc. of
HLT-EMNLP, pp 25--32.
Menestrina, D., Benjelloun, O., Garcia-Molina, H. (2006). Generic entity
resolution with data confidences. In First International VLDB Workshop on
Clean Databases. Seoul, Korea.
Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and
Retrieval. In Proceedings of LREC 2008 Workshop on Resources and
Evaluation for Identity Matching, Entity Resolution and Entity
Management.
Spock Team (2008). The Spock Challenge. http://challenge.spock.com/
(Retrieved February 5.)
Vilain, M. Burger, J. Aberdeen, J. Connolly, D., Hirschman, L. (1995). A
model-theoretic coreference scoring scheme. In Proceedings of the 6th
Message Understanding Conference (MUC6). Morgan Kaufmann, pp. 45--52.
27
Questions?
More information:
http://www.basistech.com

Contenu connexe

En vedette (13)

Language acquisition
Language acquisitionLanguage acquisition
Language acquisition
 
The interference of the first language
The interference of the first languageThe interference of the first language
The interference of the first language
 
Age and acquisition
Age and acquisitionAge and acquisition
Age and acquisition
 
Age and language acquisition
Age and language acquisitionAge and language acquisition
Age and language acquisition
 
Language acquisition (2)
Language acquisition (2)Language acquisition (2)
Language acquisition (2)
 
code switching
code switchingcode switching
code switching
 
Interference Between First and Second Languages pp pres
Interference Between First and Second Languages pp presInterference Between First and Second Languages pp pres
Interference Between First and Second Languages pp pres
 
Age and acquisition
Age and acquisitionAge and acquisition
Age and acquisition
 
Bilingualism, code switching, and code mixing
Bilingualism, code switching, and code mixingBilingualism, code switching, and code mixing
Bilingualism, code switching, and code mixing
 
Krashens Five Hypotheses
Krashens Five HypothesesKrashens Five Hypotheses
Krashens Five Hypotheses
 
Code Switching
Code SwitchingCode Switching
Code Switching
 
Bilingualism
BilingualismBilingualism
Bilingualism
 
Krashen's Five Main Hypotheses
Krashen's Five Main Hypotheses Krashen's Five Main Hypotheses
Krashen's Five Main Hypotheses
 

Similaire à Linguistic Considerations of Identity Resolution (2008)

Sikm pugh sustainability conversations for impact snapshot 210420
Sikm pugh sustainability conversations for impact snapshot 210420Sikm pugh sustainability conversations for impact snapshot 210420
Sikm pugh sustainability conversations for impact snapshot 210420
Katrina (Kate) Pugh
 
Testing vocabulary and literature
Testing vocabulary and literatureTesting vocabulary and literature
Testing vocabulary and literature
Kurtz Candilas
 
PVUSD Common Core Roll Out Presentation
PVUSD Common Core Roll Out PresentationPVUSD Common Core Roll Out Presentation
PVUSD Common Core Roll Out Presentation
Nicole James
 
Tbl presentation
Tbl presentationTbl presentation
Tbl presentation
gingerfresa
 
Written Analysis Grading Rubric CRITERIA Outstanding Above.docx
Written Analysis Grading Rubric CRITERIA Outstanding Above.docxWritten Analysis Grading Rubric CRITERIA Outstanding Above.docx
Written Analysis Grading Rubric CRITERIA Outstanding Above.docx
jeffevans62972
 
MNPS WIDA Transformations
MNPS WIDA TransformationsMNPS WIDA Transformations
MNPS WIDA Transformations
mollystovall
 
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular IdeasLean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
David Rico
 

Similaire à Linguistic Considerations of Identity Resolution (2008) (20)

lexicographic evidence
lexicographic evidencelexicographic evidence
lexicographic evidence
 
Sikm pugh sustainability conversations for impact snapshot 210420
Sikm pugh sustainability conversations for impact snapshot 210420Sikm pugh sustainability conversations for impact snapshot 210420
Sikm pugh sustainability conversations for impact snapshot 210420
 
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...
 
Ich Bin Ein Website - The impact of culture and language on internationalization
Ich Bin Ein Website - The impact of culture and language on internationalizationIch Bin Ein Website - The impact of culture and language on internationalization
Ich Bin Ein Website - The impact of culture and language on internationalization
 
Testing vocabulary and literature
Testing vocabulary and literatureTesting vocabulary and literature
Testing vocabulary and literature
 
MDG Seminar Presentation
MDG Seminar PresentationMDG Seminar Presentation
MDG Seminar Presentation
 
PVUSD Common Core Roll Out Presentation
PVUSD Common Core Roll Out PresentationPVUSD Common Core Roll Out Presentation
PVUSD Common Core Roll Out Presentation
 
Second Language Development through Writing: Considerations for the WIC Class...
Second Language Development through Writing: Considerations for the WIC Class...Second Language Development through Writing: Considerations for the WIC Class...
Second Language Development through Writing: Considerations for the WIC Class...
 
Analysis & Structure
Analysis & StructureAnalysis & Structure
Analysis & Structure
 
Are children with_specific_language_impairment_competent_with_the_pragmatics_...
Are children with_specific_language_impairment_competent_with_the_pragmatics_...Are children with_specific_language_impairment_competent_with_the_pragmatics_...
Are children with_specific_language_impairment_competent_with_the_pragmatics_...
 
Tbl presentation
Tbl presentationTbl presentation
Tbl presentation
 
Written Analysis Grading Rubric CRITERIA Outstanding Above.docx
Written Analysis Grading Rubric CRITERIA Outstanding Above.docxWritten Analysis Grading Rubric CRITERIA Outstanding Above.docx
Written Analysis Grading Rubric CRITERIA Outstanding Above.docx
 
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
 
Assessing Higher-Order Thinking And Communication Skills In College Graduates...
Assessing Higher-Order Thinking And Communication Skills In College Graduates...Assessing Higher-Order Thinking And Communication Skills In College Graduates...
Assessing Higher-Order Thinking And Communication Skills In College Graduates...
 
MNPS WIDA Transformations
MNPS WIDA TransformationsMNPS WIDA Transformations
MNPS WIDA Transformations
 
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular IdeasLean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
 
Common core 2
Common core 2Common core 2
Common core 2
 
Cultural Essay Examples
Cultural Essay ExamplesCultural Essay Examples
Cultural Essay Examples
 
Themes identification techniques in qualitative research
Themes identification techniques in qualitative researchThemes identification techniques in qualitative research
Themes identification techniques in qualitative research
 
Patterns for learning in SL: Borrowing the Language of 2D design
Patterns for learning in SL: Borrowing the Language of 2D designPatterns for learning in SL: Borrowing the Language of 2D design
Patterns for learning in SL: Borrowing the Language of 2D design
 

Plus de David Murgatroyd

Plus de David Murgatroyd (13)

Mission-Driven Machine Learning
Mission-Driven Machine LearningMission-Driven Machine Learning
Mission-Driven Machine Learning
 
Leveraging AI the Right Way (for Product Managers)
Leveraging AI the Right Way (for Product Managers)Leveraging AI the Right Way (for Product Managers)
Leveraging AI the Right Way (for Product Managers)
 
Managing Your Machine Learning Portfolio
Managing Your Machine Learning PortfolioManaging Your Machine Learning Portfolio
Managing Your Machine Learning Portfolio
 
How to train your product owner
How to train your product ownerHow to train your product owner
How to train your product owner
 
Technology & Faith: from Coding to Culture
Technology & Faith: from Coding to CultureTechnology & Faith: from Coding to Culture
Technology & Faith: from Coding to Culture
 
Agile Deep Learning
Agile Deep LearningAgile Deep Learning
Agile Deep Learning
 
Choosing a Job for the Right Reasons
Choosing a Job for the Right ReasonsChoosing a Job for the Right Reasons
Choosing a Job for the Right Reasons
 
NLP in the Real World
NLP in the Real WorldNLP in the Real World
NLP in the Real World
 
System combination for HLT
System combination for HLTSystem combination for HLT
System combination for HLT
 
HltCon overview
HltCon overviewHltCon overview
HltCon overview
 
Simple fuzzy name matching in solr
Simple fuzzy name matching in solrSimple fuzzy name matching in solr
Simple fuzzy name matching in solr
 
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
Moving beyond-entity-extraction-to-entity-resolution-david-murgatroyd-human-l...
 
From Research to Reality: Advances in HLT 2013
From Research to Reality: Advances in HLT 2013From Research to Reality: Advances in HLT 2013
From Research to Reality: Advances in HLT 2013
 

Dernier

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 

Dernier (20)

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 

Linguistic Considerations of Identity Resolution (2008)

  • 1. GOVERNMENT USERS Conference “Navigating the Human Terrain” College Park, MD, May 20-21, 2008 Linguistic Considerations of Identity Resolution David Murgatroyd Software Architect Basis Technology
  • 2. 2 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  • 3. 3 Introduction: An Exercise Jim Killeen Kileen, J. D. Jaime Kilin ‫كلين‬ ‫جمس‬  Is there a >50% chance these refer to the same person? If…US Citizens; On a ferry to Spain; In a documentary
  • 4. 4 What is Identity Resolution?  Identity Resolution (aka Entity Resolution):  determining if two or more given references refer to the same entity.  Different from name matching as it’s about identity of entities not similarity of names  See also:  Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and Retrieval. In Proceedings of LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management.
  • 5. 5 What sorts of references?  Non-linguistic reference examples:  Numerical identifiers — SSN — Some portions of address (Street Number, Zip Code)  Visual identifiers (e.g., pictures, symbols)  Biometrics (e.g., DNA, iris, signature, voice)  Linguistic reference examples:  Nouns or pronouns in documents (e.g., “the CEO of Basis”)  Names of associated/related entities — Locations (e.g., Street or City Name) — Organizations — Individuals  Name of entity <- we’re going to focus on this one
  • 6. 6 Let’s focus on names of people  Common and familiar  Often fairly identifying piece of personal information  Demonstrate typical challenges of resolution with linguistic data
  • 7. 7 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  • 8. 8 Variation (Intentional)  Variation may be intentional  References may be draw on a large set of names: — Formality (e.g., nicknames) — Transparency (e.g., aliases) — Location (e.g., toponym) — Life status  Vocation (e.g., titles)  Marital status (e.g., marriage/divorce/widowhood)  Parenthood (e.g., patronymic)  Faith (e.g., christening, pilgrimage)  Death (e.g., posthumous names) — Dialect (e.g., adolescent girls preferring “Jenni” over “Jenny”) — Style of text (e.g., “Sollermun” for “Solomon” in Huck Finn) Jim Killeen
  • 9. 9 Variation (Unintentional)  Variation may be unintentional, arising from:  Typos — E.g., “Killeen” vs. “Kileen”  Guessing spelling based on pronunciation — E.g., “Caliin”  Ambiguities inherent in the encoding (e.g., Unicode): — Characters with the same glyph  E.g., Latin and Cyrillic small “i” — Characters with similar glyphs  E.g., Latin “K” and Greenlandic “ĸ” — Characters with composed/combined forms  E.g., ņ (n with cedilla) vs. ņ (n + combining cedilla) Kileen, J. D.
  • 10. 10 Composition  Names have differing orders:  Given v. Surname: “Killen, Jim” v. “Jim Killeen”  Varies by culture  Name references may be partial:  “Jim” v. “Jim Killeen”
  • 11. 11 Under-specification  Name components may be abbreviated  Initials (e.g., “J. D.”)  Abbreviations (e.g., “Jas.”)  Name references may have incomplete…  orthography (e.g., Semitic languages)  segmentation (e.g., Asian languages)  phonology (e.g., Ideographic languages) Kileen, J. D. ‫كلين‬ ‫جمس‬
  • 12. 12 Frequency  Any person can make up a name (an open class)  A few are common, most are very uncommon  Zipfian distribution  Lesson:  Valuable to know common names  Valuable to have a strategy for unknown names
  • 13. 13 Multilinguality  Names may appear in many languages-of-use  This leads to variation at many linguistic levels.  Orthographic:  transliteration confronts skew in: —orthographic-to-phonetic mappings of source and target languages-of-use —sound systems between the languages ‫كلين‬ ‫جمس‬ <-> James Klein
  • 14. 14 Multilinguality (cont’d)  Syntactic:  different languages-of-use may imply different name word order  Semantic:  name words which communicate meaning (e.g., titles) may vary (e.g., “Jr.” for “‫الصغر‬ “which means “the younger”)  Pragmatic:  different languages-of-use may use different names based on the audience (e.g., “Mr. Laden” vs. “‫المير‬” which means “the prince”)
  • 15. 15 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  • 16. 16 Inputs & Outputs  Inputs options include:  Pair-wise: simple integration, but no shared effort  Set-based: harder integration, but able to optimize  Output options include:  Feature-based: with weights/tuning  Probability-based: —more principled combination —NOTE: similarity is not probability
  • 17. 17 Integration Properties  Certain properties help make efficient implementations:  Reflexivity: —Resolve(a,a) is always true —NOTE: does not imply Resolve(a,a’) where a~a’  Commutativity: —Resolve(a,b)  Resolve(b,a)  Transitivity: —Resolve(a,b) & Resolve(b,c) => Resolve(a,c)
  • 18. 18 Outline  Introduction  Linguistic Challenges  Variation (Intentional & Unintentional)  Composition  Frequency  Under-specification  Multilinguality  Integration Challenges  Inputs & Outputs  Properties  Evaluation Challenges  Corpora: Find or Build?  Metrics: Adopt or Create?  Conclusion
  • 19. 19 Corpora: Find or Build?  Requirements:  Annotated for ground truth  Represent linguistic challenges  Scalable/practical  Options  Adapt public “database” corpora: — Wikipedia:  Annotated: yes  Representative: somewhat  Scalable: yes — Citation DBs:  Annotated: no  Representative: somewhat  Scalable: yes
  • 20. 20 Corpora: Find or Build? (cont’d)  Adapt public “document” corpora: — Co-reference documents:  Annotated: yes  Representative: less as often single doc/language-of-use  Scalable: yes  Create corpora by hand: — From scratch: “parrot sessions” (auditory or visual)  Annotated: yes  Representative: largely  Scalable: no — From un-annotated databases:  Annotated: no  Representative: yes  Scalable/practical: no; databases may be private — Synthesize from generative model  Annotated: yes  Representative: no, tied to generating model  Scalable: yes
  • 21. 21 Metrics  Back to our initial example Jim Killeen Kileen, J. D. Jaime Kilin ‫كلين‬ ‫جمس‬ Jim JDKJimK illeen J. Diw Killeen Reference System A System B
  • 22. 22 Metrics: Adopt or Create?  How to quantify the quality of the system’s resolutions vs. the reference?  Goals:  Discriminative: separates good v. bad systems for users’ needs  Interpretable: number aligns with intuition  Considerations:  Assume transitive closure (TC) of output?  Apply weights to try to be more discriminative?  Common concepts:  Precision: % of stuff in answer that’s right  Recall: % of right stuff in answer  F-Score: Harmonic mean of these = 2*P*R/(P+R)
  • 23. 23 Candidate Metrics  Pair-wise % correct: over all N*(N-1)/2 node pairs  Pair-wise P&R: based on links drawn  Edit-distance: # of links to add/subtract to correct  Metrics used in document co-reference resolution:  MUC-6: entity-based P&R on missing links from graph  B-CUBED: average per-reference P&R of links  CEAF (Constrained Entity-Alignment F): entities aligned using some similarity measure; P&R are % of possible similarity level achieved
  • 24. 24 Comparing Metrics Jim Killeen Jaime Kilin ‫كلين‬ ‫جمس‬ Jim JDKJimK illeen J. Diw Killeen Reference System A System B Kileen, J. D. No TCTC 3 6 1 4 Edit-dist 81858973717982B 90788062618279A No TCTCNo TCTC CEAF (TC) B-CUBED (TC) MUC-6 (TC) Pairwise F% Correct My preference
  • 25. 25 Conclusion  Identity resolution systems face linguistic challenges  They need to be carefully integrated to meet these challenges  Evaluation corpora should reflect these challenges  Evaluation metrics should align with qualitative judgements
  • 26. 26 Bibliography Bagga, A., Baldwin., B. (1998). Algorithms for scoring coreference chains. In Proceedings of the First International Conference on Language Resources and Evaluation Workshop on Linguistic Coreference. Fellegi, I. P., Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, Vol. 64, No. 328, pp. 1183--1210. Luo, X. (2005). On coreference resolution performance metrics. In Proc. of HLT-EMNLP, pp 25--32. Menestrina, D., Benjelloun, O., Garcia-Molina, H. (2006). Generic entity resolution with data confidences. In First International VLDB Workshop on Clean Databases. Seoul, Korea. Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and Retrieval. In Proceedings of LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management. Spock Team (2008). The Spock Challenge. http://challenge.spock.com/ (Retrieved February 5.) Vilain, M. Burger, J. Aberdeen, J. Connolly, D., Hirschman, L. (1995). A model-theoretic coreference scoring scheme. In Proceedings of the 6th Message Understanding Conference (MUC6). Morgan Kaufmann, pp. 45--52.