SlideShare une entreprise Scribd logo
1  sur  13
NAMES
Gerrit Bloothooft & David Onland, UiL-OTS Utrecht
Martin Reynaert, TiCC / Tilburg University
Katrien Depuydt & Tanneke Schoonheim, INT Leiden
1
aim
standardization of historical person names for
richer search results
OCR post-processing
nominal record linkage
onomastic research
resulting NAMES corpus
in RDF format for Linked Open Data
& lexicon service
2
issues (1)
large edit distance
Grietje – Margaretha
Acquoij – Akkooi
partial solution:
(1) semi-phonetic transcription
GRYTJE – MARGARETA
AKOY – AKOY
(2) use ‘proven’ variant pairs
3
issues (2)
ambiguity (no one-to-one relation)
Mina – {Wilhelmina, Hermina, Jacomina..}
Lootzen – { Lodze, Loisen, Luytzen, Lothen..}
edit distance=1 (semi-phonetic)
partial solution:
(sub)standards
Mina > MINA
Willy > WIL
Wilhelmina > WILHELM
relations between standards
MINA < WILHELM
MINA < HEERMAN
WIL < WILHELM
4
material
full material:
564.000 surnames and 190.000 first names (NAMES corpus)
from 19th century birth, marriage and death certificates
(52 million person tokens from Catch LINKS - Wiewaswie)
near exact true person resolution in LINKS project:
birth : Jannigje, daugther of Arie Kool and Cornelia van Gent
death: Jannie, daugther of Arie Kool and Cornelia van Gent
proven name variant pairs:
328.411 surname variant pairs (Ruijter/Ruyter)
134.220 first name variant pairs (Jannigje/Jannie)
5
initial clustering
LINKS:
127.000 SURNAMES on 15.114 standards
48.000 FIRST NAMES on 926 gender-independent standards
6
dictionaries
extraction of variants and lemmas from:
FIRST NAMES dictionary
van der Schaar, 20.000 names, 12.500 in NAMES with 2.400 lemmas
corpus of Dutch SURNAMES
CBG, 320.000 names, 85.000 in NAMES with 15.000 base names
SURNAMES in Belgium and North France
Debrabandere, 118.000 names, 52.000 in NAMES related to 18.700 lemmas
7
(1) utilize dictionaries (known variants)
(2) choose standards which optimize variant pair coverage
maximize number of variant pairs with names under same standard
(3) combine comparable standards
Adriaans + Adriaansen
surnames: 15.114 > 10.805 standards
first names: 926 > 782 standards gender-independent
8
expert review of LINKS standards
TICCL support
expectation:
- can learn from proven name variant pairs
- can assist the expert review of LINKS standards
- can automatically extend standards to remaining names
experience:
- no learning
- no assistance
work around:
- learning: weighted edit-distance based on strings
(proof of concept developed, not yet applied)
- candidate selection: brute force comparison
9
remaining 75% names
brute force comparison (semi-phonetic level)
to names with standard (courtesy Marijn Schraagen)
apply simple rules:
edit-distance = 1
length > 5 characters
equal initial 2 characters
decide for:
standard shared by most of the selected names
10
final result first names
190.000 different names (52,6 million tokens)
782 standards
44.600 with expert NAMES standard (98,83% coverage of tokens)
103.450 automatically extended set (99,25% coverage of tokens)
visual inspection: fairly good quality
11
final result surnames (elements)
564.000 different names (52,4 million tokens)
10.805 standards
119.900 with expert NAMES standard (49,83% coverage of tokens)
327.497 automatically extended set (89,47 % coverage of tokens)
add high-frequent solitary names that have no variant pairs
(Broek, Hoekstra, Hoek, Groen, Boersma, Ploeg..)
11.605 standards
330.314 including solitary names (94,11 % coverage of tokens)
12
output
- names and standards with quality level
- relations between standards
in RDF format for Linked Open Data & lexicon service
13

Contenu connexe

Plus de CLARIAH

2016 05-20-clariah-wp3
2016 05-20-clariah-wp32016 05-20-clariah-wp3
2016 05-20-clariah-wp3CLARIAH
 
2016 05-20-clariah-wp2
2016 05-20-clariah-wp22016 05-20-clariah-wp2
2016 05-20-clariah-wp2CLARIAH
 
2016 05-20-clariah-wp5
2016 05-20-clariah-wp52016 05-20-clariah-wp5
2016 05-20-clariah-wp5CLARIAH
 
MTAS Henny Brugman
MTAS Henny BrugmanMTAS Henny Brugman
MTAS Henny BrugmanCLARIAH
 
LREC Ton vd Wouden
LREC Ton vd WoudenLREC Ton vd Wouden
LREC Ton vd WoudenCLARIAH
 
Paqu Gertjan van Noord en Jan Odijk
Paqu Gertjan van Noord en Jan OdijkPaqu Gertjan van Noord en Jan Odijk
Paqu Gertjan van Noord en Jan OdijkCLARIAH
 
Open sonar martinreynaert
Open sonar martinreynaertOpen sonar martinreynaert
Open sonar martinreynaertCLARIAH
 
Struc data Auke Rijpma
Struc data Auke RijpmaStruc data Auke Rijpma
Struc data Auke RijpmaCLARIAH
 
Diachronous conceptuallexicons Marieke van Erp / Piek Vossen
Diachronous conceptuallexicons Marieke van Erp / Piek VossenDiachronous conceptuallexicons Marieke van Erp / Piek Vossen
Diachronous conceptuallexicons Marieke van Erp / Piek VossenCLARIAH
 
Corpus studio Erwin Komen
Corpus studio Erwin KomenCorpus studio Erwin Komen
Corpus studio Erwin KomenCLARIAH
 
Athena richard zijdeman
Athena richard zijdemanAthena richard zijdeman
Athena richard zijdemanCLARIAH
 
Struc data aukerijpma
Struc data aukerijpmaStruc data aukerijpma
Struc data aukerijpmaCLARIAH
 
Anansi jauco noordzij
Anansi jauco noordzijAnansi jauco noordzij
Anansi jauco noordzijCLARIAH
 
Clariah dag 2016_wp1_ocw
Clariah dag 2016_wp1_ocwClariah dag 2016_wp1_ocw
Clariah dag 2016_wp1_ocwCLARIAH
 
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016CLARIAH
 
WP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dagWP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dagCLARIAH
 
WP 2: overview of the progress of WP2 on the "CLARIAH-day 22-01-2016
WP 2: overview of the progress of WP2 on the "CLARIAH-day 22-01-2016 WP 2: overview of the progress of WP2 on the "CLARIAH-day 22-01-2016
WP 2: overview of the progress of WP2 on the "CLARIAH-day 22-01-2016 CLARIAH
 
WP 5: overview of the progress of WP5 on the "CLARIAH-day 22-01-2016
WP 5: overview of the progress of WP5 on the "CLARIAH-day 22-01-2016 WP 5: overview of the progress of WP5 on the "CLARIAH-day 22-01-2016
WP 5: overview of the progress of WP5 on the "CLARIAH-day 22-01-2016 CLARIAH
 
Keynote: What do ordinary humanity scholars want from CLARIAH?
Keynote: What do ordinary humanity scholars want from CLARIAH?Keynote: What do ordinary humanity scholars want from CLARIAH?
Keynote: What do ordinary humanity scholars want from CLARIAH?CLARIAH
 
Clariah arianna betti_keynote
Clariah arianna betti_keynoteClariah arianna betti_keynote
Clariah arianna betti_keynoteCLARIAH
 

Plus de CLARIAH (20)

2016 05-20-clariah-wp3
2016 05-20-clariah-wp32016 05-20-clariah-wp3
2016 05-20-clariah-wp3
 
2016 05-20-clariah-wp2
2016 05-20-clariah-wp22016 05-20-clariah-wp2
2016 05-20-clariah-wp2
 
2016 05-20-clariah-wp5
2016 05-20-clariah-wp52016 05-20-clariah-wp5
2016 05-20-clariah-wp5
 
MTAS Henny Brugman
MTAS Henny BrugmanMTAS Henny Brugman
MTAS Henny Brugman
 
LREC Ton vd Wouden
LREC Ton vd WoudenLREC Ton vd Wouden
LREC Ton vd Wouden
 
Paqu Gertjan van Noord en Jan Odijk
Paqu Gertjan van Noord en Jan OdijkPaqu Gertjan van Noord en Jan Odijk
Paqu Gertjan van Noord en Jan Odijk
 
Open sonar martinreynaert
Open sonar martinreynaertOpen sonar martinreynaert
Open sonar martinreynaert
 
Struc data Auke Rijpma
Struc data Auke RijpmaStruc data Auke Rijpma
Struc data Auke Rijpma
 
Diachronous conceptuallexicons Marieke van Erp / Piek Vossen
Diachronous conceptuallexicons Marieke van Erp / Piek VossenDiachronous conceptuallexicons Marieke van Erp / Piek Vossen
Diachronous conceptuallexicons Marieke van Erp / Piek Vossen
 
Corpus studio Erwin Komen
Corpus studio Erwin KomenCorpus studio Erwin Komen
Corpus studio Erwin Komen
 
Athena richard zijdeman
Athena richard zijdemanAthena richard zijdeman
Athena richard zijdeman
 
Struc data aukerijpma
Struc data aukerijpmaStruc data aukerijpma
Struc data aukerijpma
 
Anansi jauco noordzij
Anansi jauco noordzijAnansi jauco noordzij
Anansi jauco noordzij
 
Clariah dag 2016_wp1_ocw
Clariah dag 2016_wp1_ocwClariah dag 2016_wp1_ocw
Clariah dag 2016_wp1_ocw
 
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016
WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016
 
WP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dagWP3: overzicht van de voortgang van WP# op de CLARIAH-dag
WP3: overzicht van de voortgang van WP# op de CLARIAH-dag
 
WP 2: overview of the progress of WP2 on the "CLARIAH-day 22-01-2016
WP 2: overview of the progress of WP2 on the "CLARIAH-day 22-01-2016 WP 2: overview of the progress of WP2 on the "CLARIAH-day 22-01-2016
WP 2: overview of the progress of WP2 on the "CLARIAH-day 22-01-2016
 
WP 5: overview of the progress of WP5 on the "CLARIAH-day 22-01-2016
WP 5: overview of the progress of WP5 on the "CLARIAH-day 22-01-2016 WP 5: overview of the progress of WP5 on the "CLARIAH-day 22-01-2016
WP 5: overview of the progress of WP5 on the "CLARIAH-day 22-01-2016
 
Keynote: What do ordinary humanity scholars want from CLARIAH?
Keynote: What do ordinary humanity scholars want from CLARIAH?Keynote: What do ordinary humanity scholars want from CLARIAH?
Keynote: What do ordinary humanity scholars want from CLARIAH?
 
Clariah arianna betti_keynote
Clariah arianna betti_keynoteClariah arianna betti_keynote
Clariah arianna betti_keynote
 

Dernier

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 

Dernier (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 

NAMES Presentation by Gerrit Bloothooft, CLARIAH Toogdag 19-10-2018

  • 1. NAMES Gerrit Bloothooft & David Onland, UiL-OTS Utrecht Martin Reynaert, TiCC / Tilburg University Katrien Depuydt & Tanneke Schoonheim, INT Leiden 1
  • 2. aim standardization of historical person names for richer search results OCR post-processing nominal record linkage onomastic research resulting NAMES corpus in RDF format for Linked Open Data & lexicon service 2
  • 3. issues (1) large edit distance Grietje – Margaretha Acquoij – Akkooi partial solution: (1) semi-phonetic transcription GRYTJE – MARGARETA AKOY – AKOY (2) use ‘proven’ variant pairs 3
  • 4. issues (2) ambiguity (no one-to-one relation) Mina – {Wilhelmina, Hermina, Jacomina..} Lootzen – { Lodze, Loisen, Luytzen, Lothen..} edit distance=1 (semi-phonetic) partial solution: (sub)standards Mina > MINA Willy > WIL Wilhelmina > WILHELM relations between standards MINA < WILHELM MINA < HEERMAN WIL < WILHELM 4
  • 5. material full material: 564.000 surnames and 190.000 first names (NAMES corpus) from 19th century birth, marriage and death certificates (52 million person tokens from Catch LINKS - Wiewaswie) near exact true person resolution in LINKS project: birth : Jannigje, daugther of Arie Kool and Cornelia van Gent death: Jannie, daugther of Arie Kool and Cornelia van Gent proven name variant pairs: 328.411 surname variant pairs (Ruijter/Ruyter) 134.220 first name variant pairs (Jannigje/Jannie) 5
  • 6. initial clustering LINKS: 127.000 SURNAMES on 15.114 standards 48.000 FIRST NAMES on 926 gender-independent standards 6
  • 7. dictionaries extraction of variants and lemmas from: FIRST NAMES dictionary van der Schaar, 20.000 names, 12.500 in NAMES with 2.400 lemmas corpus of Dutch SURNAMES CBG, 320.000 names, 85.000 in NAMES with 15.000 base names SURNAMES in Belgium and North France Debrabandere, 118.000 names, 52.000 in NAMES related to 18.700 lemmas 7
  • 8. (1) utilize dictionaries (known variants) (2) choose standards which optimize variant pair coverage maximize number of variant pairs with names under same standard (3) combine comparable standards Adriaans + Adriaansen surnames: 15.114 > 10.805 standards first names: 926 > 782 standards gender-independent 8 expert review of LINKS standards
  • 9. TICCL support expectation: - can learn from proven name variant pairs - can assist the expert review of LINKS standards - can automatically extend standards to remaining names experience: - no learning - no assistance work around: - learning: weighted edit-distance based on strings (proof of concept developed, not yet applied) - candidate selection: brute force comparison 9
  • 10. remaining 75% names brute force comparison (semi-phonetic level) to names with standard (courtesy Marijn Schraagen) apply simple rules: edit-distance = 1 length > 5 characters equal initial 2 characters decide for: standard shared by most of the selected names 10
  • 11. final result first names 190.000 different names (52,6 million tokens) 782 standards 44.600 with expert NAMES standard (98,83% coverage of tokens) 103.450 automatically extended set (99,25% coverage of tokens) visual inspection: fairly good quality 11
  • 12. final result surnames (elements) 564.000 different names (52,4 million tokens) 10.805 standards 119.900 with expert NAMES standard (49,83% coverage of tokens) 327.497 automatically extended set (89,47 % coverage of tokens) add high-frequent solitary names that have no variant pairs (Broek, Hoekstra, Hoek, Groen, Boersma, Ploeg..) 11.605 standards 330.314 including solitary names (94,11 % coverage of tokens) 12
  • 13. output - names and standards with quality level - relations between standards in RDF format for Linked Open Data & lexicon service 13

Notes de l'éditeur

  1. Clustering of name variants Frequency > 10
  2. Remaining 0,73% ~400.000 tokens
  3. Rest: beperkt aantal frequente namen zonder variantparen! 2304 namen, 657 rest_standaard (verschillende fonnaam) som_aantal = 1.956.782 1.956.782 + 47.676.966 = 49.633.748 / 52.408.712 = 94.70%