NAMES Presentation by Gerrit Bloothooft, CLARIAH Toogdag 19-10-2018

•Télécharger en tant que PPTX, PDF•

0 j'aime•120 vues

Dutch corpus of person name variants This project aims to develop a gold standard for person name variants, mainly based on the LINKS corpus of 19/20th century person names from the vital register (63 million tokens). 25% of the 564.000 surnames and 189.000 first names have already been standardized, based on variants associated to the same individual. Expert review of this core set is necessary, however, which process will be assisted by the CLARIAH tool TICCL. This will also constitute the (statistical) learning phase of TICCL (to handle previously unseen variants), while a data structure will be established to deal with ambiguities and to accommodate different levels of standardization. In a second phase, the remaining 75% of the LINKS corpus will be standardized. The corpus will both be delivered in RDF format for Linked Open Data, and as a lexical service. The usage of the corpus will be tested within the CLARIAH Anansi environment .

Données & analyses

NAMES
Gerrit Bloothooft & David Onland, UiL-OTS Utrecht
Martin Reynaert, TiCC / Tilburg University
Katrien Depuydt & Tanneke Schoonheim, INT Leiden
1

aim
standardization of historical person names for
richer search results
OCR post-processing
nominal record linkage
onomastic research
resulting NAMES corpus
in RDF format for Linked Open Data
& lexicon service
2

issues (1)
large edit distance
Grietje – Margaretha
Acquoij – Akkooi
partial solution:
(1) semi-phonetic transcription
GRYTJE – MARGARETA
AKOY – AKOY
(2) use ‘proven’ variant pairs
3

issues (2)
ambiguity (no one-to-one relation)
Mina – {Wilhelmina, Hermina, Jacomina..}
Lootzen – { Lodze, Loisen, Luytzen, Lothen..}
edit distance=1 (semi-phonetic)
partial solution:
(sub)standards
Mina > MINA
Willy > WIL
Wilhelmina > WILHELM
relations between standards
MINA < WILHELM
MINA < HEERMAN
WIL < WILHELM
4

material
full material:
564.000 surnames and 190.000 first names (NAMES corpus)
from 19th century birth, marriage and death certificates
(52 million person tokens from Catch LINKS - Wiewaswie)
near exact true person resolution in LINKS project:
birth : Jannigje, daugther of Arie Kool and Cornelia van Gent
death: Jannie, daugther of Arie Kool and Cornelia van Gent
proven name variant pairs:
328.411 surname variant pairs (Ruijter/Ruyter)
134.220 first name variant pairs (Jannigje/Jannie)
5

initial clustering
LINKS:
127.000 SURNAMES on 15.114 standards
48.000 FIRST NAMES on 926 gender-independent standards
6

dictionaries
extraction of variants and lemmas from:
FIRST NAMES dictionary
van der Schaar, 20.000 names, 12.500 in NAMES with 2.400 lemmas
corpus of Dutch SURNAMES
CBG, 320.000 names, 85.000 in NAMES with 15.000 base names
SURNAMES in Belgium and North France
Debrabandere, 118.000 names, 52.000 in NAMES related to 18.700 lemmas
7

(1) utilize dictionaries (known variants)
(2) choose standards which optimize variant pair coverage
maximize number of variant pairs with names under same standard
(3) combine comparable standards
Adriaans + Adriaansen
surnames: 15.114 > 10.805 standards
first names: 926 > 782 standards gender-independent
8
expert review of LINKS standards

TICCL support
expectation:
- can learn from proven name variant pairs
- can assist the expert review of LINKS standards
- can automatically extend standards to remaining names
experience:
- no learning
- no assistance
work around:
- learning: weighted edit-distance based on strings
(proof of concept developed, not yet applied)
- candidate selection: brute force comparison
9

remaining 75% names
brute force comparison (semi-phonetic level)
to names with standard (courtesy Marijn Schraagen)
apply simple rules:
edit-distance = 1
length > 5 characters
equal initial 2 characters
decide for:
standard shared by most of the selected names
10

final result first names
190.000 different names (52,6 million tokens)
782 standards
44.600 with expert NAMES standard (98,83% coverage of tokens)
103.450 automatically extended set (99,25% coverage of tokens)
visual inspection: fairly good quality
11

final result surnames (elements)
564.000 different names (52,4 million tokens)
10.805 standards
119.900 with expert NAMES standard (49,83% coverage of tokens)
327.497 automatically extended set (89,47 % coverage of tokens)
add high-frequent solitary names that have no variant pairs
(Broek, Hoekstra, Hoek, Groen, Boersma, Ploeg..)
11.605 standards
330.314 including solitary names (94,11 % coverage of tokens)
12

output
- names and standards with quality level
- relations between standards
in RDF format for Linked Open Data & lexicon service
13

Recommandé

ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018CLARIAH

DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018CLARIAH

Masterclass innosurance 2018CLARIAH

Flat TLACLARIAH

QB'er demonstrationCLARIAH

Collection registration for the CLARIAH Media Suite.CLARIAH

CMDI2RDFCLARIAH

2016 05-20-clariah-wp4CLARIAH

Recommandé

ACAD Presentation by Wilbert Spooren, CLARIAH Toogdag 19-10-2018CLARIAH

DB:CCC Presentation of Karin Hofmeester, CLARIAH Toogdag 19-10-2018CLARIAH

Masterclass innosurance 2018CLARIAH

Flat TLACLARIAH

QB'er demonstrationCLARIAH

Collection registration for the CLARIAH Media Suite.CLARIAH

CMDI2RDFCLARIAH

2016 05-20-clariah-wp4CLARIAH

2016 05-20-clariah-wp3CLARIAH

2016 05-20-clariah-wp2CLARIAH

2016 05-20-clariah-wp5CLARIAH

MTAS Henny BrugmanCLARIAH

LREC Ton vd WoudenCLARIAH

Paqu Gertjan van Noord en Jan OdijkCLARIAH

Open sonar martinreynaertCLARIAH

Struc data Auke RijpmaCLARIAH

Diachronous conceptuallexicons Marieke van Erp / Piek VossenCLARIAH

Corpus studio Erwin KomenCLARIAH

Athena richard zijdemanCLARIAH

Struc data aukerijpmaCLARIAH

Anansi jauco noordzijCLARIAH

Clariah dag 2016_wp1_ocwCLARIAH

WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016CLARIAH

WP3: overzicht van de voortgang van WP# op de CLARIAH-dagCLARIAH

WP 2: overview of the progress of WP2 on the "CLARIAH-day 22-01-2016 CLARIAH

WP 5: overview of the progress of WP5 on the "CLARIAH-day 22-01-2016 CLARIAH

Keynote: What do ordinary humanity scholars want from CLARIAH?CLARIAH

Clariah arianna betti_keynoteCLARIAH

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

Contenu connexe

Plus de CLARIAH

2016 05-20-clariah-wp3CLARIAH

2016 05-20-clariah-wp2CLARIAH

2016 05-20-clariah-wp5CLARIAH

MTAS Henny BrugmanCLARIAH

LREC Ton vd WoudenCLARIAH

Paqu Gertjan van Noord en Jan OdijkCLARIAH

Open sonar martinreynaertCLARIAH

Struc data Auke RijpmaCLARIAH

Diachronous conceptuallexicons Marieke van Erp / Piek VossenCLARIAH

Corpus studio Erwin KomenCLARIAH

Athena richard zijdemanCLARIAH

Struc data aukerijpmaCLARIAH

Anansi jauco noordzijCLARIAH

Clariah dag 2016_wp1_ocwCLARIAH

WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016CLARIAH

WP3: overzicht van de voortgang van WP# op de CLARIAH-dagCLARIAH

WP 2: overview of the progress of WP2 on the "CLARIAH-day 22-01-2016 CLARIAH

WP 5: overview of the progress of WP5 on the "CLARIAH-day 22-01-2016 CLARIAH

Keynote: What do ordinary humanity scholars want from CLARIAH?CLARIAH

Clariah arianna betti_keynoteCLARIAH

Plus de CLARIAH (20)

2016 05-20-clariah-wp3

2016 05-20-clariah-wp2

2016 05-20-clariah-wp5

MTAS Henny Brugman

LREC Ton vd Wouden

Paqu Gertjan van Noord en Jan Odijk

Open sonar martinreynaert

Struc data Auke Rijpma

Diachronous conceptuallexicons Marieke van Erp / Piek Vossen

Corpus studio Erwin Komen

Athena richard zijdeman

Struc data aukerijpma

Anansi jauco noordzij

Clariah dag 2016_wp1_ocw

WP4: overzicht van de voortgang van WP4 op de CLARIAH-dag 22 januari 2016

WP3: overzicht van de voortgang van WP# op de CLARIAH-dag

WP 2: overview of the progress of WP2 on the "CLARIAH-day 22-01-2016

WP 5: overview of the progress of WP5 on the "CLARIAH-day 22-01-2016

Keynote: What do ordinary humanity scholars want from CLARIAH?

Clariah arianna betti_keynote

Dernier

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

ASML's Taxonomy Adventure by Daniel Cantervoginip

Easter Eggs From Star Wars and in cars 1 and 217djon017

IMA MSN - Medical Students Network (2).pptxdolaknnilon

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Dernier (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

ASML's Taxonomy Adventure by Daniel Canter

Easter Eggs From Star Wars and in cars 1 and 2

IMA MSN - Medical Students Network (2).pptx

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

Identifying Appropriate Test Statistics Involving Population Mean

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Customer Service Analytics - Make Sense of All Your Data.pptx

Advanced Machine Learning for Business Professionals

RABBIT: A CLI tool for identifying bots based on their GitHub events.

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Defining Constituents, Data Vizzes and Telling a Data Story

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

Heart Disease Classification Report: A Data Analysis Project

Generative AI for Social Good at Open Data Science East 2024

NAMES Presentation by Gerrit Bloothooft, CLARIAH Toogdag 19-10-2018

1. NAMES Gerrit Bloothooft & David Onland, UiL-OTS Utrecht Martin Reynaert, TiCC / Tilburg University Katrien Depuydt & Tanneke Schoonheim, INT Leiden 1

2. aim standardization of historical person names for richer search results OCR post-processing nominal record linkage onomastic research resulting NAMES corpus in RDF format for Linked Open Data & lexicon service 2

3. issues (1) large edit distance Grietje – Margaretha Acquoij – Akkooi partial solution: (1) semi-phonetic transcription GRYTJE – MARGARETA AKOY – AKOY (2) use ‘proven’ variant pairs 3

4. issues (2) ambiguity (no one-to-one relation) Mina – {Wilhelmina, Hermina, Jacomina..} Lootzen – { Lodze, Loisen, Luytzen, Lothen..} edit distance=1 (semi-phonetic) partial solution: (sub)standards Mina > MINA Willy > WIL Wilhelmina > WILHELM relations between standards MINA < WILHELM MINA < HEERMAN WIL < WILHELM 4

5. material full material: 564.000 surnames and 190.000 first names (NAMES corpus) from 19th century birth, marriage and death certificates (52 million person tokens from Catch LINKS - Wiewaswie) near exact true person resolution in LINKS project: birth : Jannigje, daugther of Arie Kool and Cornelia van Gent death: Jannie, daugther of Arie Kool and Cornelia van Gent proven name variant pairs: 328.411 surname variant pairs (Ruijter/Ruyter) 134.220 first name variant pairs (Jannigje/Jannie) 5

6. initial clustering LINKS: 127.000 SURNAMES on 15.114 standards 48.000 FIRST NAMES on 926 gender-independent standards 6

7. dictionaries extraction of variants and lemmas from: FIRST NAMES dictionary van der Schaar, 20.000 names, 12.500 in NAMES with 2.400 lemmas corpus of Dutch SURNAMES CBG, 320.000 names, 85.000 in NAMES with 15.000 base names SURNAMES in Belgium and North France Debrabandere, 118.000 names, 52.000 in NAMES related to 18.700 lemmas 7

8. (1) utilize dictionaries (known variants) (2) choose standards which optimize variant pair coverage maximize number of variant pairs with names under same standard (3) combine comparable standards Adriaans + Adriaansen surnames: 15.114 > 10.805 standards first names: 926 > 782 standards gender-independent 8 expert review of LINKS standards

9. TICCL support expectation: - can learn from proven name variant pairs - can assist the expert review of LINKS standards - can automatically extend standards to remaining names experience: - no learning - no assistance work around: - learning: weighted edit-distance based on strings (proof of concept developed, not yet applied) - candidate selection: brute force comparison 9

10. remaining 75% names brute force comparison (semi-phonetic level) to names with standard (courtesy Marijn Schraagen) apply simple rules: edit-distance = 1 length > 5 characters equal initial 2 characters decide for: standard shared by most of the selected names 10

11. final result first names 190.000 different names (52,6 million tokens) 782 standards 44.600 with expert NAMES standard (98,83% coverage of tokens) 103.450 automatically extended set (99,25% coverage of tokens) visual inspection: fairly good quality 11

12. final result surnames (elements) 564.000 different names (52,4 million tokens) 10.805 standards 119.900 with expert NAMES standard (49,83% coverage of tokens) 327.497 automatically extended set (89,47 % coverage of tokens) add high-frequent solitary names that have no variant pairs (Broek, Hoekstra, Hoek, Groen, Boersma, Ploeg..) 11.605 standards 330.314 including solitary names (94,11 % coverage of tokens) 12

13. output - names and standards with quality level - relations between standards in RDF format for Linked Open Data & lexicon service 13

Notes de l'éditeur

Clustering of name variantsFrequency > 10
Remaining 0,73% ~400.000 tokens
Rest: beperkt aantal frequente namen zonder variantparen! 2304 namen, 657 rest_standaard (verschillende fonnaam)som_aantal = 1.956.7821.956.782 + 47.676.966 = 49.633.748 / 52.408.712 = 94.70%