Dutch corpus of person name variants
This project aims to develop a gold standard for person name variants, mainly based on the LINKS corpus of 19/20th century person names from the vital register (63 million tokens). 25% of the 564.000 surnames and 189.000 first names have already been standardized, based on variants associated to the same individual. Expert review of this core set is necessary, however, which process will be assisted by the CLARIAH tool TICCL. This will also constitute the (statistical) learning phase of TICCL (to handle previously unseen variants), while a data structure will be established to deal with ambiguities and to accommodate different levels of standardization. In a second phase, the remaining 75% of the LINKS corpus will be standardized.
The corpus will both be delivered in RDF format for Linked Open Data, and as a lexical service. The usage of the corpus will be tested within the CLARIAH Anansi environment .
Generative AI for Social Good at Open Data Science East 2024
NAMES Presentation by Gerrit Bloothooft, CLARIAH Toogdag 19-10-2018
1. NAMES
Gerrit Bloothooft & David Onland, UiL-OTS Utrecht
Martin Reynaert, TiCC / Tilburg University
Katrien Depuydt & Tanneke Schoonheim, INT Leiden
1
2. aim
standardization of historical person names for
richer search results
OCR post-processing
nominal record linkage
onomastic research
resulting NAMES corpus
in RDF format for Linked Open Data
& lexicon service
2
4. issues (2)
ambiguity (no one-to-one relation)
Mina – {Wilhelmina, Hermina, Jacomina..}
Lootzen – { Lodze, Loisen, Luytzen, Lothen..}
edit distance=1 (semi-phonetic)
partial solution:
(sub)standards
Mina > MINA
Willy > WIL
Wilhelmina > WILHELM
relations between standards
MINA < WILHELM
MINA < HEERMAN
WIL < WILHELM
4
5. material
full material:
564.000 surnames and 190.000 first names (NAMES corpus)
from 19th century birth, marriage and death certificates
(52 million person tokens from Catch LINKS - Wiewaswie)
near exact true person resolution in LINKS project:
birth : Jannigje, daugther of Arie Kool and Cornelia van Gent
death: Jannie, daugther of Arie Kool and Cornelia van Gent
proven name variant pairs:
328.411 surname variant pairs (Ruijter/Ruyter)
134.220 first name variant pairs (Jannigje/Jannie)
5
7. dictionaries
extraction of variants and lemmas from:
FIRST NAMES dictionary
van der Schaar, 20.000 names, 12.500 in NAMES with 2.400 lemmas
corpus of Dutch SURNAMES
CBG, 320.000 names, 85.000 in NAMES with 15.000 base names
SURNAMES in Belgium and North France
Debrabandere, 118.000 names, 52.000 in NAMES related to 18.700 lemmas
7
8. (1) utilize dictionaries (known variants)
(2) choose standards which optimize variant pair coverage
maximize number of variant pairs with names under same standard
(3) combine comparable standards
Adriaans + Adriaansen
surnames: 15.114 > 10.805 standards
first names: 926 > 782 standards gender-independent
8
expert review of LINKS standards
9. TICCL support
expectation:
- can learn from proven name variant pairs
- can assist the expert review of LINKS standards
- can automatically extend standards to remaining names
experience:
- no learning
- no assistance
work around:
- learning: weighted edit-distance based on strings
(proof of concept developed, not yet applied)
- candidate selection: brute force comparison
9
10. remaining 75% names
brute force comparison (semi-phonetic level)
to names with standard (courtesy Marijn Schraagen)
apply simple rules:
edit-distance = 1
length > 5 characters
equal initial 2 characters
decide for:
standard shared by most of the selected names
10
11. final result first names
190.000 different names (52,6 million tokens)
782 standards
44.600 with expert NAMES standard (98,83% coverage of tokens)
103.450 automatically extended set (99,25% coverage of tokens)
visual inspection: fairly good quality
11
12. final result surnames (elements)
564.000 different names (52,4 million tokens)
10.805 standards
119.900 with expert NAMES standard (49,83% coverage of tokens)
327.497 automatically extended set (89,47 % coverage of tokens)
add high-frequent solitary names that have no variant pairs
(Broek, Hoekstra, Hoek, Groen, Boersma, Ploeg..)
11.605 standards
330.314 including solitary names (94,11 % coverage of tokens)
12
13. output
- names and standards with quality level
- relations between standards
in RDF format for Linked Open Data & lexicon service
13
Notes de l'éditeur
Clustering of name variantsFrequency > 10
Remaining 0,73% ~400.000 tokens
Rest: beperkt aantal frequente namen zonder variantparen!
2304 namen, 657 rest_standaard (verschillende fonnaam)som_aantal = 1.956.7821.956.782 + 47.676.966 = 49.633.748 / 52.408.712 = 94.70%