HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Special resources to access 16th century German
Ludwig-Maximilians-Universität München
Annette Gotscharek
15. 10. 2011, IMPACT Conference
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Special resources to access 16th century German
“access”?
OCR:
Role of the lexicon: defines the set of valid words.
... Geist
Geister
Teile
gemütlich …
Information Retrieval (IR):
Role of the lexicon: meaningful expansion of the user query to increase recall.
... Geist Geister, Geiste, Geistern
Teil Teile, Teils, Teilen
gemütlich gemütlicher, gemütlichste ...
15. 10. 2011, IMPACT Conference 2
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Special resources to access 16th century German
In IMPACT, we worked on documents from 1500-1950, but 16th century is special:
– Language period: Early New High German (1350-1650)
– Oldest and therefore most challenging period of printed books
– Large library holdings from 16th century at our partner library BSB
linguistic features of historical language on word-level
Historic modern English
– Historical spelling variation: geyſte Geiste spirit
– Historical morphology: er frug er fragte he asked
– Obsolete vocabulary: mirackel Wunder (?) miracle
– Obsolete character set: aͤ ä…
Need adapted linguistic resources
15. 10. 2011, IMPACT Conference 3
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Adapted linguistic resources: structure
OCR:
... Geist
Geister
Teile
gemütlich …
Information Retrieval (IR):
... Geist Geister, Geiste, Geistern
Teil Teile, Teils, Teilen
gemütlich gemütlicher, gemütlichste ...
15. 10. 2011, IMPACT Conference 4
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Adapted linguistic resources: structure
OCR:
... Geist Geyst
Geister Geyster
Teile Theile
gemütlich gemüthlich …
Information Retrieval (IR):
... Geist Geister, Geiste, Geistern Geyster, Geyste, Geystern
Teil Teile, Teils, Teilen Theile, Theils, Theilen
gemütlich gemütlicher, gemütlichste gemüthlicher, gemüthlichste...
15. 10. 2011, IMPACT Conference 5
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Linguistic Resources for Historical Texts
Diachronic Groundtruth Corpus (1500-1950)
Hypothetical lexicon for rule based variants
Manually verified lexicon
15. 10. 2011, IMPACT Conference 6
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Linguistic Resources for Historical Texts
Diachronic Groundtruth Corpus (1500-1950)
Hypothetical lexicon for rule based variants
Manually verified lexicon
15. 10. 2011, IMPACT Conference 7
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Diachronic Groundtruth Corpus (1500-1950)
Collection of groundtruth material from different sources in the web and non-public
electronic corpora (Institut für Deutsche Sprache Mannheim)
Large gap especially in 16th / 17th century:
with BSB: preparation of additional corpus from BSB documents:
– Random selection of 100 works from digitized images of 16th and 17th century
– Mostly related to theology
– Latin texts excluded, no poems etc.
– Keyed by a service provider
– 1766 pages with ~ 858,000 tokens groundtruth material
15. 10. 2011, IMPACT Conference 8
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Diachronic Groundtruth Corpus (1500-1950)
Gains of tokens by the extension of the corpus:
Complete corpus contains ~ 3,380,000 tokens in 500 texts from 4 centuries
basis for different analyses and lexicon building
15. 10. 2011, IMPACT Conference 9
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Coverage on Diachronic Corpus: modern
Types (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900-
– 1549 1599 1649 1699 1749 1799 1849 1899 1949
Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1
words
Modern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8
compounds
Less than 45% of the vocabulary is covered by modern resources before 1750.
16th century: only 15% - 29% modern simple words, modern closed compounds
are hardly relevant.
15. 10. 2011, IMPACT Conference 10
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Linguistic Resources for Historical Texts
Diachronic Groundtruth Corpus (1500-1950)
Hypothetical lexicon for rule based variants
Manually verified lexicon
15. 10. 2011, IMPACT Conference 11
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Hypothetical lexicon for rule based variants
Systematic substitution rules (patterns) describe the difference
between modern and historical spelling:
t th,ei ey
(modern) teil theyl (historic)
Based on the modern lexicon and the 140 manually collected
patterns, the set of all potential rule based historical variants can be
computed automatically (“hypothetical lexicon”).
15. 10. 2011, IMPACT Conference 12
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Hypothetical lexicon for rule based variants
hypothetical
lexicon
…
Esel Teil
… Esel Teill
Teil Esell Teyl
…
… Esehl e →eh
Teyll
Esehll ei →ey
Tehill
Eßel s →ß
Theil
modern Eßell l→ll
…
Eßehll t →th
lexicon …
…
pattern set
15. 10. 2011, IMPACT Conference 13
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Hypothetical lexicon for rule based variants
Automatic mapping from rule based historical variants to their equivalent in
the modern vocabulary is possible:
historic modern
Geyst = Geist + (ei ey)
Theile = Teile + (t th)
By far not all historical variants can be described by simple replacement rules:
historic modern
frug = fragte + ?
Mirackel = ?+?
15. 10. 2011, IMPACT Conference 14
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Coverage on Diachronic Corpus: hypothetic
Types (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900-
1549 1599 1649 1699 1749 1799 1849 1899 1949
Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1
words
Modern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8
compounds
Hypothetic 29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0
16th century: 30% of the vocabulary are covered by the lexicon of rule based
variants
Applied as OCR-Lexicon via the IMPACT Abbyy External Dictionary Interface:
improvement of recognition rate (published 2009)
15. 10. 2011, IMPACT Conference 15
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Coverage on Diachronic Corpus: missing
Types (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900-
1549 1599 1649 1699 1749 1799 1849 1899 1949
Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1
words
Modern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8
compounds
Hypothetic 29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0
Missing 45.9 28.7 29.7 26.0 23.5 15.1 13.9 13.5 8.1
Especially in the 16th century: Up to 46% “difficult” vocabulary.
manually verified lexicon necessary!
15. 10. 2011, IMPACT Conference 16
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Linguistic Resources for Historical Texts
Diachronic Groundtruth Corpus (1500-1950)
Hypothetical lexicon for rule based variants
Manually verified lexicon
15. 10. 2011, IMPACT Conference 17
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Manually verified IR-lexicon: Structure
One entry contains:
– Historical word form from the corpus
– Corresponding modern word form
– Patterns if applicable
– Corresponding modern lemma
– At least one occurrence in the corpus as a attestation for the reading
Manual assignment of modern word form and lemma
Explicit handling of not rule based variants
15. 10. 2011, IMPACT Conference 18
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Manually verified IR-lexicon: Compilation
Web-based, collaborative user interface
User support:
– For rule based variants: Suggestion of the corresponding modern word
form by the hypothetic lexicon
– Suggestion of all possible lemmas for the modern word form by a large
modern lexicon (CISLEX)
– Concordance list of the historical variant
15. 10. 2011, IMPACT Conference 19
20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Manually verified IR-lexicon: Status
41,600 entries have been created for 24,800 historical word forms
from the diachronic corpus, 72,100 attestations were annotated.
IMPACT-Partner in Slovenia und Bulgaria create corresponding
lexica with an adapted version of the tool.
15. 10. 2011, IMPACT Conference 20
21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Thank you.
15. 10. 2011, IMPACT Conference 21