Here are the steps to verify attestations in a historical dictionary using the IMPACT Dictionary Attestation Tool:
1. Select a headword or lemma to verify attestations for.
2. The tool will retrieve potential attestations of the headword and its variants from scanned dictionary pages.
3. Review each potential attestation quote to confirm it is a true match to the headword or one of its variants.
4. Mark attestations as verified matches or reject incorrect ones.
5. Additional information like the dictionary page number is recorded to allow checking the source.
6. The verified attestations are then added to the lexicon entry for the headword.
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Bratislava WS - Depuydt - INL - lexicon building_pdf
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
A gentle introduction to lexicon building and lexicon
application
Katrien Depuydt (Institute for Dutch Lexicology, Leiden)
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Outline
What is a lexicon
Lexica in IMPACT
Lexicon building and lexicon application tools
Results so far with focus on Dutch
IMPACT workshop, Bratislava, May 7, 2010 2
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
What is a lexicon?
IMPACT workshop, Bratislava, May 7, 2010 3
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon vs. electronic dictionary (1)
An electronic dictionary has
Of course, digitized full text (no images)
Primarily: for human use
Ideally: searchable with explicitly (XML) tagged information
lemma, Part of speech, meaning, quotations etc.
Example:online Oxford English Dictionary
IMPACT workshop, Bratislava, May 7, 2010 4
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Dictionary XML (example)
IMPACT workshop, Bratislava, May 7, 2010 5
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon vs. electronic dictionary (2)
A computational lexicon is
Of course, in structured digital format (XML, relational database)
Primarily for use in computer applications
Has explicitly coded information
(eg. lemma, part of speech, morphology, semantics, syntax…).
Used (for instance):
Linguistic annotation
Enhanced retrieval (basic: inflected forms; advanced: synonyms etc.)
Syntactic parsing, machine translation
IMPACT workshop, Bratislava, May 7, 2010 6
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010 7
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexica in IMPACT
IMPACT workshop, Bratislava, May 7, 2010 8
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The OCR lexicon
An OCR lexicon is
A verified list of words in a language
Based on a corpus, dated to enable relevant selection
Preferably with frequency information
Preferably from same period/text type as the documents
you want OCR’d (selection!)
IMPACT workshop, Bratislava, May 7, 2010 9
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR lexicon example
From WNT attestation lexicon From DBNL historical corpus
absoluut 8 wechgerukt 5
absoluyt 2 wechgeschickt 6
absoluyter 1 wechgeven 6
absolveren 3 wech-gevoerde 11
absolverende 1 wechgevoerde 14
absorbeeren 1 wech-gevoert 59
absorbeert 1 wechgevoert 98
absorberen 1 wechgeworpen 21
absorptie 3 wechghenomen 12
absoute 2 wechghevoert 7
abstineeren 1 wechginck 5
abstinencie 1 wechloopen 6
abstinentie 2 wechneemt 11
abstineren 1 wechneme 6
abstrackheyt 1 wech-nemen 20
abstract 7 wechnemen 74
abstracta 1 wechneminge 12
abstracte 7 wech-neminge 6
abstracten 4 wechrapen 6
abstractheid 1 wechrucken 6
abstractie 1 wechruiming 7
abstractiën 1 wecht 7
IMPACT workshop, Bratislava, May 7, 2010 10
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The IR lexicon
IR lexicon: Main information categories:
wordforms (list of words) +
- frequency information
- quotations (dated sources) from corpora or
electronic dictionaries
- MODERN LEMMA (// dictionary entry) assigned to spelling
variants and morphological variants of the same word
The modern lemma forms are the main search keys for retrieval
This is a standard practice in corpus linguistics and modern historical
lexicography
IMPACT workshop, Bratislava, May 7, 2010 11
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
<?xml version='1.0'?>
<!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'>
<lexicon>
<lexical_entry><lemma_id>219490</lemma_id>
<modern_lemma>aantuilen</modern_lemma>
<gloss></gloss>
<POS>VRB</POS>
<ne_label></ne_label>
<language_id></language_id>
<portmanteau_lemma_id></portmanteau_lemma_id>
<wordform><form_representation>
<wordform_id>850026</wordform_id>
<written_form>tuyld</written_form>
<attestation><id>92141</id>
<token_id></token_id>
<quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen:
Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote>
<derivation_id>0</derivation_id>
<document_id>204</document_id>
<start_pos>119</start_pos>
<end_pos>124</end_pos>
</attestation>
</form_representation>
</wordform>
IMPACT workshop, Bratislava, May 7, 2010 12
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
How to build and apply these lexica?
IMPACT workshop, Bratislava, May 7, 2010 13
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon building
Build a lexicon with the aims of
Be profitable to OCR and OCR postcorrection
Improving retrieval by building a lexicon of variants with the modern
lemma as a main entry key
Tools for lexicon building
Tools on how to use the lexicon (lexicon deployment) for enrichment
Lexicon cookbook
Best practice and tools to use lexica in OCR
!!! No lexicon will ever contain all variants found in historical text
IMPACT workshop, Bratislava, May 7, 2010 14
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Types of variation (orthographical and other)
uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken
uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken
d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke
uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke
I uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken
uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke
uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick
uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic
uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk
(most of these can be dealt with by means of patterns)
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt
wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys
II swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt
worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde
tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
(some of these can be dealt with by patterns and/or fuzzy matching,
others can only be handled by explicit listing)
IMPACT workshop, Bratislava, May 7, 2010 15
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The “hypothetical” vs. the witnessed lexicon (1)
Mechanisms
- to extend the lexicon
- to assess the plausibility of “hypothetical” words
without previous attestations, i.e. words we have not
seen before.
IMPACT workshop, Bratislava, May 7, 2010 16
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The “hypothetical” vs. the witnessed lexicon (2)
Unknown inflected forms of registered lemmata: automatic
expansion from the lemma to the full paradigm of word forms:
paradigmatic expansion or reverse lemmatization
New spellings of known words can be dealt with by developing
a good model of the historical spelling.
(The database structure provides for the storage of
orthographic variant patterns.)
Previously unseen compounds can be dealt with by means of
a good model of word formation. (work scheduled for 2010)
IMPACT workshop, Bratislava, May 7, 2010 17
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Virtual lexicon
of generated word
forms
Witnessed Modern Word Hypothetical Modern Word
Historical Variant 1
Transformation
Patterns
Historical Variant 2
IMPACT workshop, Bratislava, May 7, 2010 18
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
What is needed for lexicon building
Build models of linguistic variation (inflection, orthography)
Collect variants
Approach
Cycle: model helps to construct lexicon, and vice versa (induction of
rules/patterns)
Combination of manual work and computational linguistics
Lexicon building toolkit to support development, containing both
computational linguistic tools and tools supporting manual work
IMPACT workshop, Bratislava, May 7, 2010 19
20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Cf. Computational Tools and Lexica to Improve Access to Text, Jesse de Does, Katrien
Depuydt
IMPACT workshop, Bratislava, May 7, 2010 20
21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Spelling variation tools (pattern-based)
Language-independent approach:
Supervised rule (pattern) induction from pairs (“modern” word,
historical word), yielding patterns like aa/ae, s/z, ….
Pattern weights are computed from example material
Additional approaches possible:
Use of aligned data (parallel historical text and modern version)
Unsupervised pattern weighting (=~ text profiling from TR5)
IMPACT workshop, Bratislava, May 7, 2010 21
22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lemmatization
Reduction of historical word forms to modern lemma
Historical word standard (“modern”) spelling lemma form
(pattern matching) (lemmatizer)
Dystels (1) distels (2) distel
When we have a perfect or near-perfect modern full form
lexicon, the second step is simply lexicon lookup.
But:
1) We will not have full form information for many lemmata
(especially the historical ones)
2) Even lemmata present in modern language may have historical
inflected forms different from the present-day paradigm
IMPACT workshop, Bratislava, May 7, 2010 22
23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lemmatization and reverse lemmatization
We also need a lemmatization process for these situations
A typical lemmatizer assigns some standard form (infinitive,
nominative, stem) to inflected forms. Usually based on patterns
relating the inflected form to the standard form.
But:
Matching these patterns can be hard to combine with matching
both spelling variation patterns and OCR errors
(bok/bokken/bokkeu)
We adopt the solution of actually expanding the “hypothetical
modern full form lexicon” containing the most plausible possible
paradigmatic expansions of lemmata
This construction is carried out by means of a statistical reverse
lemmatizer
IMPACT workshop, Bratislava, May 7, 2010 23
24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Attestation
From hypothetical (non-witnessed) lexicon content to attested word
forms in “real” text
Automatic selection of candidate attestations
Manual work: verification and correction
Two approaches
Dictionary based (INL): Woordenboek der Nederlandsche Taal
Corpus based (LMU, INL): Dutch DBNL corpus
IMPACT workshop, Bratislava, May 7, 2010 24
25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Dictionary Attestation Tool
Lexicon building at work: Verifying attestations in historical dictionaries
Task
Find the variants of a headword as they occur in the quotations
headword
work
• We are working on what works.
• Depart from me, ye that worke iniquity. Quotations
• She worcketh knittinge of stockings.
variants
IMPACT workshop, Bratislava, May 7, 2010 25
26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Dictionary Attestation Tool
Task
Find the variants of a headword as they occur in the quotations
Automatically (preprocessing)
Electronic • match literally
historical e.g: work work, Work
dictionary
Database
with lemmata
• match using existing lexica and lists
and quotatioms
e.g: work works, worked, wrought
• approximate matching
e.g: work worke
By hand (using the tool)
• correct automatic mismatches
e.g: works words, worms
• find missed matches
e.g: work worketh, wrowght
IMPACT workshop, Bratislava, May 7, 2010 26
27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Attestation Tool
Up-to-date overview of what is done and needs to be don
Tool
Done by this user so far
Lemma headword
Quotations
Sorted by uncertainty
IMPACT workshop, Bratislava, May 7, 2010 27
28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Lexicon Tool
Task
Find and verify attestations in a historical corpus
Automatically (preprocessing = apply lemmatizer)
• match literally
e.g: work work, Work
• match using existing lexica and lists
e.g: work works, worked, wrought
• matching using spelling variation module
e.g: uiterlijk uyterlick
By hand (using the tool)
• assign correct lemma
e.g: was (N) zijn (V)
• group tokens belonging together
e.g: konings zoon koningszoon
• select attestations
IMPACT workshop, Bratislava, May 7, 2010 28
29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Corpus-based lexicon building: Impact Lexicon Tool
IMPACT workshop, Bratislava, May 7, 2010 29
30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
General vocabulary vs. Named entities
Tools for lexicon building described so far: applicable to general
lexicon
Tools for NE recognition, classification and variant matching
- library requirement
- distinguish general vocabulary from NE’s
- avoid unpleasant mixups like Abimelech apemelk!
(b/p; i/e; e/0; k/ch)
IMPACT workshop, Bratislava, May 7, 2010 30
31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Improvement of state of the art / innovation
We use existing computational linguistic approaches, but figure out
how to apply them to historical language
We develop a workflow to deal with the problems posed by historical
language, figuring out how all pieces fit together
Data selection and acquisition
Manual work
Computational linguistics tools
IMPACT workshop, Bratislava, May 7, 2010 31
32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Some results so far with focus on Dutch
IMPACT workshop, Bratislava, May 7, 2010 32
33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Measuring results for Dutch
We use the ground truth data developed in the project
Evaluation of EE tools
Evaluation of lexicon coverage
Evaluation of lexicon usage in IR (2010)
Evaluation of OCR and lexicon usage in OCR (2010)
Evaluation of benefit of lexicon building for OCR (for which type
of material / quality of OCR does this make sense) (2010-11)
IMPACT workshop, Bratislava, May 7, 2010 33
34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Dutch ground truth data
Type and genre # words
Gold Standard Book 300k
Random Set Book 340k
Random Set Staten Generaal 2.5M
Gold Standard Staten Generaal 500k
Gold Standard Newspapers 1 3.4M
Gold Standard Newspapers 2 170k
Random Set Newspapers 3.2M
total 13.1M
IMPACT workshop, Bratislava, May 7, 2010 34
35. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Efficiency of lexicon building
Dictionary-based lexicon building using historical dictionary:
Woordenboek der Nederlandsche Taal
Lemmata: 220211, quotations: 1524366
Tempo: 1725 quotations/hour; 231 lemmata/hour
IMPACT workshop, Bratislava, May 7, 2010 35
36. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Reverse lemmatization
Reminder: build hypothetical (non-attested) word forms in a “quick
and dirty” way to use in lemmatization and corpus-based lexicon
building
Using simple statistical algorithms and a simple approach to
inflection
Results:
Accuracy
Small Dutch lexicon (JVKlex) 96.6%
French lexicon (Morphalou) 99.4%
Polish lexicon, verbs (Morfologik) 98.7%
IMPACT workshop, Bratislava, May 7, 2010 36
37. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (1: ground truth books)
Type coverage Token coverage
Modern lexicon (e-Lex) 46% 76%
EE3.3 56% 84%
1+2 63% 89%
Type frequency list 70% 93%
historical corpus, top
200K (freq >= 19)
Type frequency list 78% 95%
historical corpus, top
500K (freq >= 5)
IMPACT workshop, Bratislava, May 7, 2010 37
38. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (2: gt newspapers 18th-19th c.)
Type coverage Token coverage
Modern lexicon (e-Lex) 40% 83%
EE3.3 41% 84%
1+2 51% 89%
Type frequency list 52% 93%
historical corpus, top
200K
Type frequency list 62% 95%
historical corpus, top
500K
IMPACT workshop, Bratislava, May 7, 2010 38
39. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (3: gt Parl. Papers 19th c.)
Type coverage Token coverage
Modern lexicon (e-Lex) 51% 89%
EE3.3 47% 88%
1+2 58% 93%
Type frequency 59% 96%
historical corpus, top
200K
Type frequency 68% 97%
historical corpus, top
500K
IMPACT workshop, Bratislava, May 7, 2010 39
40. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (4: gt Parl. Papers 20th c.)
Type coverage Token coverage
Modern lexicon (e-Lex) 70% 93%
EE3.3 66% 93%
1+2 76% 96%
Type frequency 74% 97%
historical corpus, top
200K
Type frequency 81% 98%
historical corpus, top
500K
IMPACT workshop, Bratislava, May 7, 2010 40
41. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (5: Genesis, 1637 bible)
Type coverage Token coverage
Modern lexicon (e-Lex) 31% 61%
EE3.3 62% 83%
1+2 65% 89%
Type frequency 76% 97%
historical corpus, top
200K
Type frequency 87% 98.6%
historical corpus, top
500K
IMPACT workshop, Bratislava, May 7, 2010 41
42. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (6: Hooft, historiën)
Type coverage Token coverage
Modern lexicon (e-Lex) 26% 67%
EE3.3 47% 88%
1+2 50% 90%
Type frequency 44% 93%
historical corpus, top
200K
Type frequency 58% 96%
historical corpus, top
500K
IMPACT workshop, Bratislava, May 7, 2010 42
43. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Conclusion from this evaluation
Evident next step for Dutch lexicon building is corpus based work
First target: cover the top 200000 from the historical corpus.
– Contains 97885 types not in the witnessed historical EE3.3
lexicon
– Roughly 24% of these are covered by the modern lexicon
– Roughly 25% are names
– This leaves about 45000 common words to look into.
IMPACT workshop, Bratislava, May 7, 2010 43
44. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Measuring effect of lexicon use in IR
Example: Improved recall for retrieval in a historical corpus of about
150 million tokens, using only the modern lexicon for wereld yields
23396 hits, using th current EE3.3 lexicon we get 34339 hits.
Simple IR will be part of the demonstrators
Hard to IR results proper without special datasets
We have measured up to now either lemmatization or modern to
historical word form matching accuracy
IMPACT workshop, Bratislava, May 7, 2010 44
45. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lemmatization
Combination of lookup, matching of spelling variation, reverse
lemmatization
As yet no good evaluation set for IMPACT (current work)
Evaluation on “type” level
We will use other material here (1637 Genesis, 97144 tokens)
Approach
Restrict to “ordinary words” (no names, numbers, clitic
combinations)
Ambiguous lemmatization (context is not used) (avg. 5
suggestions per word)
Ranking based on frequency and pattern weights
IMPACT workshop, Bratislava, May 7, 2010 45
46. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Result
6265 distinct types. 5991 (95.7%) had at least one correct
suggestion
Average rank of correct suggestions: 1.23
– 5222 types found in current EE3.3 (83%)
– 65 additional types in modern lexicon
– 49 types without any match
– 969 types (15%) identified with “approximate” matching using
~500 weighted patterns and returning at most 2 suggestions
IMPACT workshop, Bratislava, May 7, 2010 46
47. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Real and hypothetical lexicon coverage
(Hooft, historiën)
Result (again restricting to ‘ordinary’ words)
36332 distinct types. Avg rank of correct suggestions: 1.23
– 20087 types found in current EE3.3 (55%)
– 1061 additional types in modern lexicon
– 2411 types without any match (7%)
– 12773 types (35%) identified with “approximate” matching using
~500 weighted patterns and returning at most 2 suggestions
(Probably about 75% of the highest-ranking approximate matches
are correct)
IMPACT workshop, Bratislava, May 7, 2010 47
48. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation of TR results
Using Finereader SDK (version 9)
External dictionary interface for experimentation
Not completely straighforward how to apply this
Translation of corpus frequencies to weights on a scale 0-100
Other details: hyphenated words, case-sensitivity, …
Workaround to circumvent the long s problem
Lexicon Data used
Corpus-based type-frequency list
EE3.2 deliverable lexicon
Finereader internal lexicon
IMPACT workshop, Bratislava, May 7, 2010 48
49. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR evaluation
1. Character accuracy
2. Word accuracy
3. In case of block alignment problems, a simple alternative is bag-of-
words accuracy
1. and 2. presuppose a good alignment of OCR with ground truth.
We will use word accuracy, or the simpler alternative 3. when there
are alignment problems
IMPACT workshop, Bratislava, May 7, 2010 49
50. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR results
Dataset With ABBYY internal Dutch With combination With combination of
lexicon of corpus-based corpus-based
historical lexicon historical lexicon and
and EE3.2 EE3.2 deliverable
deliverable (case improved deployment
insensitive, taking
hyphenation into
account)
DPO35 88.8% 90.9% 94.4 % accuracy
(word
accuracy)
Parliamenta 90.9% 94.9% 94.9%
ry papers,
1826-27
selection
(bag of
words
recall)
IMPACT workshop, Bratislava, May 7, 2010 50
51. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
‘The Book’
“Kort begrip der waereld-historie voor de jeugd”
J.F. Martinet
Predikant te Zutphen, uit 1789.
Why this book?
Representative font and amount of spelling variation etc for late 18th century Dutch
It has the “long s problem”:
= stilste not ftilfte
….
IMPACT workshop, Bratislava, May 7, 2010 51
52. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The long s problem: An example ….
OCR at start of project Results April 2010
A. De eerde was de gevaarlykflti om de verlei¬ A. De eerste was de gevaarlykste om de verlei-
ding aan 't Hof; de tweede de ftillie en veiligde; ding aan 't Hof; de tweede de stilste en veiligste;
de derde de zwaarde, daar hy byna drie millioenen de derde de zwaarste, daar hy byna drie millioenen
harde en onbefchaafde Menfchen beftieren moest. harde en onbeschaafde Menschen bestieren moest.
Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and
postcorrect it afterwards with the lexicon.
In this way we keep it from turning to “eerde”
IMPACT workshop, Bratislava, May 7, 2010 52
53. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Future work
Compound analysis
Irregular historical white space use (“impacttok++”) (cf attestations)
Corpus based lexicon extension
Testing and optimization with ground truth data
Improve the TR lexicon by extending the IR lexicon and removing
false friends from the DBNL-corpus based TR lexicon
Continue work on best way deploy lexica in OCR, with help from
ABBYY
IMPACT workshop, Bratislava, May 7, 2010 53