Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

Uwe Springmann1
, Dietmar Najock2
, Hermann Morgenroth2
,
Helmut Schmid1
, Annette Gotscharek1
and Florian Fink1
OCR of Historical Printings
of Latin Texts
Problems, Prospects, Progress
1
CIS, Ludwig-Maximilans-Universität München
2
Institute for Greek and Latin
Languages and Literatures,
Freie Universität Berlin

p. 2 (16)OCR of historical printings of Latin textsSpringmann et al.
Overview
●
Why Latin?
●
Problems
●
Prospects
●
Progress

Why Latin?
●
huge heritage: largest body of historical literary sources
●
Latin publications dominate print production until about 1750
●
many titles have never been reprinted
●
either key or barrier to cultural heritage of the western world
●
has been left out of the IMPACT project despite its importance

Some problems for OCR engines
historical fonts
long s ( )ſ
historical ligatures:
Æ, æ, Œ, œ, ﬆ, 
polytonic Greek words
diacritics
abbreviations
historical spellings
Problems

Some problems for OCR engines (continued)
●
historical typography and spelling are also a problem for early modern
languages
●
ambiguities of abbreviations (especially in incunabula) will not immediately
lead to fully expanded, machine readable text
●
but discretionary diacritics are helpful in POS/morphology disambiguation:
–
adverb/vocative: altè/alte
–
adverb/pronoun: quàm/quam
–
conjunction/preposition: cùm/cum
–
ablative/nominative: hastâ/hasta
Problems

State of the art – example pages
Prospects
1544
1779
1649

State of the art – results for example pages
Prospects
Year Abbyy FR 11.1 Tesseract 3.03 OCRopus 0.7
1544 83,14 70,32 74,59
1649 88,07 84,87 78,98
1779 82,13 80,77 75,46
character accuracy in %
out-of-the-box performance, no language model (or default = English)
OCRopus hampered by bad image-text segmentation

Prospects
Overcoming the obstacles
●
Training (Tesseract, OCRopus)
–
(a) generate pseudo-historical images from existing texts and
historical-looking computer fonts (add some degradation to the image)
–
(b) transcribe some real pages and train on true historical fonts
●
Lexical resources (Tesseract) in recognition
●
Post-processing
–
correct OCR errors, not historical spelling (might be interesting itself)
–
add annotation: expand abbreviations, ligatures, normalize spelling
–
helpful: language model, lexicon of historical word forms

Progress
Postcorrection: Open-Source-Tool PoCoTo
(see paper of Vobl et al. - presentation by Christoph Ringlstetter)

Progress
Training on historical fonts (artificial images)
Example: Pontanus, Progymnasmata Latinitatis (1589)

Progress
Training on fonts, ideal lexicon
Example: Pontanus, Progymnasmata Latinitatis (1589)
Page
Abbyy
FR 11.1
Tesseract
3.03
Ocropus
0.7
Tesseract
(font)
Tesseract
(font + lex.)
Ocropus
(font)
15 87,79 80,88 80,70 91,02 93,90 92,55
16 82,94 77,41 76,94 80,12 85,65 80,47
17 85,25 75,98 86,07 85,41 91,56 93,93
18 85,93 79,51 85,53 88,29 92,68 89,67
19 87,94 80,09 79,09 86,06 90,15 87,83
OCRopus: no language model!
red: accuracy better than Abbyy

Progress
Training on historical fonts (real images)
Example: Thanner, Petronij Arbitri Sathyra (1500)
Page
Tesseract
3.03
Ocropus
0.7
Ocropus
(trained)
13 41,59 44,59 93,15
14 52,38 57,77 94,61
15 53,09 62,38 95,17
16 59,09 61,45 93,27
page 1-12: training set; page 13-16: test set

Progress
Summary
●
very old printings are hard to OCR out-of-the box
●
Tesseract and OCRopus can be trained to results above ABBYY
●
applying lexica as well as font training helps a lot
●
OCRopus can be trained to accuracies > 90%, but must at present be
combined with good line segmentation in a preprocessing step
●
postcorrection will do the rest

Progress
Thank you for your interest!

Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

Recommandé

Recommandé

Contenu connexe

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Dernier

Dernier (20)

Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress