The document discusses OCR for typewritten documents. It describes the IMPACT project, which is supported by the European Community under the FP7 ICT Work Programme and coordinated by the National Library of the Netherlands. The presentation covers the challenges of typewritten documents for OCR, the specific approaches used in the IMPACT project's TOCR system, and some example results showing its performance.
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR for Typewritten Documents
Stefan Pletschacher
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Overview
Introduction to Typewritten OCR
Document Types and Challenges
Specific Approaches
Results
Hansen Writing Ball, Source: Wikipedia
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 2
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
(The) Short History
1870 first commercially manufactured typewriter
1970s-80s first PCs and desktop printers
Sholes and Glidden typewriter, 1873, IBM 5150 PC, 1981, Source: Wikipedia
Source: Wikipedia
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 3
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Typewritten Documents
Millions of pages of significant typewritten documents exist in archives and libraries
– Practically most administrative and individually-produced documents of the
20th Century
Typewritten documents pose unique challenges to recognition
– Each character is produced independently of the rest – glyphs can appear
with different intensity/weight even within the same word
– Carbon copies are common – glyphs are blurred, connected to each other
and the background is textured
– Content – administrative documents with names, abbreviations, numbers etc.
which render lexicon based recognition approaches less useful
In addition, the usual degradations of historical documents are present due to
ageing and use
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 4
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Document Types and Challenges
Manuscripts Annotations
Scientific publications Abbreviations and names
Index cards Carbon copies (low contrast)
Administrative documents Punch holes, staples etc.
Letters Damage from regular handling (folds,
… tears, stains)
Discoloured paper (often unevenly)
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 5
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Some Examples
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 6
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Specific Approaches
Incorporate background knowledge about typewritten
documents
Pre-processing
– Improved glyph segmentation
– Enhancement of individual glyph images
Recognition
– Perform language independent character recognition using specifically
trained classifiers
– Voting engine
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 7
8. Typewritten OCR
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
TOCR
Document
Image
(greyscale)
Glyph Segmentation
System
Glyph
Elements
developed
Binarisation
Glyph Enhancement
in IMPACT
Document
Image Enhanced
(black-and-white) Glyphs
Composite Character Recognition
Region Segmentation
Template Matching Weights
Voting Engine
<?xml version="1.0“>
<PcGts> Feature-based Classifier
PAGE XML <Page>
...
<Region/>
(with text regions) </Page>
</PcGts>
Glyph
Text Line Segmentation Elements
(with text)
<?xml version="1.0“>
<PcGts>
<?xml version="1.0“>
PAGE Exporter PAGE XML <Page>
PAGE XML <PcGts>
<Page> (completely filled)
<Region/>
(with text lines) <Region/>
</Page>
(includes word composition) </Page>
</PcGts>
</PcGts>
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 8
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Some Results
Top:
Commercial
OCR
Bottom:
IMPACT
Typewritten
OCR prototype
More complete
results and thus
higher overall
accuracy
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 9
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
For more information visit:
PRImA
http://www.primaresearch.org
IMPACT
http://www.impact-project.eu
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 10