Session1 02.anna-maria sichani

OCR for Greek polytonic (multi
accent) historical printed documents
development, optimization and quality control
Sichani Anna-Maria, Kaddas Panagiotis, Gatos Vassilis, George Mikros

● The challenge : Greek polytonic and OCR
● Previous work
● Our goal & approach
● Data acquisition
● OCR tool development
● Graphic User interface
● Experiments - Training
● Post correction and quality control
● Conclusions - future steps
Topics covered

Greek polytonic (multi accent) scripts and
Optical Character Recognition (OCR)
technologies
● Recognition of non-Latin minority languages or scripts having a large number of
character classes is still subject of active research.
● Greek polytonic (multi accent) scripts have a large variety of diacritic marks and as a
result a large number of character classes (more than 270).
● Greek polytonic documents (both handwritten and printed) cannot be successfully
processed by the currently available OCR technologies / a huge amount of digitised
(handwritten and printed) Greek documents still remains without full text processing
and analysis capabilities.

Previous work
● (Digital) Classics / OCR for Greek polytonic (mainly Ancient Greek) -
modern fonts
○ Tesseract | Antigrapheus
○ Perseus project
● Computer Science / Machine learning
○ OCR framework combining different recognition modules
(accent recognition , recognizing characters from different text zones, text
dewarping, text line detection)
○ adaptive zoning features (to adjust the position of every zone based on
local pattern information)
○ Hidden Markov Models (HMM)

Our Approach | Goal
➔ Develop a user friendly OCR tool and workflow for Greek polytonic
historical printed texts
● COST action “Distant Reading for European Literary History(CA16204)”
distant-reading.net
● Create the 1st openly available and reusable Modern Greek Literary
Textual Collection as part of larger European Literary Textual Collection
(ELTeC) in order to apply Distant Reading methods

Data acquisition
a corpus of Modern Greek texts using polytonic scripts
● 100 19th century Greek novels, 7-400 pages
● project -specific criteria for the corpus sampling (authors' gender
representation, date of publication, length, et.c.)
● texts already digitised by various Greek institutions in pdf /jpg
formats (National Library of Greece, University libraries)
● different digitisation standards → variations in terms of quality of
digitisation
● mainly from original printed editions of the 19th and 20th century
using various typefaces and diacritics → image quality and variations
in the digitisation

OCR tool development
Proposed OCR Method (train – test phase) using image processing techniques
1. Character segmentation and recognition: all input document images are fed into an
existing OCR tool (ABBYY FineReader)
2. Manual post-procession of the results in order to improve the quality of the initial
character recognition and create an accurate training set.
3. Feature extractor algorithm, were each already segmented character is normalized
into a fixed size and then splitted into 3x3 zones. From each sub image pixel intensity is
extracted as image feature and the final descriptor for each character is the zone
feature concatenation.
4. Classification : machine learning algorithm, k-NN, in order to classify characters to the
nearest class of the train set, based on the feature values extracted.

Graphical User Interface (GUI)
* *ABBYY FineReader was used for an initial character recognition → accuracy is below 80%
in most cases → further optimization
● easy-to-use
● fast and accurate way of marking each character of a document page
● embedding user feedback (EDIT)
● optimizing initial results (discard or modify erroneous character
segmentation and recognition) → SAVE all changes into a database file

Training experiments
- training phase
● annotating 1-2 pages per document using
GUI
● database dev : 100 annotated document
pages (50k of character entries belonging to
143 unique classes)
- mass application (batch mode) for 100 novels
- performance evaluation (Levenshtein
distance)
94.2% accuracy for the best parameter of the k-
NN algorithm (k=5) ( results 98.6% -- 82.3% )
25% improvement when compared to the
initial recognition results

Post correction and quality control
➔ manually comparing & correcting the resulted OCRed files ( txt) with the
original digitised files (PDF)
➔ auxiliary resource as a look-up vocabulary to assist with and speed up the
post-correction phase
Use of corrected txt files
● evaluation ( ground truth )
● conversion to XML TEI format ( predefined schema )
Quality control and post-correction resources management
Data collection regarding
1. the time needed for the correction
2. the type and
3. the amount of errors found

Conclusions & Future steps
work-in-progress
● point out (hidden) obstacles and gaps in digitisation standards,
ground truth, post-correction
● help accessing and further processing a huge amount of Greek
historical printed polytonic documents
● enable advanced computational processes on Greek printed
polytonic textual resources
● enlarge the training set (textual resources from the 18th c+ 20th
● more sophisticated classification of characters
● update on the feature extraction mechanisms

Any questions?
Get in touch!
Many thanks!
a.sichani@sussex.ac.uk pkaddas@iit.demokritos.gr bgat@iit.demokritos.gr
@amsichani
gmikros@gmail.com

Session1 02.anna-maria sichani

Recommandé

Recommandé

Contenu connexe

Similaire à Session1 02.anna-maria sichani

Similaire à Session1 02.anna-maria sichani (20)

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Dernier

Dernier (20)

Session1 02.anna-maria sichani