SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
OCR for Greek polytonic (multi
accent) historical printed documents
development, optimization and quality control
Sichani Anna-Maria, Kaddas Panagiotis, Gatos Vassilis, George Mikros
● The challenge : Greek polytonic and OCR
● Previous work
● Our goal & approach
● Data acquisition
● OCR tool development
● Graphic User interface
● Experiments - Training
● Post correction and quality control
● Conclusions - future steps
Topics covered
Greek polytonic (multi accent) scripts and
Optical Character Recognition (OCR)
technologies
● Recognition of non-Latin minority languages or scripts having a large number of
character classes is still subject of active research.
● Greek polytonic (multi accent) scripts have a large variety of diacritic marks and as a
result a large number of character classes (more than 270).
● Greek polytonic documents (both handwritten and printed) cannot be successfully
processed by the currently available OCR technologies / a huge amount of digitised
(handwritten and printed) Greek documents still remains without full text processing
and analysis capabilities.
Previous work
● (Digital) Classics / OCR for Greek polytonic (mainly Ancient Greek) -
modern fonts
○ Tesseract | Antigrapheus
○ Perseus project
● Computer Science / Machine learning
○ OCR framework combining different recognition modules
(accent recognition , recognizing characters from different text zones, text
dewarping, text line detection)
○ adaptive zoning features (to adjust the position of every zone based on
local pattern information)
○ Hidden Markov Models (HMM)
Our Approach | Goal
➔ Develop a user friendly OCR tool and workflow for Greek polytonic
historical printed texts
● COST action “Distant Reading for European Literary History(CA16204)”
distant-reading.net
● Create the 1st openly available and reusable Modern Greek Literary
Textual Collection as part of larger European Literary Textual Collection
(ELTeC) in order to apply Distant Reading methods
Data acquisition
a corpus of Modern Greek texts using polytonic scripts
● 100 19th century Greek novels, 7-400 pages
● project -specific criteria for the corpus sampling (authors' gender
representation, date of publication, length, et.c.)
● texts already digitised by various Greek institutions in pdf /jpg
formats (National Library of Greece, University libraries)
● different digitisation standards → variations in terms of quality of
digitisation
● mainly from original printed editions of the 19th and 20th century
using various typefaces and diacritics → image quality and variations
in the digitisation
OCR tool development
Proposed OCR Method (train – test phase) using image processing techniques
1. Character segmentation and recognition: all input document images are fed into an
existing OCR tool (ABBYY FineReader)
2. Manual post-procession of the results in order to improve the quality of the initial
character recognition and create an accurate training set.
3. Feature extractor algorithm, were each already segmented character is normalized
into a fixed size and then splitted into 3x3 zones. From each sub image pixel intensity is
extracted as image feature and the final descriptor for each character is the zone
feature concatenation.
4. Classification : machine learning algorithm, k-NN, in order to classify characters to the
nearest class of the train set, based on the feature values extracted.
OCR tool development
Graphical User Interface (GUI)
* *ABBYY FineReader was used for an initial character recognition → accuracy is below 80%
in most cases → further optimization
● easy-to-use
● fast and accurate way of marking each character of a document page
● embedding user feedback (EDIT)
● optimizing initial results (discard or modify erroneous character
segmentation and recognition) → SAVE all changes into a database file
Training experiments
- training phase
● annotating 1-2 pages per document using
GUI
● database dev : 100 annotated document
pages (50k of character entries belonging to
143 unique classes)
- mass application (batch mode) for 100 novels
- performance evaluation (Levenshtein
distance)
94.2% accuracy for the best parameter of the k-
NN algorithm (k=5) ( results 98.6% -- 82.3% )
25% improvement when compared to the
initial recognition results
Post correction and quality control
➔ manually comparing & correcting the resulted OCRed files ( txt) with the
original digitised files (PDF)
➔ auxiliary resource as a look-up vocabulary to assist with and speed up the
post-correction phase
Use of corrected txt files
● evaluation ( ground truth )
● conversion to XML TEI format ( predefined schema )
Quality control and post-correction resources management
Data collection regarding
1. the time needed for the correction
2. the type and
3. the amount of errors found
Conclusions & Future steps
work-in-progress
● point out (hidden) obstacles and gaps in digitisation standards,
ground truth, post-correction
● help accessing and further processing a huge amount of Greek
historical printed polytonic documents
● enable advanced computational processes on Greek printed
polytonic textual resources
● enlarge the training set (textual resources from the 18th c+ 20th
● more sophisticated classification of characters
● update on the feature extraction mechanisms
Any questions?
Get in touch!
Many thanks!
a.sichani@sussex.ac.uk pkaddas@iit.demokritos.gr bgat@iit.demokritos.gr
@amsichani
gmikros@gmail.com

Contenu connexe

Similaire à Session1 02.anna-maria sichani

Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
Editor IJARCET
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
Editor IJARCET
 
Muehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen MykonosMuehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen Mykonos
EUscreen
 

Similaire à Session1 02.anna-maria sichani (20)

Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
O45018291
O45018291O45018291
O45018291
 
Gjergj Sheldija: Albania
Gjergj Sheldija: AlbaniaGjergj Sheldija: Albania
Gjergj Sheldija: Albania
 
Corpus studio Erwin Komen
Corpus studio Erwin KomenCorpus studio Erwin Komen
Corpus studio Erwin Komen
 
Nayeem shaik resume
Nayeem shaik resumeNayeem shaik resume
Nayeem shaik resume
 
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual EuropeCuration Technologies for Multilingual Europe
Curation Technologies for Multilingual Europe
 
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCRA SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
 
300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx
 
Ijetcas14 619
Ijetcas14 619Ijetcas14 619
Ijetcas14 619
 
Digital library software
Digital library softwareDigital library software
Digital library software
 
Recognition of historical records
Recognition of historical recordsRecognition of historical records
Recognition of historical records
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
Intro to Reproducible Research
Intro to Reproducible ResearchIntro to Reproducible Research
Intro to Reproducible Research
 
Preprocessing techniques for recognition of ancient Kannada epigraphs
Preprocessing techniques for recognition of ancient Kannada epigraphsPreprocessing techniques for recognition of ancient Kannada epigraphs
Preprocessing techniques for recognition of ancient Kannada epigraphs
 
Muehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen MykonosMuehlberger - PrestoPrime case study 2 @EUscreen Mykonos
Muehlberger - PrestoPrime case study 2 @EUscreen Mykonos
 
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
 
Arabic Handwritten Text Recognition and Writer Identification
Arabic Handwritten Text Recognition and Writer IdentificationArabic Handwritten Text Recognition and Writer Identification
Arabic Handwritten Text Recognition and Writer Identification
 
eScriptorium: An Open Source Platform for Historical Document Analysis
eScriptorium: An Open Source Platform for Historical Document AnalysiseScriptorium: An Open Source Platform for Historical Document Analysis
eScriptorium: An Open Source Platform for Historical Document Analysis
 

Plus de IMPACT Centre of Competence

Plus de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Dernier

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Dernier (20)

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 

Session1 02.anna-maria sichani

  • 1. OCR for Greek polytonic (multi accent) historical printed documents development, optimization and quality control Sichani Anna-Maria, Kaddas Panagiotis, Gatos Vassilis, George Mikros
  • 2. ● The challenge : Greek polytonic and OCR ● Previous work ● Our goal & approach ● Data acquisition ● OCR tool development ● Graphic User interface ● Experiments - Training ● Post correction and quality control ● Conclusions - future steps Topics covered
  • 3. Greek polytonic (multi accent) scripts and Optical Character Recognition (OCR) technologies ● Recognition of non-Latin minority languages or scripts having a large number of character classes is still subject of active research. ● Greek polytonic (multi accent) scripts have a large variety of diacritic marks and as a result a large number of character classes (more than 270). ● Greek polytonic documents (both handwritten and printed) cannot be successfully processed by the currently available OCR technologies / a huge amount of digitised (handwritten and printed) Greek documents still remains without full text processing and analysis capabilities.
  • 4. Previous work ● (Digital) Classics / OCR for Greek polytonic (mainly Ancient Greek) - modern fonts ○ Tesseract | Antigrapheus ○ Perseus project ● Computer Science / Machine learning ○ OCR framework combining different recognition modules (accent recognition , recognizing characters from different text zones, text dewarping, text line detection) ○ adaptive zoning features (to adjust the position of every zone based on local pattern information) ○ Hidden Markov Models (HMM)
  • 5. Our Approach | Goal ➔ Develop a user friendly OCR tool and workflow for Greek polytonic historical printed texts ● COST action “Distant Reading for European Literary History(CA16204)” distant-reading.net ● Create the 1st openly available and reusable Modern Greek Literary Textual Collection as part of larger European Literary Textual Collection (ELTeC) in order to apply Distant Reading methods
  • 6. Data acquisition a corpus of Modern Greek texts using polytonic scripts ● 100 19th century Greek novels, 7-400 pages ● project -specific criteria for the corpus sampling (authors' gender representation, date of publication, length, et.c.) ● texts already digitised by various Greek institutions in pdf /jpg formats (National Library of Greece, University libraries) ● different digitisation standards → variations in terms of quality of digitisation ● mainly from original printed editions of the 19th and 20th century using various typefaces and diacritics → image quality and variations in the digitisation
  • 7.
  • 8. OCR tool development Proposed OCR Method (train – test phase) using image processing techniques 1. Character segmentation and recognition: all input document images are fed into an existing OCR tool (ABBYY FineReader) 2. Manual post-procession of the results in order to improve the quality of the initial character recognition and create an accurate training set. 3. Feature extractor algorithm, were each already segmented character is normalized into a fixed size and then splitted into 3x3 zones. From each sub image pixel intensity is extracted as image feature and the final descriptor for each character is the zone feature concatenation. 4. Classification : machine learning algorithm, k-NN, in order to classify characters to the nearest class of the train set, based on the feature values extracted.
  • 10. Graphical User Interface (GUI) * *ABBYY FineReader was used for an initial character recognition → accuracy is below 80% in most cases → further optimization ● easy-to-use ● fast and accurate way of marking each character of a document page ● embedding user feedback (EDIT) ● optimizing initial results (discard or modify erroneous character segmentation and recognition) → SAVE all changes into a database file
  • 11.
  • 12. Training experiments - training phase ● annotating 1-2 pages per document using GUI ● database dev : 100 annotated document pages (50k of character entries belonging to 143 unique classes) - mass application (batch mode) for 100 novels - performance evaluation (Levenshtein distance) 94.2% accuracy for the best parameter of the k- NN algorithm (k=5) ( results 98.6% -- 82.3% ) 25% improvement when compared to the initial recognition results
  • 13. Post correction and quality control ➔ manually comparing & correcting the resulted OCRed files ( txt) with the original digitised files (PDF) ➔ auxiliary resource as a look-up vocabulary to assist with and speed up the post-correction phase Use of corrected txt files ● evaluation ( ground truth ) ● conversion to XML TEI format ( predefined schema ) Quality control and post-correction resources management Data collection regarding 1. the time needed for the correction 2. the type and 3. the amount of errors found
  • 14.
  • 15. Conclusions & Future steps work-in-progress ● point out (hidden) obstacles and gaps in digitisation standards, ground truth, post-correction ● help accessing and further processing a huge amount of Greek historical printed polytonic documents ● enable advanced computational processes on Greek printed polytonic textual resources ● enlarge the training set (textual resources from the 18th c+ 20th ● more sophisticated classification of characters ● update on the feature extraction mechanisms
  • 16. Any questions? Get in touch! Many thanks! a.sichani@sussex.ac.uk pkaddas@iit.demokritos.gr bgat@iit.demokritos.gr @amsichani gmikros@gmail.com