SlideShare une entreprise Scribd logo
1  sur  14
Khmer OCR
LONG Seangmeng
Lecturer and researcher, GIC - ITC
seangmeng@itc.edu.kh
1
Scientific Day
3th May, 2012
Khmer OCR
• What is OCR?
• Khmer OCR Project
• State of the Art
• Khmer OCR System
• Project status
• Perspectives
2
Optical Character Recognition (OCR)
OCR
3
Text Image
Editable Text
Khmer OCR Project
• 2011
• Team
– Dr. SENG Sopheap, ITC
– Mr. LONG Seangmeng, ITC
– Mr. EN Sovann (doing master)
– Ms. PRUM Sophea (doing PhD)
– Mr. HAO Jeudi (5th year)
• Develop a Khmer OCR system
– Font independent
– Size independent
4
State of the Art
Author Limitation Result
CHEY Chanoeurn, KOSIN
Chamnongthai and PINIT
Kumhom
10 characters (បពជកភណឃសវទ) 92%
CHEY Chanoeurn, KOSIN
Chamnongthai and PINIT
Kumhom
20 fonts 92.85% (size 22)
91.66% (size 18)
89.27% (size 12)
ING Leng Ieng and MUAZ
Ahmed
Limon R1 22 98.88%
KRUY Vanna Font and size independent
(manual preparation for
new fonts)
97%
EN Sovann Font and size independent
(manual preparation for
new fonts)
96%
5
Khmer OCR System
6
Pre processing
Segmentation
Recognition
Post processing
Text Image
Editable Text
សា លា ្រ ក រ ង រ ភ រន
សាលា្កុងភនពពញនិងសហជីព
Khmer OCR System (cont.)
• Pre processing
7
Binarization
Noise removal
Skew detection
and correction
Khmer OCR System (cont.)
• Segmentation
8
Line
Vertical Symbol
Blob
Page
Line 1
Line 2
Khmer OCR System (cont.)
• Recognition
9
Blob
Training images (sample images) with label
…
Blob to be recognized
Search for closest
match
Closest match
Image:
Label: ក
Khmer OCR System (cont.)
• Recognition (cont.)
– How to find closest match?
– How to represent the blob image?
• Fourier transform: Any function f(t) with period T can be written as
10
Blob image => 2-D Fourier transform
The blob image (B) represented by Fourier coefficients:
B[0], B[1], B[2], …
City block distance between two blobs B and B’:
Distance = |B[0] – B’[0]| + |B[1] – B’[1]| + |B[2] – B’[2]| + …
Khmer OCR System (cont.)
• Post processing
11
ឦ
ញ
Assembling
Blob
សា លា ្រ ក រ ង រ ភ រន ពរ ព ញ រិ ន ង
សា លា ្រក រ ង រភ រន ពរព ញ រិន ង
សា្លាកងពភនពញិនង
សាលា្កុងភនពពញ
Reordering
កន
្រុង ្កុង
ពបស់
ភន
របស់
Spell Checking
Project status
• Pre processing
– Binarization and noise removal √
– Skew detection and correction X
• Segmentation √
• Recognition
– Features extraction √
– Automatic generation of training data for new fonts √
• Post processing
– Assembling and reordering rules
• Manual √
• Automatic X
– Spell checking X
• Performance evaluation X
12
Perspectives
• Joining characters
• Text layout
• Low quality text images
• Curve line
13
Thanks for your attention!
Demo & Questions???
14

Contenu connexe

Similaire à Khmer ocr scientificday_itc

Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...
Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...
Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...IMPACT Centre of Competence
 
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXTSEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXTcscpconf
 
Assessment of OCR quality and font identification in historical documents
Assessment of OCR quality and font identification in historical documentsAssessment of OCR quality and font identification in historical documents
Assessment of OCR quality and font identification in historical documentsAnshul Gupta
 
Not Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software QualityNot Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software QualityRocco Oliveto
 
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresA Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresAnsgar Scherp
 

Similaire à Khmer ocr scientificday_itc (6)

Bibliotheca Digitalis Summer school: From pixels to content - Jean-Yves Ramel
Bibliotheca Digitalis Summer school: From pixels to content - Jean-Yves RamelBibliotheca Digitalis Summer school: From pixels to content - Jean-Yves Ramel
Bibliotheca Digitalis Summer school: From pixels to content - Jean-Yves Ramel
 
Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...
Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...
Datech2014 - Session 4 - Construction of Text Digitization System for Nôm His...
 
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXTSEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
 
Assessment of OCR quality and font identification in historical documents
Assessment of OCR quality and font identification in historical documentsAssessment of OCR quality and font identification in historical documents
Assessment of OCR quality and font identification in historical documents
 
Not Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software QualityNot Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software Quality
 
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresA Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
 

Plus de Solin TEM

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1Solin TEM
 
7 layout analysis
7 layout analysis7 layout analysis
7 layout analysisSolin TEM
 
6 char segmentation
6 char segmentation6 char segmentation
6 char segmentationSolin TEM
 
5 character classifiers
5 character classifiers5 character classifiers
5 character classifiersSolin TEM
 
4 downloading
4 downloading4 downloading
4 downloadingSolin TEM
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructuresSolin TEM
 
1 intro history
1 intro history1 intro history
1 intro historySolin TEM
 

Plus de Solin TEM (8)

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
 
7 layout analysis
7 layout analysis7 layout analysis
7 layout analysis
 
6 char segmentation
6 char segmentation6 char segmentation
6 char segmentation
 
5 character classifiers
5 character classifiers5 character classifiers
5 character classifiers
 
4 downloading
4 downloading4 downloading
4 downloading
 
3 training
3 training3 training
3 training
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructures
 
1 intro history
1 intro history1 intro history
1 intro history
 

Dernier

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Dernier (20)

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Khmer ocr scientificday_itc

  • 1. Khmer OCR LONG Seangmeng Lecturer and researcher, GIC - ITC seangmeng@itc.edu.kh 1 Scientific Day 3th May, 2012
  • 2. Khmer OCR • What is OCR? • Khmer OCR Project • State of the Art • Khmer OCR System • Project status • Perspectives 2
  • 3. Optical Character Recognition (OCR) OCR 3 Text Image Editable Text
  • 4. Khmer OCR Project • 2011 • Team – Dr. SENG Sopheap, ITC – Mr. LONG Seangmeng, ITC – Mr. EN Sovann (doing master) – Ms. PRUM Sophea (doing PhD) – Mr. HAO Jeudi (5th year) • Develop a Khmer OCR system – Font independent – Size independent 4
  • 5. State of the Art Author Limitation Result CHEY Chanoeurn, KOSIN Chamnongthai and PINIT Kumhom 10 characters (បពជកភណឃសវទ) 92% CHEY Chanoeurn, KOSIN Chamnongthai and PINIT Kumhom 20 fonts 92.85% (size 22) 91.66% (size 18) 89.27% (size 12) ING Leng Ieng and MUAZ Ahmed Limon R1 22 98.88% KRUY Vanna Font and size independent (manual preparation for new fonts) 97% EN Sovann Font and size independent (manual preparation for new fonts) 96% 5
  • 6. Khmer OCR System 6 Pre processing Segmentation Recognition Post processing Text Image Editable Text សា លា ្រ ក រ ង រ ភ រន សាលា្កុងភនពពញនិងសហជីព
  • 7. Khmer OCR System (cont.) • Pre processing 7 Binarization Noise removal Skew detection and correction
  • 8. Khmer OCR System (cont.) • Segmentation 8 Line Vertical Symbol Blob Page Line 1 Line 2
  • 9. Khmer OCR System (cont.) • Recognition 9 Blob Training images (sample images) with label … Blob to be recognized Search for closest match Closest match Image: Label: ក
  • 10. Khmer OCR System (cont.) • Recognition (cont.) – How to find closest match? – How to represent the blob image? • Fourier transform: Any function f(t) with period T can be written as 10 Blob image => 2-D Fourier transform The blob image (B) represented by Fourier coefficients: B[0], B[1], B[2], … City block distance between two blobs B and B’: Distance = |B[0] – B’[0]| + |B[1] – B’[1]| + |B[2] – B’[2]| + …
  • 11. Khmer OCR System (cont.) • Post processing 11 ឦ ញ Assembling Blob សា លា ្រ ក រ ង រ ភ រន ពរ ព ញ រិ ន ង សា លា ្រក រ ង រភ រន ពរព ញ រិន ង សា្លាកងពភនពញិនង សាលា្កុងភនពពញ Reordering កន ្រុង ្កុង ពបស់ ភន របស់ Spell Checking
  • 12. Project status • Pre processing – Binarization and noise removal √ – Skew detection and correction X • Segmentation √ • Recognition – Features extraction √ – Automatic generation of training data for new fonts √ • Post processing – Assembling and reordering rules • Manual √ • Automatic X – Spell checking X • Performance evaluation X 12
  • 13. Perspectives • Joining characters • Text layout • Low quality text images • Curve line 13
  • 14. Thanks for your attention! Demo & Questions??? 14