SlideShare une entreprise Scribd logo
1  sur  14
Khmer OCR
           BarCamp
     22nd September, 2012



        LONG Seangmeng
Lecturer and researcher, GIC - ITC
     seangmeng@itc.edu.kh


                                     1
Khmer OCR
•   What is OCR?
•   Khmer OCR Project
•   State of the Art
•   Khmer OCR System
•   Project status
•   Perspectives



                            2
Optical Character Recognition (OCR)
 Text Image




                  OCR
  Editable Text




                                      3
Khmer OCR Project
• 2011
• Team
  –   Dr. SENG Sopheap, ITC
  –   Mr. LONG Seangmeng, ITC5th
  –   Mr. EN Sovann (doing master)
  –   Ms. PRUM Sophea (doing PhD)
  –   Mr. HAO Jeudi (year)
• Develop a Khmer OCR system
  – Font independent
  – Size independent

                                     4
State of the Art
Author                   Limitation                  Result
CHEY Chanoeurn, KOSIN    10 characters (បបបបប
                                         បបបប        92%
Chamnongthai and PINIT   ប)
Kumhom
CHEY Chanoeurn, KOSIN    20 fonts                    92.85% (size 22)
Chamnongthai and PINIT                               91.66% (size 18)
Kumhom                                               89.27% (size 12)
ING Leng Ieng and MUAZ   Limon R1 22                 98.88%
Ahmed
KRUY Vanna               Font and size independent   97%
                         (manual preparation for
                         new fonts)
EN Sovann                Font and size independent   96%
                         (manual preparation for
                         new fonts)
                                                                        5
Khmer OCR System
  Text Image


Pre processing



Segmentation


 Recognition

                  សស ស ស សសស ស ស
                      ស ស            សស
Post processing


 Editable Text     ស ស ស ស ស ស ស ស ស ស ស ស
                    ស ស ស ស ស ស ស ស ស ស ស ស

                                              6
Khmer OCR System (cont.)
• Pre processing
                        Binarization




                        Noise removal




                         Skew detection
                         and correction




                                          7
Khmer OCR System (cont.)
• Segmentation
                       Page



                   Line 1
        Line
                   Line 2


     Vertical Symbol




     Blob


                                8
Khmer OCR System (cont.)
 • Recognition
            Blob



                              Training images (sample images) with label

                                                                           Closest match
Blob to be recognized                                                       Image:

               Search for closest                                           Label: ស
                    match
                                                …


                                                                                       9
Khmer OCR System (cont.)
• Recognition (cont.)
   – How to find closest match?
   – How to represent the blob image?
       • Fourier transform: Any function f(t) with period T can be written as




   Blob image => 2-D Fourier transform
   The blob image (B) represented by Fourier coefficients:
            B[0], B[1], B[2], …
   City block distance between two blobs B and B’:
            Distance = |B[0] – B’[0]| + |B[1] – B’[1]| + |B[2] – B’[2]| + …


                                                                              10
Khmer OCR System (cont.)
• Post processing                        ស
                                               Assembling
                                          ស


   Blob


           ស ស
            ស ស ស ស ស ស សស សស ស ស
                 ស          ស                    សស ស
                                               ស ស ស ស ស ស ស ស ស
                                                ស ស ស ស ស ស ស ស
                          Reordering

           ស ស
            ស ស ស ស ស ស ស សស ស ស ស
                   ស       ស                       សស ស
                                               ស ស ស ស ស ស ស ស
                                                ស ស ស ស ស ស ស ស
           ស ស ស
            ស ស    ស ស ស
                    ស ស
            ស ស
             ស ស   ស ស
                    ស ស       Spell Checking
           ស ស
            ស ស    ស ស
                    ស ស
                                                            11
Project status
• Pre processing
   – Binarization and noise removal √
   – Skew detection and correction X
• Segmentation √
• Recognition
   – Features extraction √
   – Automatic generation of training data for new fonts √
• Post processing
   – Assembling and reordering rules
      • Manual √
      • Automatic X
   – Spell checking X
• Performance evaluation X


                                                             12
Perspectives
•   Joining characters
•   Text layout
•   Low quality text images
•   Curve line




                                13
Thanks for your attention!

 Demo & Questions???



                             14

Contenu connexe

Plus de Bill Chea

Save time by using sass to develop css
Save time by using sass to develop cssSave time by using sass to develop css
Save time by using sass to develop cssBill Chea
 
Safety social media for positive social change
Safety social media for positive social changeSafety social media for positive social change
Safety social media for positive social changeBill Chea
 
Open street map
Open street mapOpen street map
Open street mapBill Chea
 
Open development cambodia
Open development cambodiaOpen development cambodia
Open development cambodiaBill Chea
 
Job hunting & career development
Job hunting & career developmentJob hunting & career development
Job hunting & career developmentBill Chea
 
Internet security
Internet securityInternet security
Internet securityBill Chea
 
How to build up communication skill
How to build up communication skillHow to build up communication skill
How to build up communication skillBill Chea
 
Google mapmaker
Google mapmakerGoogle mapmaker
Google mapmakerBill Chea
 
Financial job study travel planning
Financial job study travel planningFinancial job study travel planning
Financial job study travel planningBill Chea
 
ERP web based system
ERP web based systemERP web based system
ERP web based systemBill Chea
 
10 golden features of business website
10 golden features of business website10 golden features of business website
10 golden features of business websiteBill Chea
 
UrbanVoicePDF
UrbanVoicePDFUrbanVoicePDF
UrbanVoicePDFBill Chea
 
4 hour-workweek-blogger
4 hour-workweek-blogger4 hour-workweek-blogger
4 hour-workweek-bloggerBill Chea
 

Plus de Bill Chea (20)

Why ruby
Why rubyWhy ruby
Why ruby
 
Unix tc
Unix tcUnix tc
Unix tc
 
Sithi hub
Sithi hubSithi hub
Sithi hub
 
Save time by using sass to develop css
Save time by using sass to develop cssSave time by using sass to develop css
Save time by using sass to develop css
 
Safety social media for positive social change
Safety social media for positive social changeSafety social media for positive social change
Safety social media for positive social change
 
Open street map
Open street mapOpen street map
Open street map
 
Open development cambodia
Open development cambodiaOpen development cambodia
Open development cambodia
 
Less css
Less cssLess css
Less css
 
Job hunting & career development
Job hunting & career developmentJob hunting & career development
Job hunting & career development
 
Internet security
Internet securityInternet security
Internet security
 
How to build up communication skill
How to build up communication skillHow to build up communication skill
How to build up communication skill
 
Google mapmaker
Google mapmakerGoogle mapmaker
Google mapmaker
 
Financial job study travel planning
Financial job study travel planningFinancial job study travel planning
Financial job study travel planning
 
Khmer TTS
Khmer TTSKhmer TTS
Khmer TTS
 
Khmer ASR
Khmer ASRKhmer ASR
Khmer ASR
 
ERP web based system
ERP web based systemERP web based system
ERP web based system
 
10 golden features of business website
10 golden features of business website10 golden features of business website
10 golden features of business website
 
UrbanVoicePDF
UrbanVoicePDFUrbanVoicePDF
UrbanVoicePDF
 
4 hour-workweek-blogger
4 hour-workweek-blogger4 hour-workweek-blogger
4 hour-workweek-blogger
 
UrbanVoice
UrbanVoiceUrbanVoice
UrbanVoice
 

Khmer OCR

  • 1. Khmer OCR BarCamp 22nd September, 2012 LONG Seangmeng Lecturer and researcher, GIC - ITC seangmeng@itc.edu.kh 1
  • 2. Khmer OCR • What is OCR? • Khmer OCR Project • State of the Art • Khmer OCR System • Project status • Perspectives 2
  • 3. Optical Character Recognition (OCR) Text Image OCR Editable Text 3
  • 4. Khmer OCR Project • 2011 • Team – Dr. SENG Sopheap, ITC – Mr. LONG Seangmeng, ITC5th – Mr. EN Sovann (doing master) – Ms. PRUM Sophea (doing PhD) – Mr. HAO Jeudi (year) • Develop a Khmer OCR system – Font independent – Size independent 4
  • 5. State of the Art Author Limitation Result CHEY Chanoeurn, KOSIN 10 characters (បបបបប បបបប 92% Chamnongthai and PINIT ប) Kumhom CHEY Chanoeurn, KOSIN 20 fonts 92.85% (size 22) Chamnongthai and PINIT 91.66% (size 18) Kumhom 89.27% (size 12) ING Leng Ieng and MUAZ Limon R1 22 98.88% Ahmed KRUY Vanna Font and size independent 97% (manual preparation for new fonts) EN Sovann Font and size independent 96% (manual preparation for new fonts) 5
  • 6. Khmer OCR System Text Image Pre processing Segmentation Recognition សស ស ស សសស ស ស ស ស សស Post processing Editable Text ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស 6
  • 7. Khmer OCR System (cont.) • Pre processing Binarization Noise removal Skew detection and correction 7
  • 8. Khmer OCR System (cont.) • Segmentation Page Line 1 Line Line 2 Vertical Symbol Blob 8
  • 9. Khmer OCR System (cont.) • Recognition Blob Training images (sample images) with label Closest match Blob to be recognized Image: Search for closest Label: ស match … 9
  • 10. Khmer OCR System (cont.) • Recognition (cont.) – How to find closest match? – How to represent the blob image? • Fourier transform: Any function f(t) with period T can be written as Blob image => 2-D Fourier transform The blob image (B) represented by Fourier coefficients: B[0], B[1], B[2], … City block distance between two blobs B and B’: Distance = |B[0] – B’[0]| + |B[1] – B’[1]| + |B[2] – B’[2]| + … 10
  • 11. Khmer OCR System (cont.) • Post processing ស Assembling ស Blob ស ស ស ស ស ស ស ស សស សស ស ស ស ស សស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស Reordering ស ស ស ស ស ស ស ស ស សស ស ស ស ស ស សស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស ស Spell Checking ស ស ស ស ស ស ស ស 11
  • 12. Project status • Pre processing – Binarization and noise removal √ – Skew detection and correction X • Segmentation √ • Recognition – Features extraction √ – Automatic generation of training data for new fonts √ • Post processing – Assembling and reordering rules • Manual √ • Automatic X – Spell checking X • Performance evaluation X 12
  • 13. Perspectives • Joining characters • Text layout • Low quality text images • Curve line 13
  • 14. Thanks for your attention! Demo & Questions??? 14