SlideShare une entreprise Scribd logo
1  sur  25
BUILDING A NAÏVE OCR
                 SYSTEM
                  Kaur Alasoo and Jaana Metsamaa
                       Data Mining (2009 fall)




28 January 2010                           http://kaur.pri.ee/projects/wordocr/
VISION


   Take a
snapshot of a
    page.
VISION


Extract one
word from
the picture.
VISION




Look up the word
   on Google,
 dictionary, etc...
REALITY
• To   increase the likelihood of success, we limited ourselves to:

 • One     font from one book.

 • Only    lowercase characters.

 • Pictures   taken with one phone.
GETTING THE DATA




 44 pictures
GETTING THE DATA




 44 pictures   242 words
0. Take a picture

                         1. Convert to grayscale

                                                   2. Apply thresholding




  WORKFLOW                                          3. Cut horizontally

                        4. Cut vertically
5. Resize




            8x16 px
FROM PICTURE TO FEATURE
        VECTOR
FROM PICTURE TO FEATURE
        VECTOR
FROM PICTURE TO FEATURE
        VECTOR
       255
FROM PICTURE TO FEATURE
        VECTOR
       255 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155 255 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155 255 255
TRAINING THE CLASSIFIER




We used 1343 characters to train a SVM with RBF
                   kernel.
RESULTS

• 10-fold   cross-validation accuracy with our method:

                              98.86%
• Character   classification accuracy using Tesseract OCR:

                              91.47%
WHAT WE LEARNED?
SOME LETTERS ARE
                    EXTREMELY RARE
15

                                                               English             Our data

11



 8



 4



 0
     e   t   a o   i   n   s   h   r   d   l   c   u m w   f    g   y    p b   v   k   j   x q   z
NOT ALL FONTS CAN BE EASILY
SEPARATED INTO CHARACTERS
WHAT WE LEARNED?

• Installing   libraries is the most difficult thing.

• Do    not overly restrict yourself.

• Donot build your on OCR system unless its absolutely
 necessary.

• The   letters are the easy part.


                                                http://kaur.pri.ee/projects/wordocr/

Contenu connexe

Similaire à Building a Naive OCR System

Tesseract OCR Engine
Tesseract OCR EngineTesseract OCR Engine
Tesseract OCR Engine
Raghu nath
 
Stathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IACStathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IAC
Stathy Touloumis
 
Automated static deobfuscation in the context of Reverse Engineering
Automated static deobfuscation in the context of Reverse EngineeringAutomated static deobfuscation in the context of Reverse Engineering
Automated static deobfuscation in the context of Reverse Engineering
zynamics GmbH
 
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
TNR Global
 

Similaire à Building a Naive OCR System (20)

Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"
Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"
Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"
 
JavaScript Metaprogramming with ES 2015 Proxy
JavaScript Metaprogramming with ES 2015 ProxyJavaScript Metaprogramming with ES 2015 Proxy
JavaScript Metaprogramming with ES 2015 Proxy
 
gPBL - Reading Assistant for Blind - Working Progress
gPBL - Reading Assistant for Blind - Working Progress gPBL - Reading Assistant for Blind - Working Progress
gPBL - Reading Assistant for Blind - Working Progress
 
6 reasons Jubilee could be a Rubyist's new best friend
6 reasons Jubilee could be a Rubyist's new best friend6 reasons Jubilee could be a Rubyist's new best friend
6 reasons Jubilee could be a Rubyist's new best friend
 
[CocoaHeads Tricity] Michał Zygar - Consuming API
[CocoaHeads Tricity] Michał Zygar - Consuming API[CocoaHeads Tricity] Michał Zygar - Consuming API
[CocoaHeads Tricity] Michał Zygar - Consuming API
 
Tesseract OCR Engine
Tesseract OCR EngineTesseract OCR Engine
Tesseract OCR Engine
 
Call Control Power Tools with Adhearsion
Call Control Power Tools with AdhearsionCall Control Power Tools with Adhearsion
Call Control Power Tools with Adhearsion
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Visual Object Category Recognition
Visual Object Category RecognitionVisual Object Category Recognition
Visual Object Category Recognition
 
Call Control Power Tools with Adhearsion
Call Control Power Tools with Adhearsion Call Control Power Tools with Adhearsion
Call Control Power Tools with Adhearsion
 
Adopting Elixir in a 10 year old codebase
Adopting Elixir in a 10 year old codebaseAdopting Elixir in a 10 year old codebase
Adopting Elixir in a 10 year old codebase
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodb
 
Cassandra Summit 2014: Astyanax — To Be or Not To Be
Cassandra Summit 2014: Astyanax — To Be or Not To BeCassandra Summit 2014: Astyanax — To Be or Not To Be
Cassandra Summit 2014: Astyanax — To Be or Not To Be
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Stathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IACStathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IAC
 
Automated static deobfuscation in the context of Reverse Engineering
Automated static deobfuscation in the context of Reverse EngineeringAutomated static deobfuscation in the context of Reverse Engineering
Automated static deobfuscation in the context of Reverse Engineering
 
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
 
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
 
Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...
Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...
Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...
 
Erlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughputErlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughput
 

Plus de Kaur Alasoo (6)

The Data-Driven World
The Data-Driven WorldThe Data-Driven World
The Data-Driven World
 
What is Bioinformatics?
What is Bioinformatics?What is Bioinformatics?
What is Bioinformatics?
 
Masinõpe ja bioinformaatika
Masinõpe ja bioinformaatikaMasinõpe ja bioinformaatika
Masinõpe ja bioinformaatika
 
Teine tuutoritund 2009
Teine tuutoritund 2009Teine tuutoritund 2009
Teine tuutoritund 2009
 
Esimene tuutoritund 2009
Esimene tuutoritund 2009Esimene tuutoritund 2009
Esimene tuutoritund 2009
 
Vietnami sõda
Vietnami sõdaVietnami sõda
Vietnami sõda
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Building a Naive OCR System