SlideShare une entreprise Scribd logo
BUILDING A NAÏVE OCR
                 SYSTEM
                  Kaur Alasoo and Jaana Metsamaa
                       Data Mining (2009 fall)




28 January 2010                           http://kaur.pri.ee/projects/wordocr/
VISION


   Take a
snapshot of a
    page.
VISION


Extract one
word from
the picture.
VISION




Look up the word
   on Google,
 dictionary, etc...
REALITY
• To   increase the likelihood of success, we limited ourselves to:

 • One     font from one book.

 • Only    lowercase characters.

 • Pictures   taken with one phone.
GETTING THE DATA




 44 pictures
GETTING THE DATA




 44 pictures   242 words
0. Take a picture

                         1. Convert to grayscale

                                                   2. Apply thresholding




  WORKFLOW                                          3. Cut horizontally

                        4. Cut vertically
5. Resize




            8x16 px
FROM PICTURE TO FEATURE
        VECTOR
FROM PICTURE TO FEATURE
        VECTOR
FROM PICTURE TO FEATURE
        VECTOR
       255
FROM PICTURE TO FEATURE
        VECTOR
       255 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155 255 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155 255 255
TRAINING THE CLASSIFIER




We used 1343 characters to train a SVM with RBF
                   kernel.
RESULTS

• 10-fold   cross-validation accuracy with our method:

                              98.86%
• Character   classification accuracy using Tesseract OCR:

                              91.47%
WHAT WE LEARNED?
SOME LETTERS ARE
                    EXTREMELY RARE
15

                                                               English             Our data

11



 8



 4



 0
     e   t   a o   i   n   s   h   r   d   l   c   u m w   f    g   y    p b   v   k   j   x q   z
NOT ALL FONTS CAN BE EASILY
SEPARATED INTO CHARACTERS
WHAT WE LEARNED?

• Installing   libraries is the most difficult thing.

• Do    not overly restrict yourself.

• Donot build your on OCR system unless its absolutely
 necessary.

• The   letters are the easy part.


                                                http://kaur.pri.ee/projects/wordocr/

Contenu connexe

Similaire à Building a Naive OCR System

Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"
Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"
Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"Fwdays
 
JavaScript Metaprogramming with ES 2015 Proxy
JavaScript Metaprogramming with ES 2015 ProxyJavaScript Metaprogramming with ES 2015 Proxy
JavaScript Metaprogramming with ES 2015 ProxyAlexandr Skachkov
 
gPBL - Reading Assistant for Blind - Working Progress
gPBL - Reading Assistant for Blind - Working Progress gPBL - Reading Assistant for Blind - Working Progress
gPBL - Reading Assistant for Blind - Working Progress Chanon Khongprasongsiri
 
6 reasons Jubilee could be a Rubyist's new best friend
6 reasons Jubilee could be a Rubyist's new best friend6 reasons Jubilee could be a Rubyist's new best friend
6 reasons Jubilee could be a Rubyist's new best friendForrest Chang
 
[CocoaHeads Tricity] Michał Zygar - Consuming API
[CocoaHeads Tricity] Michał Zygar - Consuming API[CocoaHeads Tricity] Michał Zygar - Consuming API
[CocoaHeads Tricity] Michał Zygar - Consuming APICocoaHeads Tricity
 
Tesseract OCR Engine
Tesseract OCR EngineTesseract OCR Engine
Tesseract OCR EngineRaghu nath
 
Call Control Power Tools with Adhearsion
Call Control Power Tools with AdhearsionCall Control Power Tools with Adhearsion
Call Control Power Tools with AdhearsionAdhearsion Foundation
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Visual Object Category Recognition
Visual Object Category RecognitionVisual Object Category Recognition
Visual Object Category RecognitionAshish Gupta
 
Call Control Power Tools with Adhearsion
Call Control Power Tools with Adhearsion Call Control Power Tools with Adhearsion
Call Control Power Tools with Adhearsion Mojo Lingo
 
Adopting Elixir in a 10 year old codebase
Adopting Elixir in a 10 year old codebaseAdopting Elixir in a 10 year old codebase
Adopting Elixir in a 10 year old codebaseMichael Klishin
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodbMitch Pirtle
 
Cassandra Summit 2014: Astyanax — To Be or Not To Be
Cassandra Summit 2014: Astyanax — To Be or Not To BeCassandra Summit 2014: Astyanax — To Be or Not To Be
Cassandra Summit 2014: Astyanax — To Be or Not To BeDataStax Academy
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & MetricsSanghamitra Deb
 
Stathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IACStathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IACStathy Touloumis
 
Automated static deobfuscation in the context of Reverse Engineering
Automated static deobfuscation in the context of Reverse EngineeringAutomated static deobfuscation in the context of Reverse Engineering
Automated static deobfuscation in the context of Reverse Engineeringzynamics GmbH
 
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02TNR Global
 
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011Michael McIntosh
 
Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...
Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...
Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...Alberto Brandolini
 
Erlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughputErlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughputPaolo Negri
 

Similaire à Building a Naive OCR System (20)

Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"
Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"
Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"
 
JavaScript Metaprogramming with ES 2015 Proxy
JavaScript Metaprogramming with ES 2015 ProxyJavaScript Metaprogramming with ES 2015 Proxy
JavaScript Metaprogramming with ES 2015 Proxy
 
gPBL - Reading Assistant for Blind - Working Progress
gPBL - Reading Assistant for Blind - Working Progress gPBL - Reading Assistant for Blind - Working Progress
gPBL - Reading Assistant for Blind - Working Progress
 
6 reasons Jubilee could be a Rubyist's new best friend
6 reasons Jubilee could be a Rubyist's new best friend6 reasons Jubilee could be a Rubyist's new best friend
6 reasons Jubilee could be a Rubyist's new best friend
 
[CocoaHeads Tricity] Michał Zygar - Consuming API
[CocoaHeads Tricity] Michał Zygar - Consuming API[CocoaHeads Tricity] Michał Zygar - Consuming API
[CocoaHeads Tricity] Michał Zygar - Consuming API
 
Tesseract OCR Engine
Tesseract OCR EngineTesseract OCR Engine
Tesseract OCR Engine
 
Call Control Power Tools with Adhearsion
Call Control Power Tools with AdhearsionCall Control Power Tools with Adhearsion
Call Control Power Tools with Adhearsion
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Visual Object Category Recognition
Visual Object Category RecognitionVisual Object Category Recognition
Visual Object Category Recognition
 
Call Control Power Tools with Adhearsion
Call Control Power Tools with Adhearsion Call Control Power Tools with Adhearsion
Call Control Power Tools with Adhearsion
 
Adopting Elixir in a 10 year old codebase
Adopting Elixir in a 10 year old codebaseAdopting Elixir in a 10 year old codebase
Adopting Elixir in a 10 year old codebase
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodb
 
Cassandra Summit 2014: Astyanax — To Be or Not To Be
Cassandra Summit 2014: Astyanax — To Be or Not To BeCassandra Summit 2014: Astyanax — To Be or Not To Be
Cassandra Summit 2014: Astyanax — To Be or Not To Be
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Stathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IACStathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IAC
 
Automated static deobfuscation in the context of Reverse Engineering
Automated static deobfuscation in the context of Reverse EngineeringAutomated static deobfuscation in the context of Reverse Engineering
Automated static deobfuscation in the context of Reverse Engineering
 
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
 
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011
 
Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...
Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...
Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...
 
Erlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughputErlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughput
 

Plus de Kaur Alasoo

The Data-Driven World
The Data-Driven WorldThe Data-Driven World
The Data-Driven WorldKaur Alasoo
 
What is Bioinformatics?
What is Bioinformatics?What is Bioinformatics?
What is Bioinformatics?Kaur Alasoo
 
Masinõpe ja bioinformaatika
Masinõpe ja bioinformaatikaMasinõpe ja bioinformaatika
Masinõpe ja bioinformaatikaKaur Alasoo
 
Teine tuutoritund 2009
Teine tuutoritund 2009Teine tuutoritund 2009
Teine tuutoritund 2009Kaur Alasoo
 
Esimene tuutoritund 2009
Esimene tuutoritund 2009Esimene tuutoritund 2009
Esimene tuutoritund 2009Kaur Alasoo
 

Plus de Kaur Alasoo (6)

The Data-Driven World
The Data-Driven WorldThe Data-Driven World
The Data-Driven World
 
What is Bioinformatics?
What is Bioinformatics?What is Bioinformatics?
What is Bioinformatics?
 
Masinõpe ja bioinformaatika
Masinõpe ja bioinformaatikaMasinõpe ja bioinformaatika
Masinõpe ja bioinformaatika
 
Teine tuutoritund 2009
Teine tuutoritund 2009Teine tuutoritund 2009
Teine tuutoritund 2009
 
Esimene tuutoritund 2009
Esimene tuutoritund 2009Esimene tuutoritund 2009
Esimene tuutoritund 2009
 
Vietnami sõda
Vietnami sõdaVietnami sõda
Vietnami sõda
 

Dernier

SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsUXDXConf
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxEasyPrinterHelp
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024TopCSSGallery
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfChristopherTHyatt
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 

Dernier (20)

SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 

Building a Naive OCR System