Building a Naive OCR System

•Télécharger en tant que KEY, PDF•

3 j'aime•1,537 vues

This document summarizes a student project to build a naïve optical character recognition (OCR) system. The students limited their system to one font from one book, lowercase characters only, and pictures taken with one phone. They collected data from 44 pictures containing 242 words. Their workflow involved preprocessing images then extracting 8x16 pixel feature vectors for each character. They used a support vector machine classifier with a radial basis function kernel, achieving 98.86% accuracy on 10-fold cross-validation of their data, compared to 91.47% for commercial Tesseract OCR. Key lessons were that character frequencies vary significantly from English, not all fonts separate easily into characters, and installing libraries was more difficult than expected.

Technologie

BUILDING A NAÏVE OCR
SYSTEM
Kaur Alasoo and Jaana Metsamaa
Data Mining (2009 fall)

28 January 2010 http://kaur.pri.ee/projects/wordocr/

VISION

Extract one
word from
the picture.

VISION

Look up the word
on Google,
dictionary, etc...

REALITY
• To increase the likelihood of success, we limited ourselves to:

• One font from one book.

• Only lowercase characters.

• Pictures taken with one phone.

0. Take a picture

1. Convert to grayscale

2. Apply thresholding

WORKFLOW 3. Cut horizontally

4. Cut vertically
5. Resize

8x16 px

FROM PICTURE TO FEATURE
VECTOR
255 255 255

FROM PICTURE TO FEATURE
VECTOR
255 255 255 0

FROM PICTURE TO FEATURE
VECTOR
255 255 255 0 0

FROM PICTURE TO FEATURE
VECTOR
255 255 255 0 0 155

FROM PICTURE TO FEATURE
VECTOR
255 255 255 0 0 155 255

FROM PICTURE TO FEATURE
VECTOR
255 255 255 0 0 155 255 255

TRAINING THE CLASSIFIER

We used 1343 characters to train a SVM with RBF
kernel.

RESULTS

• 10-fold cross-validation accuracy with our method:

98.86%
• Character classiﬁcation accuracy using Tesseract OCR:

91.47%

SOME LETTERS ARE
EXTREMELY RARE
15

English Our data

11

8

4

0
e t a o i n s h r d l c u m w f g y p b v k j x q z

NOT ALL FONTS CAN BE EASILY
SEPARATED INTO CHARACTERS

WHAT WE LEARNED?

• Installing libraries is the most difﬁcult thing.

• Do not overly restrict yourself.

• Donot build your on OCR system unless its absolutely
necessary.

• The letters are the easy part.

http://kaur.pri.ee/projects/wordocr/

Recommandé

Spot vs. Reserved InstancesRuben Van den Bossche

Sissejuhatus informaatikasse: kokkuvõttev statistikaKaur Alasoo

An OCR System for recognition of Urdu text in Nastaliq FontDr. Syed Hassan Amin

NLP and Deep Learning for non_expertsSanghamitra Deb

Lessons from Branch's launchaflock

Core Data in RubyMotion #inspectLori Olson

What Drove Wordnik Non-Relational?DATAVERSITY

Inside Wordnik's ArchitectureTony Tam

Recommandé

Spot vs. Reserved InstancesRuben Van den Bossche

Sissejuhatus informaatikasse: kokkuvõttev statistikaKaur Alasoo

An OCR System for recognition of Urdu text in Nastaliq FontDr. Syed Hassan Amin

NLP and Deep Learning for non_expertsSanghamitra Deb

Lessons from Branch's launchaflock

Core Data in RubyMotion #inspectLori Olson

What Drove Wordnik Non-Relational?DATAVERSITY

Inside Wordnik's ArchitectureTony Tam

Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"Fwdays

JavaScript Metaprogramming with ES 2015 ProxyAlexandr Skachkov

gPBL - Reading Assistant for Blind - Working Progress Chanon Khongprasongsiri

6 reasons Jubilee could be a Rubyist's new best friendForrest Chang

[CocoaHeads Tricity] Michał Zygar - Consuming APICocoaHeads Tricity

Tesseract OCR EngineRaghu nath

Call Control Power Tools with AdhearsionAdhearsion Foundation

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon

Visual Object Category RecognitionAshish Gupta

Call Control Power Tools with Adhearsion Mojo Lingo

Adopting Elixir in a 10 year old codebaseMichael Klishin

Cloud conference - mongodbMitch Pirtle

Cassandra Summit 2014: Astyanax — To Be or Not To BeDataStax Academy

NLP Classifier Models & MetricsSanghamitra Deb

Stathy DevOps in MSP / MKE on IACStathy Touloumis

Automated static deobfuscation in the context of Reverse Engineeringzynamics GmbH

Esp2solr eurocon-2011-presentation-111021215049-phpapp02TNR Global

Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011Michael McIntosh

Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...Alberto Brandolini

$Erlang as a cloud citizen, a fractal approach to throughput$ $Erlang as a cloud citizen, a fractal approach to throughput$

Erlang as a cloud citizen, a fractal approach to throughputPaolo Negri

The Data-Driven WorldKaur Alasoo

What is Bioinformatics?Kaur Alasoo

Contenu connexe

Similaire à Building a Naive OCR System

Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"Fwdays

JavaScript Metaprogramming with ES 2015 ProxyAlexandr Skachkov

gPBL - Reading Assistant for Blind - Working Progress Chanon Khongprasongsiri

6 reasons Jubilee could be a Rubyist's new best friendForrest Chang

[CocoaHeads Tricity] Michał Zygar - Consuming APICocoaHeads Tricity

Tesseract OCR EngineRaghu nath

Call Control Power Tools with AdhearsionAdhearsion Foundation

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon

Visual Object Category RecognitionAshish Gupta

Call Control Power Tools with Adhearsion Mojo Lingo

Adopting Elixir in a 10 year old codebaseMichael Klishin

Cloud conference - mongodbMitch Pirtle

Cassandra Summit 2014: Astyanax — To Be or Not To BeDataStax Academy

NLP Classifier Models & MetricsSanghamitra Deb

Stathy DevOps in MSP / MKE on IACStathy Touloumis

Automated static deobfuscation in the context of Reverse Engineeringzynamics GmbH

Esp2solr eurocon-2011-presentation-111021215049-phpapp02TNR Global

Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011Michael McIntosh

Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...Alberto Brandolini

$Erlang as a cloud citizen, a fractal approach to throughput$ $Erlang as a cloud citizen, a fractal approach to throughput$

Erlang as a cloud citizen, a fractal approach to throughputPaolo Negri

Similaire à Building a Naive OCR System (20)

Александр Скачков "JavaScript metaprogramming with ES2015 Proxy"

JavaScript Metaprogramming with ES 2015 Proxy

gPBL - Reading Assistant for Blind - Working Progress

6 reasons Jubilee could be a Rubyist's new best friend

[CocoaHeads Tricity] Michał Zygar - Consuming API

Tesseract OCR Engine

Call Control Power Tools with Adhearsion

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017

Visual Object Category Recognition

Call Control Power Tools with Adhearsion

Adopting Elixir in a 10 year old codebase

Cloud conference - mongodb

Cassandra Summit 2014: Astyanax — To Be or Not To Be

NLP Classifier Models & Metrics

Stathy DevOps in MSP / MKE on IAC

Automated static deobfuscation in the context of Reverse Engineering

Esp2solr eurocon-2011-presentation-111021215049-phpapp02

Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011

Loosely Coupled Complexity - Unleash the power of your Domain Model with Comm...

$Erlang as a cloud citizen, a fractal approach to throughput$ $Erlang as a cloud citizen, a fractal approach to throughput$

Erlang as a cloud citizen, a fractal approach to throughput

Plus de Kaur Alasoo

The Data-Driven WorldKaur Alasoo

What is Bioinformatics?Kaur Alasoo

Masinõpe ja bioinformaatikaKaur Alasoo

Teine tuutoritund 2009Kaur Alasoo

Esimene tuutoritund 2009Kaur Alasoo

Vietnami sõdaKaur Alasoo

Plus de Kaur Alasoo (6)

The Data-Driven World

What is Bioinformatics?

Masinõpe ja bioinformaatika

Teine tuutoritund 2009

Esimene tuutoritund 2009

Vietnami sõda

Dernier

SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin

Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel

Designing for Hardware Accessibility at ComcastUXDXConf

AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379

Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin

IESVE for Early Stage Design and PlanningIES VE

Strategic AI Integration in Engineering TeamsUXDXConf

IoT Analytics Company Presentation May 2024IoTAnalytics

ECS 2024 Teams Premium - Pretty SecureFemke de Vroome

Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance

The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance

Buy Epson EcoTank L3210 Colour Printer Online.pptxEasyPrinterHelp

Top 10 Symfony Development Companies 2024TopCSSGallery

Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin

Agentic RAG What it is its types applications and implementation.pdfChristopherTHyatt

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff

Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk

Speed Wins: From Kafka to APIs in Minutesconfluent

Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin

Dernier (20)

SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...

Enterprise Knowledge Graphs - Data Summit 2024

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

Designing for Hardware Accessibility at Comcast

AI presentation and introduction - Retrieval Augmented Generation RAG 101

Integrating Telephony Systems with Salesforce: Insights and Considerations, B...

IESVE for Early Stage Design and Planning

Strategic AI Integration in Engineering Teams

IoT Analytics Company Presentation May 2024

ECS 2024 Teams Premium - Pretty Secure

Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...

The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf

Buy Epson EcoTank L3210 Colour Printer Online.pptx

Top 10 Symfony Development Companies 2024

Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom

Agentic RAG What it is its types applications and implementation.pdf

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Intro in Product Management - Коротко про професію продакт менеджера

Speed Wins: From Kafka to APIs in Minutes

Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade

Building a Naive OCR System

1. BUILDING A NAÏVE OCR SYSTEM Kaur Alasoo and Jaana Metsamaa Data Mining (2009 fall) 28 January 2010 http://kaur.pri.ee/projects/wordocr/

2. VISION Take a snapshot of a page.

3. VISION Extract one word from the picture.

4. VISION Look up the word on Google, dictionary, etc...

5. REALITY • To increase the likelihood of success, we limited ourselves to: • One font from one book. • Only lowercase characters. • Pictures taken with one phone.

6. GETTING THE DATA 44 pictures

7. GETTING THE DATA 44 pictures 242 words

8. 0. Take a picture 1. Convert to grayscale 2. Apply thresholding WORKFLOW 3. Cut horizontally 4. Cut vertically 5. Resize 8x16 px

9. FROM PICTURE TO FEATURE VECTOR

10. FROM PICTURE TO FEATURE VECTOR

11. FROM PICTURE TO FEATURE VECTOR 255

12. FROM PICTURE TO FEATURE VECTOR 255 255

13. FROM PICTURE TO FEATURE VECTOR 255 255 255

14. FROM PICTURE TO FEATURE VECTOR 255 255 255 0

15. FROM PICTURE TO FEATURE VECTOR 255 255 255 0 0

16. FROM PICTURE TO FEATURE VECTOR 255 255 255 0 0 155

17. FROM PICTURE TO FEATURE VECTOR 255 255 255 0 0 155 255

18. FROM PICTURE TO FEATURE VECTOR 255 255 255 0 0 155 255 255

19. FROM PICTURE TO FEATURE VECTOR 255 255 255 0 0 155 255 255

20. TRAINING THE CLASSIFIER We used 1343 characters to train a SVM with RBF kernel.

21. RESULTS • 10-fold cross-validation accuracy with our method: 98.86% • Character classiﬁcation accuracy using Tesseract OCR: 91.47%

22. WHAT WE LEARNED?

23. SOME LETTERS ARE EXTREMELY RARE 15 English Our data 11 8 4 0 e t a o i n s h r d l c u m w f g y p b v k j x q z

24. NOT ALL FONTS CAN BE EASILY SEPARATED INTO CHARACTERS

25. WHAT WE LEARNED? • Installing libraries is the most difﬁcult thing. • Do not overly restrict yourself. • Donot build your on OCR system unless its absolutely necessary. • The letters are the easy part. http://kaur.pri.ee/projects/wordocr/