This document summarizes a student project to build a naïve optical character recognition (OCR) system. The students limited their system to one font from one book, lowercase characters only, and pictures taken with one phone. They collected data from 44 pictures containing 242 words. Their workflow involved preprocessing images then extracting 8x16 pixel feature vectors for each character. They used a support vector machine classifier with a radial basis function kernel, achieving 98.86% accuracy on 10-fold cross-validation of their data, compared to 91.47% for commercial Tesseract OCR. Key lessons were that character frequencies vary significantly from English, not all fonts separate easily into characters, and installing libraries was more difficult than expected.
5. REALITY
• To increase the likelihood of success, we limited ourselves to:
• One font from one book.
• Only lowercase characters.
• Pictures taken with one phone.
23. SOME LETTERS ARE
EXTREMELY RARE
15
English Our data
11
8
4
0
e t a o i n s h r d l c u m w f g y p b v k j x q z
24. NOT ALL FONTS CAN BE EASILY
SEPARATED INTO CHARACTERS
25. WHAT WE LEARNED?
• Installing libraries is the most difficult thing.
• Do not overly restrict yourself.
• Donot build your on OCR system unless its absolutely
necessary.
• The letters are the easy part.
http://kaur.pri.ee/projects/wordocr/