The document discusses optical character recognition (OCR), including its history, current capabilities, and challenges. OCR is a technology that uses optical mechanisms to automatically recognize text characters, similar to how humans read. It involves converting scanned images of text into machine-encoded text. The summary discusses some of the key difficulties in OCR, such as distinguishing similar characters like 'O' and '0' or interpreting text against backgrounds. It also provides an overview of the paper, which will analyze the advancements and limitations of existing OCR systems to determine if it is suitable for different needs.
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Opticalcharacter recognition
1. Opticalcharacter recognition
OCR
Shobhit Saxena
Amity University
Saxenashobhit1988@gmail.com
Nidhi Sharma
Amity University
nidhi9392@gmail.com
Abstract—Optical character recognition, usually abbreviated to
OCR, is the mechanical or electronic conversion of scanned images
of handwritten, typewritten or printed text into machine-encoded text.
It is a system that provides a full alphanumeric recognition of printed
or handwritten characters at electronic speed by simply scanning the
form. It is widely used as a form of data entry from some sort of
original paper data source, whether documents, sales receipts, mail,
or any number of printed records.
It is a common method of digitizing printed texts so that they can be
electronically searched, stored more compactly, displayed on-line,
and used in machine processes such as machine translation, text-to-
speech and text mining.OCR is a field of research in pattern
recognition, artificial intelligence and computer vision. More
recently, the term Intelligent Character Recognition(ICR) has been
used to describe the process of interpreting image data, in particular
alphanumeric text
I. INTRODUCTION
The area of Optical Character Recognition (OCR) involves
locating the characters in the image and converting them into text
files. The character in the image cannot be processed as such and
need to be represented in suitable character coding.
With the availability of computers at cheap rates and convenience in
dealing with data in digital form everyone aims at storing data in
digital form. Data in the form of hard copies of the text documents
are stored in digital form by scanning them as image files. These
images do not support the operations based on text such as editing,
summarizing etc, and it is quite tedious task to manually feed this
data into computer systems. This is where OCR comes into play.
II. PROBLEMS AND MOTIVATION
A. Currently, OCR is available a beta product (a product in
experimental stage) and research is still being carried out in this field.
The OCR employs a part of Artificial Intelligence, which again is an
under research topic.
The main problem of OCR system is to correctly interpret the images
of characters. This procedure makes use of Pattern classification
algorithms. There are several algorithms available and various others
are being formulated which can be chosen to be used in the OCR
implementation.
III. APPROACH AND GOAL OF THIS PROJECT
The goal of this project is to implement OCR using LVQ
(Learning Vector Quantization) algorithm. This project uses
supervised learning approach of Pattern Classification. First the
image is studied so as to detect the possibility of presence of a
character, and if such image is found, it is associated with suitable
character code. The main steps involved in OCR are:
A. Pre-processing
The digitized images are usually in gray tone, and for a clear
document, a simple histogram based threshold approach is sufficient
for converting them to two three tone images. The histogram of gray
values of the pixels shows two prominent peaks, and a middle gray
value located between the peaks is a good choice for threshold.
For salt and pepper noise we generally use median filter. Median
filter replaces the value of a pixel by the median of gray levels in the
neighborhood of that pixel (the original value of the pixel is included
in the computation of the median), Median filters provide excellent
noise reduction capabilities, with considering less blurring than linear
smoothing filters of similar size.
(The image illustrated below is only an example)
Fig 1.1 Image with salt and pepper noise
2. Fig1.2 Image without salt and pepper noise
B. Segmentation
Segmentation is one of the most important phases of OCR system. By
applying good segmentation techniques we can increase the
performance of OCR. Segmentation subdivides an image into its
constituent regions or objects. Basically in segmentation, we try to
extract basic constituent of the script, which are certainly characters.
This is needed because our classifier recognizes these characters
only. Segmentation phase is also crucial in contributing to this error
due to touching characters, which the classifier cannot properly
tackle. Even in good quality documents, some adjacent characters
touch each other due to inappropriate scanning resolution.
Segmentation of Line: Text lines are detected by horizontal
scanning. For segmentation of line, we scan scanned
document page horizontally from the top and find the last
row containing all white pixels, before a black pixel is found.
Then we find the first row containing entire white pixel just
after the end of black pixels. We repeated this process on
entire page to find out all lines.
Segmentation of Words: After finding a particular line we
separate individual words. This is done by vertical scanning.
Segmentation of Individual Characters Once we get the
words we segment it to individual characters. Before
segmenting words to individual characters, we locate the
head line. This is done by finding the rows having maximum
number of black pixels in a word. After locating head line we
remove it i.e. Converts it in white pixels. After removing
head line our word is divided into three horizontal parts
known as upper zone, middle zone and lower zone.
Individual characters are separated from each zone by
applying vertical scanning.
Fig 1.3 Output of the segmentation
Classification
Classification is performed based on the extracted features.
For initial classification of characters, we consider three
features as follows:
• Mean Distance
• Histogram of projection based on pixel value
• Histogram of projection based on spatial position of pixel
Feature Extraction:
Feature extraction is one of the most important steps in developing a
classification system. This step describes the various features selected
by us for classification of the selected characters.
Fig 1.4 Control flow of OCR
3. C. Where are we today?
The advent of the array method of scanning, coupled with the
higher speeds and more compact computing power, has led to the
concept of "Image Processing". Image processing does not have
to utilize optical recognition to be successful. For example, the
ability to change any document to an electronically digitized item
may effectively replace microfilm devices. This provides the user
a much more convenient method of sorting images compared to
handling actual documents or microfilm pictures. Image
processing relies on larger more complex arrays than early third
generation OCR scanners. A graph within a graph is an
“inset,” not an “insert.” The word alternatively is preferred
to the word “alternately” (unless you really mean
something that alternates).
COMPARISON TABLE OF OCR AND OMR
ITEM OCR OMR
Handprint
recognition
Y N
Machine print
recognition
Y N
Recognition of
checks and
"X"s
Y Y
Requires
timing tracks/
form IDs
N Y
Requires
registration
marks
Y N
Electronic
image storage
and
retrieval
Y N
D. Design objectives:
Design objective includes the key points to be considered in
designing the software. Some of the important design issues to be
dealt with are:
System must be user friendly: A system is no good if it does not
eases the work of its operator and can be used easily. User friendly
systems are easy to use and adapt.
System must make the task comprehensible: A system must have
clear objectives and should be capable of doing exactly what is told
to it.
System must be transparent: The working of the system must be
clear to the user so that it can easily be modified and it is possible to
troubleshoot it.
2. Acquiring input data:
Acquiring input data indicates the various issues in feeding the data
for processing.
Input data can be acquired by two methods:
Scanning the image: This method involves the use of scanners.
Obtaining the pre-scanned image. In this method an image is fed
to the OCR system.
3. Components of OCR:
3.1 Character Tracer:
This component will locate characters in the image.
3.2 Mean Square Error Recognizer:
Recognizes characters of target image, with the help of training
image.
3.3 Handwriting Recognizer:
This feature learns to recognize the handwriting of an individual. It
has following subcomponents:
a) Training image:
This image is used as reference image.
b) Configure:
This component chooses the recognizer process from MSE and
Character aspect ratio analyzer.
c) Process:
This feature processes the target image.
3.4 Validation rules:
Validation rules are used to ensure proper functioning of any system.
Validation rules can be applied on both input as well as output.
3.4Validation rules for image fed to OCR System
Format: Input image should be in proper format, generally the
images stored using raster graphics is easy to process and interpret.
Memory size: The image should be with in proper memory size
limits.
Resolution: Resolution of image determines the quality of the image
and its dimensions. Image should be of appropriate resolution
(400x400).
Availability: Availability means that the image must be present at
the location specified and should exist at the time of processing by
the system.
4. 3.5 Validation rules for the output:
Once the input is fed properly and is processed, the output should be
validated before presenting it to the user. Some of the common
validation rules for the output are:
Character encoding: The characters of the output file should be
encoded using proper method. The most commonly used encoding
technique is Unicode character encoding because it can represent
fonts of nearly every language.
Number of characters generated: An optimum number of
characters must be generated to make the output meaningful. The
output is of no use if the OCR system is not able to recognize
majority of characters.
Format of output file: The output file should be in a format that
suits user. The output file will be checked for errors and consistency
using checksums that are calculated at various stages of processing.
Error messages:
Error messages are produced to assist users in operating the system.
Errors are the violation of validation rules or the unexpected behavior
of the system. Error messages should be simple and descriptive and
must give an overview of the problem occurred. Typical error
messages in case of OCR can be:
Image not found: This error message is to be displayed when the
system is unable to locate the image to be processed.
Invalid format: This message can be displayed when the files are
of unknown format or the file header is broken.
Out of memory: This situation arises when the physical memory
is scarce.
3.6 Interfaces:
Interfaces define a way to interact with the system. Characteristics of
a good interface are:
3.7 There are two popular interfaces:
a) Single complete view: In this type of interface the control
switches are displayed in a window all at once.
b) Tabbed view: In tabbed view, only the control switches related to
the selected option are displayed simultaneously. Tabbed view is
simpler and has greater clarity. For our project we will prefer tabbed
view.
IMPLEMENTATION
4.1Character Tracer
4.2Input-file selection
4.3Input-image for character tracer
5. 4.4Character Tracer Output
4.5Handwriting Recognizer Training
4.6Handwriting recognizer input image:
4.7Handwriting Recognizer Configuration
]
5.1 Handwriting recognizer target image
5.2Handwriting Recognizer processing and output
6. Maintenance
The job of the developer continues even after delivering the product,
in the form of maintenance. Maintenance is necessary to ensure the
proper functioning and allow the system to adapt to ever increasing
needs.
6.1 Types of Maintenance:
a) Fixing
This type of maintenance involves the removal of errors. It can
further be divided into following types:
Corrective: The corrective maintenance involves the
identification and removal of defects. The aim is to remove the
errors.
Adaptive: The adaptive maintenance involves the process of
modifying the software so as to adapt the changes in the runtime
execution such as change in OS, hardware and database.
Since this project is build in java it will not need any adaptive
maintenance due to change in OS, only the corresponding jvm needs
to be installed in new host OS.
b) Enhancing:
This type of maintenance involves increasing the software
functionalities as demanded. It has following sub-types:
Perfective: Changes made due to user request, ie when user
demands any specific changes. This may include change in layout,
GUI etc.
6. Preventative: This involves making the system more maintainable.
In OCR, enhancing may involve increasing the type of fonts that can
be recognized and ways in which the recognized fonts may be
represented.
7.3 Documentation and user’s training.
To make the most of the system, its users have to be made aware of
the ways so as to exploit the system’s functionality as much as
possible. User needs to be trained in following areas:
a) Hardware requirements: The end user is the one who has to
interact with the system on day-to-day basis. So he must be trained
about the hardware issues, so that he can troubleshoot the minor
problems himself and minimize the risk of damage to the system.
b) Average processing time: User must be aware of the average
time required by the system to complete its processing, so that he
waits for appropriate time before instructing the system to do another
job.
c) Proper input methods: Proper input methods are a must for a
system to work efficiently hence it is necessary for the user to
provide input in desired manner.
Result and conclusion
OCR is the acronym for Optical Character Recognition. This
technology allows a machine to automatically recognize characters
through an optical mechanism. Human beings recognize many
objects in this manner our eyes are the "optical mechanism." But
while the brain "sees" the input, the ability to comprehend these
signals varies in each person according to many factors. By
reviewing these variables, we can understand the challenges faced by
the technologist developing an OCR system.
First, if we read a page in a language other than our own, we may
recognize the various characters, but be unable to recognize words.
However, on the same page, we are usually able to interpret
numerical statements - the symbols for numbers are universally used.
This explains why many OCR systems recognize numbers only,
while relatively few understand the full alphanumeric character
range.
Second, there is similarity between many numerical and alphabetical
symbol shapes. For example, while examining a string of characters
combining letters and numbers, there is very little visible difference
between a capital letter "O" and the numeral "0." As humans, we can
re-read the sentence or entire paragraph to help us determine the
accurate meaning. This procedure, however, is much more difficult
for a machine.
Third, we rely on contrast to help us recognize characters. We may
find it very difficult to read text which appears against a very dark
background, or is printed over other words or graphics. Again,
programming a system to interpret only the relevant data and
disregard the rest is a difficult task for OCR engineers.
There are many other problems which challenge the developers of
OCR systems. In this paper, we will review the history,
advancements, abilities and limitations of existing systems. This
analysis should help determine if OCR is the correct application for
your company's needs, and if so, which type of system to implement.
.
References
[1] http://en.wikipedia.org/wiki/Image_scanner
[2] http://en.wikipedia.org/wiki/OCR
[3] S. Mori et.al, “Historical Reviewof OCR ResearchandDevelopment”,
Proceeding IEEE, 80, no 7, pp. 1029-1058, July 1992.
[4] A. Chaudhary, E.A.S. Ahmad, S. Hossain, C. M. Rahman, “OCR of
Bangla Character Using Neural Network: A better Approach”, 2nd
International Conferenceon Electrical Engineering(ICEE 2002),khuln,
Bangladesh.
[5] Utpal Garain and Bidyut B. Chaudhary, “Segmentation of Touching
Characterin Printed Devnagari and Bangla Script Using Fuzzy Multi
factorial Analysis”, IEEE
[6] TransactiononSystem, Manand Cybernetics- Part C: Applications and
Reviews, 32, November 2002. Page(s): 449-459.
[7] M. Young, The Technical Writer’s Handbook. Mill Valley, CA:
University Science, 1989.