Shriram Nandakumar & Deepa Naik

Recognition of Handwritten Digits
Shriram Nandakumar, Deepa Naik
Student Numbers: 244935, 232887
nandakum@student.tut.fi, deepa.naik@student.tut.fi
ABSTRACT
In this project, pattern classification for handwritten digit
recognition is performed. Initial pre-processing is done to
original gray-scale images of MNIST database and features
are extracted. Naïve Bayes and Logistic Regression
classifiers are applied and their performances are evaluated
based on percentage accuracy and confusion matrices. The
effect of regularization is investigated for Logistic
Regression classifier. It is observed that Logistic Regression
significantly outperforms Naïve Bayes classifier.
1. INTRODUCTION
Pattern recognition is a branch of machine learning that
deals with classification of an object into a correct class
based on measurements about the object [1]. It is an
important problem in various engineering and scientific
disciplines like robotics, stock market analysis, psychology,
medicine and many other fields.
Any pattern recognition system typically consists of the
following five stages [1]:
1) Sensing,
2) Preprocessing,
3) Feature extraction and (or) selection,
4) Classification,
5) Post processing.
Sensing involves measurement or acquisition of data.
Preprocessing refers to filtering and other cleaning
operations performed on raw data. The amount of raw data
after preprocessing is usually huge and seldom used as such.
Feature extraction step aids in representing the massive data
succinctly by converting them into feature vectors. A further
reduction can be achieved by the optional feature selection
stage. A classifier is trained by using the feature vectors
obtained from training data. The final post-processing stage
decides upon an action based on classification results.
Depending on the type of learning procedure used, pattern
recognition is broadly classified into supervised and
unsupervised learning. In supervised classification, we
present examples of a feature vector along with its correct
class to teach a classifier. On the other contrary, in
unsupervised classification or clustering, there is no explicit
teacher nor training samples. “The classification of the
feature vectors must be based on similarity between them
based on which they are divided into natural groupings” [1].
One of the well-studied applications of pattern recognition
and classification is handwritten character recognition.
Handwritten Character Recognition finds applications in
Zip-Code recognition, automatic printed form acquisition, or
checks reading [2].
In this project, two supervised classification methods are
considered, viz. the Naive Bayes and Logistic Regression.
Both are instances of statistical classification approach.
This report is organized as follows. Section 2 presents the
theory behind the methods used. Implementation of methods
is described in section 3. Results are generated in section 4.
Finally, conclusions are drawn in section 5.
2. THEORY / BACKGROUND
This section covers the mathematical framework of the
classifiers used.
2.1 Naive Bayes Classification:
Naïve Bayes is derived from the classical Bayes theorem.
Given set of features x ={𝑥1, 𝑥2, 𝑥3,. . . 𝑥 𝑑}, posterior
probability of a class y can be expressed using Bayes rule as
𝑃(𝑦|𝑥1, 𝑥2,. . . 𝑥 𝑑) = 𝑃(𝑥1, 𝑥2,. . . 𝑥 𝑑|𝑦) 𝑃(𝑦)𝑦∈𝑌
𝑎𝑟𝑔𝑚𝑎𝑥
(1)
𝑃(𝑦|𝑥1, 𝑥2,. . . 𝑥 𝑑) is the posterior probability of class 𝑦,
𝑃(𝑥1, 𝑥2, … 𝑥 𝑑|𝑦) is the likelihood and 𝑃(𝑦) is the prior
probability of class 𝑦.
Naive Bayes assumes that the features are statistically
independent, so the joint probability can be expressed as the
product of terms as shown below:
𝑃(𝑦|𝑥1, 𝑥2, … 𝑥 𝑑) = ∏ 𝑃(𝑥𝑖|𝑦) 𝑃(𝑦)𝑑
𝑖=1𝑦∈𝑌
𝑎𝑟𝑔𝑚𝑎𝑥
(2)

2.2 Logistic Regression
Logistic regression is an iterative version of linear
regression for classification, but with lots of differences in
the cost functions to be minimized. The linear regression for
classification minimizes the following cost function [3]:
𝐸𝑡𝑟(𝒘) =
1
𝑁
∑(𝒘 𝑇
𝒙 𝑛 − 𝑦𝑛)2
𝑁
𝑛=1
(3)
𝐸𝑡𝑟 is the mean-square training error
𝒘 is the linear regression coefficient (weight) vector
𝒙 𝑛 is the nth
training sample
𝑦𝑛 is the class of the nth
training sample.
The above minimization has a closed form solution as:
𝒘 = (𝑿 𝑇
𝑿)−1
𝑿 𝑇
𝒚 (4)
X is a matrix with all the N training samples in its rows. X
is of size N by (D+1), D is the dimensionality of every
training sample.
The class label vector 𝒚 is a column vector of length N. The
classification is done as (Fig 1):
ℎ(𝒙) = 𝑠𝑖𝑔𝑛(𝒘 𝑇
𝒙) (5)
Logistic regression, on the other hand, minimizes the
following error measure [4]:
𝐸𝑖𝑛(𝒘) =
1
𝑁
∑ ln(1 + 𝑒−𝑦 𝑛 𝒘 𝑇 𝒙 𝑛)𝑁
𝑛=1 (6)
The ln(. ) term is often called as cross-entropy error. The
above expression is equivalent to maximizing the following
likelihood function [4]:
∏ 𝑃(𝑦𝑛|𝑥 𝑛) = ∏ 𝜃(
𝑁
𝑛=1
𝑁
𝑛=1
𝑦𝑛 𝒘 𝑇
𝒙 𝑛) = ∏
1
1 + 𝑒−𝑦 𝑛 𝒘 𝑇 𝒙 𝑛
𝑁
𝑛=1
(7)
𝜃(. ) is the sigmoidal threshold function.
𝑃(𝑦𝑛|𝑥 𝑛) is the posterior probability of the class of nth
training sample.
(7) can be obtained upon taking natural logarithm of the
third expression in (6) and negating it.
Unlike linear regression for classification, there is no
closed-form solution and it is typically solved by gradient
descent procedure. Also, the logistic regression model uses
Fig. 1. Linear Regression for Classification [3]
a sigmoidal threshold function (flattened-out S) rather than
a signum function (Fig 2).
The output of the logistic regression classifier can be
interpreted in a probabilistic sense and can be viewed as the
posterior distribution of class labels. Hence logistic
regression classfiers are also called as soft-threshold
classfiers.
Fig 2. Logistic Regression Model [3]
2.3 Regularization
In machine learning and pattern recognition, there are
theoretically infinite ways of solving any problem. Thus it
is important to have an objective criterion for assessing the
accuracy of candidate approaches and for selecting the right
model for a data set at hand.
An extremely simple model will often under-fit the data,
while an extremely complex model over-fits the data. The
former can be observed by its large error during training and
the latter by an extremely small error. Under-fitting fails to
model the problem at hand and over-fitting fails to
generalize for unseen inputs.
Regularization is one way of combating the problem of
over-fitting. A regularized linear regression solves the
following constrained minimization problem [3]:
𝐸𝑖𝑛(𝒘) =
1
𝑁
∑ (𝒘 𝑇
𝒙 𝑛 − 𝑦𝑛)2𝑁
𝑛=1 =
1
𝑁
(𝑿𝒘 − 𝒚) 𝑇
(𝑿𝒘 − 𝒚)
subject to ‖𝒘‖2
2
≤ 𝐶 , 𝐶 is a constant (8)

‖𝒘‖2 is the 𝑙2 norm of weight vector.
The above minimization is equivalent to minimizing
𝐸𝑖𝑛(𝒘) +
𝜆
𝑁
‖𝒘‖2
2
(9)
and has the closed form solution as [4]:
𝒘 𝒓𝒆𝒈 = (𝑿 𝑇
𝑿 + 𝜆𝑰)−1
𝑿 𝑇
𝒚 (10)
𝜆 is called the regularization parameter and 𝑰 is an identity
matrix. The regularization parameter is tunable and puts a
brake on the 𝑙2 norm of the weight vector.
On similar basis, regularization also applies to logistic
regression and as in the case of plain logistic regression, the
solution has to be obtained by iterative methods. A
regularized logistic regression classifier maximizes the
following [3]:
𝐽(𝒘) = 𝑙𝑛 ∏ 𝑃(𝑦𝑛|𝑥 𝑛) −
𝑁
𝑛=1
𝜆‖𝒘‖2
2
(11)
The notations follow the same nomenclature.
2.4 Cross-validation
“Validation techniques are motivated by two fundamental
problems in pattern recognition: model selection and
performance estimation. Almost invariably, all pattern
recognition techniques have one or more free parameters”
[5]. For example, the number of neighbors in a k-Nearest
Neighbor classification rule, number of hidden layers and
learning parameters in a multi-layer perceptron and in our
case, the regularization parameter in logistic regression.
Once a model is chosen, the performance is typically
measured by the true error rate, the classifier’s error rate on
the entire population. With only a finite set of examples, the
out-of-sample error has to be estimated as close as possible
to its true value [5].
The simplest cross-validation technique is to split the
training set further into two parts: one used for usual training
and the other for validation. During training, the free
parameters are varied and the performances are measured
with the aid of validation set. The most commonly followed
approach is to use a K-fold validation, where K experiments
are conducted using K-1 folds for training and the remaining
one for cross-validation. More sophisticated approaches
such as random K-fold sub-sampling are also used.
3. IMPLEMENTATION
This section gives details of the database, preprocessing
steps and classifier implementation.
3.1. Database
The MNIST database of handwritten digits, has a training
set of 60,000 examples, and a test set of 10,000 examples.
Train data and Test data contains grey level image of size
28x28 [4].
3.2. Preprocessing
The raw 28 × 28 gray-scale images are read and bounding
boxes are created by discarding unnecessary white borders.
The images are then downscaled to size 10 × 10. 100-
dimensional feature vectors are then formed by converting
the 10 × 10 images to 100 × 1 vectors. The above procedure
is done for the images in both training and test set.
3.3. Naïve Bayes Implementation
The built-in function in MATLABTM
is used. Training is
done as:
obj = NaiveBayes.fit(TrainData,TrainClass));
NaiveBayes.fit builds a NaiveBayes classifier object obj.
Testing is accomplished by using the obj returned by the
function NaiveBayes.fit as shown below.
PredClass = obj.predict(TestData);
obj.predict returns a vector PredClass of predicted class
labels for TestData.
3.4. Logistic Regression Implementation
The MATLABTM
implementation of GlmnetTM
package
developed at Stanford University, USA is used. Training is
done using glmnet as follows [6]:
model=glmnet(TrainData,TrainClass,’multinomial’)
Care has to be taken that the class labels do not start from
numeric 0. The model variables can be accessed as follows
[6]:
model.lambda: returns a 100-element vector containing all
used values for the penalty parameter λ.
model.beta: a cell of 10 matrices of size 100 × 100. Each
100-element column represents a set of coefficients for the
10 × 10 pixels. There are altogether 100 such vectors, one for
each value of parameter λ.
model.a0: a vector containing the bias constants for the
logistic regression model, one for each value of parameter λ.
Prediction of labels for test data is done as follows [6]:

yHat = glmnetPredict(model, X_test, lambda,
’class’);
For Logistic Regression without regularization, lambda is
chosen to be the minimum of model.lambda. A 5-fold cross-
validation is then performed to choose the best value of
lambda.
4. RESULTS AND DISCUSSION
The confusion matrices of the aforementioned classifiers are
shown in Tables 1, 2 and 3. The accuracy of the Naïve Bayes
classifier is found to be 83.2%. There is not much significant
difference in the performances of logistic regression without
and with regularization, the former yielding an accuracy of
93.66%, while the latter 93.7%.
The optimal regularization parameter for logistic regression,
after a 5-fold cross-validation, is found to be 𝜆 = 10−4
(Fig
3).
TABLE 1.
CONFUSION MATRIX OF NAÏVE BAYES CLASSIFIER
(The rows correspond to true class and the columns correspond to predicted class.
Same applies for Table 2 and Table 3)
No 0 1 2 3 4 5 6 7 8 9
0 940 0 17 3 2 14 8 1 8 4
1 1 910 21 8 34 16 23 75 54 20
2 4 45 847 23 5 3 8 35 27 7
3 0 2 8 873 0 71 1 12 11 9
4 5 39 5 0 842 12 37 20 18 58
5 4 11 12 34 10 705 15 3 35 5
6 12 26 8 0 20 18 857 0 5 0
7 1 37 54 24 0 5 0 821 17 23
8 11 55 48 33 14 23 6 15 688 46
9 2 10 12 12 55 25 3 46 111 837
TABLE 2.
CONFUSION MATRIX OF LOGISTIC REGRESSION (WITHOUT
REGULARIZATION)
No 0 1 2 3 4 5 6 7 8 9
0 962 3 1 2 2 4 10 0 6 1
1 2 1081 16 4 6 11 0 10 20 7
2 0 8 962 15 4 7 6 6 7 3
3 0 4 19 931 0 28 1 6 22 12
4 0 7 3 2 936 3 3 5 10 17
5 7 4 3 25 0 810 13 3 26 9
6 4 5 5 1 10 6 921 0 2 0
7 2 3 5 8 0 6 0 969 6 22
8 3 18 17 16 2 15 4 2 866 10
9 0 2 1 6 22 2 0 27 9 928
TABLE 3.
CONFUSION MATRIX OF LOGISTIC REGRESSION (WITH
REGULARIZATION)
No 0 1 2 3 4 5 6 7 8 9
0 964 3 3 2 2 6 10 0 6 1
1 3 1084 16 4 5 8 0 12 19 6
2 0 8 962 16 4 6 7 6 6 3
3 0 4 16 929 0 28 1 6 21 13
4 0 6 2 1 937 3 3 5 11 19
5 5 4 3 27 0 812 13 3 27 10
6 3 5 6 1 10 7 920 0 2 0
7 2 3 5 8 0 6 0 967 6 20
8 3 16 18 15 2 14 4 3 868 10
9 0 2 1 7 22 2 0 26 8 927
Fig 3. Plot of regularization parameter versus prediction
accuracy for logistic regression
5. CONCLUSION
In this paper two methods for recognition of hand written
digits, namely, Naïve Bayes and logistic regression are
compared. The results show that logistic regression method
has better performance over Naïve Bayes. It is also observed
that there is not much difference in the performance of
logistic regression with or without regularization parameter.
REFERENCES
1. Jussi Tohka, Lecture Notes, SGN- 2506: Introduction to
Pattern Recognition, Tampere University of Technology.
2. http://tcts.fpms.ac.be/rdf/hcrinuk.htm
3. Yaser.S.Abu Mostafa et al, “Learning from Data,”AML Book,
ISBN: 978-1600490064
4. The MNIST database of handwritten digits,
http://yann.lecun.com/exdb/mnist/
5. Ricardo Guttierez- Osuna, Lecture Slides- Intelligent Sensor
Systems, Wright State University.
6. http://web.stanford.edu/~hastie/glmnet_matlab/
-5 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
log10
(lambda)
PredictionAccuracy(%)

Shriram Nandakumar & Deepa Naik

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Shriram Nandakumar & Deepa Naik

Similar to Shriram Nandakumar & Deepa Naik (20)

Shriram Nandakumar & Deepa Naik