SlideShare a Scribd company logo
1 of 4
Download to read offline
Recognition of Handwritten Digits
Shriram Nandakumar, Deepa Naik
Student Numbers: 244935, 232887
nandakum@student.tut.fi, deepa.naik@student.tut.fi
ABSTRACT
In this project, pattern classification for handwritten digit
recognition is performed. Initial pre-processing is done to
original gray-scale images of MNIST database and features
are extracted. Naïve Bayes and Logistic Regression
classifiers are applied and their performances are evaluated
based on percentage accuracy and confusion matrices. The
effect of regularization is investigated for Logistic
Regression classifier. It is observed that Logistic Regression
significantly outperforms Naïve Bayes classifier.
1. INTRODUCTION
Pattern recognition is a branch of machine learning that
deals with classification of an object into a correct class
based on measurements about the object [1]. It is an
important problem in various engineering and scientific
disciplines like robotics, stock market analysis, psychology,
medicine and many other fields.
Any pattern recognition system typically consists of the
following five stages [1]:
1) Sensing,
2) Preprocessing,
3) Feature extraction and (or) selection,
4) Classification,
5) Post processing.
Sensing involves measurement or acquisition of data.
Preprocessing refers to filtering and other cleaning
operations performed on raw data. The amount of raw data
after preprocessing is usually huge and seldom used as such.
Feature extraction step aids in representing the massive data
succinctly by converting them into feature vectors. A further
reduction can be achieved by the optional feature selection
stage. A classifier is trained by using the feature vectors
obtained from training data. The final post-processing stage
decides upon an action based on classification results.
Depending on the type of learning procedure used, pattern
recognition is broadly classified into supervised and
unsupervised learning. In supervised classification, we
present examples of a feature vector along with its correct
class to teach a classifier. On the other contrary, in
unsupervised classification or clustering, there is no explicit
teacher nor training samples. “The classification of the
feature vectors must be based on similarity between them
based on which they are divided into natural groupings” [1].
One of the well-studied applications of pattern recognition
and classification is handwritten character recognition.
Handwritten Character Recognition finds applications in
Zip-Code recognition, automatic printed form acquisition, or
checks reading [2].
In this project, two supervised classification methods are
considered, viz. the Naive Bayes and Logistic Regression.
Both are instances of statistical classification approach.
This report is organized as follows. Section 2 presents the
theory behind the methods used. Implementation of methods
is described in section 3. Results are generated in section 4.
Finally, conclusions are drawn in section 5.
2. THEORY / BACKGROUND
This section covers the mathematical framework of the
classifiers used.
2.1 Naive Bayes Classification:
Naïve Bayes is derived from the classical Bayes theorem.
Given set of features x ={𝑥1, 𝑥2, 𝑥3,. . . 𝑥 𝑑}, posterior
probability of a class y can be expressed using Bayes rule as
𝑃(𝑦|𝑥1, 𝑥2,. . . 𝑥 𝑑) = 𝑃(𝑥1, 𝑥2,. . . 𝑥 𝑑|𝑦) 𝑃(𝑦)𝑦∈𝑌
𝑎𝑟𝑔𝑚𝑎𝑥
(1)
𝑃(𝑦|𝑥1, 𝑥2,. . . 𝑥 𝑑) is the posterior probability of class 𝑦,
𝑃(𝑥1, 𝑥2, … 𝑥 𝑑|𝑦) is the likelihood and 𝑃(𝑦) is the prior
probability of class 𝑦.
Naive Bayes assumes that the features are statistically
independent, so the joint probability can be expressed as the
product of terms as shown below:
𝑃(𝑦|𝑥1, 𝑥2, … 𝑥 𝑑) = ∏ 𝑃(𝑥𝑖|𝑦) 𝑃(𝑦)𝑑
𝑖=1𝑦∈𝑌
𝑎𝑟𝑔𝑚𝑎𝑥
(2)
2.2 Logistic Regression
Logistic regression is an iterative version of linear
regression for classification, but with lots of differences in
the cost functions to be minimized. The linear regression for
classification minimizes the following cost function [3]:
𝐸𝑡𝑟(𝒘) =
1
𝑁
∑(𝒘 𝑇
𝒙 𝑛 − 𝑦𝑛)2
𝑁
𝑛=1
(3)
𝐸𝑡𝑟 is the mean-square training error
𝒘 is the linear regression coefficient (weight) vector
𝒙 𝑛 is the nth
training sample
𝑦𝑛 is the class of the nth
training sample.
The above minimization has a closed form solution as:
𝒘 = (𝑿 𝑇
𝑿)−1
𝑿 𝑇
𝒚 (4)
X is a matrix with all the N training samples in its rows. X
is of size N by (D+1), D is the dimensionality of every
training sample.
The class label vector 𝒚 is a column vector of length N. The
classification is done as (Fig 1):
ℎ(𝒙) = 𝑠𝑖𝑔𝑛(𝒘 𝑇
𝒙) (5)
Logistic regression, on the other hand, minimizes the
following error measure [4]:
𝐸𝑖𝑛(𝒘) =
1
𝑁
∑ ln(1 + 𝑒−𝑦 𝑛 𝒘 𝑇 𝒙 𝑛)𝑁
𝑛=1 (6)
The ln(. ) term is often called as cross-entropy error. The
above expression is equivalent to maximizing the following
likelihood function [4]:
∏ 𝑃(𝑦𝑛|𝑥 𝑛) = ∏ 𝜃(
𝑁
𝑛=1
𝑁
𝑛=1
𝑦𝑛 𝒘 𝑇
𝒙 𝑛) = ∏
1
1 + 𝑒−𝑦 𝑛 𝒘 𝑇 𝒙 𝑛
𝑁
𝑛=1
(7)
𝜃(. ) is the sigmoidal threshold function.
𝑃(𝑦𝑛|𝑥 𝑛) is the posterior probability of the class of nth
training sample.
(7) can be obtained upon taking natural logarithm of the
third expression in (6) and negating it.
Unlike linear regression for classification, there is no
closed-form solution and it is typically solved by gradient
descent procedure. Also, the logistic regression model uses
Fig. 1. Linear Regression for Classification [3]
a sigmoidal threshold function (flattened-out S) rather than
a signum function (Fig 2).
The output of the logistic regression classifier can be
interpreted in a probabilistic sense and can be viewed as the
posterior distribution of class labels. Hence logistic
regression classfiers are also called as soft-threshold
classfiers.
Fig 2. Logistic Regression Model [3]
2.3 Regularization
In machine learning and pattern recognition, there are
theoretically infinite ways of solving any problem. Thus it
is important to have an objective criterion for assessing the
accuracy of candidate approaches and for selecting the right
model for a data set at hand.
An extremely simple model will often under-fit the data,
while an extremely complex model over-fits the data. The
former can be observed by its large error during training and
the latter by an extremely small error. Under-fitting fails to
model the problem at hand and over-fitting fails to
generalize for unseen inputs.
Regularization is one way of combating the problem of
over-fitting. A regularized linear regression solves the
following constrained minimization problem [3]:
𝐸𝑖𝑛(𝒘) =
1
𝑁
∑ (𝒘 𝑇
𝒙 𝑛 − 𝑦𝑛)2𝑁
𝑛=1 =
1
𝑁
(𝑿𝒘 − 𝒚) 𝑇
(𝑿𝒘 − 𝒚)
subject to ‖𝒘‖2
2
≤ 𝐶 , 𝐶 is a constant (8)
‖𝒘‖2 is the 𝑙2 norm of weight vector.
The above minimization is equivalent to minimizing
𝐸𝑖𝑛(𝒘) +
𝜆
𝑁
‖𝒘‖2
2
(9)
and has the closed form solution as [4]:
𝒘 𝒓𝒆𝒈 = (𝑿 𝑇
𝑿 + 𝜆𝑰)−1
𝑿 𝑇
𝒚 (10)
𝜆 is called the regularization parameter and 𝑰 is an identity
matrix. The regularization parameter is tunable and puts a
brake on the 𝑙2 norm of the weight vector.
On similar basis, regularization also applies to logistic
regression and as in the case of plain logistic regression, the
solution has to be obtained by iterative methods. A
regularized logistic regression classifier maximizes the
following [3]:
𝐽(𝒘) = 𝑙𝑛 ∏ 𝑃(𝑦𝑛|𝑥 𝑛) −
𝑁
𝑛=1
𝜆‖𝒘‖2
2
(11)
The notations follow the same nomenclature.
2.4 Cross-validation
“Validation techniques are motivated by two fundamental
problems in pattern recognition: model selection and
performance estimation. Almost invariably, all pattern
recognition techniques have one or more free parameters”
[5]. For example, the number of neighbors in a k-Nearest
Neighbor classification rule, number of hidden layers and
learning parameters in a multi-layer perceptron and in our
case, the regularization parameter in logistic regression.
Once a model is chosen, the performance is typically
measured by the true error rate, the classifier’s error rate on
the entire population. With only a finite set of examples, the
out-of-sample error has to be estimated as close as possible
to its true value [5].
The simplest cross-validation technique is to split the
training set further into two parts: one used for usual training
and the other for validation. During training, the free
parameters are varied and the performances are measured
with the aid of validation set. The most commonly followed
approach is to use a K-fold validation, where K experiments
are conducted using K-1 folds for training and the remaining
one for cross-validation. More sophisticated approaches
such as random K-fold sub-sampling are also used.
3. IMPLEMENTATION
This section gives details of the database, preprocessing
steps and classifier implementation.
3.1. Database
The MNIST database of handwritten digits, has a training
set of 60,000 examples, and a test set of 10,000 examples.
Train data and Test data contains grey level image of size
28x28 [4].
3.2. Preprocessing
The raw 28 × 28 gray-scale images are read and bounding
boxes are created by discarding unnecessary white borders.
The images are then downscaled to size 10 × 10. 100-
dimensional feature vectors are then formed by converting
the 10 × 10 images to 100 × 1 vectors. The above procedure
is done for the images in both training and test set.
3.3. Naïve Bayes Implementation
The built-in function in MATLABTM
is used. Training is
done as:
obj = NaiveBayes.fit(TrainData,TrainClass));
NaiveBayes.fit builds a NaiveBayes classifier object obj.
Testing is accomplished by using the obj returned by the
function NaiveBayes.fit as shown below.
PredClass = obj.predict(TestData);
obj.predict returns a vector PredClass of predicted class
labels for TestData.
3.4. Logistic Regression Implementation
The MATLABTM
implementation of GlmnetTM
package
developed at Stanford University, USA is used. Training is
done using glmnet as follows [6]:
model=glmnet(TrainData,TrainClass,’multinomial’)
Care has to be taken that the class labels do not start from
numeric 0. The model variables can be accessed as follows
[6]:
model.lambda: returns a 100-element vector containing all
used values for the penalty parameter λ.
model.beta: a cell of 10 matrices of size 100 × 100. Each
100-element column represents a set of coefficients for the
10 × 10 pixels. There are altogether 100 such vectors, one for
each value of parameter λ.
model.a0: a vector containing the bias constants for the
logistic regression model, one for each value of parameter λ.
Prediction of labels for test data is done as follows [6]:
yHat = glmnetPredict(model, X_test, lambda,
’class’);
For Logistic Regression without regularization, lambda is
chosen to be the minimum of model.lambda. A 5-fold cross-
validation is then performed to choose the best value of
lambda.
4. RESULTS AND DISCUSSION
The confusion matrices of the aforementioned classifiers are
shown in Tables 1, 2 and 3. The accuracy of the Naïve Bayes
classifier is found to be 83.2%. There is not much significant
difference in the performances of logistic regression without
and with regularization, the former yielding an accuracy of
93.66%, while the latter 93.7%.
The optimal regularization parameter for logistic regression,
after a 5-fold cross-validation, is found to be 𝜆 = 10−4
(Fig
3).
TABLE 1.
CONFUSION MATRIX OF NAÏVE BAYES CLASSIFIER
(The rows correspond to true class and the columns correspond to predicted class.
Same applies for Table 2 and Table 3)
No 0 1 2 3 4 5 6 7 8 9
0 940 0 17 3 2 14 8 1 8 4
1 1 910 21 8 34 16 23 75 54 20
2 4 45 847 23 5 3 8 35 27 7
3 0 2 8 873 0 71 1 12 11 9
4 5 39 5 0 842 12 37 20 18 58
5 4 11 12 34 10 705 15 3 35 5
6 12 26 8 0 20 18 857 0 5 0
7 1 37 54 24 0 5 0 821 17 23
8 11 55 48 33 14 23 6 15 688 46
9 2 10 12 12 55 25 3 46 111 837
TABLE 2.
CONFUSION MATRIX OF LOGISTIC REGRESSION (WITHOUT
REGULARIZATION)
No 0 1 2 3 4 5 6 7 8 9
0 962 3 1 2 2 4 10 0 6 1
1 2 1081 16 4 6 11 0 10 20 7
2 0 8 962 15 4 7 6 6 7 3
3 0 4 19 931 0 28 1 6 22 12
4 0 7 3 2 936 3 3 5 10 17
5 7 4 3 25 0 810 13 3 26 9
6 4 5 5 1 10 6 921 0 2 0
7 2 3 5 8 0 6 0 969 6 22
8 3 18 17 16 2 15 4 2 866 10
9 0 2 1 6 22 2 0 27 9 928
TABLE 3.
CONFUSION MATRIX OF LOGISTIC REGRESSION (WITH
REGULARIZATION)
No 0 1 2 3 4 5 6 7 8 9
0 964 3 3 2 2 6 10 0 6 1
1 3 1084 16 4 5 8 0 12 19 6
2 0 8 962 16 4 6 7 6 6 3
3 0 4 16 929 0 28 1 6 21 13
4 0 6 2 1 937 3 3 5 11 19
5 5 4 3 27 0 812 13 3 27 10
6 3 5 6 1 10 7 920 0 2 0
7 2 3 5 8 0 6 0 967 6 20
8 3 16 18 15 2 14 4 3 868 10
9 0 2 1 7 22 2 0 26 8 927
Fig 3. Plot of regularization parameter versus prediction
accuracy for logistic regression
5. CONCLUSION
In this paper two methods for recognition of hand written
digits, namely, Naïve Bayes and logistic regression are
compared. The results show that logistic regression method
has better performance over Naïve Bayes. It is also observed
that there is not much difference in the performance of
logistic regression with or without regularization parameter.
REFERENCES
1. Jussi Tohka, Lecture Notes, SGN- 2506: Introduction to
Pattern Recognition, Tampere University of Technology.
2. http://tcts.fpms.ac.be/rdf/hcrinuk.htm
3. Yaser.S.Abu Mostafa et al, “Learning from Data,”AML Book,
ISBN: 978-1600490064
4. The MNIST database of handwritten digits,
http://yann.lecun.com/exdb/mnist/
5. Ricardo Guttierez- Osuna, Lecture Slides- Intelligent Sensor
Systems, Wright State University.
6. http://web.stanford.edu/~hastie/glmnet_matlab/
-5 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
log10
(lambda)
PredictionAccuracy(%)

More Related Content

What's hot

A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
Editor Jacotech
 
Multivariate decision tree
Multivariate decision treeMultivariate decision tree
Multivariate decision tree
Prafulla Shukla
 
Selection method by fuzzy set theory and preference matrix
Selection method by fuzzy set theory and preference matrixSelection method by fuzzy set theory and preference matrix
Selection method by fuzzy set theory and preference matrix
Alexander Decker
 
11.selection method by fuzzy set theory and preference matrix
11.selection method by fuzzy set theory and preference matrix11.selection method by fuzzy set theory and preference matrix
11.selection method by fuzzy set theory and preference matrix
Alexander Decker
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
Sigmoid
 
Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8
Laila Fatehy
 

What's hot (18)

Concepts in order statistics and bayesian estimation
Concepts in order statistics and bayesian estimationConcepts in order statistics and bayesian estimation
Concepts in order statistics and bayesian estimation
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
 
Multivariate decision tree
Multivariate decision treeMultivariate decision tree
Multivariate decision tree
 
The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...
The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...
The Evaluation of Topsis and Fuzzy-Topsis Method for Decision Making System i...
 
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
 
Selection method by fuzzy set theory and preference matrix
Selection method by fuzzy set theory and preference matrixSelection method by fuzzy set theory and preference matrix
Selection method by fuzzy set theory and preference matrix
 
11.selection method by fuzzy set theory and preference matrix
11.selection method by fuzzy set theory and preference matrix11.selection method by fuzzy set theory and preference matrix
11.selection method by fuzzy set theory and preference matrix
 
Q26099103
Q26099103Q26099103
Q26099103
 
Understandig PCA and LDA
Understandig PCA and LDAUnderstandig PCA and LDA
Understandig PCA and LDA
 
CART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesCART: Not only Classification and Regression Trees
CART: Not only Classification and Regression Trees
 
PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...
PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...
PREDICTIVE EVALUATION OF THE STOCK PORTFOLIO PERFORMANCE USING FUZZY CMEANS A...
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
 
Optmization techniques
Optmization techniquesOptmization techniques
Optmization techniques
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 

Similar to Shriram Nandakumar & Deepa Naik

IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292
HARDIK SINGH
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)
Syed Atif Naseem
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
taikhoan262
 

Similar to Shriram Nandakumar & Deepa Naik (20)

1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
 
Second subjective assignment
Second  subjective assignmentSecond  subjective assignment
Second subjective assignment
 
Supervised Learning.pptx
Supervised Learning.pptxSupervised Learning.pptx
Supervised Learning.pptx
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical Equations
 
Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...
Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...
Validation Study of Dimensionality Reduction Impact on Breast Cancer Classifi...
 
Implementing Minimum Error Rate Classifier
Implementing Minimum Error Rate ClassifierImplementing Minimum Error Rate Classifier
Implementing Minimum Error Rate Classifier
 
IRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning ModelsIRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning Models
 
IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature Selection
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERS
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERSFIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERS
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERS
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
lcr
lcrlcr
lcr
 
Comparison on PCA ICA and LDA in Face Recognition
Comparison on PCA ICA and LDA in Face RecognitionComparison on PCA ICA and LDA in Face Recognition
Comparison on PCA ICA and LDA in Face Recognition
 

Shriram Nandakumar & Deepa Naik

  • 1. Recognition of Handwritten Digits Shriram Nandakumar, Deepa Naik Student Numbers: 244935, 232887 nandakum@student.tut.fi, deepa.naik@student.tut.fi ABSTRACT In this project, pattern classification for handwritten digit recognition is performed. Initial pre-processing is done to original gray-scale images of MNIST database and features are extracted. Naïve Bayes and Logistic Regression classifiers are applied and their performances are evaluated based on percentage accuracy and confusion matrices. The effect of regularization is investigated for Logistic Regression classifier. It is observed that Logistic Regression significantly outperforms Naïve Bayes classifier. 1. INTRODUCTION Pattern recognition is a branch of machine learning that deals with classification of an object into a correct class based on measurements about the object [1]. It is an important problem in various engineering and scientific disciplines like robotics, stock market analysis, psychology, medicine and many other fields. Any pattern recognition system typically consists of the following five stages [1]: 1) Sensing, 2) Preprocessing, 3) Feature extraction and (or) selection, 4) Classification, 5) Post processing. Sensing involves measurement or acquisition of data. Preprocessing refers to filtering and other cleaning operations performed on raw data. The amount of raw data after preprocessing is usually huge and seldom used as such. Feature extraction step aids in representing the massive data succinctly by converting them into feature vectors. A further reduction can be achieved by the optional feature selection stage. A classifier is trained by using the feature vectors obtained from training data. The final post-processing stage decides upon an action based on classification results. Depending on the type of learning procedure used, pattern recognition is broadly classified into supervised and unsupervised learning. In supervised classification, we present examples of a feature vector along with its correct class to teach a classifier. On the other contrary, in unsupervised classification or clustering, there is no explicit teacher nor training samples. “The classification of the feature vectors must be based on similarity between them based on which they are divided into natural groupings” [1]. One of the well-studied applications of pattern recognition and classification is handwritten character recognition. Handwritten Character Recognition finds applications in Zip-Code recognition, automatic printed form acquisition, or checks reading [2]. In this project, two supervised classification methods are considered, viz. the Naive Bayes and Logistic Regression. Both are instances of statistical classification approach. This report is organized as follows. Section 2 presents the theory behind the methods used. Implementation of methods is described in section 3. Results are generated in section 4. Finally, conclusions are drawn in section 5. 2. THEORY / BACKGROUND This section covers the mathematical framework of the classifiers used. 2.1 Naive Bayes Classification: Naïve Bayes is derived from the classical Bayes theorem. Given set of features x ={𝑥1, 𝑥2, 𝑥3,. . . 𝑥 𝑑}, posterior probability of a class y can be expressed using Bayes rule as 𝑃(𝑦|𝑥1, 𝑥2,. . . 𝑥 𝑑) = 𝑃(𝑥1, 𝑥2,. . . 𝑥 𝑑|𝑦) 𝑃(𝑦)𝑦∈𝑌 𝑎𝑟𝑔𝑚𝑎𝑥 (1) 𝑃(𝑦|𝑥1, 𝑥2,. . . 𝑥 𝑑) is the posterior probability of class 𝑦, 𝑃(𝑥1, 𝑥2, … 𝑥 𝑑|𝑦) is the likelihood and 𝑃(𝑦) is the prior probability of class 𝑦. Naive Bayes assumes that the features are statistically independent, so the joint probability can be expressed as the product of terms as shown below: 𝑃(𝑦|𝑥1, 𝑥2, … 𝑥 𝑑) = ∏ 𝑃(𝑥𝑖|𝑦) 𝑃(𝑦)𝑑 𝑖=1𝑦∈𝑌 𝑎𝑟𝑔𝑚𝑎𝑥 (2)
  • 2. 2.2 Logistic Regression Logistic regression is an iterative version of linear regression for classification, but with lots of differences in the cost functions to be minimized. The linear regression for classification minimizes the following cost function [3]: 𝐸𝑡𝑟(𝒘) = 1 𝑁 ∑(𝒘 𝑇 𝒙 𝑛 − 𝑦𝑛)2 𝑁 𝑛=1 (3) 𝐸𝑡𝑟 is the mean-square training error 𝒘 is the linear regression coefficient (weight) vector 𝒙 𝑛 is the nth training sample 𝑦𝑛 is the class of the nth training sample. The above minimization has a closed form solution as: 𝒘 = (𝑿 𝑇 𝑿)−1 𝑿 𝑇 𝒚 (4) X is a matrix with all the N training samples in its rows. X is of size N by (D+1), D is the dimensionality of every training sample. The class label vector 𝒚 is a column vector of length N. The classification is done as (Fig 1): ℎ(𝒙) = 𝑠𝑖𝑔𝑛(𝒘 𝑇 𝒙) (5) Logistic regression, on the other hand, minimizes the following error measure [4]: 𝐸𝑖𝑛(𝒘) = 1 𝑁 ∑ ln(1 + 𝑒−𝑦 𝑛 𝒘 𝑇 𝒙 𝑛)𝑁 𝑛=1 (6) The ln(. ) term is often called as cross-entropy error. The above expression is equivalent to maximizing the following likelihood function [4]: ∏ 𝑃(𝑦𝑛|𝑥 𝑛) = ∏ 𝜃( 𝑁 𝑛=1 𝑁 𝑛=1 𝑦𝑛 𝒘 𝑇 𝒙 𝑛) = ∏ 1 1 + 𝑒−𝑦 𝑛 𝒘 𝑇 𝒙 𝑛 𝑁 𝑛=1 (7) 𝜃(. ) is the sigmoidal threshold function. 𝑃(𝑦𝑛|𝑥 𝑛) is the posterior probability of the class of nth training sample. (7) can be obtained upon taking natural logarithm of the third expression in (6) and negating it. Unlike linear regression for classification, there is no closed-form solution and it is typically solved by gradient descent procedure. Also, the logistic regression model uses Fig. 1. Linear Regression for Classification [3] a sigmoidal threshold function (flattened-out S) rather than a signum function (Fig 2). The output of the logistic regression classifier can be interpreted in a probabilistic sense and can be viewed as the posterior distribution of class labels. Hence logistic regression classfiers are also called as soft-threshold classfiers. Fig 2. Logistic Regression Model [3] 2.3 Regularization In machine learning and pattern recognition, there are theoretically infinite ways of solving any problem. Thus it is important to have an objective criterion for assessing the accuracy of candidate approaches and for selecting the right model for a data set at hand. An extremely simple model will often under-fit the data, while an extremely complex model over-fits the data. The former can be observed by its large error during training and the latter by an extremely small error. Under-fitting fails to model the problem at hand and over-fitting fails to generalize for unseen inputs. Regularization is one way of combating the problem of over-fitting. A regularized linear regression solves the following constrained minimization problem [3]: 𝐸𝑖𝑛(𝒘) = 1 𝑁 ∑ (𝒘 𝑇 𝒙 𝑛 − 𝑦𝑛)2𝑁 𝑛=1 = 1 𝑁 (𝑿𝒘 − 𝒚) 𝑇 (𝑿𝒘 − 𝒚) subject to ‖𝒘‖2 2 ≤ 𝐶 , 𝐶 is a constant (8)
  • 3. ‖𝒘‖2 is the 𝑙2 norm of weight vector. The above minimization is equivalent to minimizing 𝐸𝑖𝑛(𝒘) + 𝜆 𝑁 ‖𝒘‖2 2 (9) and has the closed form solution as [4]: 𝒘 𝒓𝒆𝒈 = (𝑿 𝑇 𝑿 + 𝜆𝑰)−1 𝑿 𝑇 𝒚 (10) 𝜆 is called the regularization parameter and 𝑰 is an identity matrix. The regularization parameter is tunable and puts a brake on the 𝑙2 norm of the weight vector. On similar basis, regularization also applies to logistic regression and as in the case of plain logistic regression, the solution has to be obtained by iterative methods. A regularized logistic regression classifier maximizes the following [3]: 𝐽(𝒘) = 𝑙𝑛 ∏ 𝑃(𝑦𝑛|𝑥 𝑛) − 𝑁 𝑛=1 𝜆‖𝒘‖2 2 (11) The notations follow the same nomenclature. 2.4 Cross-validation “Validation techniques are motivated by two fundamental problems in pattern recognition: model selection and performance estimation. Almost invariably, all pattern recognition techniques have one or more free parameters” [5]. For example, the number of neighbors in a k-Nearest Neighbor classification rule, number of hidden layers and learning parameters in a multi-layer perceptron and in our case, the regularization parameter in logistic regression. Once a model is chosen, the performance is typically measured by the true error rate, the classifier’s error rate on the entire population. With only a finite set of examples, the out-of-sample error has to be estimated as close as possible to its true value [5]. The simplest cross-validation technique is to split the training set further into two parts: one used for usual training and the other for validation. During training, the free parameters are varied and the performances are measured with the aid of validation set. The most commonly followed approach is to use a K-fold validation, where K experiments are conducted using K-1 folds for training and the remaining one for cross-validation. More sophisticated approaches such as random K-fold sub-sampling are also used. 3. IMPLEMENTATION This section gives details of the database, preprocessing steps and classifier implementation. 3.1. Database The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. Train data and Test data contains grey level image of size 28x28 [4]. 3.2. Preprocessing The raw 28 × 28 gray-scale images are read and bounding boxes are created by discarding unnecessary white borders. The images are then downscaled to size 10 × 10. 100- dimensional feature vectors are then formed by converting the 10 × 10 images to 100 × 1 vectors. The above procedure is done for the images in both training and test set. 3.3. Naïve Bayes Implementation The built-in function in MATLABTM is used. Training is done as: obj = NaiveBayes.fit(TrainData,TrainClass)); NaiveBayes.fit builds a NaiveBayes classifier object obj. Testing is accomplished by using the obj returned by the function NaiveBayes.fit as shown below. PredClass = obj.predict(TestData); obj.predict returns a vector PredClass of predicted class labels for TestData. 3.4. Logistic Regression Implementation The MATLABTM implementation of GlmnetTM package developed at Stanford University, USA is used. Training is done using glmnet as follows [6]: model=glmnet(TrainData,TrainClass,’multinomial’) Care has to be taken that the class labels do not start from numeric 0. The model variables can be accessed as follows [6]: model.lambda: returns a 100-element vector containing all used values for the penalty parameter λ. model.beta: a cell of 10 matrices of size 100 × 100. Each 100-element column represents a set of coefficients for the 10 × 10 pixels. There are altogether 100 such vectors, one for each value of parameter λ. model.a0: a vector containing the bias constants for the logistic regression model, one for each value of parameter λ. Prediction of labels for test data is done as follows [6]:
  • 4. yHat = glmnetPredict(model, X_test, lambda, ’class’); For Logistic Regression without regularization, lambda is chosen to be the minimum of model.lambda. A 5-fold cross- validation is then performed to choose the best value of lambda. 4. RESULTS AND DISCUSSION The confusion matrices of the aforementioned classifiers are shown in Tables 1, 2 and 3. The accuracy of the Naïve Bayes classifier is found to be 83.2%. There is not much significant difference in the performances of logistic regression without and with regularization, the former yielding an accuracy of 93.66%, while the latter 93.7%. The optimal regularization parameter for logistic regression, after a 5-fold cross-validation, is found to be 𝜆 = 10−4 (Fig 3). TABLE 1. CONFUSION MATRIX OF NAÏVE BAYES CLASSIFIER (The rows correspond to true class and the columns correspond to predicted class. Same applies for Table 2 and Table 3) No 0 1 2 3 4 5 6 7 8 9 0 940 0 17 3 2 14 8 1 8 4 1 1 910 21 8 34 16 23 75 54 20 2 4 45 847 23 5 3 8 35 27 7 3 0 2 8 873 0 71 1 12 11 9 4 5 39 5 0 842 12 37 20 18 58 5 4 11 12 34 10 705 15 3 35 5 6 12 26 8 0 20 18 857 0 5 0 7 1 37 54 24 0 5 0 821 17 23 8 11 55 48 33 14 23 6 15 688 46 9 2 10 12 12 55 25 3 46 111 837 TABLE 2. CONFUSION MATRIX OF LOGISTIC REGRESSION (WITHOUT REGULARIZATION) No 0 1 2 3 4 5 6 7 8 9 0 962 3 1 2 2 4 10 0 6 1 1 2 1081 16 4 6 11 0 10 20 7 2 0 8 962 15 4 7 6 6 7 3 3 0 4 19 931 0 28 1 6 22 12 4 0 7 3 2 936 3 3 5 10 17 5 7 4 3 25 0 810 13 3 26 9 6 4 5 5 1 10 6 921 0 2 0 7 2 3 5 8 0 6 0 969 6 22 8 3 18 17 16 2 15 4 2 866 10 9 0 2 1 6 22 2 0 27 9 928 TABLE 3. CONFUSION MATRIX OF LOGISTIC REGRESSION (WITH REGULARIZATION) No 0 1 2 3 4 5 6 7 8 9 0 964 3 3 2 2 6 10 0 6 1 1 3 1084 16 4 5 8 0 12 19 6 2 0 8 962 16 4 6 7 6 6 3 3 0 4 16 929 0 28 1 6 21 13 4 0 6 2 1 937 3 3 5 11 19 5 5 4 3 27 0 812 13 3 27 10 6 3 5 6 1 10 7 920 0 2 0 7 2 3 5 8 0 6 0 967 6 20 8 3 16 18 15 2 14 4 3 868 10 9 0 2 1 7 22 2 0 26 8 927 Fig 3. Plot of regularization parameter versus prediction accuracy for logistic regression 5. CONCLUSION In this paper two methods for recognition of hand written digits, namely, Naïve Bayes and logistic regression are compared. The results show that logistic regression method has better performance over Naïve Bayes. It is also observed that there is not much difference in the performance of logistic regression with or without regularization parameter. REFERENCES 1. Jussi Tohka, Lecture Notes, SGN- 2506: Introduction to Pattern Recognition, Tampere University of Technology. 2. http://tcts.fpms.ac.be/rdf/hcrinuk.htm 3. Yaser.S.Abu Mostafa et al, “Learning from Data,”AML Book, ISBN: 978-1600490064 4. The MNIST database of handwritten digits, http://yann.lecun.com/exdb/mnist/ 5. Ricardo Guttierez- Osuna, Lecture Slides- Intelligent Sensor Systems, Wright State University. 6. http://web.stanford.edu/~hastie/glmnet_matlab/ -5 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 log10 (lambda) PredictionAccuracy(%)