1. Indian Institute of Technology Jodhpur
Computer Science of Engineering
Sixth Semester (2015-2016)
Machine learning(Building and comparing various machine learning
models to recognize hand written digits)
Team Members:Shrey Maheshwari(ug201314017)
:Ravi Prakash Gupta(ug201310027)
3. 1 Introduction
The data ﬁle contains grayscale images of handdrawn digits, from zero through
nine. Each image is 16 pixels in height and 16 pixels in width, for a total
of 256 pixels in total. Each pixel has a single pixelvalue associated with it,
indicating the lightness or darkness of that pixel.Each image is 8bit depth
single channel so this pixelvalue is an integer between 0 and 255, inclusive.
We have modiﬁed it in the following way value=1 if pixel value >127 value
=0 otherwise Previously each pixel value was taking 8 bits. But now each
pixel value is taking 1 bit only. So 1 image is taking 256 bits only. The
data set, (train.csv), has 266 columns. The ﬁrst 256 columns are pixel values
associated and other 10 indicate the label i.e. the digit that was drawn by
the user. We divided our data into 2 sets 1. Training data which comprises
of 80 % of the data. 2.Test data which comprises of 20% of the data.
Figure 1: Data.
Figure 1 shows the data.
The test data set, (test.csv), is the same as the training set, except that
it does not contain the ”label” column.
4. Figure 2: Visualization of data
Classiﬁcation is a process of assigning new data to a category based on
training data in known categories. In this paper, we use a number of human
identiﬁed digit images split into training and test set. A classiﬁer learns on
training images and labels and produces output based on test images. Output
is then compared to test labels to evaluate the classiﬁcation performance. A
good classiﬁer should be able to learn on the training data but maintain the
generalization property to be accurate when identifying the test set.
5. 2 Theory
The given problem falls under the category of Supervised Learning. Su-
pervised learning is the machine learning task of inferring a function from
supervised training data. The training data consist of a set of training ex-
amples. In supervised learning, each example is a pair consisting of an input
object (typically a vector) and a desired output value (also called the su-
pervisory signal).Our problem is basically a multiclass classiﬁcation problem
. To solve this problem we used Logistic Regression. logistic regression is
a regression model where the dependent variable (DV) is categorical. The
logistic function is deﬁned as follows:
1 + e−t
Figure 3: Logistic Function.
Figure 3 shows the Logistic Function.
6. The range of logistic function is [0,1] So our prediction will fall in [0,1]
which indicates the probability that output is that number of which logistic
regression is applied.And then our ﬁnal answer will be the index value at
which the probability is maximum. We applied gradient descent algorithm
as it is simple to implement. Gradient descent is a ﬁrst-order optimization
algorithm. To ﬁnd a local minimum of a function using gradient descent, one
takes steps proportional to the negative of the gradient(or of the approximate
gradient) of the function at the current point. If instead one takes steps
proportional to the positive of the gradient, one approaches a local maximum
of that function; the procedure is then known as gradient ascent. We used
around 500 iterations to reach to a saturation state after which error was not
decreasing much.We plotted error vs iterations curve to ensure that the error
is always decreasing.If it had not been the case then we would have reduced
our learning rate. We used Regularization to ensure that our model do not
overﬁt the training data. Regularization, in mathematics and statistics and
particularly in the ﬁelds of machine learning and inverse problems, refers to
a process of introducing additional information in order to solve an ill- posed
problem or to preventoverﬁtting. In general, a regularization term R(f) is
introduced to a general loss function:
V (f(ˆxi), ˆyi) + λR(f)
for a loss function V that describes the cost of predicting f(x) when the label
is y , such as the square loss or hinge loss, and for the term λ which controls
the importance of the regularization term. R(f) is typically a penalty on the
complexity of f , such as restrictions for smoothness or bounds on the vector
space norm. There are 10 labels but logictic regression is binary classiﬁer so
we need to train 10 logistic regression in binary classiﬁer so we need to 10
logistic classiﬁer . Then we applied one vs all method to ﬁnally choose the
The one-vs.-all strategy involves training a single classiﬁer per class, with
the samples of that class as positive samples and all other samples as neg-
atives. This strategy requires the base classiﬁers to produce a realvalued
conﬁdence score for its decision, rather than just a class label; discrete class
labels alone can lead to ambiguities, where multiple classes are predicted for
a single sample.
7. 3 Implementation(Data Structures And Al-
Firstly we initialized our learning parameter which we denoted by theta as all
zeros.Our hypothesis was sigmoid of Xθ where X is training data and theta is
learned parameter.We used sigmoid function because it gives output between
0 and 1.We wanted output of hypothesis between 0 and 1 because this is
multiclass problem which we are converting into 10 binary class classiﬁer.
Sigmoid function looks like-
1 + e−t
Its range is [0,1].
We deﬁned our cost function which shows the error between our prediction
and actual value as
J (θ) =
) + (1 − y(i)
) log(1 − hθ(x(i)
It is a convex function so the problem of converging at local optima will not
come into picture. The initial value of error was
array([ 0.69314718, 0.69314718, 0.69314718, 0.69314718, 0.69314718,
0.69314718, 0.69314718, 0.69314718, 0.69314718, 0.69314718])
We initialized learning rate with some value,initially high value was cre-
ating problem so we decreased the value until the problem was solved.To get
the best value of learning rate is hit and trial. Then we implemented the
loop in which we updated our learning parameter as-
θj := θj − α
) − y(i)
In every iteration we stored the value of cost error to ensure that it is decreas-
ing with every iteration.Initially in the project there was a problem that the
value of cost error was not decreasing continuously, it was increasing some-
times in between. So we found out that the value of learning rate is high due
to which it was overshooting. So we decreased the value of learning rate and
set it to maximum value at which cost error function was not overshooting.
8. Then we plotted cost error vs no. of iterations graph to ensure that the func-
tion is strictly decreasing and also to see that have we achieved saturation in
the error. So we recognized that after 200 iterations the error was not still
saturated so we increased the number of iterations.We achieved best possible
number of iterations by hit and trial and ﬁnally decided to keep it 500.
Figure 4: Cost Error Function.
Figure 4 shows the Cost Error Function.
Our hypothesis is a linear model which is x1θ1 + x2θ2 + x3θ3 + x4θ4........
Then we predicted output on test data by multiplying matrix of test data by
learned parameters. We obtained 10 output for every fresh example.Output
of every test example is 10x1 row. The value at each index indicates the
probability of that test example being that index. So we choose index with
maximum value as our ﬁnal answer.
In sensitive applications like cheque reading in banks we cant aﬀord to
make a single mistake so there is a diﬀerent approach for it.If all the 10 values
9. of our prediction are less than some threshold(say 0.7) i.e. none of the model
is much conﬁdent about the prediction, we can output not able to recognize
so that the case is handled manually.Then we tried diﬀerent combinations of
learning rate and number of iterations to achieve the best accuracy on test
We ﬁnally set the value of number of iterations to 500 and learning rate
to 0.5 . Data structures arrays and matrices, list We used matrix as data
structures to store training data,test data and learned parameters.It was best
data structure to use because we need to take transpose of data, sometimes
we need to add rows,coloumns and remove rows,coloumns .It was also very
easy to perform matrix multiplication and we saved a lot of time by doing
implementing vectorization instead of loops which was only possible with the
help of matrix data structure in numpy module. We used list data structure
of python to store the values of cost error in every iterations, it was easy to
append the values in the list.
We used arrays(single dimension matrix) to plot the graphs and contain
some other useful information. We used Matplotlib package to plot graph.We
provided values for the x-axis and plot the cost error values at those x-axis
values and joined them.We obtained a decreasing curve which saturated after
some value of x.
a = np. exp(−z)
a = 1 + a
a = 1
10. 4 Application
It has wide applications .It can be used in banks for amount reading although
it very sensitive because we cant aﬀord a single mistake so we should add
a new feature in our model that if the conﬁdence is low in classifying then
it should give output as not able to recognize and that case should be han-
dled manually.It can further be extended to character recognition of various
languages.It can be used in post oﬃces for postal code reading,there it will
reduce the workload signiﬁcantlly and make the process faster.
The same project can be further extended to read telephone numbers
. In that case we ﬁrst need to separate diﬀerent digits and then recognize
them individually.This can also convert a handwritten document to digital
document which you can edit. This can be useful for the applications which
translate the sentence from one language to another .Those applications can
only work if the input is keyboard written so handwritten character recog-
nition system can recognize those characters and then provide input to the
11. 5 Result
Figure 5: Table of accuacy obtained.
Figure 5 shows table of accuracy obtained.
12. 6 Conclusion
In this paper, a method to increase handwritten digits recognition rates by
combining feature extractions methods is proposed. Experimental results
showed that complementary features can signiﬁcantly improve recognition
performance. The proposed concavity feature extraction method in conjunc-
tion with gradient features gave the highest recognition accuracy in majority
of experiments. The method worked well with chaincode features as well,
being one out of two top performers. It also has the lowest feature count
among observed complementary features, which lowers computational cost
of classiﬁcation. Experiments using reduced training sets showed that the
proposed concavity method outperforms other observed approaches making
it useful for applications requiring use of a small training set. Adding training
instances from another dataset reﬂected on the recognition accuracy diﬀer-
ently for diﬀerent datasets. Accuracy was increased on two datasets and
decreased on one, indicating that learning process is sensitive to small dif-
ferences in image retrieval and preprocessing. Overall, the proposed method
achieved the best performance.
 Recognizing Handwritten Digits Using Mixtures of Linear Models Ge-
oﬀrey E Hinton Michael Revow Peter Dayan Department of Computer
Science, University of Toronto Toronto, Ontario, Canada M5S 1A4.
 Representation and Recognition of Handwritten Digits Using Deformable
Templates Anil K. Jain, Fellow, IEEE, and Douglas Zongke
 Comparison of Learning Algorithms For Handwritten digit recogni-
tion Y.LeCun,L.Jackel,L.Bottou,C.Cortes Bell Laboratories,Holmdel,NJ
 Handwritten Digit Recognition using DCT and HMMs Syed Salman Ali,
Muhammad Usman Ghani Lahore, Pakistan
 Neocognitron for handwritten digit recognition Kunihiko Fukushima
Tokyo University of Technology, 14041, Ktakura, Hachioji, Tokyo
 R. P. W. Duin, The combining classiﬁer: to train or not to train?, in
Pattern Recognition, 2002. Proceedings. 16th International Conference
on, 2002, vol. 2, pp. 765770 vol.2.
 Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas, On
combining classiﬁers, IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, vol. 20, pp. 226239,1998
 M. Riedmiller and H. Braun, A direct adaptive method for faster back-
propagation learning: The RPROP algorithm,International Conference
on Neural Networks, pp. 586591,1993.
 Y.C. Chim, A. A. Kassim, and Y. Ibrahim, Dual classiﬁer system for
handprinted alphanumeric character recognition, Pattern Analysis and
Application, , no. 1, pp. 155162, 1998.