4 Jul 2016•0 j'aime•115 vues

Signaler

Ridge regression, lasso and elastic netVivian S. Zhang

Java™ (OOP) - Chapter 7: "Multidimensional Arrays"Gouda Mando

Binary Class and Multi Class Strategies for Machine LearningPaxcel Technologies

Gradient Boosted Regression Trees in scikit-learnDataRobot

Chapter3 hundred page machine learningmustafa sarac

Introduction to Boosted Trees by Tianqi ChenZhuyi Xue

- 1. Indian Institute of Technology Jodhpur Computer Science of Engineering Sixth Semester (2015-2016) Machine learning(Building and comparing various machine learning models to recognize hand written digits) Team Members:Shrey Maheshwari(ug201314017) :Ravi Prakash Gupta(ug201310027) Mentor:Prof. K.R.Chowdhary 1
- 2. Contents 1 Introduction 3 2 Theory 5 3 Implementation(Data Structures And Algorithms) 7 4 Application 10 5 Result 11 6 Conclusion 12 2
- 3. 1 Introduction The data ﬁle contains grayscale images of handdrawn digits, from zero through nine. Each image is 16 pixels in height and 16 pixels in width, for a total of 256 pixels in total. Each pixel has a single pixelvalue associated with it, indicating the lightness or darkness of that pixel.Each image is 8bit depth single channel so this pixelvalue is an integer between 0 and 255, inclusive. We have modiﬁed it in the following way value=1 if pixel value >127 value =0 otherwise Previously each pixel value was taking 8 bits. But now each pixel value is taking 1 bit only. So 1 image is taking 256 bits only. The data set, (train.csv), has 266 columns. The ﬁrst 256 columns are pixel values associated and other 10 indicate the label i.e. the digit that was drawn by the user. We divided our data into 2 sets 1. Training data which comprises of 80 % of the data. 2.Test data which comprises of 20% of the data. Figure 1: Data. Figure 1 shows the data. The test data set, (test.csv), is the same as the training set, except that it does not contain the ”label” column. 3
- 4. Figure 2: Visualization of data Classiﬁcation is a process of assigning new data to a category based on training data in known categories. In this paper, we use a number of human identiﬁed digit images split into training and test set. A classiﬁer learns on training images and labels and produces output based on test images. Output is then compared to test labels to evaluate the classiﬁcation performance. A good classiﬁer should be able to learn on the training data but maintain the generalization property to be accurate when identifying the test set. 4
- 5. 2 Theory The given problem falls under the category of Supervised Learning. Su- pervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training ex- amples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the su- pervisory signal).Our problem is basically a multiclass classiﬁcation problem . To solve this problem we used Logistic Regression. logistic regression is a regression model where the dependent variable (DV) is categorical. The logistic function is deﬁned as follows: σ(t) = et 1 + e−t Figure 3: Logistic Function. Figure 3 shows the Logistic Function. 5
- 6. The range of logistic function is [0,1] So our prediction will fall in [0,1] which indicates the probability that output is that number of which logistic regression is applied.And then our ﬁnal answer will be the index value at which the probability is maximum. We applied gradient descent algorithm as it is simple to implement. Gradient descent is a ﬁrst-order optimization algorithm. To ﬁnd a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient(or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. We used around 500 iterations to reach to a saturation state after which error was not decreasing much.We plotted error vs iterations curve to ensure that the error is always decreasing.If it had not been the case then we would have reduced our learning rate. We used Regularization to ensure that our model do not overﬁt the training data. Regularization, in mathematics and statistics and particularly in the ﬁelds of machine learning and inverse problems, refers to a process of introducing additional information in order to solve an ill- posed problem or to preventoverﬁtting. In general, a regularization term R(f) is introduced to a general loss function: x min n i=1 V (f(ˆxi), ˆyi) + λR(f) for a loss function V that describes the cost of predicting f(x) when the label is y , such as the square loss or hinge loss, and for the term λ which controls the importance of the regularization term. R(f) is typically a penalty on the complexity of f , such as restrictions for smoothness or bounds on the vector space norm. There are 10 labels but logictic regression is binary classiﬁer so we need to train 10 logistic regression in binary classiﬁer so we need to 10 logistic classiﬁer . Then we applied one vs all method to ﬁnally choose the ﬁnal answer. The one-vs.-all strategy involves training a single classiﬁer per class, with the samples of that class as positive samples and all other samples as neg- atives. This strategy requires the base classiﬁers to produce a realvalued conﬁdence score for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample. 6
- 7. 3 Implementation(Data Structures And Al- gorithms) Firstly we initialized our learning parameter which we denoted by theta as all zeros.Our hypothesis was sigmoid of Xθ where X is training data and theta is learned parameter.We used sigmoid function because it gives output between 0 and 1.We wanted output of hypothesis between 0 and 1 because this is multiclass problem which we are converting into 10 binary class classiﬁer. Sigmoid function looks like- S(t) = 1 1 + e−t Its range is [0,1]. We deﬁned our cost function which shows the error between our prediction and actual value as J (θ) = 1 m [ m i=1 y(i) logθ(x(i) ) + (1 − y(i) ) log(1 − hθ(x(i) ))] It is a convex function so the problem of converging at local optima will not come into picture. The initial value of error was array([ 0.69314718, 0.69314718, 0.69314718, 0.69314718, 0.69314718, 0.69314718, 0.69314718, 0.69314718, 0.69314718, 0.69314718]) We initialized learning rate with some value,initially high value was cre- ating problem so we decreased the value until the problem was solved.To get the best value of learning rate is hit and trial. Then we implemented the loop in which we updated our learning parameter as- θj := θj − α 1 m m i=1 (hθ(x(i) ) − y(i) )x (i) j In every iteration we stored the value of cost error to ensure that it is decreas- ing with every iteration.Initially in the project there was a problem that the value of cost error was not decreasing continuously, it was increasing some- times in between. So we found out that the value of learning rate is high due to which it was overshooting. So we decreased the value of learning rate and set it to maximum value at which cost error function was not overshooting. 7
- 8. Then we plotted cost error vs no. of iterations graph to ensure that the func- tion is strictly decreasing and also to see that have we achieved saturation in the error. So we recognized that after 200 iterations the error was not still saturated so we increased the number of iterations.We achieved best possible number of iterations by hit and trial and ﬁnally decided to keep it 500. Figure 4: Cost Error Function. Figure 4 shows the Cost Error Function. Our hypothesis is a linear model which is x1θ1 + x2θ2 + x3θ3 + x4θ4........ Then we predicted output on test data by multiplying matrix of test data by learned parameters. We obtained 10 output for every fresh example.Output of every test example is 10x1 row. The value at each index indicates the probability of that test example being that index. So we choose index with maximum value as our ﬁnal answer. In sensitive applications like cheque reading in banks we cant aﬀord to make a single mistake so there is a diﬀerent approach for it.If all the 10 values 8
- 9. of our prediction are less than some threshold(say 0.7) i.e. none of the model is much conﬁdent about the prediction, we can output not able to recognize so that the case is handled manually.Then we tried diﬀerent combinations of learning rate and number of iterations to achieve the best accuracy on test data. We ﬁnally set the value of number of iterations to 500 and learning rate to 0.5 . Data structures arrays and matrices, list We used matrix as data structures to store training data,test data and learned parameters.It was best data structure to use because we need to take transpose of data, sometimes we need to add rows,coloumns and remove rows,coloumns .It was also very easy to perform matrix multiplication and we saved a lot of time by doing implementing vectorization instead of loops which was only possible with the help of matrix data structure in numpy module. We used list data structure of python to store the values of cost error in every iterations, it was easy to append the values in the list. We used arrays(single dimension matrix) to plot the graphs and contain some other useful information. We used Matplotlib package to plot graph.We provided values for the x-axis and plot the cost error values at those x-axis values and joined them.We obtained a decreasing curve which saturated after some value of x. def sigmoid(z): a = np. exp(−z) a = 1 + a a = 1 a return a 9
- 10. 4 Application It has wide applications .It can be used in banks for amount reading although it very sensitive because we cant aﬀord a single mistake so we should add a new feature in our model that if the conﬁdence is low in classifying then it should give output as not able to recognize and that case should be han- dled manually.It can further be extended to character recognition of various languages.It can be used in post oﬃces for postal code reading,there it will reduce the workload signiﬁcantlly and make the process faster. The same project can be further extended to read telephone numbers . In that case we ﬁrst need to separate diﬀerent digits and then recognize them individually.This can also convert a handwritten document to digital document which you can edit. This can be useful for the applications which translate the sentence from one language to another .Those applications can only work if the input is keyboard written so handwritten character recog- nition system can recognize those characters and then provide input to the translator application. 10
- 11. 5 Result Figure 5: Table of accuacy obtained. Figure 5 shows table of accuracy obtained. 11
- 12. 6 Conclusion In this paper, a method to increase handwritten digits recognition rates by combining feature extractions methods is proposed. Experimental results showed that complementary features can signiﬁcantly improve recognition performance. The proposed concavity feature extraction method in conjunc- tion with gradient features gave the highest recognition accuracy in majority of experiments. The method worked well with chaincode features as well, being one out of two top performers. It also has the lowest feature count among observed complementary features, which lowers computational cost of classiﬁcation. Experiments using reduced training sets showed that the proposed concavity method outperforms other observed approaches making it useful for applications requiring use of a small training set. Adding training instances from another dataset reﬂected on the recognition accuracy diﬀer- ently for diﬀerent datasets. Accuracy was increased on two datasets and decreased on one, indicating that learning process is sensitive to small dif- ferences in image retrieval and preprocessing. Overall, the proposed method achieved the best performance. 12
- 13. References [1] Recognizing Handwritten Digits Using Mixtures of Linear Models Ge- oﬀrey E Hinton Michael Revow Peter Dayan Department of Computer Science, University of Toronto Toronto, Ontario, Canada M5S 1A4. [2] Representation and Recognition of Handwritten Digits Using Deformable Templates Anil K. Jain, Fellow, IEEE, and Douglas Zongke [3] Comparison of Learning Algorithms For Handwritten digit recogni- tion Y.LeCun,L.Jackel,L.Bottou,C.Cortes Bell Laboratories,Holmdel,NJ 07733, USA [4] Handwritten Digit Recognition using DCT and HMMs Syed Salman Ali, Muhammad Usman Ghani Lahore, Pakistan [5] Neocognitron for handwritten digit recognition Kunihiko Fukushima Tokyo University of Technology, 14041, Ktakura, Hachioji, Tokyo 1920982, Japan [6] R. P. W. Duin, The combining classiﬁer: to train or not to train?, in Pattern Recognition, 2002. Proceedings. 16th International Conference on, 2002, vol. 2, pp. 765770 vol.2. [7] Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas, On combining classiﬁers, IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, vol. 20, pp. 226239,1998 [8] M. Riedmiller and H. Braun, A direct adaptive method for faster back- propagation learning: The RPROP algorithm,International Conference on Neural Networks, pp. 586591,1993. [9] Y.C. Chim, A. A. Kassim, and Y. Ibrahim, Dual classiﬁer system for handprinted alphanumeric character recognition, Pattern Analysis and Application, , no. 1, pp. 155162, 1998. 13