This document summarizes the results of running various machine learning algorithms (Naive Bayes, Perceptron, Mira) on image classification tasks using digit and face datasets of varying sizes. It provides tables showing the training time and accuracy for each algorithm on both datasets as the amount of training data is increased from 500 to 5000 samples. The algorithms are evaluated and compared based on these metrics.
8. 7
Table 7: Perceptron - Face - Time
Time Taken For Face Data(in seconds)
Digits training Data Set Size Time 1 Time 2 Time 3 Mean SD
500 2.4 2.1 2.3 2.266 0.124
1000 4.5 4.9 4.8 4.733 0.169
1500 7.7 7.6 7.4 7.56 0.124
2000 9.7 9.8 10.1 9.866 0.169
2500 12.4 12.2 12.7 12.433 0.205
3000 15.1 14.9 15.2 15.066 0.124
3500 17.3 17.6 17.5 17.466 0.124
4000 20.1 20.2 19.2 19.83 0.449
4500 22.7 22.6 22.4 22.566 0.125
5000 25.6 25.4 25.8 25.6 0.163
Figure 7: Perceptron Face Time
9. 8
Table 8: Perceptron - Face - Accuracy
Accuracy For Face Data(in seconds)
Digits training Data Set Size Run 1 Run 2 Run 3 Mean SD
500 67.1 70.4 73 70.16 2.091
1000 74.4 75.1 72.1 73.866 1.281
1500 77.4 79.6 76 77.666 1.48
2000 80.1 78.4 81.2 79.9 1.152
2500 82.3 81.2 80 81.16 0.939
3000 84.2 81.9 84.3 83.1 1.108
3500 84.4 85.7 83.5 84.6 0.90
4000 86.4 89.1 87.4 88.25 1.35
4500 87.2 88.1 88.5 87.93 0.54
5000 90.1 89.1 88.9 89.366 0.525
Figure 8: Perceptron Face Accuracy
10. 9
Table 9: Mira - Digit - Time
Time Taken For Digit Data(in seconds)
Digits training Data Set Size Time 1 Time 2 Time 3 Mean SD
500 24.3 23.6 24.9 24.26 0.531
1000 46.1 47.4 48.2 47.233 0.865
1500 67.2 69.8 68.9 68.633 1.078
2000 91.7 93.2 94.3 93.06 1.065
2500 109.4 114.8 115.8 113.33 2.81
3000 138.5 144.7 141.9 141.7 2.53
3500 152.6 158.1 162.2 157.63 3.93
4000 183.2 186.3 178.1 182.5333333 3.380663972
4500 203.8 193.5 199 198.7666667 4.208193067
5000 227.3 220.5 233.1 226.966 5.149
Figure 9: Mira Digit Time
11. 10
Table 10: Mira - Digit - Accuracy
Accuracy For Digit Data(in seconds)
Digits training Data Set Size Run 1 Run 2 Run 3 Mean SD
500 72.3 76.4 73.1 73.93 1.536
1000 74.1 76.2 77.5 75.93 1.40
1500 75.1 76.4 73.8 75.1 1.061
2000 82.3 80 78.1 80.13 1.717
2500 79.4 80.3 81.2 80.3 0.734
3000 80.1 78.9 79.9 79.633 0.524
3500 82 80.4 81.3 81.23 0.8
4000 80.1 79.5 78.8 79.46 0.3
4500 81 80.2 82.1 81.1 0.778
5000 79.2 80.4 80.8 80.13 0.679
Figure 10: Mira Face Accuracy
12. 11
Table 11: Mira - Face - Time
Time Taken For Face Data(in seconds)
Digits training Data Set Size Time 1 Time 2 Time 3 Mean SD
500 5.732 5.845 5.912 5.829 0.07428025
1000 8.325 8.826 8.967 8.706 0.275
1500 11.458 12.241 11.251 11.65 0.426
2000 13.696 13.176 14.218 13.696 0.425
2500 16.885 16.903 17.48 17.089 0.276
3000 18.74 18.539 18.436 18.571 0.126
3500 20.561 20.704 20.019 20.428 0.295
4000 22.498 22.014 22.328 22.28 0.200
4500 24.893 24.481 23.947 24.440 0.387
5000 28.115 28.947 28.571 28.544 0.340
Figure 11: Mira Face Time
13. 12
Table 12: Mira - Face -Accuracy
Accuracy For Face Data(in seconds)
Digits training Data Set Size Run 1 Run 2 Run 3 Mean SD
500 67.8 66.2 69.3 67.766 1.26
1000 72.5 74.8 75.9 74.4 1.416
1500 76.1 77.8 77.1 77 0.697
2000 79.1 78.3 77.9 78.433 0.498
2500 81.5 81.8 82.6 81.96 0.464
3000 84.1 83.7 83.9 83.9 0.163
3500 86.4 86.5 85.9 86.266 0.262
4000 86.9 87.1 87 87 0.0816
4500 89.1 88.7 90.1 89.3 0.588
5000 88.9 91.6 88.5 89.66 1.376
Figure 12: Mira Face Accuracy
14. 13
PROBLEM 3 - DISCUSSION OF ALGORITHMS AND RESULTS
(A) Naive Bayes:
A naive Bayes classifier models a joint distribution over a label Y and a set of observed
random variables, or features, {F1,F2,...Fn}, using the assumption that the full joint
distribution can be factored as follows (features are conditionally independent given
the label):
P(F1 ...Fn,Y ) = P(Y )
i
P(Fi |Y )
To classify a datum, we can find the most probable label given the feature values for
each pixel, using Bayes theorem:
P(y|f1 ... fn) =
P(f1 ... fn|y)P(y)
P(f1 ... fn)
=
P(y) i = 1m
P(fi |y)
P(f1 ... fn)
ar gmaxy P(y|f1 ... fn) = ar gmaxy
P(y) m
i=1 P(fi |y)
P(f1 ... fn)
ar gmaxy P(y|f1 ... fn) = ar gmaxy P(y)
m
i=1
P(fi |y)
Because multiplying many probabilities together often results in underflow, we will
instead compute log probabilities which have the same argmax.
ar gmaxy logP(y)
m
i=1
P(fi |y) = ar gmaxy log(P(y, f1 ... fn))
= ar gmaxy log(P(y)+
m
i=1
logP(fi |y))
Use math.log(), a built-in Python function to compute logarithms.
Parameter Estimation:
Our naive Bayes model has several parameters to estimate. One parameter is the prior
distribution over labels (digits, or face/not-face), P(Y ). We can estimate P(Y ) directly
15. 14
from the training data:
ˆP(y) =
c(y)
n
where c(y) is the number of training instances with label y and n is the total number of
training instances. The other parameters to estimate are the conditional probabilities
of our features given each label y: P(Fi |Y = y). We do this for each possible feature
value (fi ∈ 0,1).
ˆP(Fi = fi |Y = y) =
c(fi , y)
fi
c(fi , y)
where c(fi , y) is the number of times pixel Fi took value fi in the training examples of
label y.
Smoothing: Your current parameter estimates are unsmoothed, that is, you are using
the empirical estimates for the parameters P(fi |y). These estimates are rarely adequate
in real systems. Minimally, we need to make sure that no parameter ever receives
an estimate of zero, but good smoothing can boost accuracy quite a bit by reducing
overfitting. In this project, we use Laplace smoothing, which adds k counts to every
possible observation value:
P(Fi = fi |Y = y) =
c(Fi = fi ,Y = y)+k
fi
(c(Fi = fi ,Y = y)+k)
If k=0, the probabilities are unsmoothed. As k grows larger, the probabilities are
smoothed more and more. You can use your validation set to determine a good value
for k.
Conclusion: As we increase the training data from 10% to 100% we see that the accu-
racy jumps from a mere 62.8% to a reasonable 91% in face recognition and 71% to 77%
in digit recognition.The trade off is time, more the data - more processing, as from the
data collected we see the difference of 3 seconds in training face data and 8 seconds for
training digit data.
Also we see the increase in accuracy to minimize after some point. In our project
data we see that there is not much difference in accuracy between 50% training data
and 100% training data. So we could improve training time on this by using 50% data
16. 15
instead of training on 100%.
(B) Perceptron:
The perceptron algorithm uses a weight vector to make decisions unlike Native Bayes
which uses probability. The weight vector here is represented by ωy
for each class y.
For a given feature list f , the perceptron algorithm computes the class y whose weight
vector is most similar to the input vector f . The feature vector defined in our code is a
map from pixel locations to indicators of whether they are on or not.
Formally, given a feature vector f (in our case, a map from pixel locations to indicators
of whether they are on), we score each class with:
score(f , y) =
i
fi w
y
i
The class with highest score is chosen as the predicted label for that data instance
Before classifying the training set, the weights have to be learnt by the algorithm. For
this, the training set is scanned one instance at a time. When we come to an instance
(f , y), we find the label with highest score:
y = arg maxy score(f , y )
We compare y i.e the result obtained by previous equations, to the true label y. If the
two labels are equal (y = y) then we’ve gotten the instance correct and can proceed
with the other items in the training set. Otherwise, we guessed a false positive y in
place of y. That means that wy
should have scored f higher, and wy
should have
scored f lower. In order to prevent this error in the future we update these two weight
vectors accordingly.
wy
+ = f
wy
− = f
Conclusion: We ran our algorithm for 3 iterations. We see a huge difference in accuracy,
from 67% to 90%(digits) and 70% to 90%(faces). This happens becuase perceptron
iterates repeatedly through the training data. This leads us to higher training time.
17. 16
Perceptron makes weight corrections and starts to improve after each iteration, this is
also the reason behind the significant accuracy rate.
(C) Mira:
Similar to a perceptron classifier, the MIRA classifier also keeps a weight vector wy
of
each label y. Here too we scan over the data, one instance at a time. When we come to
an instance (f , y), we find the label with highest score:
y = arg maxy score(f , y )
We compare y to the true label y. If the labels are equal( y = y) , we’ve gotten the
instance correct, and we do nothing. Otherwise, we guessed y but we should have
guessed y. The difference between mira and perceptron is that in Mira we update the
weight vectors of these labels with variable step size:
ωy
= ωy
+τf
ωy
= ωy
−τf
Here τ > 0 and is chosen such that it minimizes :
minu (1/2) c||(ω )c
−ωc
||2
2
subject to the condition that
(ω )y
f ≥ (ω )y
f +1
This is equivalent to
minτ||τf ||2
2 subject to τ ≥
(ωy
−ωy
)f +1
2||f ||2
2
and τ ≥ 0
We can notice that, ωy
f ≥ ωy
f , so the condition τ ≥ 0 is always true given τ ≥
(ωy
−ωy
)f +1
2||f ||2
2
Solving the problem we get,
18. 17
τ =
(ωy
−ωy
)f +1
2||f ||2
2
We cap the maximum possible value of τ by a positive constant C,
τ = min C,τ =
(ωy
−ωy
)f +1
2||f ||2
2
Conclusion:The accuracy increased from 67% to approx 90% as we increased the
training data. Similar to perceptron the accuracy improves as we the iterations, this
happens because the weights gets updated.Training time is high when compared to
Naive Bayes but is similar to perceptron.