A brief introduction to learning (training) linear classifiers in machine learning classification, covering such things as classifier evaluation, log-likelihood, and gradient ascent.
2. The component most responsible for
learning (improving) in a model is the
Quality Metric
3. Likelihood Function
Quality Metrics improve the coefficients of a classification model using a likelihood
function.
A likelihood function l(w) measures quality of fit for coefficients w by seeking to
maximize l(w), bringing it as close at 1 as possible.
4. Maximum Likelihood Estimation
Maximum Likelihood Estimation: An equation used on l(w) with the goal of maximizing
P(y|x,w) for all N coefficients w of x. The equation is written as:
Oftentimes, the maximum likelihood estimation function will utilize gradient ascent to
achieve its goal
5. Gradient Ascent
The gradient is the plane created by the collection of vectors f(), which is the
derivatives of each coefficient w.
Gradient Ascent then is an iterative optimization algorithm, which follows the
gradient vector of the greatest magnitude until the highest point in the plane is
discovered.
Contour Plot is the plotted trajectory taken by the gradient descent algorithm to
the maximum point.
7. 1[yi=+1] is the indicator function. It is a piecewise function, equalling 0 if yi does
not equal +1, and equalling +1 if it does. These values will differ based on the
values and number of possible outputs.
8. Example
hj(xi) yi P(yi=1|xi, wi) = 1/(1+e^-(w T h(xi)) Derivative
2.5 +1 .92414 2.5(1-.92414) = 0.18965
0.3 -1 .5744 .3(0-.5744) = -0.17232
2.8 +1 .9427 2.8(1-.9427) = 0.16044
0.5 +1 .6225 .5(1-.6225) = 0.18875
Derivative of l(wi) = .36652
Where indicator function 1[yi=+1] ={1, yi = +1
0, yi = -1
9. Oftentimes, log likelihood is used for calculation of derivatives instead of
likelihood.
This is simply because it makes some of the math involved easier.
10. Interpretation of Derivatives
Assuming hj(xi) = 1 for simplicity
P(y=+1|xi, w ) ≈ 1 P(y=+1|xi, w ) ≈ 0
yi = +1 Δi≈ (1-1) ≈ 0 , good
coefficients
Δi≈ (1-0) ≈ 1, coefficients
too small (false negative)
yi = -1 Δi≈ (0-1) ≈ -1, coefficients
too large (false positive)
Δi≈ (0-0) ≈ 0, good
coefficients
12. Determining step size
Step sizes are determined by a process of trial and error
Step sizes are compared on a learning curve, where y axis = likelihood and x axis = # of
iterations.
Too small step size: Slowly growing curve
Too large step size: Oscillations (zig-zags) in curve
13. Example of too small a step size (green line) Example of too large a step size (red line, teal
line)
14. Measures for Classifier Evaluation
Error = # of mistakes/# of data points
Accuracy = # of correct predictions/# of data points
Ex: In a dataset of 30 data points, 20 are predicted correctly
Error = 10/30 Accuracy = 20/30
15. Overfitting in Classification
Overfitting: Training a model too tightly to a certain set of data points, to the extent that
it cannot make accurate predictions with new data.
Overfitting in classification models leads to overconfident predictions (extremely high
and low probabilities)
16. Signs of Overfitting
Strong Indicators of overfitting include:
● Model accuracy nearing or equalling 100% with training data
● Extremely large coefficients
● Overly complex decision boundaries
17. Remember the formula: P(y=z|xi) = 1/(1+e^-(w T h(xi))
When “good” = . 05 and “awful” = -3,
“good”+ “good” + “awful” = -2, P(y=1|xi) = 1/(1+e^-(-2)) = .119
When “good” = 5 and “awful” = -30,
“good”+ “good” + “awful” = -20, P(y=1|xi) = 1/(1+e^-(-20)) = 2.061e -9 (really small number)
Overconfidence example:
18. Method to Mitigate Overfitting
Overfitting can be prevented, or at a minimum reduced, by “penalizing” large coefficients.
Done by altering the quality metric
Total Quality = Measure of Fit AKA Data Likelihood
Total Quality = Measure of Fit - Measure of Magnitude of Coefficients
19. Measures of Magnitude of Coefficients
There are two common values used as measure of magnitude in order to penalize large
coefficients:
L2 Norm: Sum of coefficients’ squares
L1 Norm: Sum of coefficients’ absolute values
Both are magnified by λ, determined by validation set, referred to as the “tuning parameter”.
Greater λ -> Smaller Coefficients
20. L2 Regularization
The L2 norm is attained by taking the sum of the coefficients squared, w = (w1
2,w2
2, … ,
wD
2)
L2 regularization is then done by taking the derivative of l(w)and subtracting the
derivative of w2*λ, which is 2λw
Quality Metric =
21. L1 Regularization
L1 Norm is the sum of absolute values of w (thus penalizing large positive and negative
numbers equally)
Thus in L1 regularization:
Quality Metric =
Leads to sparse solutions, where many coefficients wj = 0
22. Impact of Regularization
L2: Coefficients approach (but don’t reach)
zero as λ increases
L1: Coefficients approach and reach zero as λ
increases