3. Outline
Introduction
Maximum Likelihood Estimation
Evaluating an Estimator: Bias and Variance
The Bayes Estimator
Parametric Classification
Regression
Tuning Model Complexity:
Bias / Variance Dilemma
Model Selection Procedures
4. Introduction
A statistic is any value that is calculated from a given sample.
In statistical inference, we make a decision using the
information provided by a sample.
Our first approach is parametric where we assume that the
sample is drawn from some distribution that obeys a known
model, for example, Gaussian.
The advantage of the parametric approach is that the model
is defined up to a small number of parameters—for example,
mean, variance—the sufficient statistics of the distribution.
5. Once those parameters are estimated from the sample,
the whole distribution is known.
We estimate the parameters of the distribution from the
given sample, plug in these estimates to the assumed
model, and get an estimated distribution, which we then
use to make a decision.
The method we use to estimate the parameters of a
distribution is maximum likelihood estimation.
We start with density estimation, which is the general case
of estimating p(x).
We use this for classification where the estimated
densities are the class densities, p(x|Ci), and priors, P(Ci),
to be able to calculate the posteriors, P(Ci|x), and make
our decision.
6. Parametric Estimation
X = { xt }t=1
N where xt ~ p(x)
Here x is one dimensional and the densities are univariate.
Parametric estimation:
Assume a form for p (x | θ) and estimate θ, its sufficient statistics,
using X
e.g., N ( μ, σ2) where θ = { μ, σ2}
7. Maximum Likelihood Estimation
In statistics, maximum likelihood estimation (MLE) is a
method of estimating the parameters of an
assumed probability distribution, given some observed data.
This is achieved by maximizing a likelihood function so that,
under the assumed statistical model, the observed data is
most probable.
For example, if a population is known to follow a normal
distribution but the mean and variance are unknown.
8. Let us say we have an independent and identically
distributed (iid) sample X = {xt}N
t=1.
We assume that xt are instances drawn from some known
probability density family, p(x|θ), defined up to parameters,
θ:
Likelihood of θ given the sample X
l (θ|X ) p(X |θ) = ∏ t=1
N p (xt|θ)
Log likelihood
L(θ|X) log l (θ|X) = ∑ t=1
N log p (xt|θ)
Maximum likelihood estimator (MLE)
θ* = arg maxθ L(θ|X)
9. Examples: Bernoulli Density
Bernoulli distribution is an independent probability function
where a random variable can have only two possible
values: either 1 for success or 0 for failure.
Two states, failure/success, x in {0,1}
P (x) = px (1 – p ) (1 – x)
l (θ|X )= ∏ t=1
N p (xt|θ)
L(p|X) = log l (θ|X)
= log ∏t
pxt
(1 – p ) (1 – xt)
= (∑t
xt ) log p+ (N – ∑t
xt ) log (1 – p )
MLE: p = (∑t
xt ) / N
MLE: Maximum Likelihood Estimation
10. In probability theory and statistics, the Bernoulli
distribution, named after Swiss mathematician Jacob
Bernoulli, is the discrete probability distribution of
a random variable which takes the value 1 with
probability p and the value 0 with probability q=1-p.
Less formally, it can be thought of as a model for the set
of possible outcomes of any single experiment that asks
a yes–no question.
Such questions lead to outcomes that are boolean
valued: a single bit whose value is success / yes / true /
one with probability p and failure / no / false / zero with
probability q.
11. It can be used to represent a (possibly biased) coin
toss where 1 and 0 would represent "heads" and "tails",
respectively, and p would be the probability of the coin
landing on heads (or vice versa where 1 would represent
tails and p would be the probability of tails).
The Bernoulli distribution is a special case of the binomial
distribution where a single trial is conducted (so n would be
1 for such a binomial distribution).
It is also a special case of the two-point distribution, for
which the possible outcomes need not be 0 and 1.
12. Examples: Multinomial Density
K > 2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i
pi
xi
L(p1,p2,...,pK|X) = log ∏t
∏i
pi
xi
t
where xi
t = 1 if experiment t chooses state i
xi
t = 0 otherwise
MLE: pi = (∑t
xi
t ) / N
PS. K: mutually exclusive
13. Gaussian (Normal) Distribution
2
2
2
exp
2
1 x
-
x
p
p(x) = N ( μ, σ2)
Given a sample X = { xt }t=1
N with xt ~
N ( μ, σ2) , the log likelihood of a
Gaussian sample is
μ σ
2
2
2
exp
2
1 x
x
p
2
2
2
log
)
2
log(
2
|
,
t
t
x
N
N
x
L
(Why? N samples?)
14. Gaussian (Normal) Distribution
2
2
2
exp
2
1 x
-
x
p
MLE for μ and σ2:
μ σ
N
m
x
s
N
x
m
t
t
t
t
2
2
15. Evaluating an Estimator: Bias
and Variance
Machine learning is a branch of Artificial Intelligence, which
allows machines to perform data analysis and make
predictions.
However, if the machine learning model is not accurate, it
can make predictions errors, and these prediction errors
are usually known as Bias and Variance.
In machine learning, these errors will always be present as
there is always a slight difference between the model
predictions and actual predictions.
The main aim of ML/data science analysts is to reduce
these errors in order to get more accurate results.
16. Errors in Machine Learning?
In machine learning, an error is
a measure of how accurately an
algorithm can make predictions
for the previously unknown
dataset.
On the basis of these errors, the
machine learning model is
selected that can perform best
on the particular dataset.
There are mainly two types of
errors in machine learning.
17. What is Bias?
In general, a machine learning model analyses the data,
find patterns in it and make predictions.
While training, the model learns these patterns in the
dataset and applies them to test data for prediction.
While making predictions, a difference occurs between
prediction values made by the model and actual
values/expected values, and this difference is known as
bias errors or Errors due to bias.
18. What is a Variance?
The variance would specify the amount of variation in the
prediction if the different training data was used.
In simple words, variance tells that how much a random
variable is different from its expected value.
Ideally, a model should not vary too much from one training
dataset to another, which means the algorithm should be
good in understanding the hidden mapping between inputs
and output variables.
Variance errors are either of low variance or high variance.
20. Low-Bias, Low-Variance: The combination of low bias and low
variance shows an ideal machine learning model. However, it is
not possible practically.
Low-Bias, High-Variance: With low bias and high variance, model
predictions are inconsistent and accurate on average. This case
occurs when the model learns with a large number of parameters
and hence leads to an overfitting.
High-Bias, Low-Variance: With High bias and low variance,
predictions are consistent but inaccurate on average. This case
occurs when a model does not learn well with the training dataset
or uses few numbers of the parameter. It leads
to underfitting problems in the model.
High-Bias, High-Variance: With high bias and high variance,
predictions are inconsistent and also inaccurate on average.
Different Combinations of Bias-Variance
21. How to identify High variance or High Bias?
High variance can be identified if the model has:
Low training error and high test error.
High Bias can be identified if the model has:
High training error and the test error is almost similar to training
error.
22. Bias-Variance Trade-Off
While building the machine learning model, it is really
important to take care of bias and variance in order to avoid
overfitting and underfitting in the model.
If the model is very simple with fewer parameters, it may
have low variance and high bias.
Whereas, if the model has a large number of parameters, it
will have high variance and low bias.
So, it is required to make a balance between bias and
variance errors, and this balance between the bias error and
variance error is known as the Bias-Variance trade-off.
23. For an accurate prediction of the model, algorithms need a
low variance and low bias.
But this is not possible because bias and variance are
related to each other:
If we decrease the variance, it will increase the bias.
If we decrease the bias, it will increase the variance.
Hence, the Bias-
Variance trade-off
is about finding the
sweet spot to make
a balance between
bias and variance
errors.
24. θ
Bias and Variance
Unknown parameter θ
Estimator di = d (Xi)
on sample Xi
Bias: bθ(d) = E [d] – θ
Variance: E [(d–E [d])2]
If bθ(d) = 0
d is an unbiased estimator of θ
If E [(d–E [d])2] = 0
d is a consistent estimator of θ
25. Expected value
If the probability distribution of X admits a probability density function f (x), then
the expected value can be computed as
It follows directly from the discrete case definition that if X is a constant random
variable, i.e. X = b for some fixed real number b, then the expected value of X is
also b.
The expected value of an arbitrary function of X, g(X), with respect to the
probability density function f(x) is given by the inner product of f and g:
http://en.wikipedia.org/wiki/Expected_value
26. Bias and Variance
For example:
Var [m] 0 as N∞
m is also a consistent estimator
1
[ ] [ ]
t
t
t
t
x N
E m E E x
N N N
2 2
2 2
1
[ ] [ ]
t
t
t
t
x N
Var m Var Var x
N N N N
27. Bias and Variance
For example: (see P. 65-66)
s2 is a biased estimator of σ2
(N/(N-1))s2 is a unbiased estimator of σ2
Mean square error:
r (d,θ) = E [(d–θ)2] (see P. 66, next slide)
= (E [d] – θ)2 + E [(d–E [d])2]
= Bias2 + Variance
2
2
2 1
)
(
N
N
s
E
θ
29. Standard Deviation
In statistics, the standard deviation is often estimated from a random
sample drawn from the population.
The most common measure used is the sample standard deviation, which is
defined by
where is the sample (formally, realizations from a
random variable X) and is the sample mean.
http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation
29
30. Bayes’ Estimator
Sometimes, before looking at a sample, we (or experts of
the application) may have some prior information on the
possible value range that a parameter, θ, may take.
This information is quite useful and should be used,
especially when the sample is small.
The prior information does not tell us exactly what the
parameter value is (otherwise we would not need the
sample), and we model this uncertainty by viewing θ as a
random variable and by defining a prior density for it, p(θ).
31. What is a Bayesian estimator?
A Bayesian estimator is an estimator of an
unknown parameter θ that minimizes the expected loss
for all observations x of X.
An estimator is Bayesian if it uses the Bayes theorem to
predict the most likely class of some observed data.
Because the class of data is an unknown parameter and
not a random variable, it is not possible to express the
probability of that class using the standard concept of
probability.
32. How does Bayes estimator differ from MLE
The difference between these two approaches is that the
parameters for maximum likelihood estimation are fixed,
but unknown meanwhile the parameters for Bayesian
method act as random variables with known prior
distributions
33. Bayes’ Estimator
Treat θ as a random variable with prior p(θ)
Bayes’ rule:
Maximum a Posteriori (MAP):
θMAP = arg maxθ p(θ|X)
Maximum Likelihood (ML):
θML = arg maxθ p(X|θ)
Bayes’ estimator:
θBayes = E[θ|X] = ∫ θ p(θ|X) dθ
( | ) ( )
( | )
( )
p X p
p X
p X
34. Maximum a Posteriori (MAP):
In Bayesian statistics, a maximum a posteriori probability
(MAP) estimate is an estimate of an unknown quantity,
that equals the mode of the posterior distribution.
The MAP can be used to obtain a point estimate of an
unobserved quantity on the basis of empirical data.
Maximum Likelihood (ML):
In statistics, maximum likelihood (ML) is a method of
estimating the parameters of an assumed probability
distribution, given some observed data.
This is achieved by maximizing a likelihood function so
that, under the assumed statistical model, the observed
data is most probable.
35. MAP VS ML
If p(θ) is an uniform distribution
then θMAP = arg maxθ p(θ|X)
= arg maxθ p(X|θ) p(θ) / p(X)
= arg maxθ p(X|θ)
= θML
θMAP = θML
where p(θ) / p(X) is a constant
36. Bayes’ Estimator: Example
If p (θ|X) is normal, then θML = m and θBayes =θMAP
Example: Suppose xt ~ N (θ, σ2) and θ ~ N ( μ0, σ0
2)
θBayes =
The Bayes’ estimator is a weighted average of the prior mean μ0 and the
sample mean m.
2
2
0
0
2 2 2 2
0 0
1/
/
|
/ 1/ / 1/
N
E m
N N
X
2
1
/2 2
( )
1
| exp[ ]
(2 ) 2
N
t
t
N N
x
p
X
2
0
2
0
0
1
exp
2
2
p
θML = m
36
37. Parametric Classification
The discriminant function
Assume that are Gaussian
i
i
i
i
i
i
C
P
C
x
p
x
g
C
P
C
x
p
x
g
log
|
log
ly
equivalent
or
|
i
i
i
i
i
i
i
i
i
C
P
x
x
g
x
C
x
p
log
2
log
2
log
2
1
2
exp
2
1
|
2
2
2
2
i
C
x
p |
37
log likelihood of a Gaussian sample
38. Given the sample
Maximum Likelihood (ML) estimates are
Discriminant becomes
N
t
t
t
,r
x 1
}
{
X
x
,
if
0
if
1
i
j
x
x
r
j
t
i
t
t
i
C
C
t
t
i
t
t
i
i
t
i
t
t
i
t
t
i
t
i
t
t
i
i
r
r
m
x
s
r
r
x
m
N
r
C
P
2
2
,
,
ˆ
i
i
i
i
i C
P̂
s
m
x
s
x
g log
2
log
2
log
2
1
2
2
39. The first term is a constant and if the priors are equal,
those terms can be dropped.
Assume that variances are equal, then
becomes
Choose Ci if
2
2
1 ˆ
log 2 log log
2 2
i
i i i
i
x m
g x s P C
s
2
i i
g x x m
|
|
min
|
| k
k
i m
x
m
x
40. Equal
variances
Single boundary at
halfway between
means
Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional.
Variances are equal and the posteriors intersect at one point, which is the threshold of decision.
41. Variances are
different
Two boundaries
Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional.
Variances are unequal and the posteriors intersect at two points.
42. Regression
N ,
N ,
2
2
estimator: |
~ 0
| ~ |
r f x
g x
p r x g x
L X
1
1 1
| log ,
log | log
N
t t
t
N N
t t t
t t
p x r
p r x p x
Regression assumes 0 mean
Gaussian noise added to the
model; Here, the model is linear.
|
p r x : the probability of the output
given the input
Given an sample X = { xt , rt}t=1
N,
the log likelihood is
43. Regression: From LogL to Error
Ignore the second term: (because it does not depend on our estimator)
Maximizing this is equivalent to minimizing
L X
2
2
1
2
2
1
|
1
| log exp
2
2
1
log 2 |
2
t t
N
t
N
t t
t
r g x
N r g x
X
2
1
1
| |
2
N
t t
t
E r g x
Least squares estimate
43
44. Example: Linear Regression
1 0 1 0
Let | ,
t t
g x w w w x w
0 1
2
0 1
We can obtain
and
t t
t t
t t t t
t t t
r Nw w x
r x w x w x
0
2
1
, ,
t t
t t
t t
t t
t
t t
N x r
w
w r x
x x
w y
A
1
w y
A
(Exercise!!)
X
2
1
1
Minimize | |
2
N
t t
t
E r g x
44
Aw = y
45. Example: Polynomial Regression
2
2 1 0 2 1 0
| ,..., , , ...
k
t t t t
k k
g x w w w w w x w x w x w
2
1 1 1
1
2
2
2 2 2
2 2
1 ...
1 ...
Let ,
:
:
1 ...
k
k
N
N N N
x x x
r
r
x x x
r
x x x
r
D
We can obtain ,
T
A D D
1
T T
w r
D D D
, and
T
y D r
45
(see page 75)
Aw = y
46. Other Error Measures
Square Error:
Relative Square Error:
Absolute Error: E (θ|X) = ∑t
|rt – g(xt|θ)|
ε-sensitive Error:
E (θ|X) = ∑t
1(|rt – g(xt|θ)|>ε)
X
2
1
1
| |
2
N
t t
t
E r g x
X
2
1
2
1
|
|
N
t t
t
N
t
t
r g x
E
r r
47. Bias and Variance
(See Eq. 4.17)
Now let’s note that g(.) is a random variable (function) of samples S:
The expected square error at a particular point x wrt
to a fixed g(x) and variations in r based on p(r|x):
E[(r-g(x))2
|x]=E[(r-E[r|x])2
|x]+(E[r|x]- g(x))2
noise squared error
ES
[E[(r- g(x))2
|x]]=E[(r-E[r|x])2
|x]+ES
[(E[r|x]- g(x))2
]
Estimate for
the error at
point x
Expectation of our estimate for
the error at point x (wrt sample
variation)
48. Bias and Variance
ES
[(E[r|x]- g(x))2
|x]=
(E[r|x]-ES
[g(x)])2
+ES
[(g(x)-ES
[g(x)])2
]
bias2 variance
The expected value (average over samples X, all of size N and
drawn from the same joint density p(x, r)) : (See Eq. 4.11)
(See page 66 and 76)
squared error
squared error = bais2+variance
Now let’s note that g(.) is a random variable (function) of samples S:
ES
[E[(r- g(x))2
|x]]=E[(r-E[r|x])2
|x]+ES
[(E[r|x]- g(x))2
]
49. Estimating Bias and Variance
Samples Xi={xt
i , rt
i}, i =1,...,M, t = 1,2,…,N
are used to fit gi (x), i =1,...,M
2
2
2
1
1
Bias
1
Variance
i
i
t t
t
t t
i
t i
g x g x
M
g g x f x
N
g g x g x
NM
θ
50. Bias/Variance Dilemma
Examples:
gi (x) = 2 has no variance and high bias
gi (x) = ∑t
rt
i/N has lower bias with variance
As we increase model complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
Bias/Variance dilemma (Geman et al., 1992)
54. Model Selection Procedures
There are a number of procedures we can use to
fine-tune model complexity.
Cross-validation
Regularization
Structural risk minimization (SRM)
Minimum description length (MDL)
Bayesian Model Selection
55. Model Selection Procedures
Cross-validation:
Measure generalization accuracy by testing on data unused during
training
To find the optimal complexity
Regularization :
Penalize complex models
E’ = error on data + λ model complexity
Structural risk minimization (SRM):
To find the model simplest in terms of order and best in terms of
empirical error on the data
Model complexity measure: polynomials of increasing order, VC
dimension, ...
Minimum description length (MDL):
Kolmogorov complexity of a data set is defined as the shortest
description of data
56. Model Selection Procedures
Bayesian Model Selection:
Prior on models, p(model)
Discussions:
When the prior is chosen such that we give higher probabilities to
simpler models, the Bayesian approach, regularization, SRM, and MDL
are equivalent.
Cross-validation is the best approach if there is a large enough
validation dataset.
data
model
model
|
data
data
|
model
p
p
p
p
57. Cross-Validation
Cross-validation is a technique in which we train our
model using the subset of the data-set and then evaluate
using the complementary subset of the data-set.
The three steps involved in cross-validation are as
follows :
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
Methods of Cross Validation
Validation
LOOCV (Leave One Out Cross Validation)
K-Fold Cross Validation
58. Methods of Cross Validation
Validation
In this method, we perform training on the 50%
of the given data-set and rest 50% is used for
the testing purpose.
The major drawback of this method is that we
perform training on the 50% of the dataset, it
may possible that the remaining 50% of the
data contains some important information which
we are leaving while training our model i.e
higher bias.
59. Methods of Cross Validation
LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole data-set
but leaves only one data-point of the available data-set and
then iterates for each data-point.
It has some advantages as well as disadvantages also.
An advantage of using this method is that we make use of
all data points and hence it is low bias.
The major drawback of this method is that it leads to higher
variation in the testing model as we are testing against one
data point.
If the data point is an outlier it can lead to higher variation.
Another drawback is it takes a lot of execution time as it
iterates over ‘the number of data points’ times.
60. Methods of Cross Validation
K-Fold Cross Validation
In this method, we split the data-set
into k number of subsets(known as
folds) then we perform training on
the all the subsets but leave one(k-1)
subset for the evaluation of the
trained model.
In this method, we iterate k times
with a different subset reserved for
testing purpose each time.
Note:
It is always suggested that the value of
k should be 10 as the lower value of k is
takes towards validation and higher
value of k leads to LOOCV method.
62. Regularization
Regularization refers to techniques that are used to calibrate
machine learning models in order to minimize the adjusted
loss function and prevent overfitting or underfitting.
Using Regularization, we can fit our machine learning
model appropriately on a given test set and hence reduce
the errors in it.
64. Ridge Regularization
Also known as Ridge Regression, it modifies the over-fitted
or under fitted models by adding the penalty equivalent to
the sum of the squares of the magnitude of coefficients.
This means that the mathematical function representing our
machine learning model is minimized and coefficients are
calculated.
The magnitude of coefficients is squared and added. Ridge
Regression performs regularization by shrinking the
coefficients present.
The function depicted
in the figure shows
the cost function of
ridge regression :
65. Lasso Regularization
It modifies the over-fitted or under-fitted models by
adding the penalty equivalent to the sum of the absolute
values of coefficients.
Lasso regression also performs coefficient
minimization, but instead of squaring the magnitudes of
the coefficients, it takes the true values of coefficients.
This means that the coefficient sum can also be 0,
because of the presence of negative coefficients.
66. Key Differences between Ridge
and Lasso Regression
Ridge regression helps us to reduce only the overfitting in
the model while keeping all the features present in the
model.
It reduces the complexity of the model by shrinking the
coefficients whereas Lasso regression helps in reducing the
problem of overfitting in the model as well as automatic
feature selection.
Lasso Regression tends to make coefficients to absolute
zero whereas Ridge regression never sets the value of
coefficient to absolute zero.
67
Notes de l'éditeur
S consists of {(xi,ri)}
Its distribution is P(r1,..,rn|x1,…xn) P(x1,..,xn)