1. Introduction to Statistical
Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Introduction to Statistical Machine Learning
Christfried Webers
Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University
Canberra
February – June 2011
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
1of 146
2. Introduction to Statistical
Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Part IV
Boxes with Apples and
Oranges
Bayes’ Theorem
Probability and Uncertainty Bayes’ Probabilities
Probability Distributions
Gaussian Distribution
over a Vector
Decision Theory
Model Selection - Key
Ideas
113of 146
3. Introduction to Statistical
Simple Experiment Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
1 Choose a box.
Red box p(B = r) = 4/10
Blue box p(B = b) = 6/10
2 Choose any item of the selected box with equal probability.
Boxes with Apples and
Oranges
Given that we have chosen an orange, what is the
Bayes’ Theorem
probability that the box we chose was the blue one?
Bayes’ Probabilities
Probability Distributions
Gaussian Distribution
over a Vector
Decision Theory
Model Selection - Key
Ideas
114of 146
4. Introduction to Statistical
What do we know? Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
p(F = o | B = b) = 1/4
p(F = a | B = b) = 3/4
p(F = o | B = r) = 3/4
Boxes with Apples and
p(F = a | B = r) = 1/4 Oranges
Bayes’ Theorem
Given that we have chosen an orange, what is the
Bayes’ Probabilities
probability that the box we chose was the blue one? Probability Distributions
Gaussian Distribution
p(B = b | F = o)? over a Vector
Decision Theory
Model Selection - Key
Ideas
115of 146
5. Introduction to Statistical
Calculating the Posterior p(B = b | F = o) Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Bayes’ Theorem
p(F = o|B = b)p(B = b)
p(B = b | F = o) =
p(F = o)
Boxes with Apples and
Oranges
Sum Rule for the denominator Bayes’ Theorem
Bayes’ Probabilities
p(F = o) = p(F = o, B = b) + p(F = o, B = r) Probability Distributions
= p(F = o|B = b)p(B = b) Gaussian Distribution
over a Vector
+ p(F = o|B = r)p(B = r) Decision Theory
Model Selection - Key
1 6 3 4 9 Ideas
= × + × =
4 10 4 10 20
1 6 20 1
p(B = b | F = o) = × × =
4 10 9 3
116of 146
6. Introduction to Statistical
Bayes’ Theorem - revisited Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Before choosing an item from a box: most complete
information in p(B) (prior).
6
Note, that in our example p(B = b) = 10 . Therefore Boxes with Apples and
Oranges
choosing the box from the prior, we would opt for the blue
Bayes’ Theorem
box.
Bayes’ Probabilities
Once we observe some data (e.g. choose an orange), we Probability Distributions
can calculate p(B = b | F = o) (posterior probability) via Gaussian Distribution
over a Vector
Bayes’ theorem.
Decision Theory
After observing an orange, the posterior probability Model Selection - Key
1 2 Ideas
p(B = b | F = o) = 3 and therefore p(B = r | F = o) = 3 .
Observing an orange it is now more likely that the orange
came from the red box.
117of 146
7. Introduction to Statistical
Bayes’ Rule Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Boxes with Apples and
Oranges
likelihood prior
posterior Bayes’ Theorem
p(X | Y) p(Y) p(X | Y) p(Y) Bayes’ Probabilities
p(Y | X) = =
p(X) Y p(X | Y)p(Y)
Probability Distributions
Gaussian Distribution
normalisation over a Vector
Decision Theory
Model Selection - Key
Ideas
118of 146
8. Introduction to Statistical
Bayes’ Probabilities Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
classical or frequentist interpretation of probabilities
Bayesian view: probabilities represent uncertainty Boxes with Apples and
Oranges
Example: Will the Arctic ice cap have disappeared by the Bayes’ Theorem
end of the century? Bayes’ Probabilities
Probability Distributions
fresh evidence can change the opinion on ice loss Gaussian Distribution
over a Vector
goal: quantify uncertainty and revise uncertainty in light of
Decision Theory
new evidence
Model Selection - Key
Ideas
use Bayesian interpretation of probability
119of 146
9. Introduction to Statistical
Andrey Kolmogorov - Axiomatic Probability Machine Learning
c 2011
Theory (1933) Christfried Webers
NICTA
The Australian National
University
Boxes with Apples and
Let (Ω, F, P) be a measure space with P(Ω) = 1. Oranges
Then (Ω, F, P) is a probability space, with sample space Ω, Bayes’ Theorem
Bayes’ Probabilities
event space F and probability measure P.
Probability Distributions
1. Axiom P(E) ≥ 0 ∀E ∈ F. Gaussian Distribution
over a Vector
2. Axiom P(Ω) = 1. Decision Theory
3. Axiom P(E1 ∪ E2 ∪ . . . ) = i P(Ei ) for any countable Model Selection - Key
Ideas
sequence of pairwise disjoint events E1 , E2 , . . . .
120of 146
10. Introduction to Statistical
Richard Threlkeld Cox - (1946, 1961) Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Assume numerical values are used to represent degrees
of belief.
Define a set of axioms encoding common sense Boxes with Apples and
properties of such beliefs. Oranges
Bayes’ Theorem
Results in a set of rules for manipulating degrees of belief Bayes’ Probabilities
which are equivalent to the sum and product rule of Probability Distributions
probability. Gaussian Distribution
over a Vector
many other authors have proposed different sets of Decision Theory
axioms and properties Model Selection - Key
Ideas
Result: the numerical quantities behave all according to
the rules of probability
Denote these quantities as Bayesian probabilities.
121of 146
11. Introduction to Statistical
Curve fitting - revisited Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
uncertainty about the parameter w captured in the prior
probability p(w)
observed data D = {t1 , . . . , tN } Boxes with Apples and
Oranges
calculate the uncertainty in w after the data D have been Bayes’ Theorem
observed Bayes’ Probabilities
p(D | w)p(w) Probability Distributions
p(w | D) =
p(D) Gaussian Distribution
over a Vector
p(D | w) as a function of w : likelihood function Decision Theory
Model Selection - Key
likelihood expresses how probable the data are for Ideas
different values of w
not a probability function over w
122of 146
12. Introduction to Statistical
Likelihood Function - Frequentist versus Bayesian Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Likelihood function p(D | w)
Frequentist Approach Bayesian Approach Boxes with Apples and
Oranges
w considered fixed only one single data Bayes’ Theorem
parameter set D Bayes’ Probabilities
value defined by uncertainty in the Probability Distributions
some ’estimator’ parameters comes Gaussian Distribution
over a Vector
error bars on the from a probability Decision Theory
estimated w distribution over w Model Selection - Key
Ideas
obtained from the
distribution of
possible data sets D
123of 146
13. Introduction to Statistical
Frequentist Estimator - Maximum Likelihood Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
choose w for which the likelihood p(D | w) is maximal
choose w for which the probability of the observed data is
maximal Boxes with Apples and
Oranges
Machine Learning: error function is negative log of Bayes’ Theorem
likelihood function Bayes’ Probabilities
log is a monoton function Probability Distributions
Gaussian Distribution
maximising the likelihood ⇐⇒ minimising the error over a Vector
Decision Theory
Example: Fair-looking coin is tossed three times, always
Model Selection - Key
landing on heads. Ideas
Maximum likelihood estimate of the probability of landing
heads will give 1.
124of 146
14. Introduction to Statistical
Bayesian Approach Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
including prior knowledge easy (via prior w)
BUT: if prior is badly chosen, can lead to bad results
Boxes with Apples and
subjective choice of prior Oranges
Bayes’ Theorem
sometimes choice of prior motivated by convinient
Bayes’ Probabilities
mathematical form Probability Distributions
need to sum/integrate over the whole parameter space Gaussian Distribution
over a Vector
advances in sampling (Markov Chain Monte Carlo Decision Theory
methods) Model Selection - Key
Ideas
advances in approximation schemes (Variational Bayes,
Expectation Propagation)
125of 146
15. Introduction to Statistical
The Gaussian Distribution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
x∈R
Gaussian Distribution with mean µ and variance σ 2
1 1 Boxes with Apples and
N (x | µ, σ 2 ) = 1 exp{− (x − µ)2 } Oranges
(2πσ 2 ) 2 2σ 2 Bayes’ Theorem
Bayes’ Probabilities
N (x|µ, σ 2 ) Probability Distributions
Gaussian Distribution
over a Vector
Decision Theory
2σ Model Selection - Key
Ideas
µ x
126of 146
16. Introduction to Statistical
The Gaussian Distribution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
N (x | µ, σ 2 ) > 0
∞
−∞
N (x | µ, σ 2 ) dx = 1
Expectation over x
Boxes with Apples and
∞ Oranges
2
E [x] = N (x | µ, σ ) x dx = µ Bayes’ Theorem
−∞ Bayes’ Probabilities
Probability Distributions
2
Expectation over x Gaussian Distribution
over a Vector
∞
Decision Theory
E x2 = N (x | µ, σ 2 ) x2 dx = µ2 + σ 2 Model Selection - Key
−∞ Ideas
Variance of x
2
var[x] = E x2 − E [x] = σ 2
127of 146
17. Introduction to Statistical
The Mode of a Probability Distribution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Mode of a distribution : the value that occurs the most
frequently in a probability distribution.
For a probability density function: the value x at which the
probability density attains its maximum.
Boxes with Apples and
Gaussian Distribution has one mode (unimodular); the Oranges
mode is µ. Bayes’ Theorem
Bayes’ Probabilities
If there are multiple local maxima in the probability
Probability Distributions
distribution, the probability distribution is called multimodal Gaussian Distribution
(example: mixture of three Gaussians). over a Vector
Decision Theory
N (x|µ, σ 2 )
p(x)
Model Selection - Key
Ideas
2σ
µ x x
128of 146
18. Introduction to Statistical
The Bernoulli Distribution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Two possible outcomes x ∈ {0, 1} (e.g. coin which may be
damaged).
p(x = 1 | µ) = µ for 0 ≤ µ ≤ 1 Boxes with Apples and
Oranges
p(x = 0 | µ) = 1 − µ Bayes’ Theorem
Bernoulli Distribution Bayes’ Probabilities
Probability Distributions
Bern(x | µ) = µx (1 − µ)1−x Gaussian Distribution
over a Vector
Decision Theory
Expectation over x
Model Selection - Key
E [x] = µ Ideas
Variance of x
var[x] = µ(1 − µ)
129of 146
19. Introduction to Statistical
The Binomial Distribution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Flip a coin N times. What is the distribution to observe
heads exactly m times?
This is a distribution over m = {0, . . . , N}.
Binomial Distribution Boxes with Apples and
Oranges
N m Bayes’ Theorem
Bin(m | N, µ) = µ (1 − µ)N−m
m Bayes’ Probabilities
Probability Distributions
0.3 Gaussian Distribution
over a Vector
Decision Theory
0.2
Model Selection - Key
Ideas
0.1
0
0 1 2 3 4 5 6 7 8 9 10
m
N = 10, µ = 0.25
130of 146
20. Introduction to Statistical
The Beta Distribution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Beta Distribution University
Γ(a + b) a−1
Beta(µ | a, b) = µ (1 − µ)b−1
Γ(a)Γ(b)
3 3 Boxes with Apples and
a = 0.1 a=1 Oranges
b = 0.1 b=1
2 2 Bayes’ Theorem
Bayes’ Probabilities
1 1 Probability Distributions
Gaussian Distribution
over a Vector
0 0
0 0.5 µ 1 0 0.5 µ 1
Decision Theory
3 3 Model Selection - Key
a=2 a=8 Ideas
b=3 b=4
2 2
1 1
0 0
0 0.5 µ 1 0 0.5 µ 1
131of 146
21. Introduction to Statistical
The Gaussian Distribution over a Vector x Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
x ∈ RD
Gaussian Distribution with mean µ ∈ RD and covariance
matrix Σ ∈ RD×D
1 1 1
N (x | µ, Σ) = D 1 exp{− (x − µ)T Σ−1 (x − µ)} Boxes with Apples and
Oranges
(2π) 2 |Σ| 2 2
Bayes’ Theorem
Bayes’ Probabilities
where |Σ| is the determinant of Σ.
Probability Distributions
x2 Gaussian Distribution
u2
over a Vector
u1 Decision Theory
Model Selection - Key
y2 Ideas
y1
µ
1/2
λ2
1/2
λ1
x1
132of 146
22. Introduction to Statistical
The Gaussian Distribution over a vector x Machine Learning
c 2011
Christfried Webers
NICTA
Can find a linear transformation to a new coordinate The Australian National
University
system y in which the x becomes
y = UT (x − µ) ,
U is the eigenvector matrix for the covariance matrix Σ
with eigenvalue matrix E Boxes with Apples and
Oranges
λ1 0 . . . 0 Bayes’ Theorem
0 λ2 . . . 0
ΣU = UE = U
. . . . . . . . . . . .
Bayes’ Probabilities
Probability Distributions
0 0 . . . λD Gaussian Distribution
over a Vector
U can be made an orthogonal matrix, therefore the Decision Theory
columns ui of U are unit vectors which are orthogonal to Model Selection - Key
Ideas
1 i=j
each other uT uj =
i
0 i=j
Now we can write Σ and its inverse (prove that ΣΣ−1 = I)
n n
1
Σ = U E UT = λi ui uT
i Σ−1 = U E−1 UT = ui uT
i
λi
i=1 i=1
133of 146
23. Introduction to Statistical
The Gaussian Distribution over a vector x Machine Learning
c 2011
Christfried Webers
NICTA
Now use the linear transformation y = UT (x − µ) and The Australian National
University
Σ−1 = U E−1 UT to transform the exponent
(x − µ)T Σ−1 (x − µ) into
n
y2
j
yT E y =
λj Boxes with Apples and
j=1
Oranges
Now exponating the sum (and taking care of the factors) Bayes’ Theorem
Bayes’ Probabilities
results in a product of scalar valued Gaussian distributions
Probability Distributions
in orthogonal directions ui
Gaussian Distribution
over a Vector
D
1 y2
j Decision Theory
p(y) = exp{− }
j=1
(2πλj )1/2 2λj Model Selection - Key
Ideas
x2
u2
u1
y2
y1
µ
1/2
λ2
1/2
λ1
x1
134of 146
24. Introduction to Statistical
Partitioned Gaussians Machine Learning
c 2011
Christfried Webers
NICTA
Given a joint Gaussian distribution N (x | µ, Σ), where µ is The Australian National
University
the mean vector, Σ the covariance matrix, and Λ ≡ Σ−1
the precision matrix
Assume that the variables can be partitioned into two sets
xa µa Σaa Σab Λaa Λab
x= ,µ= ,Σ= ,Λ= Boxes with Apples and
Oranges
xb µb Σba Σbb Λba Λbb
Bayes’ Theorem
Bayes’ Probabilities
Λaa = (Σaa − Σab Σ−1 Σba )−1
bb
Probability Distributions
Gaussian Distribution
Λab = −(Σaa − Σab Σ−1 Σba )−1 Σab Σ−1
bb bb over a Vector
Decision Theory
Conditional distribution Model Selection - Key
Ideas
p(xa | xb ) = N (xa | µa|b , Λ−1 )
aa
µa|b = µa − Λ−1 Λab (xb − µb )
aa
Marginal distribution
p(xa ) = N (xa | µa , Σaa )
135of 146
25. Introduction to Statistical
Partitioned Gaussians Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
1 10
xb
xb = 0.7 p(xa |xb = 0.7)
Boxes with Apples and
Oranges
0.5 5 Bayes’ Theorem
Bayes’ Probabilities
p(xa , xb ) Probability Distributions
p(xa ) Gaussian Distribution
over a Vector
Decision Theory
0 0
0 0.5 xa 1 0 0.5 xa 1
Model Selection - Key
Ideas
Contours of a Gaussian distribution over two variables xa and
xb (left), and marginal distribution p(xa ) and conditional
distribution p(xa | xb ) for xb = 0.7 (right).
136of 146
26. Introduction to Statistical
Nonlinear Change of Variables in Distributions Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Given some px (x).
Consider a nonlinear change of variables
x = g(y)
Boxes with Apples and
Oranges
What is the new probability distribution py (y) in terms of
Bayes’ Theorem
the variable y ? Bayes’ Probabilities
Probability Distributions
dx
py (y) = px (x) Gaussian Distribution
dy over a Vector
Decision Theory
= px (g(y)) |g (y)| Model Selection - Key
Ideas
For vector valued x and y
py (y) = px (x) | J |
∂xi
where Jij = ∂yj .
137of 146
27. Introduction to Statistical
Decision Theory - Key Ideas Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Two classes C1 and C2
joint distribution p(x, Ck )
using Bayes’ theorem Boxes with Apples and
Oranges
p(x | Ck ) p(Ck ) Bayes’ Theorem
p(Ck | x) =
p(x) Bayes’ Probabilities
Probability Distributions
Example: cancer treatment (k = 2) Gaussian Distribution
over a Vector
data x : an X-ray image Decision Theory
C1 : patient has cancer (C1 : patient has no cancer) Model Selection - Key
Ideas
p(C1 ) is the prior probability of a person having cancer
p(C1 | x) is the posterior probability of a person having
cancer after having seen the X-ray data
138of 146
28. Introduction to Statistical
Decision Theory - Key Ideas Machine Learning
c 2011
Christfried Webers
NICTA
Need a rule which assigns each value of the input x to one The Australian National
University
of the available classes.
The input space is partitioned into decision regions Rk .
Leads to decision boundaries or decision surfaces
probability of a mistake
p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 ) Boxes with Apples and
Oranges
Bayes’ Theorem
= p(x, C2 ) dx + p(x, C1 ) dx Bayes’ Probabilities
R1 R2
Probability Distributions
Gaussian Distribution
R1 over a Vector
Decision Theory
Model Selection - Key
Ideas
R2
139of 146
29. Introduction to Statistical
Decision Theory - Key Ideas Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
probability of a mistake University
p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )
= p(x, C2 ) dx + p(x, C1 ) dx
R1 R2 Boxes with Apples and
Oranges
goal: minimize p(mistake) Bayes’ Theorem
Bayes’ Probabilities
x0 Probability Distributions
x
Gaussian Distribution
p(x, C1 ) over a Vector
Decision Theory
p(x, C2 ) Model Selection - Key
Ideas
x
R1 R2
140of 146
30. Introduction to Statistical
Decision Theory - Key Ideas Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
multiple classes
instead of minimising the probability of mistakes, maximise Boxes with Apples and
Oranges
the probability of correct classification
Bayes’ Theorem
K Bayes’ Probabilities
p(correct) = p(x ∈ Rk , Ck ) Probability Distributions
Gaussian Distribution
k=1 over a Vector
K Decision Theory
= p(x, Ck ) dx Model Selection - Key
Ideas
k=1 Rk
141of 146
31. Introduction to Statistical
Minimising the Expected Loss Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Not all mistakes are equally costly.
Weight each misclassification of x to the wrong class Cj Boxes with Apples and
Oranges
instead of assigning it to the correct class Ck by a factor Lkj . Bayes’ Theorem
The expected loss is now Bayes’ Probabilities
Probability Distributions
Gaussian Distribution
E [L] = Lkj p(x, Ck )dx over a Vector
k j Rj Decision Theory
Model Selection - Key
Ideas
Goal: minimize the expected loss E [L]
142of 146
32. Introduction to Statistical
The Reject Region Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Avoid making automated decisions on difficult cases.
Difficult cases:
posterior probabilities p(Ck | x) are very small
Boxes with Apples and
joint distributions p(x, Ck ) have comparable values Oranges
Bayes’ Theorem
p(C1 |x) p(C2 |x)
1.0 Bayes’ Probabilities
θ Probability Distributions
Gaussian Distribution
over a Vector
Decision Theory
Model Selection - Key
Ideas
0.0 x
reject region
143of 146
33. Introduction to Statistical
S-fold Cross Validation Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Given a set of N data items and targets.
Goal: Find the best model (type of model, number of
parameters like the order p of the polynomial or the
regularisation constant λ). Avoid overfitting. Boxes with Apples and
Oranges
Solution: Train a machine learning algorithm with some of Bayes’ Theorem
the data, evaluate it with the rest. Bayes’ Probabilities
If we have many data Probability Distributions
Gaussian Distribution
1 Train a range of models or a model with a range of over a Vector
parameters. Decision Theory
2 Compare the performance on an independent data set Model Selection - Key
(validation set) and choose the one with the best predictive Ideas
performance.
3 Still, overfitting to the validation set can occur. Therefore,
use a third test set for final evaluation. (Keep the test set in
a safe and never give it to the developers ;–)
144of 146
34. Introduction to Statistical
S-fold Cross Validation Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
For few data, there is a dilemma: Few training data or few Boxes with Apples and
test data. Oranges
Bayes’ Theorem
Solution is cross-validation.
Bayes’ Probabilities
Use a portion of (S − 1)/S of the available data for training, Probability Distributions
but use all the data to asses the performance. Gaussian Distribution
over a Vector
For very scarce data one may use S = N, which is also Decision Theory
called the leave-one-out technique. Model Selection - Key
Ideas
145of 146
35. Introduction to Statistical
S-fold Cross Validation Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Partition data into S groups.
Use S − 1 groups to train a set of models that are then
evaluated on the remaining group.
Repeat for all S choices of the held-out group, and average Boxes with Apples and
Oranges
the performance scores from the S runs. Bayes’ Theorem
Bayes’ Probabilities
run 1 Probability Distributions
Gaussian Distribution
over a Vector
run 2 Decision Theory
Model Selection - Key
run 3 Ideas
run 4
Example for S = 4.
146of 146