Probability

Introduction to Statistical
Machine Learning

c 2011
Christfried Webers
NICTA
The Australian National
University

Introduction to Statistical Machine Learning

Christfried Webers

Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University

Canberra
February – June 2011

(Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning")

1of 146

Machine Learning

c 2011
Christfried Webers
NICTA
University

Part IV
Boxes with Apples and
Oranges

Bayes’ Theorem

Probability and Uncertainty Bayes’ Probabilities

Probability Distributions

Gaussian Distribution
over a Vector

Decision Theory

Model Selection - Key
Ideas

113of 146


Simple Experiment Machine Learning

c 2011
Christfried Webers
NICTA
University
1 Choose a box.
Red box p(B = r) = 4/10
Blue box p(B = b) = 6/10
2 Choose any item of the selected box with equal probability.
Oranges
Given that we have chosen an orange, what is the
Bayes’ Theorem
probability that the box we chose was the blue one?
Bayes’ Probabilities


over a Vector

Decision Theory

Ideas

114of 146


What do we know? Machine Learning

c 2011
Christfried Webers
NICTA
University

p(F = o | B = b) = 1/4
p(F = a | B = b) = 3/4
p(F = o | B = r) = 3/4
p(F = a | B = r) = 1/4 Oranges

Bayes’ Theorem
Given that we have chosen an orange, what is the
probability that the box we chose was the blue one? Probability Distributions

p(B = b | F = o)? over a Vector

Decision Theory

Ideas

115of 146


Calculating the Posterior p(B = b | F = o) Machine Learning

c 2011
Christfried Webers
NICTA
University

Bayes’ Theorem

p(F = o|B = b)p(B = b)
p(B = b | F = o) =
p(F = o)
Oranges
Sum Rule for the denominator Bayes’ Theorem

p(F = o) = p(F = o, B = b) + p(F = o, B = r) Probability Distributions

= p(F = o|B = b)p(B = b) Gaussian Distribution
over a Vector

+ p(F = o|B = r)p(B = r) Decision Theory

1 6 3 4 9 Ideas
= × + × =
4 10 4 10 20

1 6 20 1
p(B = b | F = o) = × × =
4 10 9 3

116of 146


Bayes’ Theorem - revisited Machine Learning

c 2011
Christfried Webers
NICTA
University

Before choosing an item from a box: most complete
information in p(B) (prior).
6
Note, that in our example p(B = b) = 10 . Therefore Boxes with Apples and
Oranges
choosing the box from the prior, we would opt for the blue
Bayes’ Theorem
box.
Once we observe some data (e.g. choose an orange), we Probability Distributions

can calculate p(B = b | F = o) (posterior probability) via Gaussian Distribution
over a Vector
Bayes’ theorem.
Decision Theory
After observing an orange, the posterior probability Model Selection - Key
1 2 Ideas
p(B = b | F = o) = 3 and therefore p(B = r | F = o) = 3 .
Observing an orange it is now more likely that the orange
came from the red box.

117of 146


Bayes’ Rule Machine Learning

c 2011
Christfried Webers
NICTA
University

Oranges
likelihood prior
posterior Bayes’ Theorem

p(X | Y) p(Y) p(X | Y) p(Y) Bayes’ Probabilities
p(Y | X) = =
p(X) Y p(X | Y)p(Y)

normalisation over a Vector

Decision Theory

Ideas

118of 146


Bayes’ Probabilities Machine Learning

c 2011
Christfried Webers
NICTA
University

classical or frequentist interpretation of probabilities
Bayesian view: probabilities represent uncertainty Boxes with Apples and
Oranges

Example: Will the Arctic ice cap have disappeared by the Bayes’ Theorem

end of the century? Bayes’ Probabilities

fresh evidence can change the opinion on ice loss Gaussian Distribution
over a Vector
goal: quantify uncertainty and revise uncertainty in light of
Decision Theory
new evidence
Ideas
use Bayesian interpretation of probability

119of 146


Andrey Kolmogorov - Axiomatic Probability Machine Learning

c 2011

Theory (1933) Christfried Webers
NICTA
University

Let (Ω, F, P) be a measure space with P(Ω) = 1. Oranges

Then (Ω, F, P) is a probability space, with sample space Ω, Bayes’ Theorem

event space F and probability measure P.
1. Axiom P(E) ≥ 0 ∀E ∈ F. Gaussian Distribution
over a Vector
2. Axiom P(Ω) = 1. Decision Theory

3. Axiom P(E1 ∪ E2 ∪ . . . ) = i P(Ei ) for any countable Model Selection - Key
Ideas
sequence of pairwise disjoint events E1 , E2 , . . . .

120of 146


Richard Threlkeld Cox - (1946, 1961) Machine Learning

c 2011
Christfried Webers
NICTA
University

Assume numerical values are used to represent degrees
of belief.
Deﬁne a set of axioms encoding common sense Boxes with Apples and
properties of such beliefs. Oranges

Bayes’ Theorem
Results in a set of rules for manipulating degrees of belief Bayes’ Probabilities
which are equivalent to the sum and product rule of Probability Distributions
probability. Gaussian Distribution
over a Vector
many other authors have proposed different sets of Decision Theory
axioms and properties Model Selection - Key
Ideas
Result: the numerical quantities behave all according to
the rules of probability
Denote these quantities as Bayesian probabilities.

121of 146


Curve ﬁtting - revisited Machine Learning

c 2011
Christfried Webers
NICTA
University

uncertainty about the parameter w captured in the prior
probability p(w)
observed data D = {t1 , . . . , tN } Boxes with Apples and
Oranges
calculate the uncertainty in w after the data D have been Bayes’ Theorem

observed Bayes’ Probabilities
p(D | w)p(w) Probability Distributions
p(w | D) =
p(D) Gaussian Distribution
over a Vector

p(D | w) as a function of w : likelihood function Decision Theory

likelihood expresses how probable the data are for Ideas

different values of w
not a probability function over w

122of 146


Likelihood Function - Frequentist versus Bayesian Machine Learning

c 2011
Christfried Webers
NICTA
University

Likelihood function p(D | w)

Frequentist Approach Bayesian Approach Boxes with Apples and
Oranges
w considered ﬁxed only one single data Bayes’ Theorem
parameter set D Bayes’ Probabilities

value deﬁned by uncertainty in the Probability Distributions

some ’estimator’ parameters comes Gaussian Distribution
over a Vector

error bars on the from a probability Decision Theory

estimated w distribution over w Model Selection - Key
Ideas
obtained from the
distribution of
possible data sets D

123of 146


Frequentist Estimator - Maximum Likelihood Machine Learning

c 2011
Christfried Webers
NICTA
University

choose w for which the likelihood p(D | w) is maximal
choose w for which the probability of the observed data is
maximal Boxes with Apples and
Oranges
Machine Learning: error function is negative log of Bayes’ Theorem
likelihood function Bayes’ Probabilities

log is a monoton function Probability Distributions

maximising the likelihood ⇐⇒ minimising the error over a Vector

Decision Theory
Example: Fair-looking coin is tossed three times, always
landing on heads. Ideas

Maximum likelihood estimate of the probability of landing
heads will give 1.

124of 146


Bayesian Approach Machine Learning

c 2011
Christfried Webers
NICTA
University

including prior knowledge easy (via prior w)
BUT: if prior is badly chosen, can lead to bad results
subjective choice of prior Oranges

Bayes’ Theorem
sometimes choice of prior motivated by convinient
mathematical form Probability Distributions

need to sum/integrate over the whole parameter space Gaussian Distribution
over a Vector
advances in sampling (Markov Chain Monte Carlo Decision Theory

methods) Model Selection - Key
Ideas
advances in approximation schemes (Variational Bayes,
Expectation Propagation)

125of 146


The Gaussian Distribution Machine Learning

c 2011
Christfried Webers
NICTA
University

x∈R
Gaussian Distribution with mean µ and variance σ 2
1 1 Boxes with Apples and
N (x | µ, σ 2 ) = 1 exp{− (x − µ)2 } Oranges
(2πσ 2 ) 2 2σ 2 Bayes’ Theorem


N (x|µ, σ 2 ) Probability Distributions

over a Vector

Decision Theory
2σ Model Selection - Key
Ideas

µ x

126of 146


The Gaussian Distribution Machine Learning

c 2011
Christfried Webers
NICTA
University

N (x | µ, σ 2 ) > 0
∞
−∞
N (x | µ, σ 2 ) dx = 1
Expectation over x
∞ Oranges
2
E [x] = N (x | µ, σ ) x dx = µ Bayes’ Theorem
−∞ Bayes’ Probabilities

2
Expectation over x Gaussian Distribution
over a Vector
∞
Decision Theory
E x2 = N (x | µ, σ 2 ) x2 dx = µ2 + σ 2 Model Selection - Key
−∞ Ideas

Variance of x
2
var[x] = E x2 − E [x] = σ 2

127of 146


The Mode of a Probability Distribution Machine Learning

c 2011
Christfried Webers
NICTA
University
Mode of a distribution : the value that occurs the most
frequently in a probability distribution.
For a probability density function: the value x at which the
probability density attains its maximum.
Gaussian Distribution has one mode (unimodular); the Oranges

mode is µ. Bayes’ Theorem

If there are multiple local maxima in the probability
distribution, the probability distribution is called multimodal Gaussian Distribution
(example: mixture of three Gaussians). over a Vector

Decision Theory
N (x|µ, σ 2 )
p(x)
Ideas

2σ

µ x x

128of 146


The Bernoulli Distribution Machine Learning

c 2011
Christfried Webers
NICTA
University

Two possible outcomes x ∈ {0, 1} (e.g. coin which may be
damaged).
p(x = 1 | µ) = µ for 0 ≤ µ ≤ 1 Boxes with Apples and
Oranges
p(x = 0 | µ) = 1 − µ Bayes’ Theorem

Bernoulli Distribution Bayes’ Probabilities


Bern(x | µ) = µx (1 − µ)1−x Gaussian Distribution
over a Vector

Decision Theory
Expectation over x
E [x] = µ Ideas

Variance of x
var[x] = µ(1 − µ)

129of 146


The Binomial Distribution Machine Learning

c 2011
Christfried Webers
NICTA
University

Flip a coin N times. What is the distribution to observe
heads exactly m times?
This is a distribution over m = {0, . . . , N}.
Binomial Distribution Boxes with Apples and
Oranges

N m Bayes’ Theorem
Bin(m | N, µ) = µ (1 − µ)N−m
m Bayes’ Probabilities


0.3 Gaussian Distribution
over a Vector

Decision Theory
0.2
Ideas
0.1

0
0 1 2 3 4 5 6 7 8 9 10
m

N = 10, µ = 0.25

130of 146


The Beta Distribution Machine Learning

c 2011
Christfried Webers
NICTA
Beta Distribution University

Γ(a + b) a−1
Beta(µ | a, b) = µ (1 − µ)b−1
Γ(a)Γ(b)
3 3 Boxes with Apples and
a = 0.1 a=1 Oranges
b = 0.1 b=1
2 2 Bayes’ Theorem


1 1 Probability Distributions

over a Vector
0 0
0 0.5 µ 1 0 0.5 µ 1
Decision Theory

3 3 Model Selection - Key
a=2 a=8 Ideas
b=3 b=4
2 2

1 1

0 0
0 0.5 µ 1 0 0.5 µ 1

131of 146


The Gaussian Distribution over a Vector x Machine Learning

c 2011
Christfried Webers
NICTA
University
x ∈ RD
Gaussian Distribution with mean µ ∈ RD and covariance
matrix Σ ∈ RD×D
1 1 1
N (x | µ, Σ) = D 1 exp{− (x − µ)T Σ−1 (x − µ)} Boxes with Apples and
Oranges
(2π) 2 |Σ| 2 2
Bayes’ Theorem

where |Σ| is the determinant of Σ.
x2 Gaussian Distribution
u2
over a Vector

u1 Decision Theory

y2 Ideas
y1
µ
1/2
λ2
1/2
λ1

x1

132of 146


The Gaussian Distribution over a vector x Machine Learning

c 2011
Christfried Webers
NICTA
Can ﬁnd a linear transformation to a new coordinate The Australian National
University
system y in which the x becomes
y = UT (x − µ) ,
U is the eigenvector matrix for the covariance matrix Σ
with eigenvalue matrix E Boxes with Apples and
  Oranges
λ1 0 . . . 0 Bayes’ Theorem
 0 λ2 . . . 0 
ΣU = UE = U 
. . . . . . . . . . . .


0 0 . . . λD Gaussian Distribution
over a Vector

U can be made an orthogonal matrix, therefore the Decision Theory

columns ui of U are unit vectors which are orthogonal to Model Selection - Key
Ideas
1 i=j
each other uT uj =
i
0 i=j
Now we can write Σ and its inverse (prove that ΣΣ−1 = I)
n n
1
Σ = U E UT = λi ui uT
i Σ−1 = U E−1 UT = ui uT
i
λi
i=1 i=1
133of 146


The Gaussian Distribution over a vector x Machine Learning

c 2011
Christfried Webers
NICTA
Now use the linear transformation y = UT (x − µ) and The Australian National
University
Σ−1 = U E−1 UT to transform the exponent
(x − µ)T Σ−1 (x − µ) into
n
y2
j
yT E y =
λj Boxes with Apples and
j=1
Oranges

Now exponating the sum (and taking care of the factors) Bayes’ Theorem

results in a product of scalar valued Gaussian distributions
in orthogonal directions ui
over a Vector
D
1 y2
j Decision Theory
p(y) = exp{− }
j=1
(2πλj )1/2 2λj Model Selection - Key
Ideas

x2
u2

u1

y2
y1
µ
1/2
λ2
1/2
λ1

x1
134of 146


Partitioned Gaussians Machine Learning

c 2011
Christfried Webers
NICTA
Given a joint Gaussian distribution N (x | µ, Σ), where µ is The Australian National
University
the mean vector, Σ the covariance matrix, and Λ ≡ Σ−1
the precision matrix
Assume that the variables can be partitioned into two sets
xa µa Σaa Σab Λaa Λab
x= ,µ= ,Σ= ,Λ= Boxes with Apples and
Oranges
xb µb Σba Σbb Λba Λbb
Bayes’ Theorem


Λaa = (Σaa − Σab Σ−1 Σba )−1
bb

Λab = −(Σaa − Σab Σ−1 Σba )−1 Σab Σ−1
bb bb over a Vector

Decision Theory
Conditional distribution Model Selection - Key
Ideas

p(xa | xb ) = N (xa | µa|b , Λ−1 )
aa

µa|b = µa − Λ−1 Λab (xb − µb )
aa

Marginal distribution
p(xa ) = N (xa | µa , Σaa )
135of 146


Partitioned Gaussians Machine Learning

c 2011
Christfried Webers
NICTA
University

1 10

xb
xb = 0.7 p(xa |xb = 0.7)

Oranges

0.5 5 Bayes’ Theorem

p(xa , xb ) Probability Distributions
p(xa ) Gaussian Distribution
over a Vector

Decision Theory
0 0
0 0.5 xa 1 0 0.5 xa 1
Ideas

Contours of a Gaussian distribution over two variables xa and
xb (left), and marginal distribution p(xa ) and conditional
distribution p(xa | xb ) for xb = 0.7 (right).

136of 146


Nonlinear Change of Variables in Distributions Machine Learning

c 2011
Christfried Webers
NICTA
University
Given some px (x).
Consider a nonlinear change of variables

x = g(y)
Oranges
What is the new probability distribution py (y) in terms of
Bayes’ Theorem
the variable y ? Bayes’ Probabilities

dx
py (y) = px (x) Gaussian Distribution
dy over a Vector

Decision Theory
= px (g(y)) |g (y)| Model Selection - Key
Ideas

For vector valued x and y

py (y) = px (x) | J |
∂xi
where Jij = ∂yj .

137of 146


Decision Theory - Key Ideas Machine Learning

c 2011
Christfried Webers
NICTA
University

Two classes C1 and C2
joint distribution p(x, Ck )
using Bayes’ theorem Boxes with Apples and
Oranges
p(x | Ck ) p(Ck ) Bayes’ Theorem
p(Ck | x) =
p(x) Bayes’ Probabilities


Example: cancer treatment (k = 2) Gaussian Distribution
over a Vector
data x : an X-ray image Decision Theory

C1 : patient has cancer (C1 : patient has no cancer) Model Selection - Key
Ideas

p(C1 ) is the prior probability of a person having cancer
p(C1 | x) is the posterior probability of a person having
cancer after having seen the X-ray data

138of 146



c 2011
Christfried Webers
NICTA
Need a rule which assigns each value of the input x to one The Australian National
University
of the available classes.
The input space is partitioned into decision regions Rk .
Leads to decision boundaries or decision surfaces
probability of a mistake
p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 ) Boxes with Apples and
Oranges

Bayes’ Theorem
= p(x, C2 ) dx + p(x, C1 ) dx Bayes’ Probabilities
R1 R2

R1 over a Vector

Decision Theory

Ideas
R2

139of 146



c 2011
Christfried Webers
NICTA
probability of a mistake University

p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )

= p(x, C2 ) dx + p(x, C1 ) dx
R1 R2 Boxes with Apples and
Oranges

goal: minimize p(mistake) Bayes’ Theorem


x0 Probability Distributions
x
p(x, C1 ) over a Vector

Decision Theory
p(x, C2 ) Model Selection - Key
Ideas

x
R1 R2

140of 146



c 2011
Christfried Webers
NICTA
University

multiple classes
instead of minimising the probability of mistakes, maximise Boxes with Apples and
Oranges
the probability of correct classiﬁcation
Bayes’ Theorem

K Bayes’ Probabilities

p(correct) = p(x ∈ Rk , Ck ) Probability Distributions

k=1 over a Vector
K Decision Theory

= p(x, Ck ) dx Model Selection - Key
Ideas
k=1 Rk

141of 146


Minimising the Expected Loss Machine Learning

c 2011
Christfried Webers
NICTA
University

Not all mistakes are equally costly.
Weight each misclassiﬁcation of x to the wrong class Cj Boxes with Apples and
Oranges
instead of assigning it to the correct class Ck by a factor Lkj . Bayes’ Theorem

The expected loss is now Bayes’ Probabilities


E [L] = Lkj p(x, Ck )dx over a Vector

k j Rj Decision Theory

Ideas
Goal: minimize the expected loss E [L]

142of 146


The Reject Region Machine Learning

c 2011
Christfried Webers
NICTA
University

Avoid making automated decisions on difﬁcult cases.
Difﬁcult cases:
posterior probabilities p(Ck | x) are very small
joint distributions p(x, Ck ) have comparable values Oranges

Bayes’ Theorem
p(C1 |x) p(C2 |x)
1.0 Bayes’ Probabilities
θ Probability Distributions

over a Vector

Decision Theory

Ideas

0.0 x
reject region

143of 146


S-fold Cross Validation Machine Learning

c 2011
Christfried Webers
NICTA
University

Given a set of N data items and targets.
Goal: Find the best model (type of model, number of
parameters like the order p of the polynomial or the
regularisation constant λ). Avoid overfitting. Boxes with Apples and
Oranges
Solution: Train a machine learning algorithm with some of Bayes’ Theorem
the data, evaluate it with the rest. Bayes’ Probabilities

If we have many data Probability Distributions

1 Train a range of models or a model with a range of over a Vector
parameters. Decision Theory
2 Compare the performance on an independent data set Model Selection - Key
(validation set) and choose the one with the best predictive Ideas

performance.
3 Still, overfitting to the validation set can occur. Therefore,
use a third test set for final evaluation. (Keep the test set in
a safe and never give it to the developers ;–)

144of 146



c 2011
Christfried Webers
NICTA
University

For few data, there is a dilemma: Few training data or few Boxes with Apples and
test data. Oranges

Bayes’ Theorem
Solution is cross-validation.
Use a portion of (S − 1)/S of the available data for training, Probability Distributions

but use all the data to asses the performance. Gaussian Distribution
over a Vector
For very scarce data one may use S = N, which is also Decision Theory

called the leave-one-out technique. Model Selection - Key
Ideas

145of 146



c 2011
Christfried Webers
NICTA
University

Partition data into S groups.
Use S − 1 groups to train a set of models that are then
evaluated on the remaining group.
Repeat for all S choices of the held-out group, and average Boxes with Apples and
Oranges
the performance scores from the S runs. Bayes’ Theorem


run 1 Probability Distributions

over a Vector
run 2 Decision Theory

run 3 Ideas

run 4
Example for S = 4.

146of 146

Probability

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (8)

Plus de nep_test_account

Plus de nep_test_account (18)

Probability