Introduction to advanced Monte Carlo methods

An introduction to advanced (?) MCMC methods


Christian P. Robert

Universit´ Paris-Dauphine and CREST-INSEE
e
http://www.ceremade.dauphine.fr/~xian

Royal Statistical Society, October 13, 2010

Motivating example

Motivating example

1 Motivating example

2 The Metropolis-Hastings Algorithm

Motivating example

Latent structures make life harder!

Even simple models may lead to computational complications,
as in latent variable models

f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆

Motivating example



f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆

If (x, x⋆ ) observed, ﬁne!

Motivating example



f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆

If (x, x⋆ ) observed, ﬁne!
If only x observed, trouble!

Motivating example

Example (Mixture models)
Models of mixtures of distributions:

X ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

X ∼ p1 f1 (x) + · · · + pk fk (x) .

Motivating example




X ∼ p1 f1 (x) + · · · + pk fk (x) .

For a sample of independent random variables (X1 , · · · , Xn ),
sample density
n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1

Motivating example




X ∼ p1 f1 (x) + · · · + pk fk (x) .

For a sample of independent random variables (X1 , · · · , Xn ),
sample density
n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1

Expanding this product involves k n elementary terms: prohibitive
to compute in large samples.

Motivating example

0.3N (µ1 , 1) + 0.7N (µ2 , 1) loglikelihood
3
2
µ2

1
0
−1

−1 0 1 2 3

µ1

Motivating example

A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;

Motivating example

(ii) use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;

Motivating example

models;
(iii) use of a huge dataset;

Motivating example

models;
(iv) use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);

Motivating example

models;
(iv) use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
(v) use of a complex inferential procedure as for instance, Bayes
factors
π π(θ ∈ Θ0 )
B01 (x) = P (θ ∈ Θ0 | x)/P (θ ∈ Θ1 | x) .
π(θ ∈ Θ1 )

The Metropolis-Hastings Algorithm


1 Motivating example

2 The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
The Metropolis–Hastings algorithm
A collection of Metropolis-Hastings algorithms
Extensions
Convergence assessment


Running Monte Carlo via Markov Chains

Fact: It is not necessary to use a sample from the distribution f to
approximate the integral

I= h(x)f (x)dx ,


Running Monte Carlo via Markov Chains

Fact: It is not necessary to use a sample from the distribution f to
approximate the integral

I= h(x)f (x)dx ,

We can obtain X1 , . . . , Xn ∼ f (approx) without directly
simulating from f , using an ergodic Markov chain with
stationary distribution f


Running Monte Carlo via Markov Chains (2)

Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f


Running Monte Carlo via Markov Chains (2)

Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f

Ensures the convergence in distribution of (X (t) ) to a random
variable from f .
For a “large enough” T0 , X (T0 ) can be considered as
distributed from f
Produces a dependent sample X (T0 ) , X (T0 +1) , . . ., which is
generated from f , suﬃcient for most approximation purposes.



Problem:
How can one build a Markov chain with a given stationary
distribution?



Problem:
How can one build a Markov chain with a given stationary
distribution?

MH basics
Algorithm that converges to the objective (target) density

f

using an arbitrary transition kernel density

q(x, y)

called instrumental (or proposal) distribution


The MH algorithm

Algorithm (Metropolis–Hastings)
Given x(t) ,
1 Generate Yt ∼ q(x(t) , y).
2 Take

Yt with prob. ρ(x(t) , Yt ),
X (t+1) =
x(t) with prob. 1 − ρ(x(t) , Yt ),

where
f (y) q(y, x)
ρ(x, y) = min ,1 .
f (x) q(x, y)


Features

Independent of normalizing constants for both f and q(x, ·)
(ie, those constants independent of x)
Never move to values with f (y) = 0
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain


Features

Independent of normalizing constants for both f and q(x, ·)
(ie, those constants independent of x)
Never move to values with f (y) = 0
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain
P( θ-> θ ’)
Satisﬁes the detailed balance condition
θ’
θ

f (x)K(x, y) = f (y)K(y, x) P(θ’-> θ )

[Green, 1995]


Convergence properties

1 The M-H Markov chain is reversible, with invariant/stationary
density f .



density f .
2 As f is a probability measure, the chain is positive recurrent



density f .
2 As f is a probability measure, the chain is positive recurrent
3 If
f (Yt ) q(Yt , X (t) )
Pr ≥ 1 < 1. (1)
f (X (t) ) q(X (t) , Yt )

i.e., if the event {X (t+1) = X (t) } occurs with positive
probability, then the chain is aperiodic


Convergence properties (2)
4 If
q(x, y) > 0 for every (x, y), (2)
the chain is irreducible


4 If
q(x, y) > 0 for every (x, y), (2)
5 For M-H, f -irreducibility implies Harris recurrence


4 If
q(x, y) > 0 for every (x, y), (2)
5 For M-H, f -irreducibility implies Harris recurrence
6 Thus, under conditions (1) and (2)
(i) For h, with Ef |h(X)| < ∞,
T
1
lim h(X (t) ) = h(x)df (x) a.e. f.
T →∞ T t=1

(ii) and
lim K n (x, ·)µ(dx) − f =0
n→∞
TV
for every initial distribution µ, where K n (x, ·) denotes the
kernel for n transitions.


The Independent Case

The instrumental distribution q(x, ·) is independent of x and is
denoted g


The Independent Case

The instrumental distribution q(x, ·) is independent of x and is
denoted g

Algorithm (Independent Metropolis-Hastings)
Given x(t) ,
1 Generate Yt ∼ g(y)
2 Take

Y f (Yt ) g(x(t) )
with prob. min ,1 ,

t
X (t+1) = f (x(t) ) g(Yt )

x(t) otherwise.



Properties
The resulting sample is not iid


Properties
The resulting sample is not iid but there exist strong convergence
properties:
Theorem (Ergodicity)
The algorithm produces a uniformly ergodic chain if there exists a
constant M such that

f (x) ≤ M g(x) , x ∈ supp f.

In this case,
n
1
K n (x, ·) − f TV ≤ 1− .
M

[Mengersen & Tweedie, 1996]


Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,

xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 )

and observables
yt |xt ∼ N (x2 , σ 2 )
t


Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,

xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 )

and observables
yt |xt ∼ N (x2 , σ 2 )
t

The distribution of xt given xt−1 , xt+1 and yt is

−1 τ2
exp (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 + (yt − x2 )2
t .
2τ 2 σ2


Example (Noisy AR(1) too)
2
Use for proposal the N (µt , ωt ) distribution, with

xt−1 + xt+1 2 τ2
µt = ϕ and ωt = .
1 + ϕ2 1 + ϕ2


Example (Noisy AR(1) too)
2
Use for proposal the N (µt , ωt ) distribution, with

xt−1 + xt+1 2 τ2
µt = ϕ and ωt = .
1 + ϕ2 1 + ϕ2
Ratio
π(x)/qind (x) = exp −(yt − x2 )2 /2σ 2
t

is bounded


(top) Last 500 realisations of the chain {Xk }k out of 10, 000
iterations; (bottom) histogram of the chain, compared with
the target distribution.


Random walk Metropolis–Hastings

Instead, use a local perturbation as proposal

Yt = X (t) + εt ,

where εt ∼ g, independent of X (t) .
The instrumental density is now of the form g(y − x) and the
Markov chain is a random walk if g is symmetric

g(x) = g(−x)


Algorithm (Random walk Metropolis)
Given x(t)
1 Generate Yt ∼ g(y − x(t) )
2 Take
f (Yt )

Y with prob. min 1, ,
(t+1) t
X = f (x(t) )
 (t)
x otherwise.


Probit illustration

Likelihood and posterior given by
n
π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi .
i=1

under the ﬂat prior


Probit illustration

Likelihood and posterior given by
n
π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi .
i=1

under the ﬂat prior
A random walk proposal works well for a small number of
ˆ
predictors. Use the maximum likelihood estimate β as starting
ˆ
value and asymptotic (Fisher) covariance matrix of the MLE, Σ, as
scale


MCMC algorithm

Probit random-walk Metropolis-Hastings
ˆ ˆ
Initialization: Set β (0) = β and compute Σ
Iteration t:
1 ˜ ˆ
Generate β ∼ Nk+1 (β (t−1) , τ Σ)
2 Compute

˜
π(β|y)
˜
ρ(β (t−1) , β) = min 1,
π(β (t−1) |y)

3 ˜ ˜
With probability ρ(β (t−1) , β) set β (t) = β;
otherwise set β (t) = β (t−1) .


R bank benchmark
Probit modelling with
no intercept over the

0.8
−1.0

1.0

0.4
four measurements.

−2.0

0.0
0.0
0 4000 8000 −2.0 −1.5 −1.0 −0.5 0 200 600 1000

Three diﬀerent scales

3

0.0 0.4 0.8
τ = 1, 0.1, 10: best

2

0.4
1
mixing behavior is

−1

0.0
0 4000 8000 −1 0 1 2 3 0 200 600 1000

associated with τ = 1.

2.5

0.8

0.0 0.4 0.8
−0.5 1.0
Average of the

0.4
0.0
parameters over 1.8
0 4000 8000 −0.5 0.5 1.5 2.5 0 200 600 1000

MCMC 9, 000

0.0 0.4 0.8
2.0
1.2

1.0
iterations gives plug-in
0.0
0.6

0 4000 8000 0.6 1.0 1.4 1.8 0 200 600 1000

estimate

pi = Φ (−1.2193xi1 + 0.9540xi2 + 0.9795xi3 + 1.1481xi4 ) .
ˆ


n k
π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ)
j=1 ℓ=1


n k
π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ)
j=1 ℓ=1

Metropolis-Hastings proposal:

θ(t) + ωε(t) if u(t) < ρ(t)
θ(t+1) =
θ(t) otherwise

where
π(θ(t) + ωε(t) |x)
ρ(t) = ∧1
π(θ(t) |x)
and ω scaled for good acceptance rate


Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 1
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 10
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 100
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 500
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 1000
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 10
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 100
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 500
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 1000
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 10,000
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 5000
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1



Uniform ergodicity prohibited by random walk structure



Uniform ergodicity prohibited by random walk structure
At best, geometric ergodicity:

Theorem (Suﬃcient ergodicity)
For a symmetric density f , log-concave in the tails, and a positive
and symmetric density g, the chain (X (t) ) is geometrically ergodic.
[Mengersen & Tweedie, 1996]
no tail eﬀect


1.5

1.5
1.0

1.0
Example (Comparison of tail
eﬀects)

0.5

0.5
0.0

0.0
Random-walk
Metropolis–Hastings algorithms

-0.5

-0.5
based on a N (0, 1) instrumental

-1.0

-1.0
for the generation of (a) a

-1.5

-1.5
N (0, 1) distribution and (b) a 0 50 100

(a)
150 200 0 50 100

(b)
150 200

distribution with density 90% conﬁdence envelopes of
ψ(x) ∝ (1 + |x|)−3 the means, derived from 500
parallel independent chains

Extensions

Extensions

There are many other families of HM algorithms
Adaptive Rejection Metropolis Sampling
Reversible Jump
Langevin algorithms
to name just a few...

Extensions

Langevin Algorithms

Proposal based on the Langevin diffusion Lt is defined by the
stochastic differential equation
1
dLt = dBt + ∇ log f (Lt )dt,
2
where Bt is the standard Brownian motion
Theorem
The Langevin diffusion is the only non-explosive diffusion which is
reversible with respect to f .

Extensions

Discretization

Because continuous time cannot be simulated, consider the
discretised sequence

σ2
x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ 2 corresponds to the discretisation step
0.6
0.5
0.4

Example of
Density

0.3

f (x) = exp(−x4 )
0.2
0.1
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

σ2 = .1

Extensions

Discretization


σ2
2
0.6
0.5
0.4

Example of
0.3

f (x) = exp(−x4 )
0.2
0.1
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

σ2 = .01

Extensions

Discretization


σ2
2
0.6
0.5
0.4

Example of
Density

0.3

f (x) = exp(−x4 )
0.2
0.1
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

σ2 = .001

Extensions

Discretization


σ2
2
0.8
0.6

Example of
Density

0.4

f (x) = exp(−x4 )
0.2
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

σ2 = .0001

Extensions

Discretization


σ2
2
0.6
0.5
0.4

Example of
Density

0.3

f (x) = exp(−x4 )
0.2
0.1
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

σ2 = .0001∗

Extensions

Discretization

Unfortunately, the discretized chain may be transient, for instance
when
lim σ 2 ∇ log f (x)|x|−1 > 1
x→±∞

Example of f (x) = exp(−x4 ) when σ 2 = .2

Extensions

MH correction

Accept the new value Yt with probability
2
σ2
exp − Yt − x(t) − (t)
2 ∇ log f (x ) 2σ 2
f (Yt )
· ∧1.
f (x(t) ) σ2
2
exp − x(t) − Yt − 2 ∇ log f (Yt ) 2σ 2

Choice of the scaling factor σ
Should lead to an acceptance rate of 0.574 to achieve optimal
convergence rates (when the components of x are uncorrelated)
[Roberts & Rosenthal, 1998]

Extensions

Optimizing the Acceptance Rate

Problem of choice of the transition kernel from a practical point of
view
Most common alternatives:
1 a fully automated algorithm like ARMS;
2 an instrumental density g which approximates f , such that
f /g is bounded for uniform ergodicity to apply;
3 a random walk
In both cases (b) and (c), the choice of g is critical,

Extensions

Case of the random walk

Diﬀerent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .

Extensions

Case of the random walk

Diﬀerent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .
If x(t) and yt are close, i.e. f (x(t) ) ≃ f (yt ) y is accepted with
probability
f (yt )
min ,1 ≃ 1 .
f (x(t) )
For multimodal densities with well separated modes, the negative
eﬀect of limited moves on the surface of f clearly shows.

Extensions

Case of the random walk (2)

If the average acceptance rate is low, the successive values of f (yt )
tend to be small compared with f (x(t) ), which means that the
random walk moves quickly on the surface of f since it often
reaches the “borders” of the support of f

Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]

Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]

This rule is to be taken with a pinch of salt!

Extensions

Example (Noisy AR(1) continued)
For a Gaussian random walk with scale ω small enough, the
random walk never jumps to the other mode. But if the scale ω is
suﬃciently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.

Extensions

Markov chain based on a random walk with scale ω = .1

Extensions

Markov chain based on a random walk with scale ω = .5

Extensions

Where do we stand?
MCMC in a nutshell:

Extensions

Where do we stand?
MCMC in a nutshell:
Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation
to target density f when detailed balance condition holds

f (x)K(x, y) = f (y)K(y, x)

Extensions

Where do we stand?
MCMC in a nutshell:

f (x)K(x, y) = f (y)K(y, x)

Easiest implementation of the principle is random walk
Metropolis-Hastings

Yt = X (t) + εt

Extensions

Where do we stand?
MCMC in a nutshell:

f (x)K(x, y) = f (y)K(y, x)

Easiest implementation of the principle is random walk
Metropolis-Hastings

Yt = X (t) + εt

Practical convergence requires suﬃcient energy from the
proposal that is calibrated by trial and error.


Convergence diagnostics

How many iterations?



Rule # 1 There is no absolute number of simulations, i.e.
1, 000 is neither large, nor small.
Rule # 2 It takes [much] longer to check for convergence
than for the chain itself to converge.
Rule # 3 MCMC is a “what-you-get-is-what-you-see”
algorithm: it fails to tell about unexplored parts of the space.
Rule # 4 When in doubt, run MCMC chains in parallel and
check for consistency.



Rule # 1 There is no absolute number of simulations, i.e.
1, 000 is neither large, nor small.
Rule # 2 It takes [much] longer to check for convergence
than for the chain itself to converge.
Rule # 3 MCMC is a “what-you-get-is-what-you-see”
algorithm: it fails to tell about unexplored parts of the space.
Rule # 4 When in doubt, run MCMC chains in parallel and
check for consistency.

Many “quick-&-dirty” solutions in the literature, but not
necessarily 100% trustworthy.


Example (Bimodal target)

0.4
Density

0.3
exp −x2 /2 4(x − .3)2 + .01

0.2
f (x) = √ .
4(1 + (.3)2 ) + .01

0.1
2π

0.0
−4 −2 0 2 4

and use of random walk Metropolis–Hastings algorithm with
variance .04
Evaluation of the missing mass by
T −1
[θ(t+1) − θ(t) ] f (θ(t) )
t=1


1.0
0.8
0.6
mass

0.4
0.2
0.0

0 500 1000 1500 2000

Index

Sequence [in blue] and mass evaluation [in brown]

[Philippe & Robert, 2001]


Eﬀective sample size
How many iid simulations from π are equivalent to N simulations
from the MCMC algorithm?


Eﬀective sample size
How many iid simulations from π are equivalent to N simulations
from the MCMC algorithm?

Based on estimated k-th order auto-correlation,

ρk = cov x(t) , x(t+k) ,

eﬀective sample size
T0 −1/2
ess
N =n 1+2 ρk
ˆ ,
k=1

Only partial indicator that fails to signal chains stuck in one
mode of the target


Tempering

Facilitate exploration of π by ﬂattening the target: simulate from
πα (x) ∝ π(x)α for α > 0 small enough


Tempering

Facilitate exploration of π by flattening the target: simulate from
πα (x) ∝ π(x)α for α > 0 small enough
Determine where the modal regions of π are (possibly with
parallel versions using different α’s)
Recycle simulations from π(x)α into simulations from π by
importance sampling
Simple modification of the Metropolis–Hastings algorithm,
with new acceptance
α
π(θ′ |x) q(θ|θ′ )
∧1
π(θ|x) q(θ′ |θ)


Tempering with the mean mixture

1 0.5 0.2
4

4

4
3

3

3
2

2

2
1

1

1
0

0

0
−1

−1

−1
−1 0 1 2 3 4 −1 0 1 2 3 4 −1 0 1 2 3 4

Introduction to advanced Monte Carlo methods

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Introduction to advanced Monte Carlo methods

Similaire à Introduction to advanced Monte Carlo methods (20)

Plus de Christian Robert

Plus de Christian Robert (20)

Introduction to advanced Monte Carlo methods