1. An introduction to advanced (?) MCMC methods
An introduction to advanced (?) MCMC methods
Christian P. Robert
Universit´ Paris-Dauphine and CREST-INSEE
e
http://www.ceremade.dauphine.fr/~xian
Royal Statistical Society, October 13, 2010
2. An introduction to advanced (?) MCMC methods
Motivating example
Motivating example
1 Motivating example
2 The Metropolis-Hastings Algorithm
3. An introduction to advanced (?) MCMC methods
Motivating example
Latent structures make life harder!
Even simple models may lead to computational complications,
as in latent variable models
f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆
4. An introduction to advanced (?) MCMC methods
Motivating example
Latent structures make life harder!
Even simple models may lead to computational complications,
as in latent variable models
f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆
If (x, x⋆ ) observed, fine!
5. An introduction to advanced (?) MCMC methods
Motivating example
Latent structures make life harder!
Even simple models may lead to computational complications,
as in latent variable models
f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆
If (x, x⋆ ) observed, fine!
If only x observed, trouble!
6. An introduction to advanced (?) MCMC methods
Motivating example
Example (Mixture models)
Models of mixtures of distributions:
X ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X ∼ p1 f1 (x) + · · · + pk fk (x) .
7. An introduction to advanced (?) MCMC methods
Motivating example
Example (Mixture models)
Models of mixtures of distributions:
X ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X ∼ p1 f1 (x) + · · · + pk fk (x) .
For a sample of independent random variables (X1 , · · · , Xn ),
sample density
n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1
8. An introduction to advanced (?) MCMC methods
Motivating example
Example (Mixture models)
Models of mixtures of distributions:
X ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X ∼ p1 f1 (x) + · · · + pk fk (x) .
For a sample of independent random variables (X1 , · · · , Xn ),
sample density
n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1
Expanding this product involves k n elementary terms: prohibitive
to compute in large samples.
10. An introduction to advanced (?) MCMC methods
Motivating example
A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
11. An introduction to advanced (?) MCMC methods
Motivating example
A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii) use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
12. An introduction to advanced (?) MCMC methods
Motivating example
A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii) use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
(iii) use of a huge dataset;
13. An introduction to advanced (?) MCMC methods
Motivating example
A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii) use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
(iii) use of a huge dataset;
(iv) use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
14. An introduction to advanced (?) MCMC methods
Motivating example
A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii) use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
(iii) use of a huge dataset;
(iv) use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
(v) use of a complex inferential procedure as for instance, Bayes
factors
π π(θ ∈ Θ0 )
B01 (x) = P (θ ∈ Θ0 | x)/P (θ ∈ Θ1 | x) .
π(θ ∈ Θ1 )
15. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
1 Motivating example
2 The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
The Metropolis–Hastings algorithm
A collection of Metropolis-Hastings algorithms
Extensions
Convergence assessment
16. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains
Fact: It is not necessary to use a sample from the distribution f to
approximate the integral
I= h(x)f (x)dx ,
17. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains
Fact: It is not necessary to use a sample from the distribution f to
approximate the integral
I= h(x)f (x)dx ,
We can obtain X1 , . . . , Xn ∼ f (approx) without directly
simulating from f , using an ergodic Markov chain with
stationary distribution f
18. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f
19. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f
Ensures the convergence in distribution of (X (t) ) to a random
variable from f .
For a “large enough” T0 , X (T0 ) can be considered as
distributed from f
Produces a dependent sample X (T0 ) , X (T0 +1) , . . ., which is
generated from f , sufficient for most approximation purposes.
20. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
The Metropolis–Hastings algorithm
Problem:
How can one build a Markov chain with a given stationary
distribution?
21. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
The Metropolis–Hastings algorithm
Problem:
How can one build a Markov chain with a given stationary
distribution?
MH basics
Algorithm that converges to the objective (target) density
f
using an arbitrary transition kernel density
q(x, y)
called instrumental (or proposal) distribution
22. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
The MH algorithm
Algorithm (Metropolis–Hastings)
Given x(t) ,
1 Generate Yt ∼ q(x(t) , y).
2 Take
Yt with prob. ρ(x(t) , Yt ),
X (t+1) =
x(t) with prob. 1 − ρ(x(t) , Yt ),
where
f (y) q(y, x)
ρ(x, y) = min ,1 .
f (x) q(x, y)
23. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Features
Independent of normalizing constants for both f and q(x, ·)
(ie, those constants independent of x)
Never move to values with f (y) = 0
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain
24. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Features
Independent of normalizing constants for both f and q(x, ·)
(ie, those constants independent of x)
Never move to values with f (y) = 0
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain
P( θ-> θ ’)
Satisfies the detailed balance condition
θ’
θ
f (x)K(x, y) = f (y)K(y, x) P(θ’-> θ )
[Green, 1995]
25. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties
1 The M-H Markov chain is reversible, with invariant/stationary
density f .
26. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties
1 The M-H Markov chain is reversible, with invariant/stationary
density f .
2 As f is a probability measure, the chain is positive recurrent
27. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties
1 The M-H Markov chain is reversible, with invariant/stationary
density f .
2 As f is a probability measure, the chain is positive recurrent
3 If
f (Yt ) q(Yt , X (t) )
Pr ≥ 1 < 1. (1)
f (X (t) ) q(X (t) , Yt )
i.e., if the event {X (t+1) = X (t) } occurs with positive
probability, then the chain is aperiodic
28. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties (2)
4 If
q(x, y) > 0 for every (x, y), (2)
the chain is irreducible
29. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties (2)
4 If
q(x, y) > 0 for every (x, y), (2)
the chain is irreducible
5 For M-H, f -irreducibility implies Harris recurrence
30. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties (2)
4 If
q(x, y) > 0 for every (x, y), (2)
the chain is irreducible
5 For M-H, f -irreducibility implies Harris recurrence
6 Thus, under conditions (1) and (2)
(i) For h, with Ef |h(X)| < ∞,
T
1
lim h(X (t) ) = h(x)df (x) a.e. f.
T →∞ T t=1
(ii) and
lim K n (x, ·)µ(dx) − f =0
n→∞
TV
for every initial distribution µ, where K n (x, ·) denotes the
kernel for n transitions.
31. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
The Independent Case
The instrumental distribution q(x, ·) is independent of x and is
denoted g
32. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
The Independent Case
The instrumental distribution q(x, ·) is independent of x and is
denoted g
Algorithm (Independent Metropolis-Hastings)
Given x(t) ,
1 Generate Yt ∼ g(y)
2 Take
Y f (Yt ) g(x(t) )
with prob. min ,1 ,
t
X (t+1) = f (x(t) ) g(Yt )
x(t) otherwise.
33. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Properties
The resulting sample is not iid
34. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Properties
The resulting sample is not iid but there exist strong convergence
properties:
Theorem (Ergodicity)
The algorithm produces a uniformly ergodic chain if there exists a
constant M such that
f (x) ≤ M g(x) , x ∈ supp f.
In this case,
n
1
K n (x, ·) − f TV ≤ 1− .
M
[Mengersen & Tweedie, 1996]
35. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,
xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 )
and observables
yt |xt ∼ N (x2 , σ 2 )
t
36. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,
xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 )
and observables
yt |xt ∼ N (x2 , σ 2 )
t
The distribution of xt given xt−1 , xt+1 and yt is
−1 τ2
exp (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 + (yt − x2 )2
t .
2τ 2 σ2
37. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1) too)
2
Use for proposal the N (µt , ωt ) distribution, with
xt−1 + xt+1 2 τ2
µt = ϕ and ωt = .
1 + ϕ2 1 + ϕ2
38. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1) too)
2
Use for proposal the N (µt , ωt ) distribution, with
xt−1 + xt+1 2 τ2
µt = ϕ and ωt = .
1 + ϕ2 1 + ϕ2
Ratio
π(x)/qind (x) = exp −(yt − x2 )2 /2σ 2
t
is bounded
39. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
(top) Last 500 realisations of the chain {Xk }k out of 10, 000
iterations; (bottom) histogram of the chain, compared with
the target distribution.
40. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk Metropolis–Hastings
Instead, use a local perturbation as proposal
Yt = X (t) + εt ,
where εt ∼ g, independent of X (t) .
The instrumental density is now of the form g(y − x) and the
Markov chain is a random walk if g is symmetric
g(x) = g(−x)
41. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Algorithm (Random walk Metropolis)
Given x(t)
1 Generate Yt ∼ g(y − x(t) )
2 Take
f (Yt )
Y with prob. min 1, ,
(t+1) t
X = f (x(t) )
(t)
x otherwise.
42. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Probit illustration
Likelihood and posterior given by
n
π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi .
i=1
under the flat prior
43. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Probit illustration
Likelihood and posterior given by
n
π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi .
i=1
under the flat prior
A random walk proposal works well for a small number of
ˆ
predictors. Use the maximum likelihood estimate β as starting
ˆ
value and asymptotic (Fisher) covariance matrix of the MLE, Σ, as
scale
44. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
MCMC algorithm
Probit random-walk Metropolis-Hastings
ˆ ˆ
Initialization: Set β (0) = β and compute Σ
Iteration t:
1 ˜ ˆ
Generate β ∼ Nk+1 (β (t−1) , τ Σ)
2 Compute
˜
π(β|y)
˜
ρ(β (t−1) , β) = min 1,
π(β (t−1) |y)
3 ˜ ˜
With probability ρ(β (t−1) , β) set β (t) = β;
otherwise set β (t) = β (t−1) .
45. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
R bank benchmark
Probit modelling with
no intercept over the
0.8
−1.0
1.0
0.4
four measurements.
−2.0
0.0
0.0
0 4000 8000 −2.0 −1.5 −1.0 −0.5 0 200 600 1000
Three different scales
3
0.0 0.4 0.8
τ = 1, 0.1, 10: best
2
0.4
1
mixing behavior is
−1
0.0
0 4000 8000 −1 0 1 2 3 0 200 600 1000
associated with τ = 1.
2.5
0.8
0.0 0.4 0.8
−0.5 1.0
Average of the
0.4
0.0
parameters over 1.8
0 4000 8000 −0.5 0.5 1.5 2.5 0 200 600 1000
MCMC 9, 000
0.0 0.4 0.8
2.0
1.2
1.0
iterations gives plug-in
0.0
0.6
0 4000 8000 0.6 1.0 1.4 1.8 0 200 600 1000
estimate
pi = Φ (−1.2193xi1 + 0.9540xi2 + 0.9795xi3 + 1.1481xi4 ) .
ˆ
46. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Mixture models)
n k
π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ)
j=1 ℓ=1
47. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Mixture models)
n k
π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ)
j=1 ℓ=1
Metropolis-Hastings proposal:
θ(t) + ωε(t) if u(t) < ρ(t)
θ(t+1) =
θ(t) otherwise
where
π(θ(t) + ωε(t) |x)
ρ(t) = ∧1
π(θ(t) |x)
and ω scaled for good acceptance rate
48. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 1
4
3
2
µ2
1
0
−1
−1 0 1 2 3 4
µ1
49. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 10
4
3
2
µ2
1
0
−1
−1 0 1 2 3 4
µ1
50. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 100
4
3
2
µ2
1
0
−1
−1 0 1 2 3 4
µ1
51. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 500
4
3
2
µ2
1
0
−1
−1 0 1 2 3 4
µ1
52. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 1000
4
3
2
µ2
1
0
−1
−1 0 1 2 3 4
µ1
53. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 10
4
3
2
µ2
1
0
−1
−1 0 1 2 3 4
µ1
54. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 100
4
3
2
µ2
1
0
−1
−1 0 1 2 3 4
µ1
55. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 500
4
3
2
µ2
1
0
−1
−1 0 1 2 3 4
µ1
56. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 1000
4
3
2
µ2
1
0
−1
−1 0 1 2 3 4
µ1
57. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 10,000
4
3
2
µ2
1
0
−1
−1 0 1 2 3 4
µ1
58. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 5000
4
3
2
µ2
1
0
−1
−1 0 1 2 3 4
µ1
59. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Convergence properties
Uniform ergodicity prohibited by random walk structure
60. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Convergence properties
Uniform ergodicity prohibited by random walk structure
At best, geometric ergodicity:
Theorem (Sufficient ergodicity)
For a symmetric density f , log-concave in the tails, and a positive
and symmetric density g, the chain (X (t) ) is geometrically ergodic.
[Mengersen & Tweedie, 1996]
no tail effect
61. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
1.5
1.5
1.0
1.0
Example (Comparison of tail
effects)
0.5
0.5
0.0
0.0
Random-walk
Metropolis–Hastings algorithms
-0.5
-0.5
based on a N (0, 1) instrumental
-1.0
-1.0
for the generation of (a) a
-1.5
-1.5
N (0, 1) distribution and (b) a 0 50 100
(a)
150 200 0 50 100
(b)
150 200
distribution with density 90% confidence envelopes of
ψ(x) ∝ (1 + |x|)−3 the means, derived from 500
parallel independent chains
62. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Extensions
There are many other families of HM algorithms
Adaptive Rejection Metropolis Sampling
Reversible Jump
Langevin algorithms
to name just a few...
63. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Langevin Algorithms
Proposal based on the Langevin diffusion Lt is defined by the
stochastic differential equation
1
dLt = dBt + ∇ log f (Lt )dt,
2
where Bt is the standard Brownian motion
Theorem
The Langevin diffusion is the only non-explosive diffusion which is
reversible with respect to f .
64. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Because continuous time cannot be simulated, consider the
discretised sequence
σ2
x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ 2 corresponds to the discretisation step
0.6
0.5
0.4
Example of
Density
0.3
f (x) = exp(−x4 )
0.2
0.1
0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
σ2 = .1
65. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Because continuous time cannot be simulated, consider the
discretised sequence
σ2
x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ 2 corresponds to the discretisation step
0.6
0.5
0.4
Example of
0.3
f (x) = exp(−x4 )
0.2
0.1
0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
σ2 = .01
66. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Because continuous time cannot be simulated, consider the
discretised sequence
σ2
x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ 2 corresponds to the discretisation step
0.6
0.5
0.4
Example of
Density
0.3
f (x) = exp(−x4 )
0.2
0.1
0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
σ2 = .001
67. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Because continuous time cannot be simulated, consider the
discretised sequence
σ2
x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ 2 corresponds to the discretisation step
0.8
0.6
Example of
Density
0.4
f (x) = exp(−x4 )
0.2
0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
σ2 = .0001
68. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Because continuous time cannot be simulated, consider the
discretised sequence
σ2
x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ 2 corresponds to the discretisation step
0.6
0.5
0.4
Example of
Density
0.3
f (x) = exp(−x4 )
0.2
0.1
0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
σ2 = .0001∗
69. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Unfortunately, the discretized chain may be transient, for instance
when
lim σ 2 ∇ log f (x)|x|−1 > 1
x→±∞
Example of f (x) = exp(−x4 ) when σ 2 = .2
70. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
MH correction
Accept the new value Yt with probability
2
σ2
exp − Yt − x(t) − (t)
2 ∇ log f (x ) 2σ 2
f (Yt )
· ∧1.
f (x(t) ) σ2
2
exp − x(t) − Yt − 2 ∇ log f (Yt ) 2σ 2
Choice of the scaling factor σ
Should lead to an acceptance rate of 0.574 to achieve optimal
convergence rates (when the components of x are uncorrelated)
[Roberts & Rosenthal, 1998]
71. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Optimizing the Acceptance Rate
Problem of choice of the transition kernel from a practical point of
view
Most common alternatives:
1 a fully automated algorithm like ARMS;
2 an instrumental density g which approximates f , such that
f /g is bounded for uniform ergodicity to apply;
3 a random walk
In both cases (b) and (c), the choice of g is critical,
72. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk
Different approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .
73. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk
Different approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .
If x(t) and yt are close, i.e. f (x(t) ) ≃ f (yt ) y is accepted with
probability
f (yt )
min ,1 ≃ 1 .
f (x(t) )
For multimodal densities with well separated modes, the negative
effect of limited moves on the surface of f clearly shows.
74. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk (2)
If the average acceptance rate is low, the successive values of f (yt )
tend to be small compared with f (x(t) ), which means that the
random walk moves quickly on the surface of f since it often
reaches the “borders” of the support of f
75. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]
76. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]
This rule is to be taken with a pinch of salt!
77. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Example (Noisy AR(1) continued)
For a Gaussian random walk with scale ω small enough, the
random walk never jumps to the other mode. But if the scale ω is
sufficiently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.
78. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Markov chain based on a random walk with scale ω = .1
79. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Markov chain based on a random walk with scale ω = .5
80. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Where do we stand?
MCMC in a nutshell:
81. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Where do we stand?
MCMC in a nutshell:
Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation
to target density f when detailed balance condition holds
f (x)K(x, y) = f (y)K(y, x)
82. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Where do we stand?
MCMC in a nutshell:
Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation
to target density f when detailed balance condition holds
f (x)K(x, y) = f (y)K(y, x)
Easiest implementation of the principle is random walk
Metropolis-Hastings
Yt = X (t) + εt
83. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Where do we stand?
MCMC in a nutshell:
Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation
to target density f when detailed balance condition holds
f (x)K(x, y) = f (y)K(y, x)
Easiest implementation of the principle is random walk
Metropolis-Hastings
Yt = X (t) + εt
Practical convergence requires sufficient energy from the
proposal that is calibrated by trial and error.
84. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Convergence diagnostics
How many iterations?
85. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Convergence diagnostics
How many iterations?
Rule # 1 There is no absolute number of simulations, i.e.
1, 000 is neither large, nor small.
Rule # 2 It takes [much] longer to check for convergence
than for the chain itself to converge.
Rule # 3 MCMC is a “what-you-get-is-what-you-see”
algorithm: it fails to tell about unexplored parts of the space.
Rule # 4 When in doubt, run MCMC chains in parallel and
check for consistency.
86. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Convergence diagnostics
How many iterations?
Rule # 1 There is no absolute number of simulations, i.e.
1, 000 is neither large, nor small.
Rule # 2 It takes [much] longer to check for convergence
than for the chain itself to converge.
Rule # 3 MCMC is a “what-you-get-is-what-you-see”
algorithm: it fails to tell about unexplored parts of the space.
Rule # 4 When in doubt, run MCMC chains in parallel and
check for consistency.
Many “quick-&-dirty” solutions in the literature, but not
necessarily 100% trustworthy.
87. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Example (Bimodal target)
0.4
Density
0.3
exp −x2 /2 4(x − .3)2 + .01
0.2
f (x) = √ .
4(1 + (.3)2 ) + .01
0.1
2π
0.0
−4 −2 0 2 4
and use of random walk Metropolis–Hastings algorithm with
variance .04
Evaluation of the missing mass by
T −1
[θ(t+1) − θ(t) ] f (θ(t) )
t=1
88. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
1.0
0.8
0.6
mass
0.4
0.2
0.0
0 500 1000 1500 2000
Index
Sequence [in blue] and mass evaluation [in brown]
[Philippe & Robert, 2001]
89. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Effective sample size
How many iid simulations from π are equivalent to N simulations
from the MCMC algorithm?
90. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Effective sample size
How many iid simulations from π are equivalent to N simulations
from the MCMC algorithm?
Based on estimated k-th order auto-correlation,
ρk = cov x(t) , x(t+k) ,
effective sample size
T0 −1/2
ess
N =n 1+2 ρk
ˆ ,
k=1
Only partial indicator that fails to signal chains stuck in one
mode of the target
91. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Tempering
Facilitate exploration of π by flattening the target: simulate from
πα (x) ∝ π(x)α for α > 0 small enough
92. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Tempering
Facilitate exploration of π by flattening the target: simulate from
πα (x) ∝ π(x)α for α > 0 small enough
Determine where the modal regions of π are (possibly with
parallel versions using different α’s)
Recycle simulations from π(x)α into simulations from π by
importance sampling
Simple modification of the Metropolis–Hastings algorithm,
with new acceptance
α
π(θ′ |x) q(θ|θ′ )
∧1
π(θ|x) q(θ′ |θ)
93. An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Tempering with the mean mixture
1 0.5 0.2
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
−1
−1
−1
−1 0 1 2 3 4 −1 0 1 2 3 4 −1 0 1 2 3 4