1. Density exploration methods
Pierre Jacob
Department of Statistics, University of Oxford
pierre.jacob at stats.ox.ac.uk
March 2014
Pierre Jacob Density exploration 1/ 49
2. Outline
1 MCMC and multimodal target distributions
2 Parallel MCMC, tempering and equi-energy moves
3 Wang–Landau algorithm
Pierre Jacob Density exploration 2/ 49
3. Outline
1 MCMC and multimodal target distributions
2 Parallel MCMC, tempering and equi-energy moves
3 Wang–Landau algorithm
Pierre Jacob Density exploration 3/ 49
4. MCMC and multimodal target distributions
Algorithm 1 Metropolis–Hastings targeting π
1: Init X0 ∈ X.
2: for t = 1 to T do
3: Sample X⋆ from some proposal distribution q(Xt−1, ·).
4: Compute the acceptance ratio:
α(Xt−1, X⋆
) = min
(
1,
π(X⋆)
π(Xt−1)
q(X⋆, Xt−1)
q(Xt−1, X⋆)
)
.
5: With probability α(Xt−1, X⋆), set Xt = X⋆;
otherwise Xt = Xt−1.
6: end for
Pierre Jacob Density exploration 3/ 49
5. MCMC and multimodal target distributions
MCMC methods allow to approximate
∫
φ(x)π(x)dx,
as long as π can be evaluated / estimated point-wise, and
generate
X0, X1, . . . , XT .
The guarantees are largely asymptotic in T going to infinity.
For multimodal target distributions the non-asymptotic regime
might be very different.
Pierre Jacob Density exploration 4/ 49
6. MCMC and multimodal target distributions
−4
−2
0
2
4
6
8
−4 −2 0 2 4 6 8
µ1
µ2
Figure : Posterior distribution of (µ1, µ2) in a Gaussian mixture model.
See Stephens (1997), Bayesian methods for mixtures of normal distribu-
tions, PhD thesis. Figure obtained using PAWL.
Pierre Jacob Density exploration 5/ 49
7. MCMC and multimodal target distributions
r
θ
−2
−1
0
1
2
−2 −1 0 1 2
Figure : Posterior distribution of (r, θ) in a theta-Ricker Hidden Markov
model. See Polansky et al. (2009), Likelihood ridges and multimodality
in population growth rate models. Figure obtained using SMC2
.
Pierre Jacob Density exploration 6/ 49
8. MCMC and multimodal target distributions
0.00
0.05
0.10
0.15
0.20
0.25
−5 0 5 10 15
X
density
Figure : Toy example: a mixture of well-separated normal distributions.
Pierre Jacob Density exploration 7/ 49
9. MCMC and multimodal target distributions
0.0
0.1
0.2
0.3
0.4
0.5
−5 0 5 10 15
AMH
density
Figure : Markov chain still stuck in one mode after 50, 000 iterations.
Pierre Jacob Density exploration 8/ 49
10. MCMC and multimodal target distributions
−750
−500
−250
0
0 200 400 600 800
X
Y
0.25
0.50
0.75
density
Figure : Feist your eyes on the moustarget distribution!
Pierre Jacob Density exploration 9/ 49
11. MCMC and multimodal target distributions
Note that multimodal distributions are not difficult to sample
from if the modes are not well separated.
In fact we can [re]define a mode as a region from where
Metropolis-Hastings cannot escape.
Non-asymptotic Error Bounds for Sequential MCMC
Methods in Multimodal Settings. N. Schweizer 2012
Pierre Jacob Density exploration 10/ 49
12. Outline
1 MCMC and multimodal target distributions
2 Parallel MCMC, tempering and equi-energy moves
3 Wang–Landau algorithm
Pierre Jacob Density exploration 11/ 49
13. Parallel MCMC
A first idea is to run N chains independently, from various
starting points chosen to be “spread out”.
The chains can thus find multiple modes, and other benefits
such as parallelization and convergence diagnostics.
What if there are > N modes? What if all the chains are
initialized in the attraction zone of the same mode?
Pierre Jacob Density exploration 11/ 49
14. Parallel MCMC
Figure : Parallel MCMC on the moustarget distribution
Pierre Jacob Density exploration 12/ 49
15. Parallel MCMC
−400
−300
−200
−100
0
100
0 2500 5000 7500 10000
iterations
Y
indexchain
1
2
3
4
5
6
7
8
9
10
Figure : Parallel MCMC on the moustarget distribution
Pierre Jacob Density exploration 13/ 49
16. Parallel Tempering
The idea of parallel tempering is to run N chains targeting
different versions of π, of “increasing difficulty”.
Introduce “inverse temperatures”:
0 < γ1 < . . . < γN = 1.
Introduce “tempered” distributions πγn for n = 1, . . . , N.
For γ ≈ 0, πγ is considered easier to sample because the
variations of π are smaller.
Pierre Jacob Density exploration 14/ 49
17. Parallel Tempering
One MCMC chain per inverse temperature, for instance using
a Metropolis-Hastings kernel targeting πγn .
Note that the local modes of πγ are the same for every γ.
The chains interact through “swap moves”.
Pierre Jacob Density exploration 15/ 49
18. Parallel Tempering
When a “swap move” is to be performed, do the following.
Sample indices k1, k2 uniformly in {1, . . . , N}.
With acceptance probability
min
(
1,
πγk1 (xk2 )πγk2 (xk1 )
πγk1 (xk1 )πγk2 (xk2 )
)
,
exchange the value of xk1 and xk2 .
This doesn’t change the joint target distribution
πγ1
⊗ πγ2
⊗ . . . ⊗ πγN
.
In particular the N-th chain still targets πγN = π.
Pierre Jacob Density exploration 16/ 49
19. Parallel Tempering
Figure : Parallel Tempering on the moustarget distribution, with γ
equally spaced in [0.5, 1].
Pierre Jacob Density exploration 17/ 49
20. Parallel Tempering
−400
−300
−200
−100
0
100
0 2500 5000 7500 10000
iterations
Y
indexchain
1
2
3
4
5
6
7
8
9
10
Figure : Parallel Tempering on the moustarget distribution, with γ
equally spaced in [0.5, 1].
Pierre Jacob Density exploration 18/ 49
21. Parallel Tempering
Figure : Parallel Tempering on the moustarget distribution, with γ
equally spaced in [0.1, 1].
Pierre Jacob Density exploration 19/ 49
22. Parallel Tempering
−1000
−500
0
0 2500 5000 7500 10000
iterations
Y
indexchain
1
2
3
4
5
6
7
8
9
10
Figure : Parallel Tempering on the moustarget distribution, with γ
equally spaced in [0.1, 1].
Pierre Jacob Density exploration 20/ 49
23. Parallel Tempering
The choice of (γn)N
n=1 is essential.
Taking γ1 very low increases the exploration for the chain
targeting πγ1 .
If the increments γn − γn−1 are too large, the swap moves
tend to be rejected, which decreases the exploration for the
“upper” chains.
Pierre Jacob Density exploration 21/ 49
24. SMC sampler
Sequence of distributions, for instance πγn for n = 1, . . . , N
such that 0 < γ1 < . . . < γN = 1. Say N = 100.
M particles (say 10,000), sequentially importance sampling
from µ to πγ1 and then from πγn−1 to πγn .
When the effective sample size is low, resample and then
MCMC move for each particle (say 5 steps for each particle).
The ability to recover modes is sensitive to the choice of the
initial distribution µ.
Pierre Jacob Density exploration 22/ 49
25. SMC sampler
q qqq
qq
q
q qq
−750
−500
−250
0
0 200 400 600 800
X
Y
Figure : SMC sampler on the moustarget distribution
µ = N
((
400
−100
)
,
(
322 0
0 322
))
Pierre Jacob Density exploration 23/ 49
27. Equi-energy sampler
Same initial setting as parallel tempering: N chains (Xn
t )N
n=1,
each targets πγn , using MH kernel.
First chain (X1
t ) simply targets πγ1 using MH.
For an upper chain (Xn
t ), with probability ε perform an
equi-energy move, otherwise MH step targeting πγn .
Adaptive Equi-Energy Sampler : Convergence and
Illustration, Schreck, Fort, Moulines, 2013.
Pierre Jacob Density exploration 25/ 49
28. Equi-energy sampler
An equi-energy move consists in proposing a point among the
history of the chain just below, (Xn−1
t ), and then accepting it
with a MH type acceptance probability.
[Whereas in Parallel Tempering we proposed the current state
of any other chain.]
The proposal is reduced to points with roughly similar values
of π(x) (hence “equi-energy”).
Pierre Jacob Density exploration 26/ 49
29. Equi-energy sampler
Introduce a sequence ξ0 = 0 < ξ1 < . . . < ξS = +∞ cutting
the density axis R+ in S intervals.
Introduce H(x, y) such that H(x, y) = 1 if π(x) and π(y) are
in the same interval [ξk, ξk+1).
Introduce the proposal distribution given Xn
t and
θ = {Xn−1
k }1≤k≤t:
gθ(Xn
t , dy) ∝
t∑
k=1
H(Xn−1
k , Xn
t )δXn−1
k
(dy).
Then the proposed point is accepted with probability
min
(
1,
πγn−γn−1 (y)
πγn−γn−1 (Xn
t )
)
(similar to swap acceptance probability in Parallel Tempering)
Pierre Jacob Density exploration 27/ 49
30. Outline
1 MCMC and multimodal target distributions
2 Parallel MCMC, tempering and equi-energy moves
3 Wang–Landau algorithm
Pierre Jacob Density exploration 28/ 49
31. Wang–Landau algorithm
The main idea is to force the chain to avoid regions that have
already been visited.
The concept of region is formalized by a partition of the state
space.
The self-avoiding effect is achieved by an adaptation of the
transition kernel.
Determining the density of states for classical statistical
models: A random walk algorithm to produce a flat
histogram, F. Wang and D. Landau, Physical Review E 2001.
Pierre Jacob Density exploration 28/ 49
32. Wang–Landau algorithm
Partition the state space:
X =
d∪
i=1
Xi
Desired frequencies of visit:
ϕ = (ϕ1, . . . , ϕd) such that
d∑
i=1
ϕi = 1
Pierre Jacob Density exploration 29/ 49
34. Wang–Landau algorithm
Penalized distribution for any θ = (θ1, . . . , θd):
πθ(x) ∝
π(x)
θ(J(x))
where J(x) such that x ∈ XJ(x).
There is θ⋆ such that:
∀i ∈ {1, . . . , d}
∫
Xi
πθ⋆ (x)dx = ϕi
i.e. πθ⋆ gives a desired mass ϕi to each bin Xi.
These ideal penalties θ⋆ are not available.
Pierre Jacob Density exploration 31/ 49
37. Wang–Landau algorithm
Algorithm 2 Wang-Landau with deterministic schedule (ηt)
1: Init θ0 > 0, X0 ∈ X.
2: for t = 1 to T do
3: Sample Xt from Kθt−1 (Xt−1, ·), MH kernel targeting πθt−1 .
4: Update the penalties:
log θt(i) ← log θt−1(i) + ηt (1IXi (Xt) − ϕi)
5: end for
Pierre Jacob Density exploration 34/ 49
38. Wang–Landau algorithm
If ηt → 0 “fast enough”, (θt)t≥0 converges.
If for each bin i, ϕi = 1/d:
θt(i) −−−→
t→∞
∫
Xi
π(x)dx =: ψi
at least up to a multiplicative constant.
(Xt)t≥0 is asymptotically distributed according to πθ⋆ .
Convergence of the Wang-Landau algorithm,
G. Fort, B. Jourdain, E. Kuhn, T. Lelievre, G. Stoltz
2012, on arXiv.
Pierre Jacob Density exploration 35/ 49
39. Wang–Landau algorithm
Choice of (ηt) can have a huge impact on the results.
Define the counters:
νt(i) :=
t∑
n=1
1IXi (Xn)
Flat Histogram (FH) is reached when:
max
i∈{1,...,d}
νt(i)
t
− ϕi < c
for some c > 0.
Instead of decreasing (ηt) at each iteration, decrease only
when the Flat Histogram criterion is reached.
Pierre Jacob Density exploration 36/ 49
40. Wang–Landau algorithm
Algorithm 3 Wang-Landau with stochastic schedule (ηκt )
1: Init θ0 = 1, X0 ∈ X.
2: Init κ0 ← 0.
3: for t = 1 to T do
4: Sample Xt from Kθt−1 (Xt−1, ·), MH kernel targeting πθt−1 .
5: If (FH) then κt ← κt−1 + 1, otherwise κt ← κt−1.
6: Update the penalties:
log θt(i) ← log θt−1(i) + ηκt (1IXi (Xt) − ϕi)
7: end for
Pierre Jacob Density exploration 37/ 49
41. Wang–Landau algorithm
To be sure that eventually, for any c > 0:
max
i∈{1,...,d}
νt(i)
t
− ϕi < c
we have proved:
∀i ∈ {1, . . . , d}
νt(i)
t
P
−−−→
t→∞
ϕi
for any fixed η > 0,
which implies:
E
[
inf
{
t ≥ 0 : ∀i ∈ {1, . . . , d} |
νt(i)
t
− ϕi| < c
}]
< ∞.
The Wang-Landau algorithm reaches the Flat Histogram
criterion in finite time, PJ & R. Ryder, AAP 2013.
Pierre Jacob Density exploration 38/ 49
42. Wang–Landau algorithm
N chains (X
(1)
t , . . . , X
(N)
t ) using the same kernel Kθt
targeting πθt at time t.
The interaction is made through the common penalties (θt).
The update was
log θt(i) ← log θt−1(i) + η (1IXi (Xt) − ϕi) .
Pierre Jacob Density exploration 39/ 49
43. Wang–Landau algorithm
N chains (X
(1)
t , . . . , X
(N)
t ) using the same kernel Kθt
targeting πθt at time t.
The interaction is made through the common penalties (θt).
The update is now
log θt(i) ← log θt−1(i) + η
(
1
N
∑N
k=1 1IXi (X
(k)
t ) − ϕi
)
.
Pierre Jacob Density exploration 39/ 49
44. Wang–Landau algorithm
Default choice of partition: along density values
(always 1-dimensional!).
Introduce a sequence ξ0 = 0 < ξ1 < . . . < ξd = +∞ cutting
the density axis R+ in d intervals.
Some sense of a good range can be grasped from pilot runs.
Pierre Jacob Density exploration 40/ 49
45. Wang–Landau algorithm
Figure : Wang-Landau on the moustarget distribution. Colours represent
the partition.
Pierre Jacob Density exploration 41/ 49
49. Wang–Landau algorithm
In some situations we might have some intuition on the
direction along which the modes are spread.
For instance, if we knew that the modes of the moustarget
distribution were along the y-axis:
∀i ∈ {1, . . . , d} Xi = R × (yi, yi+1)
with −∞ = y1 < y2 < . . . < yd = +∞.
Pierre Jacob Density exploration 45/ 49
50. Wang–Landau algorithm
Figure : Wang-Landau using the y-axis to partition the space.
Pierre Jacob Density exploration 46/ 49
53. Bibliography
The Wang-Landau algorithm in general state spaces:
applications and convergence analysis, Y. Atchad´e and J.
Liu, Statistica Sinica 2010.
Determining the density of states for classical statistical
models: A random walk algorithm to produce a flat
histogram, F. Wang and D. Landau, Physical Review E 2001.
An Adaptive Interacting Wang-Landau Algorithm for
Automatic Density Exploration, L. Bornn, PJ, P. Del Moral,
A. Doucet, JCGS 2013.
The Wang-Landau algorithm reaches the Flat Histogram
criterion in finite time, PJ & R. Ryder, AAP 2013.
Adaptive Equi-Energy Sampler : Convergence and
Illustration, Schreck, Fort, Moulines, 2013.
Efficiency of the Wang-Landau algorithm: a simple test
case, G. Fort, B. Jourdain, E. Kuhn, T Lelievre, G. Stoltz,
2014.
Pierre Jacob Density exploration 49/ 49