Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Prochain SlideShare
Chargement dans…5
×

# Unbiased MCMC with couplings

328 vues

Publié le

talk given in ML Oxford on July 19, 2019

Publié dans : Sciences
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Soyez le premier à commenter

• Soyez le premier à aimer ceci

### Unbiased MCMC with couplings

1. 1. Unbiased MCMC with couplings Pierre E. Jacob Department of Statistics, Harvard University joint work with John O’Leary, Yves Atchad´e, and other fantastic people acknowledged throughout Department of Statistics, University of Oxford July 19, 2019 Pierre E. Jacob Unbiased MCMC
2. 2. Outline 1 It is about Monte Carlo and bias 2 Unbiased estimators from Markov kernels 3 Design of coupled Markov chains 4 It can fail 5 Beyond parallel computation and burn-in Pierre E. Jacob Unbiased MCMC
3. 3. Outline 1 It is about Monte Carlo and bias 2 Unbiased estimators from Markov kernels 3 Design of coupled Markov chains 4 It can fail 5 Beyond parallel computation and burn-in Pierre E. Jacob Unbiased MCMC
4. 4. Setting Continuous or discrete space of dimension d. Target probability distribution π, with probability density function x → π(x). Goal: approximate π, e.g. approximate Eπ[h(X)] = h(x)π(x)dx = π(h), for a class of “test” functions h. Pierre E. Jacob Unbiased MCMC
5. 5. Markov chain Monte Carlo Initially, X0 ∼ π0, then Xt|Xt−1 ∼ P(Xt−1, ·) for t = 1, . . . , T. Estimator: 1 T − b T t=b+1 h(Xt), where b iterations are discarded as burn-in. Might converge to Eπ[h(X)] as T → ∞ by the ergodic theorem. Biased for any ﬁxed b, T, since π0 = π. Averaging independent copies of such estimators for ﬁxed T, b would not provide a consistent estimator of Eπ[h(X)] as the number of independent copies goes to inﬁnity. Pierre E. Jacob Unbiased MCMC
6. 6. Example: Metropolis–Hastings kernel P Initialize: X0 ∼ π0. At each iteration t ≥ 0, with Markov chain at state Xt, 1 propose X ∼ k(Xt, ·), 2 sample U ∼ U([0, 1]), 3 if U ≤ π(X )k(X , Xt) π(Xt)k(Xt, X ) , set Xt+1 = X , otherwise set Xt+1 = Xt. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika, 1970. Pierre E. Jacob Unbiased MCMC
7. 7. MCMC path π = N(0, 1), MH with Normal proposal std = 0.5, π0 = N(10, 32 ) Pierre E. Jacob Unbiased MCMC
8. 8. MCMC marginal distributions π = N(0, 1), RWMH with Normal proposal std = 0.5, π0 = N(10, 32 ) Pierre E. Jacob Unbiased MCMC
9. 9. Parallel computing R processors generate unbiased estimators, independently. The number of estimators completed by time t is random. There are diﬀerent ways of aggregating estimators, either discarding on-going calculations at time t or not. Diﬀerent regimes can be considered, e.g. as R → ∞ for ﬁxed t, as t → ∞ for ﬁxed R, as t and R both go to inﬁnity. Glynn & Heidelberger, Analysis of parallel replicated simulations under a completion time constraint, 1991. Pierre E. Jacob Unbiased MCMC
10. 10. Parallel computing Pierre E. Jacob Unbiased MCMC
11. 11. Outline 1 It is about Monte Carlo and bias 2 Unbiased estimators from Markov kernels 3 Design of coupled Markov chains 4 It can fail 5 Beyond parallel computation and burn-in Pierre E. Jacob Unbiased MCMC
12. 12. Coupled chains Generate two Markov chains (Xt) and (Yt) as follows, sample X0 and Y0 from π0 (independently or not), sample X1|X0 ∼ P(X0, ·), for t ≥ 1, sample (Xt+1, Yt)|(Xt, Yt−1) ∼ ¯P ((Xt, Yt−1), ·), ¯P is such that Xt+1|Xt ∼ P(Xt, ·) and Yt|Yt−1 ∼ P(Yt−1, ·), so each chain marginally evolves according to P. ⇒ Xt and Yt have the same distribution for all t ≥ 0. ¯P is also such that ∃τ such that Xτ = Yτ−1, and the chains are faithful. Pierre E. Jacob Unbiased MCMC
13. 13. Debiasing idea (one slide version) Limit as a telescopic sum, for all k ≥ 0, Eπ[h(X)] = lim t→∞ E[h(Xt)] = E[h(Xk)] + ∞ t=k+1 E[h(Xt) − h(Xt−1)]. Since for all t ≥ 0, Xt and Yt have the same distribution, = E[h(Xk)] + ∞ t=k+1 E[h(Xt) − h(Yt−1)]. If we can swap expectation and limit, = E[h(Xk) + ∞ t=k+1 (h(Xt) − h(Yt−1))], so h(Xk) + ∞ t=k+1(h(Xt) − h(Yt−1)) is unbiased. Pierre E. Jacob Unbiased MCMC
14. 14. Unbiased estimators Unbiased estimator is given by Hk(X, Y ) = h(Xk) + τ−1 t=k+1 (h(Xt) − h(Yt−1)), with the convention τ−1 t=k+1{·} = 0 if τ − 1 < k + 1. h(Xk) alone is biased; the other terms correct for the bias. Cost: τ − 1 calls to ¯P and 1 + max(0, k − τ) calls to P. Glynn & Rhee, Exact estimation for Markov chain equilibrium expectations, 2014. Note: same reasoning would work with arbitrary lags L ≥ 1. Pierre E. Jacob Unbiased MCMC
15. 15. Conditions Jacob, O’Leary, Atchad´e, Unbiased MCMC with couplings. 1 Marginal chain converges: E[h(Xt)] → Eπ[h(X)], and h(Xt) has (2 + η)-ﬁnite moments for all t. 2 Meeting time τ has geometric tails: ∃C < +∞ ∃δ ∈ (0, 1) ∀t ≥ 0 P(τ > t) ≤ Cδt . 3 Chains stay together: Xt = Yt−1 for all t ≥ τ. Condition 2 itself implied by e.g. geometric drift condition. Under these conditions, Hk(X, Y ) is unbiased, has ﬁnite expected cost and ﬁnite variance, for all k. Pierre E. Jacob Unbiased MCMC
16. 16. Conditions: update Middleton, Deligiannidis, Doucet, Jacob, Unbiased MCMC for intractable target distributions. 1 Marginal chain converges: E[h(Xt)] → Eπ[h(X)], and h(Xt) has (2 + η)-ﬁnite moments for all t. 2 Meeting time τ has polynomial tails: ∃C < +∞ ∃δ > 2(2η−1 + 1) ∀t ≥ 0 P(τ > t) ≤ Ct−δ . 3 Chains stay together: Xt = Yt−1 for all t ≥ τ. Condition 2 itself implied by e.g. polynomial drift condition. Pierre E. Jacob Unbiased MCMC
17. 17. Improved unbiased estimators Eﬃciency matters, thus in practice we recommend a variation of the previous estimator, deﬁned for integers k ≤ m as Hk:m(X, Y ) = 1 m − k + 1 m t=k Ht(X, Y ) which can also be written 1 m − k + 1 m t=k h(Xt)+ τ−1 t=k+1 min 1, t − k m − k + 1 (h(Xt)−h(Yt−1)), i.e. standard MCMC average + bias correction term. As k → ∞, bias correction is zero with increasing probability. Note: changing the lag is another way of modifying eﬃciency while preserving the lack of bias. Pierre E. Jacob Unbiased MCMC
18. 18. Eﬃciency Writing Hk:m(X, Y ) = MCMCk:m + BCk:m, then, denoting the MSE of MCMCk:m by MSEk:m, V[Hk:m(X, Y )] ≤ MSEk:m+2 MSEk:m E BC2 k:m +E BC2 k:m . Under geometric drift condition and regularity assumptions on h, for some δ < 1, C < ∞, E[BC2 k:m] ≤ Cδk (m − k + 1)2 , and δ is directly related to tails of the meeting time. Similarly under polynomial drift, we obtain a term of the form k−δ in the numerator instead of δk. Pierre E. Jacob Unbiased MCMC
19. 19. Signed measure estimator Replacing function evaluations by delta masses leads to ˆπ(·) = 1 m − k + 1 m t=k δXt (·) + τ−1 t=k+1 min 1, t − k m − k + 1 (δXt (·) − δYt−1 (·)), which is of the form ˆπ(·) = N n=1 ωnδZn (·), where N n=1 ωn = 1 but some ωn might be negative. Unbiasedness reads: E[Eˆπ[h(X)]] = Eπ[h(X)]. Pierre E. Jacob Unbiased MCMC
20. 20. Assessing convergence of MCMC Total variation distance between Xk ∼ πk and π = limk→∞ πk: πk − π TV = 1 2 sup h:|h|≤1 |E[h(Xk)] − Eπ[h(X)]| = 1 2 sup h:|h|≤1 |E[ τ−1 t=k+1 h(Xt) − h(Yt−1)]| ≤ E[max(0, τ − k − 1)]. 0.00 0.01 0.02 0.03 0.04 0 50 100 150 200 meeting time density 1e−03 1e−02 1e−01 1e+00 1e+01 0 50 100 k upperbound Pierre E. Jacob Unbiased MCMC
21. 21. Assessing convergence of MCMC With L-lag couplings, τ(L) = inf{t ≥ L : Xt = Yt−L}, dTV (πk, π) ≤ E 0 ∨ (τ(L) − L − k)/L , dW (πk, π) ≤ E (τ(L)−L−k)/L j=1 Xk+jL − Yk+(j−1)L 1 . 0.00 0.25 0.50 0.75 1.00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 iterations dTV SSG PT Biswas & Jacob, Estimating Convergence of Markov chains with L-Lag Couplings, 2019. Pierre E. Jacob Unbiased MCMC
22. 22. Outline 1 It is about Monte Carlo and bias 2 Unbiased estimators from Markov kernels 3 Design of coupled Markov chains 4 It can fail 5 Beyond parallel computation and burn-in Pierre E. Jacob Unbiased MCMC
23. 23. Designing coupled chains To implement the proposed unbiased estimators, we need to sample from a Markov kernel ¯P, such that, when (Xt+1, Yt) is sampled from ¯P ((Xt, Yt−1), ·), marginally Xt+1|Xt ∼ P(Xt, ·), and Yt|Yt−1 ∼ P(Yt−1, ·), it is possible that Xt+1 = Yt exactly for some t ≥ 0, if Xt = Yt−1, then Xt+1 = Yt almost surely. Pierre E. Jacob Unbiased MCMC
24. 24. Couplings of MCMC algorithms Many practical couplings in the literature. . . Propp & Wilson, Exact sampling with coupled Markov chains and applications to statistical mechanics, Random Structures & Algorithms, 1996. Johnson, Studying convergence of Markov chain Monte Carlo algorithms using coupled sample paths, JASA, 1996. Neal, Circularly-coupled Markov chain sampling, UoT tech report, 1999. Pinto & Neal, Improving Markov chain Monte Carlo estimators by coupling to an approximating chain, UoT tech report, 2001. Glynn & Rhee, Exact estimation for Markov chain equilibrium expectations, Journal of Applied Probability, 2014. Pierre E. Jacob Unbiased MCMC
25. 25. Couplings (X, Y ) follows a coupling of p and q if X ∼ p and Y ∼ q. The coupling inequality states that P(X = Y ) ≤ 1 − p − q TV, for any coupling, with p − q TV = 1 2 |p(x) − q(x)|dx. Maximal couplings achieve the bound. Pierre E. Jacob Unbiased MCMC
26. 26. Maximal coupling of Gamma and Normal Pierre E. Jacob Unbiased MCMC
27. 27. Maximal coupling: sampling algorithm Requires: evaluations of p and q, sampling from p and q. 1 Sample X ∼ p and W ∼ U([0, 1]). If W ≤ q(X)/p(X), output (X, X). 2 Otherwise, sample Y ∼ q and W ∼ U([0, 1]) until W > p(Y )/q(Y ), and output (X, Y ). Output: a pair (X, Y ) such that X ∼ p, Y ∼ q and P(X = Y ) is maximal. Pierre E. Jacob Unbiased MCMC
28. 28. Back to Metropolis–Hastings (kernel P) At each iteration t, Markov chain at state Xt, 1 propose X ∼ k(Xt, ·), 2 sample U ∼ U([0, 1]), 3 if U ≤ π(X )k(X , Xt) π(Xt)k(Xt, X ) , set Xt+1 = X , otherwise set Xt+1 = Xt. How to propagate two MH chains from states Xt and Yt−1 such that {Xt+1 = Yt} can happen? Pierre E. Jacob Unbiased MCMC
29. 29. Coupling of Metropolis–Hastings (kernel ¯P) At each iteration t, two Markov chains at states Xt, Yt−1, 1 propose (X , Y ) from max coupling of k(Xt, ·), k(Yt−1, ·), 2 sample U ∼ U([0, 1]), 3 if U ≤ π(X )k(X , Xt) π(Xt)k(Xt, X ) , set Xt+1 = X , otherwise set Xt+1 = Xt, if U ≤ π(Y )k(Y , Yt−1) π(Yt−1)k(Yt−1, Y ) , set Yt = Y , otherwise set Yt = Yt−1. Pierre E. Jacob Unbiased MCMC
30. 30. Coupling of Gibbs Gibbs sampler: update component i of the chain, leaving π(dxi|x1, . . . , xi−1, xi+1, . . . , xd) invariant. For instance, we can propose X ∼ k(Xi t, ·) to replace Xi t, and accept or not with a Metropolis–Hastings step. These proposals can be maximally coupled across two chains, at each component update. The chains meet when all components have met. Likewise we can couple parallel tempering chains, and meeting occurs when entire ensembles of chains meet. Pierre E. Jacob Unbiased MCMC
31. 31. Hamiltonian Monte Carlo Introduce potential energy U(q) = − log π(q), and total energy E(q, p) = U(q) + 1 2|p|2. Hamiltonian dynamics for (q(s), p(s)), where s ≥ 0: d ds q(s) = pE(q(s), p(s)), d ds p(s) = − qE(q(s), p(s)). Solving Hamiltonian dynamics exactly is not feasible, but discretization + Metropolis–Hastings correction ensure that π remains invariant. Common random numbers can make two HMC chains contract, under assumptions on the target such as strong log-concavity. Heng & Jacob, Unbiased HMC with couplings, Biometrika, 2019. Pierre E. Jacob Unbiased MCMC
32. 32. Coupling of HMC Figure 2 of Mangoubi & Smith, Rapid mixing of HMC strongly log-concave distributions, 2017. Coupling two copies X1, X2, . . . (blue) and Y1, Y2, . . . (green) of HMC by choosing same momentum pi at every step. See also Bou-Rabee, Eberle & Zimmer, Coupling and Convergence for Hamiltonian Monte Carlo, 2018. Pierre E. Jacob Unbiased MCMC
33. 33. Particle Independent Metropolis–Hastings Initialization: run SMC, obtain ˆπ(0)(·), ˆZ(0), where E[ ˆZ(0)] = Z, the normalizing constant of π. At each iteration t, given (ˆπ(t)(·), ˆZ(t)), 1 run SMC, obtain (π (·), Z ), 2 sample U ∼ U([0, 1]), 3 if U ≤ Z ˆZ(t) , set (t + 1)-th state to (π (·), Z ), otherwise set (t + 1)-th state to (ˆπ(t)(·), ˆZ(t)). For any number of particles, T−1 T t=1 ˆπ(t)(·) →T→∞ π. Andrieu, Doucet, Holenstein, Particle MCMC, JRSS B, 2010. Pierre E. Jacob Unbiased MCMC
34. 34. Coupled Particle Independent Metropolis–Hastings At each iteration t, given (ˆπ(t)(·), ˆZ(t)), and (˜π(t−1)(·), ˜Z(t−1)), 1 run SMC, obtain (π (·), Z ), 2 sample U ∼ U([0, 1]), 3 if U ≤ Z ˆZ(t) , set (t + 1)-th “1st” state to (π (·), Z ), otherwise set (t + 1)-th “1st” state to (ˆπ(t)(·), ˆZ(t)). 4 if U ≤ Z ˜Z(t−1) , set t-th “2nd” state to (π (·), Z ), otherwise set t-th “2nd” state to (˜π(t−1)(·), ˜Z(t−1)). Pierre E. Jacob Unbiased MCMC
35. 35. Unbiased SMC PIMH turns any SMC sampler into an MCMC scheme that can be debiased using the coupled chains machinery. Thus any MCMC algorithm that can be plugged in an SMC sampler, can then be subsequently de-biased. No need for couplings of the MCMC chains themselves. Middleton, Deligiannidis, Doucet, Jacob, Unbiased Smoothing using Particle Independent Metropolis-Hastings, AISTATS, 2019. Pierre E. Jacob Unbiased MCMC
36. 36. Outline 1 It is about Monte Carlo and bias 2 Unbiased estimators from Markov kernels 3 Design of coupled Markov chains 4 It can fail 5 Beyond parallel computation and burn-in Pierre E. Jacob Unbiased MCMC
37. 37. Bimodal target Target is mixture of univariate Normal distributions: π = 0.5 · N(−4, 1) + 0.5 · N(+4, 1). MCMC: random walk Metropolis–Hastings, with proposal standard deviation σ that will vary. Initial distribution π0 will vary. Pierre E. Jacob Unbiased MCMC
38. 38. Bimodal target With σ = 3, π0 = N(10, 102). . . 0.00 0.05 0.10 0.15 0.20 −10 −5 0 5 10 x density Pierre E. Jacob Unbiased MCMC
39. 39. Bimodal target With σ = 1, π0 = N(10, 102). . . 0.00 0.05 0.10 0.15 0.20 −10 −5 0 5 10 x density Pierre E. Jacob Unbiased MCMC
40. 40. Bimodal target With σ = 1, π0 = N(10, 12). . . 0.0 0.1 0.2 0.3 −10 −5 0 5 10 x density Pierre E. Jacob Unbiased MCMC
41. 41. Outline 1 It is about Monte Carlo and bias 2 Unbiased estimators from Markov kernels 3 Design of coupled Markov chains 4 It can fail 5 Beyond parallel computation and burn-in Pierre E. Jacob Unbiased MCMC
42. 42. Unbiased property Unbiased MCMC estimators can be used to tackle problems for which standard MCMC methods are ill-suited. For instance, problems where one needs to approximate h(x1, x2)π2(x2|x1)dx2 π1(x1)dx1, or equivalently E1 [E2[h(X1, X2)|X1]] . Pierre E. Jacob Unbiased MCMC
43. 43. Example 1: cut distribution First model, parameter θ1, data Y1, posterior π1(θ1|Y1). Second model, parameter θ2, data Y2, conditional posterior: π2(θ2|Y2, θ1) = p2(θ2|θ1)p2(Y2|θ1, θ2) p2(Y2|θ1) . Plummer, Cuts in Bayesian graphical models, 2015. Jacob, Murray, Holmes, Robert, Better together? Statistical learning in models made of modules. Pierre E. Jacob Unbiased MCMC
44. 44. Example: epidemiological study Model of virus prevalence ∀i = 1, . . . , I Zi ∼ Binomial(Ni, ϕi), Zi is number of women infected with high-risk HPV in a sample of size Ni in country i. Impact of prevalence onto cervical cancer occurrence ∀i = 1, . . . , I Yi ∼ Poisson(λiTi), log(λi) = θ2,1 + θ2,2ϕi, Yi is number of cancer cases arising from Ti woman-years of follow-up in country i. Plummer, Cuts in Bayesian graphical models, 2015. Pierre E. Jacob Unbiased MCMC
45. 45. Example: epidemiological study Approximations of the cut distribution marginals: 0 1 2 3 −2.5 −2.0 −1.5 θ2,1 density 0.00 0.05 0.10 0.15 10 15 20 25 θ2,2 density Red curves were obtained by long MCMC run targeting π2(θ2|Y2, θ1), for draws θ1 from ﬁrst posterior π1(θ1|Y1). Black bars were obtained with unbiased MCMC. Pierre E. Jacob Unbiased MCMC
46. 46. Example 2: normalizing constants We might be interested in normalizing constant Z of π(x) ∝ exp(−U(x)), deﬁned as Z = exp(−U(x))dx. Introduce ∀λ ∈ [0, 1] πλ(x) ∝ exp(−Uλ(x)). Thermodynamics integration or path sampling identity: log Z1 Z0 = − 1 0 λUλ(x)πλ(dx) q(λ) q(dλ), where q(dλ) is any distribution supported on [0, 1]. Rischard, Jacob, Pillai, Unbiased estimation of log normalizing constants with applications to Bayesian cross-validation. Pierre E. Jacob Unbiased MCMC
47. 47. Example 3: Bayesian cross-validation Data y = {y1, . . . , yn}, made of n units. Partition y into training T and validation V sets. For each split y = {T, V }, assess predictive performance with log p(V |T) = log p(V |θ, T)p(dθ|T). Note, log p(V |T) is a log-ratio of normalizing constants. Finally we might want to approximate CV = n nT −1 T,V log p(V |T). Rischard, Jacob, Pillai, Unbiased estimation of log normalizing constants with applications to Bayesian cross-validation. Pierre E. Jacob Unbiased MCMC
48. 48. Discussion Perfect samplers, designed to sample i.i.d. from π, would yield the same beneﬁts and more. What about regeneration? Implications of lack of bias are numerous. Proposed estimators can be obtained using coupled MCMC kernels, or using the “MCMC → SMC → coupled PIMH” route. Cannot work if underlying MCMC doesn’t work. . . but parallelism allows for more expensive MCMC kernels. Thank you for listening! Pierre E. Jacob Unbiased MCMC
49. 49. References with John O’Leary, Yves F. Atchad´e Unbiased Markov chain Monte Carlo with couplings, 2019. with Fredrik Lindsten, Thomas Sch¨on Smoothing with Couplings of Conditional Particle Filters, 2018. with Jeremy Heng Unbiased Hamiltonian Monte Carlo with couplings, 2019. with Lawrence Middleton, George Deligiannidis, Arnaud Doucet Unbiased Markov chain Monte Carlo for intractable target distributions, 2019. Unbiased Smoothing using Particle Independent Metropolis-Hastings, 2019. Pierre E. Jacob Unbiased MCMC
50. 50. References with Maxime Rischard, Natesh Pillai Unbiased estimation of log normalizing constants with applications to Bayesian cross-validation, 2019. with Niloy Biswas Estimating Convergence of Markov chains with L-Lag Couplings, 2019. Funding provided by the National Science Foundation, grants DMS-1712872 and DMS-1844695. Pierre E. Jacob Unbiased MCMC