These are the slides I gave during three lectures at the YES III meeting in Eindhoven, October 5-7, 2009. They include a presentation of the Savage-Dickey paradox and a comparison of major importance sampling solution, both obtained in collaboration with Jean-Michel Marin.
1. On some computational methods for Bayesian model choice
On some computational methods for Bayesian
model choice
Christian P. Robert
CREST-INSEE and Universit´ Paris Dauphine
e
http://www.ceremade.dauphine.fr/~xian
YES III workshop, Eindhoven,
October 7, 2009
Joint works with J.-M. Marin, M. Beaumont, N. Chopin,
J.-M. Cornuet, and D. Wraith
2. On some computational methods for Bayesian model choice
Outline
1 Introduction
2 Importance sampling solutions
3 Cross-model solutions
4 Nested sampling
5 ABC model choice
3. On some computational methods for Bayesian model choice
Introduction
Bayes tests
Construction of Bayes tests
Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a
statistical model, a test is a statistical procedure that takes its
values in {0, 1}.
Theorem (Bayes test)
The Bayes estimator associated with π and with the 0 − 1 loss is
1 if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x),
δ π (x) =
0 otherwise,
4. On some computational methods for Bayesian model choice
Introduction
Bayes tests
Construction of Bayes tests
Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a
statistical model, a test is a statistical procedure that takes its
values in {0, 1}.
Theorem (Bayes test)
The Bayes estimator associated with π and with the 0 − 1 loss is
1 if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x),
δ π (x) =
0 otherwise,
5. On some computational methods for Bayesian model choice
Introduction
Bayes factor
Bayes factor
Definition (Bayes factors)
For testing hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 , under prior
π(Θ0 )π0 (θ) + π(Θc )π1 (θ) ,
0
central quantity
f (x|θ)π0 (θ)dθ
π(Θ0 |x) π(Θ0 ) Θ0
B01 = =
π(Θc |x)
0 π(Θc )
0 f (x|θ)π1 (θ)dθ
Θc
0
[Jeffreys, 1939]
6. On some computational methods for Bayesian model choice
Introduction
Bayes factor
Self-contained concept
Outside decision-theoretic environment:
eliminates impact of π(Θ0 ) but depends on the choice of
(π0 , π1 )
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
π
if log10 (B10 ) between 0 and 0.5, evidence against H0 weak,
π
if log10 (B10 ) 0.5 and 1, evidence substantial,
π
if log10 (B10 ) 1 and 2, evidence strong and
π
if log10 (B10 ) above 2, evidence decisive
Requires the computation of the marginal/evidence under
both hypotheses/models
7. On some computational methods for Bayesian model choice
Introduction
Model choice
Model choice and model comparison
Choice between models
Several models available for the same observation
Mi : x ∼ fi (x|θi ), i∈I
where I can be finite or infinite
Replace hypotheses with models but keep marginal likelihoods and
Bayes factors
8. On some computational methods for Bayesian model choice
Introduction
Model choice
Model choice and model comparison
Choice between models
Several models available for the same observation
Mi : x ∼ fi (x|θi ), i∈I
where I can be finite or infinite
Replace hypotheses with models but keep marginal likelihoods and
Bayes factors
9. On some computational methods for Bayesian model choice
Introduction
Model choice
Bayesian model choice
Probabilise the entire model/parameter space
allocate probabilities pi to all models Mi
define priors πi (θi ) for each parameter space Θi
compute
pi fi (x|θi )πi (θi )dθi
Θi
π(Mi |x) =
pj fj (x|θj )πj (θj )dθj
j Θj
take largest π(Mi |x) to determine “best” model,
or use averaged predictive
π(Mj |x) fj (x |θj )πj (θj |x)dθj
j Θj
10. On some computational methods for Bayesian model choice
Introduction
Model choice
Bayesian model choice
Probabilise the entire model/parameter space
allocate probabilities pi to all models Mi
define priors πi (θi ) for each parameter space Θi
compute
pi fi (x|θi )πi (θi )dθi
Θi
π(Mi |x) =
pj fj (x|θj )πj (θj )dθj
j Θj
take largest π(Mi |x) to determine “best” model,
or use averaged predictive
π(Mj |x) fj (x |θj )πj (θj |x)dθj
j Θj
11. On some computational methods for Bayesian model choice
Introduction
Model choice
Bayesian model choice
Probabilise the entire model/parameter space
allocate probabilities pi to all models Mi
define priors πi (θi ) for each parameter space Θi
compute
pi fi (x|θi )πi (θi )dθi
Θi
π(Mi |x) =
pj fj (x|θj )πj (θj )dθj
j Θj
take largest π(Mi |x) to determine “best” model,
or use averaged predictive
π(Mj |x) fj (x |θj )πj (θj |x)dθj
j Θj
12. On some computational methods for Bayesian model choice
Introduction
Model choice
Bayesian model choice
Probabilise the entire model/parameter space
allocate probabilities pi to all models Mi
define priors πi (θi ) for each parameter space Θi
compute
pi fi (x|θi )πi (θi )dθi
Θi
π(Mi |x) =
pj fj (x|θj )πj (θj )dθj
j Θj
take largest π(Mi |x) to determine “best” model,
or use averaged predictive
π(Mj |x) fj (x |θj )πj (θj |x)dθj
j Θj
13. On some computational methods for Bayesian model choice
Introduction
Evidence
Evidence
All these problems end up with a similar quantity, the evidence
Zk = πk (θk )Lk (θk ) dθk ,
Θk
aka the marginal likelihood.
14. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Bayes factor approximation
When approximating the Bayes factor
f0 (x|θ0 )π0 (θ0 )dθ0
Θ0
B01 =
f1 (x|θ1 )π1 (θ1 )dθ1
Θ1
use of importance functions 0 and 1 and
n−1
0
n0 i i
i=1 f0 (x|θ0 )π0 (θ0 )/
i
0 (θ0 )
B01 =
n−1
1
n1 i i
i=1 f1 (x|θ1 )π1 (θ1 )/
i
1 (θ1 )
15. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Diabetes in Pima Indian women
Example (R benchmark)
“A population of women who were at least 21 years old, of Pima
Indian heritage and living near Phoenix (AZ), was tested for
diabetes according to WHO criteria. The data were collected by
the US National Institute of Diabetes and Digestive and Kidney
Diseases.”
200 Pima Indian women with observed variables
plasma glucose concentration in oral glucose tolerance test
diastolic blood pressure
diabetes pedigree function
presence/absence of diabetes
16. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Probit modelling on Pima Indian women
Probability of diabetes function of above variables
P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,
Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a
g-prior modelling:
β ∼ N3 (0, n XT X)−1
17. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Probit modelling on Pima Indian women
Probability of diabetes function of above variables
P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,
Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a
g-prior modelling:
β ∼ N3 (0, n XT X)−1
18. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
MCMC 101 for probit models
Use of either a random walk proposal
β =β+
in a Metropolis-Hastings algorithm (since the likelihood is
available)
or of a Gibbs sampler that takes advantage of the missing/latent
variable
z|y, x, β ∼ N (xT β, 1) Iy × I1−y
z≥0 z≤0
(since β|y, X, z is distributed as a standard normal)
[Gibbs three times faster]
19. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
MCMC 101 for probit models
Use of either a random walk proposal
β =β+
in a Metropolis-Hastings algorithm (since the likelihood is
available)
or of a Gibbs sampler that takes advantage of the missing/latent
variable
z|y, x, β ∼ N (xT β, 1) Iy × I1−y
z≥0 z≤0
(since β|y, X, z is distributed as a standard normal)
[Gibbs three times faster]
20. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
MCMC 101 for probit models
Use of either a random walk proposal
β =β+
in a Metropolis-Hastings algorithm (since the likelihood is
available)
or of a Gibbs sampler that takes advantage of the missing/latent
variable
z|y, x, β ∼ N (xT β, 1) Iy × I1−y
z≥0 z≤0
(since β|y, X, z is distributed as a standard normal)
[Gibbs three times faster]
21. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Diabetes in Pima Indian women
ˆ ˆ
MCMC output for βmax = 1.5, β = 1.15, k = 40, and 20, 000
simulations.
500 1000 1500 2000 2500
1.3
1.2
Frequency
1.1
1.0
0.9
0
0 5000 10000 15000 20000 0.9 1.0 1.1 1.2 1.3
400
10 20 30 40 50 60 70
300
200
100
0
0 5000 10000 15000 20000 12 21 29 37 45 53 61 69
22. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Importance sampling for the Pima Indian dataset
Use of the importance function inspired from the MLE estimate
distribution
ˆ ˆ
β ∼ N (β, Σ)
R Importance sampling code
model1=summary(glm(y~-1+X1,family=binomial(link="probit")))
is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)
is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)
bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],
sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-
dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))
23. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Importance sampling for the Pima Indian dataset
Use of the importance function inspired from the MLE estimate
distribution
ˆ ˆ
β ∼ N (β, Σ)
R Importance sampling code
model1=summary(glm(y~-1+X1,family=binomial(link="probit")))
is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)
is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)
bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],
sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-
dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))
24. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Diabetes in Pima Indian women
Comparison of the variation of the Bayes factor approximations
based on 100 replicas for 20, 000 simulations for a simulation from
the prior and the above importance sampler
5
4
q
3
2
Monte Carlo Importance sampling
25. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Bridge sampling
Special case:
If
π1 (θ1 |x) ∝ π1 (θ1 |x)
˜
π2 (θ2 |x) ∝ π2 (θ2 |x)
˜
live on the same space (Θ1 = Θ2 ), then
n
1 π1 (θi |x)
˜
B12 ≈ θi ∼ π2 (θ|x)
n π2 (θi |x)
˜
i=1
[Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
26. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Bridge sampling variance
The bridge sampling estimator does poorly if
2
var(B12 ) 1 π1 (θ) − π2 (θ)
2 ≈ E
B12 n π2 (θ)
is large, i.e. if π1 and π2 have little overlap...
27. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Bridge sampling variance
The bridge sampling estimator does poorly if
2
var(B12 ) 1 π1 (θ) − π2 (θ)
2 ≈ E
B12 n π2 (θ)
is large, i.e. if π1 and π2 have little overlap...
29. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
An infamous example
When
1
α(θ) =
π1 (θ|x)˜2 (θ|x)
˜ π
harmonic mean approximation to B12
n1
1
1/˜1 (θ1i |x)
π
n1
i=1
B21 = n2 θji ∼ πj (θ|x)
1
1/˜2 (θ2i |x)
π
n2
i=1
30. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
The original harmonic mean estimator
When θki ∼ πk (θ|x),
T
1 1
T L(θkt |x)
t=1
is an unbiased estimator of 1/mk (x)
[Newton & Raftery, 1994]
Highly dangerous: Most often leads to an infinite variance!!!
31. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
The original harmonic mean estimator
When θki ∼ πk (θ|x),
T
1 1
T L(θkt |x)
t=1
is an unbiased estimator of 1/mk (x)
[Newton & Raftery, 1994]
Highly dangerous: Most often leads to an infinite variance!!!
32. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
“The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees that
this estimator is consistent ie, it will very likely be very close to the
correct answer if you use a sufficiently large number of points from
the posterior distribution.
The bad news is that the number of points required for this
estimator to get close to the right answer will often be greater
than the number of atoms in the observable universe. The even
worse news is that it’s easy for people to not realize this, and to
na¨ıvely accept estimates that are nowhere close to the correct
value of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]
33. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
“The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees that
this estimator is consistent ie, it will very likely be very close to the
correct answer if you use a sufficiently large number of points from
the posterior distribution.
The bad news is that the number of points required for this
estimator to get close to the right answer will often be greater
than the number of atoms in the observable universe. The even
worse news is that it’s easy for people to not realize this, and to
na¨ıvely accept estimates that are nowhere close to the correct
value of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]
34. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Optimal bridge sampling
The optimal choice of auxiliary function is
n1 + n2
α =
n1 π1 (θ|x) + n2 π2 (θ|x)
leading to
n1
1 π2 (θ1i |x)
˜
n1 n1 π1 (θ1i |x) + n2 π2 (θ1i |x)
i=1
B12 ≈ n2
1 π1 (θ2i |x)
˜
n2 n1 π1 (θ2i |x) + n2 π2 (θ2i |x)
i=1
Back later!
35. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Optimal bridge sampling (2)
Reason:
Var(B12 ) 1 π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ
2 ≈ 2 −1
B12 n1 n2 π1 (θ)π2 (θ)α(θ) dθ
(by the δ method)
Drag: Dependence on the unknown normalising constants solved
iteratively
36. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Optimal bridge sampling (2)
Reason:
Var(B12 ) 1 π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ
2 ≈ 2 −1
B12 n1 n2 π1 (θ)π2 (θ)α(θ) dθ
(by the δ method)
Drag: Dependence on the unknown normalising constants solved
iteratively
37. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Extension to varying dimensions
When π1 and π2 work on different dimensions, for instance
θ2 = (θ1 , ψ), introduction of a pseudo-posterior density,
ω(ψ|θ1 , x), to augment π1 (θ1 |x) into joint distribution
π1 (θ1 |x) × ω(ψ|θ1 , x) on θ2 so that
π1 (θ1 |x)α(θ1 , ψ)π2 (θ1 , ψ|x)dθ1 ω(ψ|θ1 , x) dψ
˜
B12 = ,
π2 (θ1 , ψ|x)α(θ1 , ψ)π1 (θ1 |x)dθ1 ω(ψ|θ1 , x) dψ
˜
38. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Extension to varying dimensions
When π1 and π2 work on different dimensions, for instance
θ2 = (θ1 , ψ), introduction of a pseudo-posterior density,
ω(ψ|θ1 , x), to augment π1 (θ1 |x) into joint distribution
π1 (θ1 |x) × ω(ψ|θ1 , x) on θ2 so that
π1 (θ1 |x)α(θ1 , ψ)π2 (θ1 , ψ|x)dθ1 ω(ψ|θ1 , x) dψ
˜
B12 = ,
π2 (θ1 , ψ|x)α(θ1 , ψ)π1 (θ1 |x)dθ1 ω(ψ|θ1 , x) dψ
˜
39. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Illustration for the Pima Indian dataset
Use of the MLE induced conditional of β3 given (β1 , β2 ) as a
pseudo-posterior and mixture of both MLE approximations on β3
in bridge sampling estimate
R Bridge sampling code
cova=model2$cov.unscaled
expecta=model2$coeff[,1]
covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]
probit1=hmprobit(Niter,y,X1)
probit2=hmprobit(Niter,y,X2)
pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))
probit1p=cbind(probit1,pseudo)
bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),
sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],
cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+dnorm(pseudo,
expecta[3],cova[3,3])))
40. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Illustration for the Pima Indian dataset
Use of the MLE induced conditional of β3 given (β1 , β2 ) as a
pseudo-posterior and mixture of both MLE approximations on β3
in bridge sampling estimate
R Bridge sampling code
cova=model2$cov.unscaled
expecta=model2$coeff[,1]
covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]
probit1=hmprobit(Niter,y,X1)
probit2=hmprobit(Niter,y,X2)
pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))
probit1p=cbind(probit1,pseudo)
bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),
sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],
cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+dnorm(pseudo,
expecta[3],cova[3,3])))
41. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Diabetes in Pima Indian women (cont’d)
Comparison of the variation of the Bayes factor approximations
based on 100 replicas for 20, 000 simulations for a simulation from
the prior, the above bridge sampler and the above importance
sampler
5
4
q
q
3
q
2
MC Bridge IS
42. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Ratio importance sampling
Another identity:
Eϕ [˜1 (θ)/ϕ(θ)]
π
B12 =
Eϕ [˜2 (θ)/ϕ(θ)]
π
for any density ϕ with sufficiently large support
[Torrie & Valleau, 1977]
Use of a single sample θ1 , . . . , θn from ϕ
i=1 π1 (θi )/ϕ(θi )
˜
B12 =
i=1 π2 (θi )/ϕ(θi )
˜
43. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Ratio importance sampling
Another identity:
Eϕ [˜1 (θ)/ϕ(θ)]
π
B12 =
Eϕ [˜2 (θ)/ϕ(θ)]
π
for any density ϕ with sufficiently large support
[Torrie & Valleau, 1977]
Use of a single sample θ1 , . . . , θn from ϕ
i=1 π1 (θi )/ϕ(θi )
˜
B12 =
i=1 π2 (θi )/ϕ(θi )
˜
44. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Ratio importance sampling (2)
Approximate variance:
2
var(B12 ) 1 (π1 (θ) − π2 (θ))2
2 ≈ Eϕ
B12 n ϕ(θ)2
Optimal choice:
| π1 (θ) − π2 (θ) |
ϕ∗ (θ) =
| π1 (η) − π2 (η) | dη
[Chen, Shao & Ibrahim, 2000]
45. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Ratio importance sampling (2)
Approximate variance:
2
var(B12 ) 1 (π1 (θ) − π2 (θ))2
2 ≈ Eϕ
B12 n ϕ(θ)2
Optimal choice:
| π1 (θ) − π2 (θ) |
ϕ∗ (θ) =
| π1 (η) − π2 (η) | dη
[Chen, Shao & Ibrahim, 2000]
46. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Improving upon bridge sampler
Theorem 5.5.3: The asymptotic variance of the optimal ratio
importance sampling estimator is smaller than the asymptotic
variance of the optimal bridge sampling estimator
[Chen, Shao, & Ibrahim, 2000]
Does not require the normalising constant
| π1 (η) − π2 (η) | dη
but a simulation from
ϕ∗ (θ) ∝| π1 (θ) − π2 (θ) | .
47. On some computational methods for Bayesian model choice
Importance sampling solutions
Regular importance
Improving upon bridge sampler
Theorem 5.5.3: The asymptotic variance of the optimal ratio
importance sampling estimator is smaller than the asymptotic
variance of the optimal bridge sampling estimator
[Chen, Shao, & Ibrahim, 2000]
Does not require the normalising constant
| π1 (η) − π2 (η) | dη
but a simulation from
ϕ∗ (θ) ∝| π1 (θ) − π2 (θ) | .
48. On some computational methods for Bayesian model choice
Importance sampling solutions
Varying dimensions
Generalisation to point null situations
When
π1 (θ1 )dθ1
˜
Θ1
B12 =
π2 (θ2 )dθ2
˜
Θ2
and Θ2 = Θ1 × Ψ, we get θ2 = (θ1 , ψ) and
π1 (θ1 )ω(ψ|θ1 )
˜
B12 = Eπ2
π2 (θ1 , ψ)
˜
holds for any conditional density ω(ψ|θ1 ).
49. On some computational methods for Bayesian model choice
Importance sampling solutions
Varying dimensions
X-dimen’al bridge sampling
Generalisation of the previous identity:
For any α,
Eπ [˜1 (θ1 )ω(ψ|θ1 )α(θ1 , ψ)]
π
B12 = 2
Eπ1 ×ω [˜2 (θ1 , ψ)α(θ1 , ψ)]
π
and, for any density ϕ,
Eϕ [˜1 (θ1 )ω(ψ|θ1 )/ϕ(θ1 , ψ)]
π
B12 =
Eϕ [˜2 (θ1 , ψ)/ϕ(θ1 , ψ)]
π
[Chen, Shao, & Ibrahim, 2000]
Optimal choice: ω(ψ|θ1 ) = π2 (ψ|θ1 )
[Theorem 5.8.2]
50. On some computational methods for Bayesian model choice
Importance sampling solutions
Varying dimensions
X-dimen’al bridge sampling
Generalisation of the previous identity:
For any α,
Eπ [˜1 (θ1 )ω(ψ|θ1 )α(θ1 , ψ)]
π
B12 = 2
Eπ1 ×ω [˜2 (θ1 , ψ)α(θ1 , ψ)]
π
and, for any density ϕ,
Eϕ [˜1 (θ1 )ω(ψ|θ1 )/ϕ(θ1 , ψ)]
π
B12 =
Eϕ [˜2 (θ1 , ψ)/ϕ(θ1 , ψ)]
π
[Chen, Shao, & Ibrahim, 2000]
Optimal choice: ω(ψ|θ1 ) = π2 (ψ|θ1 )
[Theorem 5.8.2]
51. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Approximating Zk from a posterior sample
Use of the [harmonic mean] identity
ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1
Eπk x = dθk =
πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk
no matter what the proposal ϕ(·) is.
[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Direct exploitation of the MCMC output
RB-RJ
52. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Approximating Zk from a posterior sample
Use of the [harmonic mean] identity
ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1
Eπk x = dθk =
πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk
no matter what the proposal ϕ(·) is.
[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Direct exploitation of the MCMC output
RB-RJ
53. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance sampling
constraints: ϕ(θ) must have lighter (rather than fatter) tails than
πk (θk )Lk (θk ) for the approximation
T (t)
1 ϕ(θk )
Z1k = 1 (t) (t)
T πk (θk )Lk (θk )
t=1
to have a finite variance.
E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
54. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance sampling
constraints: ϕ(θ) must have lighter (rather than fatter) tails than
πk (θk )Lk (θk ) for the approximation
T (t)
1 ϕ(θk )
Z1k = 1 (t) (t)
T πk (θk )Lk (θk )
t=1
to have a finite variance.
E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
55. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Comparison with regular importance sampling (cont’d)
Compare Z1k with a standard importance sampling approximation
T (t) (t)
1 πk (θk )Lk (θk )
Z2k = (t)
T ϕ(θk )
t=1
(t)
where the θk ’s are generated from the density ϕ(·) (with fatter
tails like t’s)
56. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
HPD indicator as ϕ
Use the convex hull of MCMC simulations corresponding to the
10% HPD region (easily derived!) and ϕ as indicator:
1
ϕ(θ) = I (t)
T t d(θ,θ )≤
57. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
HPD indicator as ϕ
Use the convex hull of MCMC simulations corresponding to the
10% HPD region (easily derived!) and ϕ as indicator:
1
ϕ(θ) = I (t)
T t d(θ,θ )≤
58. On some computational methods for Bayesian model choice
Importance sampling solutions
Harmonic means
Diabetes in Pima Indian women (cont’d)
Comparison of the variation of the Bayes factor approximations
based on 100 replicas for 20, 000 simulations for a simulation from
the above harmonic mean sampler and importance samplers
3.102 3.104 3.106 3.108 3.110 3.112 3.114 3.116
q
q
Harmonic mean Importance sampling
59. On some computational methods for Bayesian model choice
Importance sampling solutions
Mixtures to bridge
Approximating Zk using a mixture representation
Bridge sampling redux
Design a specific mixture for simulation [importance sampling]
purposes, with density
ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
where ϕ(·) is arbitrary (but normalised)
Note: ω1 is not a probability weight
60. On some computational methods for Bayesian model choice
Importance sampling solutions
Mixtures to bridge
Approximating Zk using a mixture representation
Bridge sampling redux
Design a specific mixture for simulation [importance sampling]
purposes, with density
ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
where ϕ(·) is arbitrary (but normalised)
Note: ω1 is not a probability weight
61. On some computational methods for Bayesian model choice
Importance sampling solutions
Mixtures to bridge
Approximating Z using a mixture representation (cont’d)
Corresponding MCMC (=Gibbs) sampler
At iteration t
1 Take δ (t) = 1 with probability
(t−1) (t−1) (t−1) (t−1) (t−1)
ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )
and δ (t) = 2 otherwise;
(t) (t−1)
2 If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where
MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated
with the posterior πk (θk |x) ∝ πk (θk )Lk (θk );
(t)
3 If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
62. On some computational methods for Bayesian model choice
Importance sampling solutions
Mixtures to bridge
Approximating Z using a mixture representation (cont’d)
Corresponding MCMC (=Gibbs) sampler
At iteration t
1 Take δ (t) = 1 with probability
(t−1) (t−1) (t−1) (t−1) (t−1)
ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )
and δ (t) = 2 otherwise;
(t) (t−1)
2 If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where
MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated
with the posterior πk (θk |x) ∝ πk (θk )Lk (θk );
(t)
3 If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
63. On some computational methods for Bayesian model choice
Importance sampling solutions
Mixtures to bridge
Approximating Z using a mixture representation (cont’d)
Corresponding MCMC (=Gibbs) sampler
At iteration t
1 Take δ (t) = 1 with probability
(t−1) (t−1) (t−1) (t−1) (t−1)
ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )
and δ (t) = 2 otherwise;
(t) (t−1)
2 If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where
MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated
with the posterior πk (θk |x) ∝ πk (θk )Lk (θk );
(t)
3 If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
64. On some computational methods for Bayesian model choice
Importance sampling solutions
Mixtures to bridge
Evidence approximation by mixtures
Rao-Blackwellised estimate
T
ˆ 1
ξ=
(t)
ω1 πk (θk )Lk (θk )
(t) (t) (t)
ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
(t)
T
t=1
converges to ω1 Zk /{ω1 Zk + 1}
3k
ˆ ˆ ˆ
Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie
T (t) (t) (t) (t) (t)
t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk )
ˆ
Z3k =
T (t) (t) (t) (t)
t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )
[Bridge sampler]
65. On some computational methods for Bayesian model choice
Importance sampling solutions
Mixtures to bridge
Evidence approximation by mixtures
Rao-Blackwellised estimate
T
ˆ 1
ξ=
(t)
ω1 πk (θk )Lk (θk )
(t) (t) (t)
ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
(t)
T
t=1
converges to ω1 Zk /{ω1 Zk + 1}
3k
ˆ ˆ ˆ
Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie
T (t) (t) (t) (t) (t)
t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk )
ˆ
Z3k =
T (t) (t) (t) (t)
t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )
[Bridge sampler]
66. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Chib’s representation
Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and
θk ∼ πk (θk ),
fk (x|θk ) πk (θk )
Zk = mk (x) =
πk (θk |x)
Use of an approximation to the posterior
∗ ∗
fk (x|θk ) πk (θk )
Zk = mk (x) = .
ˆ ∗
πk (θk |x)
67. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Chib’s representation
Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and
θk ∼ πk (θk ),
fk (x|θk ) πk (θk )
Zk = mk (x) =
πk (θk |x)
Use of an approximation to the posterior
∗ ∗
fk (x|θk ) πk (θk )
Zk = mk (x) = .
ˆ ∗
πk (θk |x)
68. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Case of latent variables
For missing variable z as in mixture models, natural Rao-Blackwell
estimate
T
∗ 1 ∗ (t)
πk (θk |x) = πk (θk |x, zk ) ,
T
t=1
(t)
where the zk ’s are Gibbs sampled latent variables
69. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Label switching
A mixture model [special case of missing variable model] is
invariant under permutations of the indices of the components.
E.g., mixtures
0.3N (0, 1) + 0.7N (2.3, 1)
and
0.7N (2.3, 1) + 0.3N (0, 1)
are exactly the same!
c The component parameters θi are not identifiable
marginally since they are exchangeable
70. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Label switching
A mixture model [special case of missing variable model] is
invariant under permutations of the indices of the components.
E.g., mixtures
0.3N (0, 1) + 0.7N (2.3, 1)
and
0.7N (2.3, 1) + 0.3N (0, 1)
are exactly the same!
c The component parameters θi are not identifiable
marginally since they are exchangeable
71. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Connected difficulties
1 Number of modes of the likelihood of order O(k!):
c Maximization and even [MCMC] exploration of the
posterior surface harder
2 Under exchangeable priors on (θ, p) [prior invariant under
permutation of the indices], all posterior marginals are
identical:
c Posterior expectation of θ1 equal to posterior expectation
of θ2
72. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Connected difficulties
1 Number of modes of the likelihood of order O(k!):
c Maximization and even [MCMC] exploration of the
posterior surface harder
2 Under exchangeable priors on (θ, p) [prior invariant under
permutation of the indices], all posterior marginals are
identical:
c Posterior expectation of θ1 equal to posterior expectation
of θ2
73. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
License
Since Gibbs output does not produce exchangeability, the Gibbs
sampler has not explored the whole parameter space: it lacks
energy to switch simultaneously enough component allocations at
once
0.2 0.3 0.4 0.5
−1 0 1 2 3
µi
0 100 200
n
300 400 500
pi −1 0
µ
1
i
2 3
0.4 0.6 0.8 1.0
0.2 0.3 0.4 0.5
σi
pi
0 100 200 300 400 500 0.2 0.3 0.4 0.5
n pi
0.4 0.6 0.8 1.0
−1 0 1 2 3
σi
µi
0 100 200 300 400 500 0.4 0.6 0.8 1.0
n σi
74. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Label switching paradox
We should observe the exchangeability of the components [label
switching] to conclude about convergence of the Gibbs sampler.
If we observe it, then we do not know how to estimate the
parameters.
If we do not, then we are uncertain about the convergence!!!
75. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Label switching paradox
We should observe the exchangeability of the components [label
switching] to conclude about convergence of the Gibbs sampler.
If we observe it, then we do not know how to estimate the
parameters.
If we do not, then we are uncertain about the convergence!!!
76. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Label switching paradox
We should observe the exchangeability of the components [label
switching] to conclude about convergence of the Gibbs sampler.
If we observe it, then we do not know how to estimate the
parameters.
If we do not, then we are uncertain about the convergence!!!
77. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Compensation for label switching
(t)
For mixture models, zk usually fails to visit all configurations in a
balanced way, despite the symmetry predicted by the theory
1
πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x)
k!
σ∈S
for all σ’s in Sk , set of all permutations of {1, . . . , k}.
Consequences on numerical approximation, biased by an order k!
Recover the theoretical symmetry by using
T
∗ 1 ∗ (t)
πk (θk |x) = πk (σ(θk )|x, zk ) .
T k!
σ∈Sk t=1
[Berkhof, Mechelen, & Gelman, 2003]
78. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Compensation for label switching
(t)
For mixture models, zk usually fails to visit all configurations in a
balanced way, despite the symmetry predicted by the theory
1
πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x)
k!
σ∈S
for all σ’s in Sk , set of all permutations of {1, . . . , k}.
Consequences on numerical approximation, biased by an order k!
Recover the theoretical symmetry by using
T
∗ 1 ∗ (t)
πk (θk |x) = πk (σ(θk )|x, zk ) .
T k!
σ∈Sk t=1
[Berkhof, Mechelen, & Gelman, 2003]
79. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Galaxy dataset
n = 82 galaxies as a mixture of k normal distributions with both
mean and variance unknown.
[Roeder, 1992]
Average density
0.8
0.6
Relative Frequency
0.4
0.2
0.0
−2 −1 0 1 2 3
data
80. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Galaxy dataset (k)
∗
Using only the original estimate, with θk as the MAP estimator,
log(mk (x)) = −105.1396
ˆ
for k = 3 (based on 103 simulations), while introducing the
permutations leads to
log(mk (x)) = −103.3479
ˆ
Note that
−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105 Gibbs iterations and, for k > 5, 100
permutations selected at random in Sk ).
[Lee, Marin, Mengersen & Robert, 2008]
81. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Galaxy dataset (k)
∗
Using only the original estimate, with θk as the MAP estimator,
log(mk (x)) = −105.1396
ˆ
for k = 3 (based on 103 simulations), while introducing the
permutations leads to
log(mk (x)) = −103.3479
ˆ
Note that
−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105 Gibbs iterations and, for k > 5, 100
permutations selected at random in Sk ).
[Lee, Marin, Mengersen & Robert, 2008]
82. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Galaxy dataset (k)
∗
Using only the original estimate, with θk as the MAP estimator,
log(mk (x)) = −105.1396
ˆ
for k = 3 (based on 103 simulations), while introducing the
permutations leads to
log(mk (x)) = −103.3479
ˆ
Note that
−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105 Gibbs iterations and, for k > 5, 100
permutations selected at random in Sk ).
[Lee, Marin, Mengersen & Robert, 2008]
83. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Case of the probit model
For the completion by z,
1
π (θ|x) =
ˆ π(θ|x, z (t) )
T t
is a simple average of normal densities
R Bridge sampling code
gibbs1=gibbsprobit(Niter,y,X1)
gibbs2=gibbsprobit(Niter,y,X2)
bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3),
sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/
mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2),
sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))
84. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Case of the probit model
For the completion by z,
1
π (θ|x) =
ˆ π(θ|x, z (t) )
T t
is a simple average of normal densities
R Bridge sampling code
gibbs1=gibbsprobit(Niter,y,X1)
gibbs2=gibbsprobit(Niter,y,X2)
bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3),
sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/
mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2),
sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))
85. On some computational methods for Bayesian model choice
Importance sampling solutions
Chib’s solution
Diabetes in Pima Indian women (cont’d)
Comparison of the variation of the Bayes factor approximations
based on 100 replicas for 20, 000 simulations for a simulation from
the above Chib’s and importance samplers
q
0.0255
q
q q
q
0.0250
0.0245
0.0240
q
q
Chib's method importance sampling
86. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
The Savage–Dickey ratio
Special representation of the Bayes factor used for simulation
Given a test H0 : θ = θ0 in a model f (x|θ, ψ) with a nuisance
parameter ψ, under priors π0 (ψ) and π1 (θ, ψ) such that
π1 (ψ|θ0 ) = π0 (ψ)
then
π1 (θ0 |x)
B01 = ,
π1 (θ0 )
with the obvious notations
π1 (θ) = π1 (θ, ψ)dψ , π1 (θ|x) = π1 (θ, ψ|x)dψ ,
[Dickey, 1971; Verdinelli & Wasserman, 1995]
87. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
Measure-theoretic difficulty
The representation depends on the choice of versions of conditional
densities:
π0 (ψ)f (x|θ0 , ψ) dψ
B01 = [by definition]
π1 (θ, ψ)f (x|θ, ψ) dψdθ
π1 (ψ|θ0 )f (x|θ0 , ψ) dψ π1 (θ0 )
= [specific version of π1 (ψ|θ0 )]
π1 (θ, ψ)f (x|θ, ψ) dψdθ π1 (θ0 )
π1 (θ0 , ψ)f (x|θ0 , ψ) dψ
= [specific version of π1 (θ0 , ψ)]
m1 (x)π1 (θ0 )
π1 (θ0 |x)
=
π1 (θ0 )
c Dickey’s (1971) condition is not a condition
88. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
Similar measure-theoretic difficulty
Verdinelli-Wasserman extension:
π1 (θ0 |x) π1 (ψ|x,θ0 ,x) π0 (ψ)
B01 = E
π1 (θ0 ) π1 (ψ|θ0 )
depends on similar choices of versions
Monte Carlo implementation relies on continuous versions of all
densities without making mention of it
[Chen, Shao & Ibrahim, 2000]
89. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
Similar measure-theoretic difficulty
Verdinelli-Wasserman extension:
π1 (θ0 |x) π1 (ψ|x,θ0 ,x) π0 (ψ)
B01 = E
π1 (θ0 ) π1 (ψ|θ0 )
depends on similar choices of versions
Monte Carlo implementation relies on continuous versions of all
densities without making mention of it
[Chen, Shao & Ibrahim, 2000]
90. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
Computational implementation
Starting from the (new) prior
π1 (θ, ψ) = π1 (θ)π0 (ψ)
˜
define the associated posterior
π1 (θ, ψ|x) = π0 (ψ)π1 (θ)f (x|θ, ψ) m1 (x)
˜ ˜
and impose
π1 (θ0 |x)
˜ π0 (ψ)f (x|θ0 , ψ) dψ
=
π0 (θ0 ) m1 (x)
˜
to hold.
Then
π1 (θ0 |x) m1 (x)
˜ ˜
B01 =
π1 (θ0 ) m1 (x)
91. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
Computational implementation
Starting from the (new) prior
π1 (θ, ψ) = π1 (θ)π0 (ψ)
˜
define the associated posterior
π1 (θ, ψ|x) = π0 (ψ)π1 (θ)f (x|θ, ψ) m1 (x)
˜ ˜
and impose
π1 (θ0 |x)
˜ π0 (ψ)f (x|θ0 , ψ) dψ
=
π0 (θ0 ) m1 (x)
˜
to hold.
Then
π1 (θ0 |x) m1 (x)
˜ ˜
B01 =
π1 (θ0 ) m1 (x)
92. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
First ratio
If (θ(1) , ψ (1) ), . . . , (θ(T ) , ψ (T ) ) ∼ π (θ, ψ|x), then
˜
1
π1 (θ0 |x, ψ (t) )
˜
T t
converges to π1 (θ0 |x) (if the right version is used in θ0 ).
˜
When π1 (θ0 |x, ψ unavailable, replace with
˜
T
1
π1 (θ0 |x, z (t) , ψ (t) )
˜
T
t=1
93. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
First ratio
If (θ(1) , ψ (1) ), . . . , (θ(T ) , ψ (T ) ) ∼ π (θ, ψ|x), then
˜
1
π1 (θ0 |x, ψ (t) )
˜
T t
converges to π1 (θ0 |x) (if the right version is used in θ0 ).
˜
When π1 (θ0 |x, ψ unavailable, replace with
˜
T
1
π1 (θ0 |x, z (t) , ψ (t) )
˜
T
t=1
94. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
Bridge revival (1)
Since m1 (x)/m1 (x) is unknown, apparent failure!
˜
Use of the identity
π1 (θ, ψ)f (x|θ, ψ) π1 (ψ|θ) m1 (x)
Eπ1 (θ,ψ|x)
˜
= Eπ1 (θ,ψ|x)
˜
=
π0 (ψ)π1 (θ)f (x|θ, ψ) π0 (ψ) m1 (x)
˜
to (biasedly) estimate m1 (x)/m1 (x) by
˜
T
π1 (ψ (t) |θ(t) )
T
t=1
π0 (ψ (t) )
based on the same sample from π1 .
˜
95. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
Bridge revival (1)
Since m1 (x)/m1 (x) is unknown, apparent failure!
˜
Use of the identity
π1 (θ, ψ)f (x|θ, ψ) π1 (ψ|θ) m1 (x)
Eπ1 (θ,ψ|x)
˜
= Eπ1 (θ,ψ|x)
˜
=
π0 (ψ)π1 (θ)f (x|θ, ψ) π0 (ψ) m1 (x)
˜
to (biasedly) estimate m1 (x)/m1 (x) by
˜
T
π1 (ψ (t) |θ(t) )
T
t=1
π0 (ψ (t) )
based on the same sample from π1 .
˜
96. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
Bridge revival (2)
Alternative identity
π0 (ψ)π1 (θ)f (x|θ, ψ) π0 (ψ) m1 (x)
˜
Eπ1 (θ,ψ|x) = Eπ1 (θ,ψ|x) =
π1 (θ, ψ)f (x|θ, ψ) π1 (ψ|θ) m1 (x)
¯ ¯
suggests using a second sample (θ(1) , ψ (1) , z (1) ), . . . ,
¯(T ) , ψ (T ) , z (T ) ) ∼ π1 (θ, ψ|x) and
(θ ¯
T ¯
1 π0 (ψ (t) )
T ¯ ¯
π1 (ψ (t) |θ(t) )
t=1
Resulting estimate:
(t) , ψ (t) ) T ¯
1 t π1 (θ0 |x, z
˜ 1 π0 (ψ (t) )
B01 =
T π1 (θ0 ) T
t=1
π1 (ψ ¯
¯ (t) |θ (t) )
97. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
Bridge revival (2)
Alternative identity
π0 (ψ)π1 (θ)f (x|θ, ψ) π0 (ψ) m1 (x)
˜
Eπ1 (θ,ψ|x) = Eπ1 (θ,ψ|x) =
π1 (θ, ψ)f (x|θ, ψ) π1 (ψ|θ) m1 (x)
¯ ¯
suggests using a second sample (θ(1) , ψ (1) , z (1) ), . . . ,
¯(T ) , ψ (T ) , z (T ) ) ∼ π1 (θ, ψ|x) and
(θ ¯
T ¯
1 π0 (ψ (t) )
T ¯ ¯
π1 (ψ (t) |θ(t) )
t=1
Resulting estimate:
(t) , ψ (t) ) T ¯
1 t π1 (θ0 |x, z
˜ 1 π0 (ψ (t) )
B01 =
T π1 (θ0 ) T
t=1
π1 (ψ ¯
¯ (t) |θ (t) )
98. On some computational methods for Bayesian model choice
Importance sampling solutions
The Savage–Dickey ratio
Diabetes in Pima Indian women (cont’d)
Comparison of the variation of the Bayes factor approximations
based on 100 replicas for 20, 000 simulations for a simulation from
the above importance, Chib’s, Savage–Dickey’s and bridge
samplers
q
3.4
3.2
3.0
2.8
q
IS Chib Savage−Dickey Bridge
99. On some computational methods for Bayesian model choice
Cross-model solutions
Variable selection
Bayesian variable selection
Example of a regression setting: one dependent random variable y
and a set {x1 , . . . , xk } of k explanatory variables.
Question: Are all xi ’s involved in the regression?
Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k)
explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of
explanatory variables for the regression of y [intercept included in
every corresponding model]
Computational issue
2k models in competition...
[Marin & Robert, Bayesian Core, 2007]
100. On some computational methods for Bayesian model choice
Cross-model solutions
Variable selection
Bayesian variable selection
Example of a regression setting: one dependent random variable y
and a set {x1 , . . . , xk } of k explanatory variables.
Question: Are all xi ’s involved in the regression?
Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k)
explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of
explanatory variables for the regression of y [intercept included in
every corresponding model]
Computational issue
2k models in competition...
[Marin & Robert, Bayesian Core, 2007]
101. On some computational methods for Bayesian model choice
Cross-model solutions
Variable selection
Bayesian variable selection
Example of a regression setting: one dependent random variable y
and a set {x1 , . . . , xk } of k explanatory variables.
Question: Are all xi ’s involved in the regression?
Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k)
explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of
explanatory variables for the regression of y [intercept included in
every corresponding model]
Computational issue
2k models in competition...
[Marin & Robert, Bayesian Core, 2007]
102. On some computational methods for Bayesian model choice
Cross-model solutions
Variable selection
Bayesian variable selection
Example of a regression setting: one dependent random variable y
and a set {x1 , . . . , xk } of k explanatory variables.
Question: Are all xi ’s involved in the regression?
Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k)
explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of
explanatory variables for the regression of y [intercept included in
every corresponding model]
Computational issue
2k models in competition...
[Marin & Robert, Bayesian Core, 2007]
103. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Reversible jump
Idea: Set up a proper measure–theoretic framework for designing
moves between models Mk
[Green, 1995]
Create a reversible kernel K on H = k {k} × Θk such that
K(x, dy)π(x)dx = K(y, dx)π(y)dy
A B B A
for the invariant density π [x is of the form (k, θ(k) )]
104. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Reversible jump
Idea: Set up a proper measure–theoretic framework for designing
moves between models Mk
[Green, 1995]
Create a reversible kernel K on H = k {k} × Θk such that
K(x, dy)π(x)dx = K(y, dx)π(y)dy
A B B A
for the invariant density π [x is of the form (k, θ(k) )]
105. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Local moves
For a move between two models, M1 and M2 , the Markov chain
being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ)
the corresponding kernels, under the detailed balance condition
π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) ,
and take, wlog, dim(M2 ) > dim(M1 ).
Proposal expressed as
θ2 = Ψ1→2 (θ1 , v1→2 )
where v1→2 is a random variable of dimension
dim(M2 ) − dim(M1 ), generated as
v1→2 ∼ ϕ1→2 (v1→2 ) .
106. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Local moves
For a move between two models, M1 and M2 , the Markov chain
being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ)
the corresponding kernels, under the detailed balance condition
π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) ,
and take, wlog, dim(M2 ) > dim(M1 ).
Proposal expressed as
θ2 = Ψ1→2 (θ1 , v1→2 )
where v1→2 is a random variable of dimension
dim(M2 ) − dim(M1 ), generated as
v1→2 ∼ ϕ1→2 (v1→2 ) .
107. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Local moves (2)
In this case, q1→2 (θ1 , dθ2 ) has density
−1
∂Ψ1→2 (θ1 , v1→2 )
ϕ1→2 (v1→2 ) ,
∂(θ1 , v1→2 )
by the Jacobian rule.
Reverse importance link
If probability 1→2 of choosing move to M2 while in M1 ,
acceptance probability reduces to
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1∧ .
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
If several models are considered simultaneously, with probability
1→2 of choosing move to M2 while in M1 , as in
∞
XZ
K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x)
m=1
108. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Local moves (2)
In this case, q1→2 (θ1 , dθ2 ) has density
−1
∂Ψ1→2 (θ1 , v1→2 )
ϕ1→2 (v1→2 ) ,
∂(θ1 , v1→2 )
by the Jacobian rule.
Reverse importance link
If probability 1→2 of choosing move to M2 while in M1 ,
acceptance probability reduces to
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1∧ .
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
If several models are considered simultaneously, with probability
1→2 of choosing move to M2 while in M1 , as in
∞
XZ
K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x)
m=1
109. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Local moves (2)
In this case, q1→2 (θ1 , dθ2 ) has density
−1
∂Ψ1→2 (θ1 , v1→2 )
ϕ1→2 (v1→2 ) ,
∂(θ1 , v1→2 )
by the Jacobian rule.
Reverse importance link
If probability 1→2 of choosing move to M2 while in M1 ,
acceptance probability reduces to
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1∧ .
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
If several models are considered simultaneously, with probability
1→2 of choosing move to M2 while in M1 , as in
∞
XZ
K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x)
m=1
110. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Generic reversible jump acceptance probability
Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is
1→2
−1
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 )
c Difficult calibration
111. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Generic reversible jump acceptance probability
Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is
1→2
−1
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 )
c Difficult calibration
112. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Generic reversible jump acceptance probability
Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is
π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 )
while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is
1→2
−1
π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 )
α(θ1 , v1→2 ) = 1 ∧
π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 )
c Difficult calibration
113. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Green’s sampler
Algorithm
Iteration t (t ≥ 1): if x(t) = (m, θ(m) ),
1 Select model Mn with probability πmn
2 Generate umn ∼ ϕmn (u) and set
(θ(n) , vnm ) = Ψm→n (θ(m) , umn )
3 Take x(t+1) = (n, θ(n) ) with probability
π(n, θ(n) ) πnm ϕnm (vnm ) ∂Ψm→n (θ(m) , umn )
min ,1
π(m, θ(m) ) πmn ϕmn (umn ) ∂(θ(m) , umn )
and take x(t+1) = x(t) otherwise.
114. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Interpretation
The representation puts us back in a fixed dimension setting:
M1 × V1→2 and M2 in one-to-one relation.
reversibility imposes that θ1 is derived as
(θ1 , v1→2 ) = Ψ−1 (θ2 )
1→2
appears like a regular Metropolis–Hastings move from the
couple (θ1 , v1→2 ) to θ2 when stationary distributions are
π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal
distribution is deterministic
115. On some computational methods for Bayesian model choice
Cross-model solutions
Reversible jump
Interpretation
The representation puts us back in a fixed dimension setting:
M1 × V1→2 and M2 in one-to-one relation.
reversibility imposes that θ1 is derived as
(θ1 , v1→2 ) = Ψ−1 (θ2 )
1→2
appears like a regular Metropolis–Hastings move from the
couple (θ1 , v1→2 ) to θ2 when stationary distributions are
π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal
distribution is deterministic
116. On some computational methods for Bayesian model choice
Cross-model solutions
Saturation schemes
Alternative
Saturation of the parameter space H = k {k} × Θk by creating
θ = (θ1 , . . . , θD )
a model index M
pseudo-priors πj (θj |M = k) for j = k
[Carlin & Chib, 1995]
Validation by
P(M = k|x) = P (M = k|x, θ)π(θ|x)dθ = Zk
where the (marginal) posterior is [not πk !]
D
π(θ|x) = P(θ, M = k|x)
k=1
D
= pk Zk πk (θk |x) πj (θj |M = k) .
k=1 j=k