Monash University short course, part II

MCMC and likelihood-free methods Part/day II: ABC

MCMC and likelihood-free methods
Part/day II: ABC

Christian P. Robert

Universit´ Paris-Dauphine, IuF, & CREST
e

Monash University, EBS, July 18, 2012


Outline
simulation-based methods in Econometrics

Genetics of ABC

Approximate Bayesian computation

ABC for model choice

ABC model choice consistency


Econ’ections

simulation-based methods in
Econometrics

Genetics of ABC





Usages of simulation in Econometrics

Similar exploration of simulation-based techniques in Econometrics
Simulated method of moments
Method of simulated moments
Simulated pseudo-maximum-likelihood
Indirect inference
[Gouri´roux & Monfort, 1996]
e



o
Given observations y1:n from a model

yt = r (y1:(t−1) , t , θ) , t ∼ g (·)

simulate 1:n , derive

yt (θ) = r (y1:(t−1) , t , θ)

and estimate θ by
n
arg min (yto − yt (θ))2
θ
t=1



o
Given observations y1:n from a model

yt = r (y1:(t−1) , t , θ) , t ∼ g (·)

simulate 1:n , derive

yt (θ) = r (y1:(t−1) , t , θ)

and estimate θ by
n n 2
arg min yto − yt (θ)
θ
t=1 t=1


Method of simulated moments

Given a statistic vector K (y ) with

Eθ [K (Yt )|y1:(t−1) ] = k(y1:(t−1) ; θ)

ﬁnd an unbiased estimator of k(y1:(t−1) ; θ),

˜
k( t , y1:(t−1) ; θ)

Estimate θ by
n S
arg min K (yt ) − ˜ t
k( s , y1:(t−1) ; θ)/S
θ
t=1 s=1

[Pakes & Pollard, 1989]


Indirect inference

ˆ
Minimise (in θ) the distance between estimators β based on
pseudo-models for genuine observations and for observations
simulated under the true model and the parameter θ.

[Gouri´roux, Monfort, & Renault, 1993;
e
Smith, 1993; Gallant & Tauchen, 1996]


Indirect inference (PML vs. PSE)

Example of the pseudo-maximum-likelihood (PML)

ˆ
β(y) = arg max log f (yt |β, y1:(t−1) )
β
t

leading to
ˆ ˆ
arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2
θ

when
ys (θ) ∼ f (y|θ) s = 1, . . . , S


Indirect inference (PML vs. PSE)

Example of the pseudo-score-estimator (PSE)
2
ˆ ∂ log f
β(y) = arg min (yt |β, y1:(t−1) )
β
t
∂β

leading to
ˆ ˆ
arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2
θ

when
ys (θ) ∼ f (y|θ) s = 1, . . . , S


Consistent indirect inference

...in order to get a unique solution the dimension of
the auxiliary parameter β must be larger than or equal to
the dimension of the initial parameter θ. If the problem is
just identiﬁed the diﬀerent methods become easier...


Consistent indirect inference

...in order to get a unique solution the dimension of
the auxiliary parameter β must be larger than or equal to
the dimension of the initial parameter θ. If the problem is
just identified the different methods become easier...

Consistency depending on the criterion and on the asymptotic
identifiability of θ
[Gouri´roux, Monfort, 1996, p. 66]
e


AR(2) vs. MA(1) example

true (AR) model
yt = t −θ t−1

and [wrong!] auxiliary (MA) model

yt = β1 yt−1 + β2 yt−2 + ut

R code
x=eps=rnorm(250)
x[2:250]=x[2:250]-0.5*x[1:249]
simeps=rnorm(250)
propeta=seq(-.99,.99,le=199)
dist=rep(0,199)
bethat=as.vector(arima(x,c(2,0,0),incl=FALSE)$coef)
for (t in 1:199)
dist[t]=sum((as.vector(arima(c(simeps[1],simeps[2:250]-propeta[t]*
simeps[1:249]),c(2,0,0),incl=FALSE)$coef)-bethat)^2)


One sample:
0.8
0.6
distance

0.4
0.2
0.0

−1.0 −0.5 0.0 0.5 1.0


Many samples:
6
5
4
3
2
1
0

0.2 0.4 0.6 0.8 1.0


Choice of pseudo-model

Pick model such that
ˆ
1. β(θ) not ﬂat
(i.e. sensitive to changes in θ)
ˆ
2. β(θ) not dispersed (i.e. robust agains changes in ys (θ))
[Frigessi & Heggland, 2004]


ABC using indirect inference (1)

We present a novel approach for developing summary statistics
for use in approximate Bayesian computation (ABC) algorithms by
using indirect inference(...) In the indirect inference approach to
ABC the parameters of an auxiliary model ﬁtted to the data become
the summary statistics. Although applicable to any ABC technique,
we embed this approach within a sequential Monte Carlo algorithm
that is completely adaptive and requires very little tuning(...)

[Drovandi, Pettitt & Faddy, 2011]

c Indirect inference provides summary statistics for ABC...


ABC using indirect inference (2)

...the above result shows that, in the limit as h → 0, ABC will
be more accurate than an indirect inference method whose auxiliary
statistics are the same as the summary statistic that is used for
ABC(...) Initial analysis showed that which method is more
accurate depends on the true value of θ.

[Fearnhead and Prangle, 2012]

c Indirect inference provides estimates rather than global inference...

Genetics of ABC

Genetics of ABC

Econometrics

Genetics of ABC




Genetics of ABC

Genetic background of ABC

ABC is a recent computational technique that only requires being
able to sample from the likelihood f (·|θ)
This technique stemmed from population genetics models, about
15 years ago, and population geneticists still contribute
signiﬁcantly to methodological developments of ABC.
[Griﬃth & al., 1997; Tavar´ & al., 1999]
e

Genetics of ABC

Population genetics

[Part derived from the teaching material of Raphael Leblois, ENS Lyon, November 2010]

Describe the genotypes, estimate the alleles frequencies,
determine their distribution among individuals, populations
and between populations;

Predict and understand the evolution of gene frequencies in
populations as a result of various factors.

c Analyses the eﬀect of various evolutive forces (mutation, drift,
migration, selection) on the evolution of gene frequencies in time
and space.

Genetics of ABC

A[B]ctors
Les mécanismes de l’évolution
Towards an explanation of the mechanisms of evolutionary
changes, mathematical models ofl’évolution ont to be developed
•! Les mathématiques de evolution started
in the years 1920-1930.
commencé à être développés dans les
années 1920-1930

Ronald A Fisher (1890-1962) Sewall Wright (1889-1988) John BS Haldane (1892-1964)

Genetics of ABC

Le modèle
Wright-Fisher model de Wright-Fisher

•! En l’absence de mutation et de
A population of constant
sélection, les fréquences
size, in which individuals
alléliques dérivent (augmentent
reproduce at the same time.
et diminuent) inévitablement
jusqu’à Each gene in a generation is
la fixation d’un allèle
a copy of a gene of the
•! La dérive conduit generation.
previous donc à la
perte de variation génétique à
l’intérieur des populations mutation
In the absence of
and selection, allele
frequencies derive inevitably
until the ﬁxation of an
allele.

Genetics of ABC

Coalescent theory
[Kingman, 1982; Tajima, Tavar´, &tc]
e

!"#$%&'(('")**+$,-'".'"/010234%'".'5"*$*%()23$15"6"
Coalescence theory interested in the genealogy of a sample of
genes7**+$,-'",()5534%'" common ancestor of the sample.
"
!! back in time to the " "!"7**+$,-'"8",$)('5,'1,'"9"

Genetics of ABC

Common ancestor

Les différentes
lignées fusionnent
(coalescent) au fur
et à mesure que
l’on remonte vers le

Time of coalescence
passé

(T)
Modélisation du processus de dérive génétique
The diﬀerent lineages mergeen “remontant dans lein the past.
when we go back temps”
jusqu’à l’ancêtre commun d’un échantillon de gènes 6

Genetics of ABC
Sous l’hypothèse de neutralité des marqueurs génétiques étudiés,
les mutations sont indépendantes de la généalogie
Neutral généalogie ne dépend que des processus démographiques
i.e. la mutations

On construit donc la généalogie selon les paramètres
démographiques (ex. N),
puis on ajoute aassumption les
Under the posteriori of
mutations sur les différentes
neutrality, the mutations
branches,independentau feuilles de
are du MRCA of the
l’arbregenealogy.
On obtient ainsi des données de
We construct the genealogy
polymorphisme sous les modèles
according to the
démographiques et mutationnels
demographic parameters,
considérés
then we add a posteriori the
mutations. 20

Genetics of ABC

Demo-genetic inference

Each model is characterized by a set of parameters θ that cover
historical (time divergence, admixture time ...), demographics
(population sizes, admixture rates, migration rates, ...) and genetic
(mutation rate, ...) factors
The goal is to estimate these parameters from a dataset of
polymorphism (DNA sample) y observed at the present time
Problem: most of the time, we can not calculate the likelihood of
the polymorphism data f (y|θ).

Genetics of ABC

Untractable likelihood

Missing (too missing!) data structure:

f (y|θ) = f (y|G , θ)f (G |θ)dG
G

The genealogies are considered as nuisance parameters.
This problematic thus diﬀers from the phylogenetic approach
where the tree is the parameter of interesst.

Genetics of ABC

A genuine !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
example of application
1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+

94

Pygmies populations: do they have a common origin? Is there a
lot of exchanges between pygmies and non-pygmies populations?

Genetics of ABC

!""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
Scenarios 1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+
under competition

Différents scénarios possibles, choix de scenario par ABC

Verdu et al. 2009

96

1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+

!""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03
Genetics of ABC

1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.
Différentsresults
Simulation scénarios possibles, choix de scenari

Différents scénarios possibles, choix de scenario par ABC

c Scenario 1A is chosen. largement soutenu par rapport aux
Le scenario 1a est
autres ! plaide pour une origine commune des
Le scenario 1a est largement soutenu par
populations pygmées d’Afrique de l’Ouest rap
Verdu e

Genetics of ABC

!""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
Most likely scenario
1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+

Scénario
évolutif :
on « raconte »
une histoire à
partir de ces
inférences

Verdu et al. 2009

99



Econometrics

Genetics of ABC




ABC basics

Untractable likelihoods

Cases when the likelihood function
f (y|θ) is unavailable and when the
completion step

f (y|θ) = f (y, z|θ) dz
Z

is impossible or too costly because of
the dimension of z
c MCMC cannot be implemented!

ABC basics

Illustrations

Example ()
Stochastic volatility model: for Highest weight trajectories

t = 1, . . . , T ,

0.4
0.2
yt = exp(zt ) t , zt = a+bzt−1 +σηt ,

0.0
−0.2
T very large makes it diﬃcult to
−0.4
include z within the simulated 0 200 400

t
600 800 1000

parameters

ABC basics

Illustrations

Example ()
Potts model: if y takes values on a grid Y of size k n and

f (y|θ) ∝ exp θ Iyl =yi
l∼i

where l∼i denotes a neighbourhood relation, n moderately large
prohibits the computation of the normalising constant

ABC basics

Illustrations

Example (Genesis)
Phylogenetic tree: in population
genetics, reconstitution of a common
ancestor from a sample of genes via
a phylogenetic tree that is close to
impossible to integrate out
[100 processor days with 4
parameters]
[Cornuet et al., 2009, Bioinformatics]

ABC basics

The ABC method

Bayesian setting: target is π(θ)f (x|θ)

ABC basics

The ABC method

When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:

ABC basics

The ABC method

When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.

[Tavar´ et al., 1997]
e

ABC basics

Why does it work?!

The proof is trivial:

f (θi ) ∝ π(θi )f (z|θi )Iy (z)
z∈D
∝ π(θi )f (y|θi )
= π(θi |y) .

[Accept–Reject 101]

ABC basics

Earlier occurrence

‘Bayesian statistics and Monte Carlo methods are ideally
suited to the task of passing many models over one
dataset’
[Don Rubin, Annals of Statistics, 1984]

Note Rubin (1984) does not promote this algorithm for
likelihood-free simulation but frequentist intuition on posterior
distributions: parameters from posteriors are more likely to be
those that could have generated the data.

ABC basics

A as A...pproximative

When y is a continuous random variable, equality z = y is replaced
with a tolerance condition,

(y, z) ≤

where is a distance

ABC basics

A as A...pproximative

When y is a continuous random variable, equality z = y is replaced
with a tolerance condition,

(y, z) ≤

where is a distance
Output distributed from

π(θ) Pθ { (y, z) < } ∝ π(θ| (y, z) < )

[Pritchard et al., 1999]

ABC basics

ABC algorithm

Algorithm 1 Likelihood-free rejection sampler 2
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f (·|θ )
until ρ{η(z), η(y)} ≤
set θi = θ
end for

where η(y) deﬁnes a (not necessarily suﬃcient) statistic

ABC basics

Output

The likelihood-free algorithm samples from the marginal in z of:

π(θ)f (z|θ)IA ,y (z)
π (θ, z|y) = ,
A ,y ×Θ π(θ)f (z|θ)dzdθ

where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.

ABC basics

Output

The likelihood-free algorithm samples from the marginal in z of:

π(θ)f (z|θ)IA ,y (z)
π (θ, z|y) = ,
A ,y ×Θ π(θ)f (z|θ)dzdθ

where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
The idea behind ABC is that the summary statistics coupled with a
small tolerance should provide a good approximation of the
posterior distribution:

π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) .

ABC basics

Convergence of ABC (ﬁrst attempt)

What happens when → 0?

ABC basics


δ > 0, there exists 0 such that < 0 implies

π(θ) f (z|θ)IA ,y (z) dz π(θ)f (y|θ)(1 δ)
XX )
µ(BX
∈
A ,y ×Θ π(θ)f (z|θ)dzdθ Θ π(θ)f (y|θ)dθ(1 ± δ)
µ(B )
XXX

ABC basics


δ 0, there exists 0 such that 0 implies

π(θ) f (z|θ)IA ,y (z) dz π(θ)f (y|θ)(1 δ)
XX )
µ(BX
∈
A ,y ×Θ π(θ)f (z|θ)dzdθ Θ π(θ)f (y|θ)dθ(1 ± δ) X
µ(B )
X
X

[Proof extends to other continuous-in-0 kernels K ]

ABC basics

Convergence of ABC (second attempt)


ABC basics

Probit modelling on Pima Indian women
Example (R benchmark)
200 Pima Indian women with observed variables
plasma glucose concentration in oral glucose tolerance test
diastolic blood pressure
diabetes pedigree function
presence/absence of diabetes

ABC basics


Probability of diabetes function of above variables
P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,

ABC basics


P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,
Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a
g -prior modelling:
β ∼ N3 (0, n XT X)−1

ABC basics


P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,
Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a
g -prior modelling:
β ∼ N3 (0, n XT X)−1
Use of importance function inspired from the MLE estimate
distribution
ˆ ˆ
β ∼ N (β, Σ)

ABC basics

Pima Indian benchmark

80
100

1.0
80

60

0.8
60

0.6
Density

Density

Density
40
40

0.4
20
20

0.2
0.0
0

0

−0.005 0.010 0.020 0.030 −0.05 −0.03 −0.01 −1.0 0.0 1.0 2.0

Figure: Comparison between density estimates of the marginals on β1
(left), β2 (center) and β3 (right) from ABC rejection samples (red) and
MCMC samples (black)

.

ABC basics

MA example

Back to the MA(q) model
q
xt = t + ϑi t−i
i=1

Simple prior: uniform over the inverse [real and complex] roots in
q
Q(u) = 1 − ϑi u i
i=1

under the identiﬁability conditions

ABC basics

MA example

Back to the MA(q) model
q
xt = t + ϑi t−i
i=1

Simple prior: uniform prior over the identiﬁability zone, e.g.
triangle for MA(2)

ABC basics

MA example (2)
ABC algorithm thus made of
1. picking a new value (ϑ1 , ϑ2 ) in the triangle
2. generating an iid sequence ( t )−qt≤T
3. producing a simulated series (xt )1≤t≤T

ABC basics

MA example (2)
ABC algorithm thus made of
1. picking a new value (ϑ1 , ϑ2 ) in the triangle
2. generating an iid sequence ( t )−qt≤T
3. producing a simulated series (xt )1≤t≤T
Distance: basic distance between the series
T
ρ((xt )1≤t≤T , (xt )1≤t≤T ) = (xt − xt )2
t=1

or distance between summary statistics like the q autocorrelations
T
τj = xt xt−j
t=j+1

ABC basics

Comparison of distance impact

Evaluation of the tolerance on the ABC sample against both
distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model

ABC basics

Comparison of distance impact

4

1.5
3

1.0
2

0.5
1

0.0
0

0.0 0.2 0.4 0.6 0.8 −2.0 −1.0 0.0 0.5 1.0 1.5

θ1 θ2

Evaluation of the tolerance on the ABC sample against both
distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model

ABC basics

Homonomy
The ABC algorithm is not to be confused with the ABC algorithm
The Artificial Bee Colony algorithm is a swarm based meta-heuristic
algorithm that was introduced by Karaboga in 2005 for optimizing
numerical problems. It was inspired by the intelligent foraging
behavior of honey bees. The algorithm is specifically based on the
model proposed by Tereshko and Loengarov (2005) for the foraging
behaviour of honey bee colonies. The model consists of three
essential components: employed and unemployed foraging bees, and
food sources. The first two components, employed and unemployed
foraging bees, search for rich food sources (...) close to their hive.
The model also defines two leading modes of behaviour (...):
recruitment of foragers to rich food sources resulting in positive
feedback and abandonment of poor sources by foragers causing
negative feedback.

[Karaboga, Scholarpedia]

ABC basics

ABC advances

Simulating from the prior is often poor in eﬃciency

ABC basics

ABC advances

Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y ...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

ABC basics

ABC advances


...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002]

ABC basics

ABC advances


...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger

.....or even by including in the inferential framework [ABCµ ]
[Ratmann et al., 2009]

Alphabet soup

ABC-NP
Better usage of [prior] simulations by
adjustement: instead of throwing away
θ such that ρ(η(z), η(y)) , replace
θ’s with locally regressed transforms
(use with BIC)

θ∗ = θ − {η(z) − η(y)}T β
ˆ [Csill´ry et al., TEE, 2010]
e

ˆ
where β is obtained by [NP] weighted least square regression on
(η(z) − η(y)) with weights

Kδ {ρ(η(z), η(y))}

[Beaumont et al., 2002, Genetics]

Alphabet soup

ABC-NP (regression)

Also found in the subsequent literature, e.g. in Fearnhead-Prangle (2012) :
weight directly simulation by

Kδ {ρ(η(z(θ)), η(y))}

or
S
1
Kδ {ρ(η(zs (θ)), η(y))}
S
s=1

[consistent estimate of f (η|θ)]

Alphabet soup

ABC-NP (regression)

Also found in the subsequent literature, e.g. in Fearnhead-Prangle (2012) :
weight directly simulation by

Kδ {ρ(η(z(θ)), η(y))}

or
S
1
Kδ {ρ(η(zs (θ)), η(y))}
S
s=1

[consistent estimate of f (η|θ)]
Curse of dimensionality: poor estimate when d = dim(η) is large...

Alphabet soup

ABC-NP (density estimation)

Use of the kernel weights

Kδ {ρ(η(z(θ)), η(y))}

leads to the NP estimate of the posterior expectation

i θi Kδ {ρ(η(z(θi )), η(y))}
i Kδ {ρ(η(z(θi )), η(y))}

[Blum, JASA, 2010]

Alphabet soup

ABC-NP (density estimation)

Use of the kernel weights

Kδ {ρ(η(z(θ)), η(y))}

leads to the NP estimate of the posterior conditional density
˜
Kb (θi − θ)Kδ {ρ(η(z(θi )), η(y))}
i

[Blum, JASA, 2010]

Alphabet soup

ABC-NP (density estimations)

Other versions incorporating regression adjustments
˜
Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))}
i

Alphabet soup


˜
i

In all cases, error

E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d )
g
c
var(ˆ (θ|y)) =
g (1 + oP (1))
nbδ d
[Blum, JASA, 2010]

Alphabet soup


˜
i

In all cases, error

E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d )
g
c
var(ˆ (θ|y)) =
g (1 + oP (1))
nbδ d
[standard NP calculations]

Alphabet soup

ABC-NCH

Incorporating non-linearities and heterocedasticities:

σ (η(y))
ˆ
θ∗ = m(η(y)) + [θ − m(η(z))]
ˆ ˆ
σ (η(z))
ˆ

Alphabet soup

ABC-NCH

Incorporating non-linearities and heterocedasticities:

σ (η(y))
ˆ
θ∗ = m(η(y)) + [θ − m(η(z))]
ˆ ˆ
σ (η(z))
ˆ

where
m(η) estimated by non-linear regression (e.g., neural network)
ˆ
σ (η) estimated by non-linear regression on residuals
ˆ

log{θi − m(ηi )}2 = log σ 2 (ηi ) + ξi
ˆ

[Blum Fran¸ois, 2009]
c

Alphabet soup

ABC-NCH (2)

Why neural network?

Alphabet soup

ABC-NCH (2)

Why neural network?
ﬁghts curse of dimensionality
selects relevant summary statistics
provides automated dimension reduction
oﬀers a model choice capability
improves upon multinomial logistic
[Blum Fran¸ois, 2009]
c

Alphabet soup

ABC-MCMC

Markov chain (θ(t) ) created via the transition function

θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y


π(θ )Kω (t) |θ )
θ (t+1)
= and u ∼ U(0, 1) ≤ π(θ(t) )K (θ |θ(t) ) ,
 ω (θ
 (t)
θ otherwise,

Alphabet soup

ABC-MCMC

Markov chain (θ(t) ) created via the transition function

θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y


π(θ )Kω (t) |θ )
θ (t+1)
= and u ∼ U(0, 1) ≤ π(θ(t) )K (θ |θ(t) ) ,
 ω (θ
 (t)
θ otherwise,

has the posterior π(θ|y ) as stationary distribution
[Marjoram et al, 2003]

Alphabet soup

ABC-MCMC (2)

Algorithm 2 Likelihood-free MCMC sampler
Use Algorithm 1 to get (θ(0) , z(0) )
for t = 1 to N do
Generate θ from Kω ·|θ(t−1) ,
Generate z from the likelihood f (·|θ ),
Generate u from U[0,1] ,
π(θ )Kω (θ(t−1) |θ )
if u ≤ I
π(θ(t−1) Kω (θ |θ(t−1) ) A ,y (z ) then
set (θ(t) , z(t) ) = (θ , z )
else
(θ(t) , z(t) )) = (θ(t−1) , z(t−1) ),
end if
end for

Alphabet soup

A toy example

Case of
1 1
x ∼ N (θ, 1) + N (−θ, 1)
2 2
under prior θ ∼ N (0, 10)

Alphabet soup

A toy example

Case of
1 1
x ∼ N (θ, 1) + N (−θ, 1)
2 2

ABC sampler
thetas=rnorm(N,sd=10)
zed=sample(c(1,-1),N,rep=TRUE)*thetas+rnorm(N,sd=1)
eps=quantile(abs(zed-x),.01)
abc=thetas[abs(zed-x)eps]

Alphabet soup

A toy example

Case of
1 1
x ∼ N (θ, 1) + N (−θ, 1)
2 2

ABC-MCMC sampler
metas=rep(0,N)
metas[1]=rnorm(1,sd=10)
zed[1]=x
for (t in 2:N){
metas[t]=rnorm(1,mean=metas[t-1],sd=5)
zed[t]=rnorm(1,mean=(1-2*(runif(1).5))*metas[t],sd=1)
if ((abs(zed[t]-x)eps)||(runif(1)dnorm(metas[t],sd=10)/dnorm(metas[t-1],sd=10))){
metas[t]=metas[t-1]
zed[t]=zed[t-1]}
}

Alphabet soup

A toy example

x =2

Alphabet soup

A toy example

x =2
0.30
0.25
0.20
0.15
0.10
0.05
0.00

−4 −2 0 2 4

θ

Alphabet soup

A toy example

x = 10

Alphabet soup

A toy example

x = 10
0.25
0.20
0.15
0.10
0.05
0.00

−10 −5 0 5 10

θ

Alphabet soup

A toy example

x = 50

Alphabet soup

A toy example

x = 50
0.10
0.08
0.06
0.04
0.02
0.00

−40 −20 0 20 40

θ

Alphabet soup

ABCµ

[Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]

Use of a joint density

f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

where y is the data, and ξ( |y, θ) is the prior predictive density of
ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ)

Alphabet soup

ABCµ

[Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]

Use of a joint density

f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

where y is the data, and ξ( |y, θ) is the prior predictive density of
ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ)
Warning! Replacement of ξ( |y, θ) with a non-parametric kernel
approximation.

Alphabet soup

ABCµ details

Multidimensional distances ρk (k = 1, . . . , K ) and errors
k = ρk (ηk (z), ηk (y)), with

ˆ 1
k ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) = K [{ k −ρk (ηk (zb ), ηk (y))}/hk ]
Bhk
b

ˆ
then used in replacing ξ( |y, θ) with mink ξk ( |y, θ)

Alphabet soup

ABCµ details

Multidimensional distances ρk (k = 1, . . . , K ) and errors
k = ρk (ηk (z), ηk (y)), with

ˆ 1
k ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) = K [{ k −ρk (ηk (zb ), ηk (y))}/hk ]
Bhk
b

ˆ
then used in replacing ξ( |y, θ) with mink ξk ( |y, θ)
ABCµ involves acceptance probability

ˆ
π(θ , ) q(θ , θ)q( , ) mink ξk ( |y, θ )
ˆ
π(θ, ) q(θ, θ )q( , ) mink ξk ( |y, θ)

Alphabet soup

ABCµ multiple errors

[ c Ratmann et al., PNAS, 2009]

Alphabet soup

ABCµ for model choice

[ c Ratmann et al., PNAS, 2009]

Alphabet soup

Questions about ABCµ

For each model under comparison, marginal posterior on used to
assess the ﬁt of the model (HPD includes 0 or not).

Alphabet soup

Questions about ABCµ

For each model under comparison, marginal posterior on used to
assess the ﬁt of the model (HPD includes 0 or not).
Is the data informative about ? [Identiﬁability]
How is the prior π( ) impacting the comparison?
How is using both ξ( |x0 , θ) and π ( ) compatible with a
standard probability model? [remindful of Wilkinson ]
Where is the penalisation for complexity in the model
comparison?
[X, Mengersen Chen, 2010, PNAS]

Alphabet soup

ABC-PRC

Another sequential version producing a sequence of Markov
(t) (t)
transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T )

Alphabet soup

ABC-PRC

Another sequential version producing a sequence of Markov
(t) (t)
transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T )

ABC-PRC Algorithm
(t−1)
1. Select θ at random from previous θi ’s with probabilities
(t−1)
ωi (1 ≤ i ≤ N).
2. Generate
(t) (t)
θi ∼ Kt (θ|θ ) , x ∼ f (x|θi ) ,
3. Check that (x, y ) , otherwise start again.

[Sisson et al., 2007]

Alphabet soup

Why PRC?

Partial rejection control: Resample from a population of weighted
particles by pruning away particles with weights below threshold C ,
replacing them by new particles obtained by propagating an
existing particle by an SMC step and modifying the weights
accordinly.
[Liu, 2001]

Alphabet soup

Why PRC?

Partial rejection control: Resample from a population of weighted
particles by pruning away particles with weights below threshold C ,
replacing them by new particles obtained by propagating an
existing particle by an SMC step and modifying the weights
accordinly.
[Liu, 2001]
PRC justiﬁcation in ABC-PRC:
Suppose we then implement the PRC algorithm for some
c 0 such that only identically zero weights are smaller
than c
Trouble is, there is no such c...

Alphabet soup

ABC-PRC weight

(t)
Probability ωi computed as
(t) (t) (t) (t)
ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 ,

where Lt−1 is an arbitrary transition kernel.

Alphabet soup

ABC-PRC weight

(t)
(t) (t) (t) (t)

In case
Lt−1 (θ |θ) = Kt (θ|θ ) ,
all weights are equal under a uniform prior.

Alphabet soup

ABC-PRC weight

(t)
(t) (t) (t) (t)

In case
Lt−1 (θ |θ) = Kt (θ|θ ) ,
all weights are equal under a uniform prior.
Inspired from Del Moral et al. (2006), who use backward kernels
Lt−1 in SMC to achieve unbiasedness

Alphabet soup

ABC-PRC bias

Lack of unbiasedness of the method

Alphabet soup

A mixture example (1)

Toy model of Sisson et al. (2007): if

θ ∼ U(−10, 10) , x|θ ∼ 0.5 N (θ, 1) + 0.5 N (θ, 1/100) ,

then the posterior distribution associated with y = 0 is the normal
mixture
θ|y = 0 ∼ 0.5 N (0, 1) + 0.5 N (0, 1/100)
restricted to [−10, 10].
Furthermore, true target available as

π(θ||x| ) ∝ Φ( −θ)−Φ(− −θ)+Φ(10( −θ))−Φ(−10( +θ)) .

Alphabet soup

“Ugly, squalid graph...”
1.0

1.0

1.0

1.0

1.0
0.8

0.8

0.8

0.8

0.8
0.6

0.6

0.6

0.6

0.6
0.4

0.4

0.4

0.4

0.4
0.2

0.2

0.2

0.2

0.2
0.0

0.0

0.0

0.0

0.0
−3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3

θ θ θ θ θ
1.0

1.0

1.0

1.0

1.0
0.8

0.8

0.8

0.8

0.8
0.6

0.6

0.6

0.6

0.6
0.4

0.4

0.4

0.4

0.4
0.2

0.2

0.2

0.2

0.2
0.0

0.0

0.0

0.0

0.0
−3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3

θ θ θ θ θ

Comparison of τ = 0.15 and τ = 1/0.15 in Kt

Alphabet soup

A PMC version
Use of the same kernel idea as ABC-PRC but with IS correction
( check PMC )
Generate a sample at iteration t by
N
(t) (t−1) (t−1)
πt (θ
ˆ )∝ ωj Kt (θ(t) |θj )
j=1

modulo acceptance of the associated xt , and use an importance
(t)
weight associated with an accepted simulation θi
(t) (t) (t)
ωi ∝ π(θi ) πt (θi ) .
ˆ

c Still likelihood free

Alphabet soup

Our ABC-PMC algorithm
Given a decreasing sequence of approximation levels 1 ≥ ... ≥ T,

1. At iteration t = 1,
For i = 1, ..., N
(1) (1)
Simulate θi ∼ π(θ) and x ∼ f (x|θi ) until (x, y ) 1
(1)
Set ωi = 1/N
(1)
Take τ 2 as twice the empirical variance of the θi ’s
2. At iteration 2 ≤ t ≤ T ,
For i = 1, ..., N, repeat
(t−1) (t−1)
Pick θi from the θj ’s with probabilities ωj
(t) 2 (t)
generate θi |θi ∼ N (θi , σt ) and x ∼ f (x|θi )
until (x, y ) t
(t) (t) N (t−1) −1 (t) (t−1)
Set ωi ∝ π(θi )/ j=1 ωj ϕ σt θi − θj )
2 (t)
Take τt+1 as twice the weighted empirical variance of the θi ’s

Alphabet soup

Sequential Monte Carlo

SMC is a simulation technique to approximate a sequence of
related probability distributions πn with π0 “easy” and πT as
target.
Iterated IS as PMC: particles moved from time n to time n via
kernel Kn and use of a sequence of extended targets πn˜
n
πn (z0:n ) = πn (zn )
˜ Lj (zj+1 , zj )
j=0

where the Lj ’s are backward Markov kernels [check that πn (zn ) is a
marginal]
[Del Moral, Doucet Jasra, Series B, 2006]

Alphabet soup

Sequential Monte Carlo (2)

Algorithm 3 SMC sampler
(0)
sample zi ∼ γ0 (x) (i = 1, . . . , N)
(0) (0) (0)
compute weights wi = π0 (zi )/γ0 (zi )
for t = 1 to N do
if ESS(w (t−1) ) NT then
resample N particles z (t−1) and set weights to 1
end if
(t−1) (t−1)
generate zi ∼ Kt (zi , ·) and set weights to
(t) (t) (t−1)
(t) (t−1) πt (zi ))Lt−1 (zi ), zi ))
wi = wi−1 (t−1) (t−1) (t)
πt−1 (zi ))Kt (zi ), zi ))

end for

Alphabet soup

ABC-SMC
[Del Moral, Doucet Jasra, 2009]

True derivation of an SMC-ABC algorithm
Use of a kernel Kn associated with target π n and derivation of the
backward kernel
π n (z )Kn (z , z)
Ln−1 (z, z ) =
πn (z)

Update of the weights
M m
m=1 IA n
(xin )
win ∝ wi(n−1) M m
(xi(n−1) )
m=1 IA n−1

m
when xin ∼ K (xi(n−1) , ·)

Alphabet soup

ABC-SMCM

Modiﬁcation: Makes M repeated simulations of the pseudo-data z
given the parameter, rather than using a single [M = 1] simulation,
leading to weight that is proportional to the number of accepted
zi s
M
1
ω(θ) = Iρ(η(y),η(zi ))
M
i=1

[limit in M means exact simulation from (tempered) target]

Alphabet soup

Properties of ABC-SMC

The ABC-SMC method properly uses a backward kernel L(z, z ) to
simplify the importance weight and to remove the dependence on
the unknown likelihood from this weight. Update of importance
weights is reduced to the ratio of the proportions of surviving
particles
Major assumption: the forward kernel K is supposed to be invariant
against the true target [tempered version of the true posterior]

Alphabet soup

Properties of ABC-SMC

The ABC-SMC method properly uses a backward kernel L(z, z ) to
simplify the importance weight and to remove the dependence on
the unknown likelihood from this weight. Update of importance
weights is reduced to the ratio of the proportions of surviving
particles
Major assumption: the forward kernel K is supposed to be invariant
against the true target [tempered version of the true posterior]
Adaptivity in ABC-SMC algorithm only found in on-line
construction of the thresholds t , slowly enough to keep a large
number of accepted transitions

Alphabet soup

A mixture example (2)
Recovery of the target, whether using a ﬁxed standard deviation of
τ = 0.15 or τ = 1/0.15, or a sequence of adaptive τt ’s.
1.0

1.0

1.0

1.0

1.0
0.8

0.8

0.8

0.8

0.8
0.6

0.6

0.6

0.6

0.6
0.4

0.4

0.4

0.4

0.4
0.2

0.2

0.2

0.2

0.2
0.0

0.0

0.0

0.0

0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

θ θ θ θ θ

Alphabet soup

Yet another ABC-PRC
Another version of ABC-PRC called PRC-ABC with a proposal
distribution Mt , S replications of the pseudo-data, and a PRC step:
PRC-ABC Algorithm
(t−1) (t−1)
1. Select θ at random from the θi ’s with probabilities ωi
(t) iid (t−1)
2. Generate θi ∼ Kt , x1 , . . . , xS ∼ f (x|θi ), set
(t) (t)
ωi = Kt {η(xs ) − η(y)} Kt (θi ) ,
s
(t)
and accept with probability p (i) = ωi /ct,N
(t) (t)
3. Set ωi ∝ ωi /p (i) (1 ≤ i ≤ N)

[Peters, Sisson Fan, Stat Computing, 2012]

Alphabet soup

Yet another ABC-PRC

Another version of ABC-PRC called PRC-ABC with a proposal
distribution Mt , S replications of the pseudo-data, and a PRC step:
(t−1) (t−1)
Use of a NP estimate Mt (θ) = ωi ψ(θ|θi ) (with a
ﬁxed variance?!)
S = 1 recommended for “greatest gains in sampler
performance”
ct,N upper quantile of non-zero weights at time t

Alphabet soup

Wilkinson’s exact BC
ABC approximation error (i.e. non-zero tolerance) replaced with
exact simulation from a controlled approximation to the target,
convolution of true posterior with kernel function
π(θ)f (z|θ)K (y − z)
π (θ, z|y) = ,
π(θ)f (z|θ)K (y − z)dzdθ
with K kernel parameterised by bandwidth .
[Wilkinson, 2008]

Alphabet soup

Wilkinson’s exact BC
ABC approximation error (i.e. non-zero tolerance) replaced with
exact simulation from a controlled approximation to the target,
convolution of true posterior with kernel function
π(θ)f (z|θ)K (y − z)
π (θ, z|y) = ,
π(θ)f (z|θ)K (y − z)dzdθ
with K kernel parameterised by bandwidth .
[Wilkinson, 2008]
Theorem
The ABC algorithm based on the assumption of a randomised
observation y = y + ξ, ξ ∼ K , and an acceptance probability of
˜

K (y − z)/M

gives draws from the posterior distribution π(θ|y).

Alphabet soup

How exact a BC?

“Using to represent measurement error is
straightforward, whereas using to model the model
discrepancy is harder to conceptualize and not as
commonly used”
[Richard Wilkinson, 2008]

Alphabet soup

How exact a BC?
Pros
Pseudo-data from true model and observed data from noisy
model
Interesting perspective in that outcome is completely
controlled
Link with ABCµ and assuming y is observed with a
measurement error with density K
Relates to the theory of model approximation
[Kennedy O’Hagan, 2001]
Cons
Requires K to be bounded by M
True approximation error never assessed
Requires a modiﬁcation of the standard ABC algorithm

Alphabet soup

ABC for HMMs

Speciﬁc case of a hidden Markov model

Xt+1 ∼ Qθ (Xt , ·)
Yt+1 ∼ gθ (·|xt )
0
where only y1:n is observed.
[Dean, Singh, Jasra, Peters, 2011]

Alphabet soup

ABC for HMMs

Speciﬁc case of a hidden Markov model

Xt+1 ∼ Qθ (Xt , ·)
Yt+1 ∼ gθ (·|xt )
0
where only y1:n is observed.
Use of speciﬁc constraints, adapted to the Markov structure:
0 0
y1 ∈ B(y1 , ) × · · · × yn ∈ B(yn , )

Alphabet soup

ABC-MLE for HMMs

ABC-MLE deﬁned by
ˆ 0 0
θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , )
θ

Alphabet soup

ABC-MLE for HMMs

ABC-MLE deﬁned by
ˆ 0 0
θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , )
θ

Exact MLE for the likelihood same basis as Wilkinson!

0
pθ (y1 , . . . , yn )

corresponding to the perturbed process

(xt , yt + zt )1≤t≤n zt ∼ U(B(0, 1)


Monash University short course, part II

Monash University short course, part II

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Monash University short course, part II

Similaire à Monash University short course, part II (20)

Plus de Christian Robert

Plus de Christian Robert (20)

Dernier

Dernier (20)

Monash University short course, part II