SlideShare a Scribd company logo
1 of 223
Download to read offline
ABC, NCE, GANs & VAEs
Christian P. Robert
U. Paris Dauphine & Warwick U.
CDT masterclass, May 2022
Outline
1 Geyer’s 1994 logistic
2 Links with bridge sampling
3 Noise contrastive estimation
4 Generative models
5 Variational autoencoders (VAEs)
6 Generative adversarial networks
(GANs)
An early entry
A standard issue in Bayesian inference is to approximate the
marginal likelihood (or evidence)
Ek =
Z
Θk
πk(ϑk)Lk(ϑk) dϑk,
aka the marginal likelihood.
[Jeffreys, 1939]
Bayes factor
For testing hypotheses H0 : ϑ ∈ Θ0 vs. Ha : ϑ 6∈ Θ0, under prior
π(Θ0)π0(ϑ) + π(Θc
0)π1(ϑ) ,
central quantity
B01 =
π(Θ0|x)
π(Θc
0|x)

π(Θ0)
π(Θc
0)
=
Z
Θ0
f(x|ϑ)π0(ϑ)dϑ
Z
Θc
0
f(x|ϑ)π1(ϑ)dϑ
[Kass  Raftery, 1995, Jeffreys, 1939]
Bayes factor approximation
When approximating the Bayes factor
B01 =
Z
Θ0
f0(x|ϑ0)π0(ϑ0)dϑ0
Z
Θ1
f1(x|ϑ1)π1(ϑ1)dϑ1
use of importance functions $0 and $1 and
b
B01 =
n−1
0
Pn0
i=1 f0(x|ϑi
0)π0(ϑi
0)/$0(ϑi
0)
n−1
1
Pn1
i=1 f1(x|ϑi
1)π1(ϑi
1)/$1(ϑi
1)
when ϑi
0 ∼ $0(ϑ) and ϑi
1 ∼ $1(ϑ)
Forgetting and learning
Counterintuitive choice of importance function based on
mixtures
If ϑit ∼ $i(ϑ) (i = 1, . . . , I, t = 1, . . . , Ti)
Eπ[h(ϑ)] ≈
1
Ti
Ti
X
t=1
h(ϑit)
π(ϑit)
$i(ϑit)
replaced with
Eπ[h(ϑ)] ≈
I
X
i=1
Ti
X
t=1
h(ϑit)
π(ϑit)
PI
j=1 Tj$j(ϑit)
Preserves unbiasedness and brings stability (while forgetting
about original index)
[Geyer, 1991, unpublished; Owen  Zhou, 2000]
Enters the logistic
If considering unnormalised $j’s, i.e.
$j(ϑ) = cj
e
$j(ϑ) j = 1, . . . , I
and realisations ϑit’s from the mixture
$(ϑ) =
1
T
I
X
i=1
Tj$j(ϑ) =
1
T
I
X
i=1
e
$j(ϑ)e
ηj
z }| {
log(cj) + log(Tj)
Geyer (1994) introduces allocation probabilities for the mixture
components
pj(ϑ, η) = e
$j(ϑ)eηj
. I
X
m=1
e
$m(ϑ)eηm
to construct a pseudo-log-likelihood
`(η) :=
I
X
i=1
Ti
X
t=1
log pi(ϑit, η)
Enters the logistic (2)
Estimating η as
^
η = arg max
η
`(η)
produces the reverse logistic regression estimator of the
constants cj as
I partial forgetting of initial distribution
I objective function equivalent to a multinomial logistic
regression with the log e
$i(ϑit)’s as covariates
I randomness reversed from Ti’s to ϑit’s
I constants cj identifiable up to a constant
I resulting biased importance sampling estimator
Illustration
Special case when I = 2, c1 = 1, T1 = T2
−`(c2) =
T
X
t=1
log{1 + c2
e
$2(ϑ1t)/$1(ϑ1t)}
+
T
X
t=1
log{1 + $1(ϑ2t)/c2
e
$2(ϑ2t)}
and
$1(ϑ) = ϕ(ϑ; 0, 32
) e
$2(ϑ) = exp{−(ϑ − 5)2
/2} c2 = 1/
√
2π
Illustration
Special case when I = 2, c1 = 1, T1 = T2
−`(c2) =
T
X
t=1
log{1 + c2
e
$2(ϑ1t)/$1(ϑ1t)}
+
T
X
t=1
log{1 + $1(ϑ2t)/c2
e
$2(ϑ2t)}
tg=function(x)exp(-(x-5)**2/2)
pl=function(a)
sum(log(1+a*tg(x)/dnorm(x,0,3)))+sum(log(1+dnorm(y,0,3)/a/tg(y)))
nrm=matrix(0,3,1e2)
for(i in 1:3)
for(j in 1:1e2)
x=rnorm(10**(i+1),0,3)
y=rnorm(10**(i+1),5,1)
nrm[i,j]=optimise(pl,c(.01,1))
Illustration
Special case when I = 2, c1 = 1, T1 = T2
−`(c2) =
T
X
t=1
log{1 + c2
e
$2(ϑ1t)/$1(ϑ1t)}
+
T
X
t=1
log{1 + $1(ϑ2t)/c2
e
$2(ϑ2t)}
1 2 3
0.3
0.4
0.5
0.6
0.7
Illustration
Special case when I = 2, c1 = 1, T1 = T2
−`(c2) =
T
X
t=1
log{1 + c2
e
$2(ϑ1t)/$1(ϑ1t)}
+
T
X
t=1
log{1 + $1(ϑ2t)/c2
e
$2(ϑ2t)}
Full logistic
Outline
1 Geyer’s 1994 logistic
2 Links with bridge sampling
3 Noise contrastive estimation
4 Generative models
5 Variational autoencoders (VAEs)
6 Generative adversarial networks
(GANs)
Bridge sampling
Approximation of Bayes factors (and other ratios of integrals)
Special case:
If
π1(ϑ1|x) ∝ π̃1(ϑ1|x)
π2(ϑ2|x) ∝ π̃2(ϑ2|x)
live on the same space (Θ1 = Θ2), then
B12 ≈
1
n
n
X
i=1
π̃1(ϑi|x)
π̃2(ϑi|x)
ϑi ∼ π2(ϑ|x)
[Bennett, 1976; Gelman  Meng, 1998; Chen, Shao  Ibrahim, 2000]
Bridge sampling variance
The bridge sampling estimator does poorly if
var(b
B12)
B2
12
≈
1
n
Eπ2

π1(ϑ) − π2(ϑ)
π2(ϑ)
2
#
is large, i.e. if π1 and π2 have little overlap...
Bridge sampling variance
The bridge sampling estimator does poorly if
var(b
B12)
B2
12
≈
1
n
Eπ2

π1(ϑ) − π2(ϑ)
π2(ϑ)
2
#
is large, i.e. if π1 and π2 have little overlap...
(Further) bridge sampling
General identity:
c1
c2
= B12 =
Z
π̃2(ϑ|x)α(ϑ)π1(ϑ|x)dϑ
Z
π̃1(ϑ|x)α(ϑ)π2(ϑ|x)dϑ
∀ α(·)
≈
1
n1
n1
X
i=1
π̃2(ϑ1i|x)α(ϑ1i)
1
n2
n2
X
i=1
π̃1(ϑ2i|x)α(ϑ2i)
ϑji ∼ πj(ϑ|x)
Optimal bridge sampling
The optimal choice of auxiliary function is
α?
(ϑ) =
n1 + n2
n1π1(ϑ|x) + n2π2(ϑ|x)
leading to
b
B12 ≈
1
n1
n1
X
i=1
π̃2(ϑ1i|x)
n1π1(ϑ1i|x) + n2π2(ϑ1i|x)
1
n2
n2
X
i=1
π̃1(ϑ2i|x)
n1π1(ϑ2i|x) + n2π2(ϑ2i|x)
Optimal bridge sampling (2)
Reason:
Var(b
B12)
B2
12
≈
1
n1n2
R
π1(ϑ)π2(ϑ)[n1π1(ϑ) + n2π2(ϑ)]α(ϑ)2 dϑ
R
π1(ϑ)π2(ϑ)α(ϑ) dϑ
2
− 1

(by the δ method)
Drawback: Dependence on the unknown normalising constants
solved iteratively
Optimal bridge sampling (2)
Reason:
Var(b
B12)
B2
12
≈
1
n1n2
R
π1(ϑ)π2(ϑ)[n1π1(ϑ) + n2π2(ϑ)]α(ϑ)2 dϑ
R
π1(ϑ)π2(ϑ)α(ϑ) dϑ
2
− 1

(by the δ method)
Drawback: Dependence on the unknown normalising constants
solved iteratively
Back to the logistic
When T1 = T2 = T, optimising
−`(c2) =
T
X
t=1
log{1+c2
e
$2(ϑ1t)/$1(ϑ1t)}+
T
X
t=1
log{1+$1(ϑ2t)/c2
e
$2(ϑ2t)}
cancelling derivative in c2
T
X
t=1
e
$2(ϑ1t)
c2
e
$2(ϑ1t) + $1(ϑ1t)
− c−1
2
T
X
t=1
$1(ϑ2t)
$1(ϑ2t) + c2
e
$2(ϑ2t)
= 0
leads to
c0
2 =
PT
t=1
$1(ϑ2t)
$1(ϑ2t)+c2 e
$2(ϑ2t)
PT
t=1
e
$2(ϑ1t)
c2 e
$2(ϑ1t)+$1(ϑ1t)
EM step for the maximum pseudo-likelihood estimation
Back to the logistic
When T1 = T2 = T, optimising
−`(c2) =
T
X
t=1
log{1+c2
e
$2(ϑ1t)/$1(ϑ1t)}+
T
X
t=1
log{1+$1(ϑ2t)/c2
e
$2(ϑ2t)}
cancelling derivative in c2
T
X
t=1
e
$2(ϑ1t)
c2
e
$2(ϑ1t) + $1(ϑ1t)
− c−1
2
T
X
t=1
$1(ϑ2t)
$1(ϑ2t) + c2
e
$2(ϑ2t)
= 0
leads to
c0
2 =
PT
t=1
$1(ϑ2t)
$1(ϑ2t)+c2 e
$2(ϑ2t)
PT
t=1
e
$2(ϑ1t)
c2 e
$2(ϑ1t)+$1(ϑ1t)
EM step for the maximum pseudo-likelihood estimation
Back to the logistic
When T1 = T2 = T, optimising
−`(c2) =
T
X
t=1
log{1+c2
e
$2(ϑ1t)/$1(ϑ1t)}+
T
X
t=1
log{1+$1(ϑ2t)/c2
e
$2(ϑ2t)}
cancelling derivative in c2
T
X
t=1
e
$2(ϑ1t)
c2
e
$2(ϑ1t) + $1(ϑ1t)
− c−1
2
T
X
t=1
$1(ϑ2t)
$1(ϑ2t) + c2
e
$2(ϑ2t)
= 0
leads to
c0
2 =
PT
t=1
$1(ϑ2t)
$1(ϑ2t)+c2 e
$2(ϑ2t)
PT
t=1
e
$2(ϑ1t)
c2 e
$2(ϑ1t)+$1(ϑ1t)
EM step for the maximum pseudo-likelihood estimation
Mixtures as proposals
Design specific mixture for simulation purposes, with density
e
ϕ(ϑ) ∝ ω1π(ϑ)L(ϑ) + ϕ(ϑ) ,
where ϕ(ϑ) is arbitrary (but normalised)
Note: ω1 is not a probability weight
[Chopin  Robert, 2011]
Mixtures as proposals
Design specific mixture for simulation purposes, with density
e
ϕ(ϑ) ∝ ω1π(ϑ)L(ϑ) + ϕ(ϑ) ,
where ϕ(ϑ) is arbitrary (but normalised)
Note: ω1 is not a probability weight
[Chopin  Robert, 2011]
evidence approximation by mixtures
Rao-Blackwellised estimate
^
ξ =
1
T
T
X
t=1
ω1π(ϑ(t)
)L(ϑ(t)
)

ω1π(ϑ(t)
)L(ϑ(t)
) + ϕ(ϑ(t)
) ,
converges to ω1Z/{ω1Z + 1}
Deduce ^
Z from
ω1
^
Z/{ω1
^
Z + 1} = ^
ξ
Back to bridge sampling optimal estimate
[Chopin  Robert, 2011]
Non-parametric MLE
“At first glance, the problem appears to be an exercise in
calculus or numerical analysis, and not amenable to statistical
formulation” Kong et al. (JRSS B, 2002)
I use of Fisher information
I non-parametric MLE based on
simulations
I comparison of sampling
schemes through variances
I Rao–Blackwellised
improvements by invariance
constraints [Meng, 2011, IRCEM]
Non-parametric MLE
“At first glance, the problem appears to be an exercise in
calculus or numerical analysis, and not amenable to statistical
formulation” Kong et al. (JRSS B, 2002)
I use of Fisher information
I non-parametric MLE based on
simulations
I comparison of sampling
schemes through variances
I Rao–Blackwellised
improvements by invariance
constraints [Meng, 2011, IRCEM]
NPMLE
Observing
Yij ∼ Fi(t) = c−1
i
Zt
−∞
ωi(x) dF(x)
with ωi known and F unknown
NPMLE
Observing
Yij ∼ Fi(t) = c−1
i
Zt
−∞
ωi(x) dF(x)
with ωi known and F unknown
“Maximum likelihood estimate” defined by weighted empirical
cdf X
i,j
ωi(yij)p(yij)δyij
maximising in p Y
ij
c−1
i ωi(yij) p(yij)
NPMLE
Observing
Yij ∼ Fi(t) = c−1
i
Zt
−∞
ωi(x) dF(x)
with ωi known and F unknown
“Maximum likelihood estimate” defined by weighted empirical
cdf X
i,j
ωi(yij)p(yij)δyij
maximising in p Y
ij
c−1
i ωi(yij) p(yij)
Result such that
X
ij
^
c−1
r ωr(yij)
P
s ns^
c−1
s ωs(yij)
= 1
[Vardi, 1985]
NPMLE
Observing
Yij ∼ Fi(t) = c−1
i
Zt
−∞
ωi(x) dF(x)
with ωi known and F unknown
Result such that
X
ij
^
c−1
r ωr(yij)
P
s ns^
c−1
s ωs(yij)
= 1
[Vardi, 1985]
Bridge sampling estimator
X
ij
^
c−1
r ωr(yij)
P
s ns^
c−1
s ωs(yij)
= 1
[Gelman  Meng, 1998; Tan, 2004]
end of the Series B 2002 discussion
“...essentially every Monte Carlo activity may be interpreted as
parameter estimation by maximum likelihood in a statistical
model. We do not claim that this point of view is necessary; nor
do we seek to establish a working principle from it.”
I restriction to discrete support measures [may be] suboptimal
[Ritov  Bickel, 1990; Robins et al., 1997, 2000, 2003]
I group averaging versions in-between multiple mixture
estimators and quasi-Monte Carlo version
[Owen  Zhou, 2000; Cornuet et al., 2012; Owen, 2003]
I statistical analogy provides at best narrative thread
end of the Series B 2002 discussion
“The hard part of the exercise is to construct a submodel such
that the gain in precision is sufficient to justify the additional
computational effort”
I garden of forking paths, with infinite possibilities
I no free lunch (variance, budget, time)
I Rao–Blackwellisation may be detrimental in Markov setups
end of the 2002 discussion
“The statistician can considerably improve the efficiency of the
estimator by using the known values of different functionals
such as moments and probabilities of different sets. The
algorithm becomes increasingly efficient as the number of
functionals becomes larger. The result, however, is an extremely
complicated algorithm, which is not necessarily faster.” Y. Ritov
“...the analyst must violate the likelihood principle and eschew
semiparametric, nonparametric or fully parametric maximum
likelihood estimation in favour of non-likelihood-based locally
efficient semiparametric estimators.” J. Robins
Outline
1 Geyer’s 1994 logistic
2 Links with bridge sampling
3 Noise contrastive estimation
4 Generative models
5 Variational autoencoders (VAEs)
6 Generative adversarial networks
(GANs)
Noise contrastive estimation
New estimation principle for parameterised and unnormalised
statistical models also based on nonlinear logistic regression
Case of parameterised model with density
p(x; α) =
p̃(x; α)
Z(α)
and untractable normalising constant Z(α)
Estimating Z(α) as extra parameter is impossible via maximum
likelihood methods
Use of estimation techniques bypassing the constant like
contrastive divergence (Hinton, 2002) and score matching
(Hyvärinen, 2005)
[Gutmann  Hyvärinen, 2010]
Noise contrastive estimation
New estimation principle for parameterised and unnormalised
statistical models also based on nonlinear logistic regression
Case of parameterised model with density
p(x; α) =
p̃(x; α)
Z(α)
and untractable normalising constant Z(α)
Estimating Z(α) as extra parameter is impossible via maximum
likelihood methods
Use of estimation techniques bypassing the constant like
contrastive divergence (Hinton, 2002) and score matching
(Hyvärinen, 2005)
[Gutmann  Hyvärinen, 2010]
Noise contrastive estimation
New estimation principle for parameterised and unnormalised
statistical models also based on nonlinear logistic regression
Case of parameterised model with density
p(x; α) =
p̃(x; α)
Z(α)
and untractable normalising constant Z(α)
Estimating Z(α) as extra parameter is impossible via maximum
likelihood methods
Use of estimation techniques bypassing the constant like
contrastive divergence (Hinton, 2002) and score matching
(Hyvärinen, 2005)
[Gutmann  Hyvärinen, 2010]
NCE principle
As in Geyer’s method, given sample x1, . . . , xT from p(x; α)
I generate artificial sample from known distribution q,
y1, . . . , yT
I maximise the classification log-likelihood (where ϑ = (α, c))
`(ϑ; x, y) :=
T
X
i=1
log h(xi; ϑ) +
T
X
i=1
log{1 − h(yi; ϑ)}
of a logistic regression model which discriminates the
observed data from the simulated data, where
h(z; ϑ) =
cp̃(z; α)
cp̃(z; α) + q(z)
NCE principle
As in Geyer’s method, given sample x1, . . . , xT from p(x; α)
I generate artificial sample from known distribution q,
y1, . . . , yT
I maximise the classification log-likelihood (where ϑ = (α, c))
`(ϑ; x, y) :=
T
X
i=1
log h(xi; ϑ) +
T
X
i=1
log{1 − h(yi; ϑ)}
of a logistic regression model which discriminates the
observed data from the simulated data, where
h(z; ϑ) =
cp̃(z; α)
cp̃(z; α) + q(z)
NCE consistency
Objective function that converges (in T) to
J(ϑ) = E [log h(x; ϑ) + log{1 − h(y; ϑ)}]
Defining f(·) = log p(·; ϑ) and
J̃(f) = Ep [log r(f(x) − log q(x)) + log{1 − r(f(y) − log q(y))}]
NCE consistency
Objective function that converges (in T) to
J(ϑ) = E [log h(x; ϑ) + log{1 − h(y; ϑ)}]
Defining f(·) = log p(·; ϑ) and
J̃(f) = Ep [log r(f(x) − log q(x)) + log{1 − r(f(y) − log q(y))}]
Assuming q(·) positive everywhere,
I J̃(·) attains its maximum at f?(·) = log p(·) true distribution
I maximization performed without any normalisation
constraint
NCE consistency
Objective function that converges (in T) to
J(ϑ) = E [log h(x; ϑ) + log{1 − h(y; ϑ)}]
Defining f(·) = log p(·; ϑ) and
J̃(f) = Ep [log r(f(x) − log q(x)) + log{1 − r(f(y) − log q(y))}]
Under regularity condition, assuming the true distribution
belongs to parametric family, the solution
^
ϑT = arg max
ϑ
`(ϑ; x, y) (1)
converges to true ϑ
Consequence: log-normalisation constant consistently
estimated by maximizing (??)
Convergence of noise contrastive estimation
Opposition of Monte Carlo MLE à la Geyer (1994, JASA)
L = 1/n
n
X
i=1
log p̃(xi; ϑ)

p̃(xi; ϑ0
)
− log


1/m
m
X
j=1
p̃(zi; ϑ)

p̃(zi; ϑ0
)
| {z }
≈Z(ϑ0)/Z(ϑ)

x1, . . . , xn ∼ p∗
z1, . . . , zm ∼ p(z; ϑ0
)
[Riou-Durand  Chopin, 2018]
Convergence of noise contrastive estimation
and of noise contrastive estimation à la Gutmann and
Hyvärinen (2012)
L(ϑ, ν) = 1/n
n
X
i=1
log qϑ,ν(xi) + 1/m
m
X
i=1
log[1 − qϑ,ν(zi)]m/n
log
qϑ,ν(z)
1 − qϑ,ν(z)
= log
p̃(xi; ϑ)
p̃(xi; ϑ0)
+ ν + log n/m
x1, . . . , xn ∼ p∗
z1, . . . , zm ∼ p(z; ϑ0
)
[Riou-Durand  Chopin, 2018]
Poisson transform
Equivalent likelihoods
L(ϑ, ν) = 1/n
n
X
i=1
log
p̃(xi; ϑ)
p̃(xi; ϑ0)
+ ν − eν Z(ϑ)
Z(ϑ0)
and
L(ϑ, ν) = 1/n
n
X
i=1
log
p̃(xi; ϑ)
p̃(xi; ϑ0)
+ ν −
eν
m
m
X
j=1
p̃(zi; ϑ)

p̃(zi; ϑ0
)
sharing same ^
ϑ as originals
NCE consistency
Under mild assumptions, almost surely
^
ξMCMLE
n,m
m→∞
−→ ^
ξn
and
^
ξNCE
n,m
m→∞
−→ ^
ξn
the maximum likelihood estimator associated with
x1, . . . , xn ∼ p(·; ϑ)
and
e−^
ν
=
Z(^
ϑ)
Z(ϑ0)
[Geyer, 1994; Riou-Durand  Chopin, 2018]
NCE asymptotics
Under less mild assumptions (more robust for NCE),
asymptotic normality of both NCE and MC-MLE estimates as
n −→ +∞ m/n −→ τ
√
n(^
ξMCMLE
n,m − ξ∗
) ≈ Nd(0, ΣMCMLE
)
and √
n(^
ξNCE
n,m − ξ∗
) ≈ Nd(0, ΣNCE
)
with important ordering
ΣMCMLE
 ΣNCE
showing that NCE dominates MCMLE in terms of mean square
error (for iid simulations)
[Geyer, 1994; Riou-Durand  Chopin, 2018]
NCE asymptotics
Under less mild assumptions (more robust for NCE),
asymptotic normality of both NCE and MC-MLE estimates as
n −→ +∞ m/n −→ τ
√
n(^
ξMCMLE
n,m − ξ∗
) ≈ Nd(0, ΣMCMLE
)
and √
n(^
ξNCE
n,m − ξ∗
) ≈ Nd(0, ΣNCE
)
with important ordering except when ϑ0 = ϑ∗
ΣMCMLE
= ΣNCE
= (1 + τ−1
)ΣRMLNCE
[Geyer, 1994; Riou-Durand  Chopin, 2018]
NCE asymptotics
[Riou-Durand  Chopin, 2018]
NCE contrast distribution
Choice of q(·) free but
I easy to sample from
I must allows for analytical expression of its log-pdf
I must be close to true density p(·), so that mean squared
error E[|^
ϑT − ϑ?|2] small
Learning an approximation ^
q to p(·), for instance via
normalising flows
[Tabak and Turner, 2013; Jia  Seljiak, 2019]
NCE contrast distribution
Choice of q(·) free but
I easy to sample from
I must allows for analytical expression of its log-pdf
I must be close to true density p(·), so that mean squared
error E[|^
ϑT − ϑ?|2] small
Learning an approximation ^
q to p(·), for instance via
normalising flows
[Tabak and Turner, 2013; Jia  Seljiak, 2019]
NCE contrast distribution
Choice of q(·) free but
I easy to sample from
I must allows for analytical expression of its log-pdf
I must be close to true density p(·), so that mean squared
error E[|^
ϑT − ϑ?|2] small
Learning an approximation ^
q to p(·), for instance via
normalising flows
[Tabak and Turner, 2013; Jia  Seljiak, 2019]
Density estimation by normalising flows
“A normalizing flow describes the transformation of a
probability density through a sequence of invertible map-
pings. By repeatedly applying the rule for change of
variables, the initial density ‘flows’ through the sequence
of invertible mappings. At the end of this sequence we
obtain a valid probability distribution and hence this type
of flow is referred to as a normalizing flow.”
[Rezende  Mohammed, 2015; Papamakarios et al., 2021]
Density estimation by normalising flows
Based on invertible and 2×differentiable transforms
(diffeomorphisms) gi(·) = g(·; ηi) of a standard distribution ϕ(·)
Representation
z = g1 ◦ · · · ◦ gp(x) x ∼ ϕ(x)
Density of z by Jacobian transform
ϕ(x(z)) × detJg1◦···◦gp (z) = ϕ(x(z))
Y
i
|dgi/dzi−1|−1
where zi = gi(zi−1)
Flow defined as x − z1 − . . . − zp = z
[Rezende  Mohammed, 2015; Papamakarios et al., 2021]
Density estimation by normalising flows
Flow defined as x − z1 − . . . − zp = z
Density of z by Jacobian transform
ϕ(x(z)) × detJg1◦···◦gp (z) = ϕ(x(z))
Y
i
|dgi/dzi−1|−1
where zi = gi(zi−1)
Composition of transforms
(g1 ◦ g2)−1
= g−1
2 ◦ g−1
1 (2)
detJg1◦g2
(u) = detJg1
(g2(u)) × detJg2
(u) (3)
[Rezende  Mohammed, 2015; Papamakarios et al., 2021]
Density estimation by normalising flows
Flow defined as x, z1, . . . , zp = z
Density of z by Jacobian transform
ϕ(x(z)) × detJg1◦···◦gp (z) = ϕ(x(z))
Y
i
|dgi/dzi−1|−1
where zi = gi(zi−1)
[Rezende  Mohammed, 2015; Papamakarios et al., 2021]
Density estimation by normalising flows
Normalising flows are
I flexible family of densities
I easy to train by optimisation (e.g., maximum likelihood
estimation, variational inference)
I neural version of density estimation and generative model
I trained from observed densities
I natural tools for approximate Bayesian inference
(variational inference, ABC, synthetic likelihood)
Invertible linear-time transformations
Family of transformations
g(z) = z + uh(w0
z + b), u, w ∈ Rd
, b ∈ R
with h smooth element-wise non-linearity transform, with
derivative h0
Jacobian term computed in O(d) time
ψ(z) = h0
(w0
z + b)w
det
∂g
∂z
= |det(Id + uψ(z)0
)| = |1 + u0
ψ(z)|
[Rezende  Mohammed, 2015]
Invertible linear-time transformations
Family of transformations
g(z) = z + uh(w0
z + b), u, w ∈ Rd
, b ∈ R
with h smooth element-wise non-linearity transform, with
derivative h0
Density q(z) obtained by transforming initial density ϕ(z)
through sequence of maps gi, i.e.
z = gp ◦ · · · ◦ g1(x)
and
log q(z) = log ϕ(x) −
p
X
k=1
log |1 + u0
ψk(zk−1)|
[Rezende  Mohammed, 2015]
General theory of normalising flows
”Normalizing flows provide a general mechanism for
defining expressive probability distributions, only requir-
ing the specification of a (usually simple) base distribu-
tion and a series of bijective transformations.”
T(u; ψ) = gp(gp−1(. . . g1(u; η1) . . . ; ηp−1); ηp)
[Papamakarios et al., 2021]
General theory of normalising flows
“...how expressive are flow-based models? Can they rep-
resent any distribution p(x), even if the base distribution
is restricted to be simple? We show that this universal
representation is possible under reasonable conditions
on p(x).”
Obvious when considering the inverse conditional cdf
transforms, assuming differentiability
[Papamakarios et al., 2021]
General theory of normalising flows
[Hyvärinen  Pajunen (1999)]
I Write
px(x) =
d
Y
i=1
p(xi|xi)
I define
zi = Fi(xi, xi) = P(Xi ≤ xi|xi)
I deduce that
det JF(x) = p(x)
I conclude that pz(z) = 1 Uniform on (0, 1)d
[Papamakarios et al., 2021]
General theory of normalising flows
“Minimizing the Monte Carlo approximation of the
Kullback–Leibler divergence [between the true and the
model densities] is equivalent to fitting the flow-based
model to the sample by maximum likelihood estimation.”
MLEstimate flow-based model parameters by
arg max
ψ
n
X
i=1
log{ϕ(T−1
(xi; ψ))} − log |det{JT−1 (xi; ψ)}|
Note possible use of reverse Kullback–Leibler divergence when
learning an approximation (VA, IS, ABC) to a known [up to a
constant] target p(x)
[Papamakarios et al., 2021]
Constructing flows
Autoregressive flows
Component-wise transform (i = 1, . . . , d)
z0
i = τ(zi; hi)
| {z }
transformer
where hi = ci(z1:(i−1))
| {z }
conditioner
= ci(z1:(i−1); ϕi)
Jacobian
log |detJϕ(z)| = log
d
Y
i=1
∂τ
∂zi
(zi; hi)
=
d
X
i=1
log
∂τ
∂zi
(zi; hi)
Table 1: Multiple choices for
I transformer τ(·; ϕ)
I conditioner c(·) (neural network)
[Papamakarios et al., 2021]
Practical considerations
“Implementing a flow often amounts to composing as
many transformations as computation and memory will
allow. Working with such deep flows introduces addi-
tional challenges of a practical nature.”
I the more the merrier?!
I batch normalisation for maintaining stable gradients
(between layers)
I fighting curse of dimension (“evaluating T incurs an
increasing computational cost as dimensionality grows”)
with multiscale architecture (clamping: component-wise
stopping rules)
[Papamakarios et al., 2021]
Practical considerations
“Implementing a flow often amounts to composing as
many transformations as computation and memory will
allow. Working with such deep flows introduces addi-
tional challenges of a practical nature.”
I the more the merrier?!
I batch normalisation for maintaining stable gradients
(between layers)
I “...early work on flow precursors dismissed the
autoregressive approach as prohibitively expensive”
addressed by sharing parameters within conditioners ci(·)
[Papamakarios et al., 2021]
Applications
“Normalizing flows have two primitive operations: den-
sity calculation and sampling. In turn, flows are effec-
tive in any application requiring a probabilistic model
with either of those capabilities.”
I density estimation [speed of convergence?]
I proxy generative model
I importance sampling for integration by minimising distance
to integrand or IS variance [finite?]
I MCMC flow substitute for HMC
[Papamakarios et al., 2021]
Applications
“Normalizing flows have two primitive operations: den-
sity calculation and sampling. In turn, flows are effec-
tive in any application requiring a probabilistic model
with either of those capabilities.”
I optimised reparameterisation of target for MCMC [exact?]
I variational approximation by maximising evidence lower
bound (ELBO) to posterior on parameter η = T(u, ϕ)
n
X
i=1
log p(xobs
, T(ui; ϕ))
| {z }
joint
+ log |detJT (ui; ϕ)|
I substitutes for likelihood-free inference on either π(η|xobs)
or p(xobs|η)
[Papamakarios et al., 2021]
A[nother] revolution in machine learning?
“One area where neural networks are being actively de-
veloped is density estimation in high dimensions: given
a set of points x ∼ p(x), the goal is to estimate the
probability density p(·). As there are no explicit la-
bels, this is usually considered an unsupervised learning
task. We have already discussed that classical methods
based for instance on histograms or kernel density esti-
mation do not scale well to high-dimensional data. In
this regime, density estimation techniques based on neu-
ral networks are becoming more and more popular. One
class of these neural density estimation techniques are
normalizing flows.”
[Cranmer et al., PNAS, 2020]
Crucially lacking
No connection with statistical density estimation, with no
general study of convergence (in training sample size) to the
true density
...or in evaluating approximation error (as in ABC)
[Kobyzev et al., 2019; Papamakarios et al., 2021]
Reconnecting with Geyer (1994)
“...neural networks can be trained to learn the likelihood
ratio function p(x|ϑ0)/p(x|ϑ1) or p(x|ϑ0)/p(x), where in
the latter case the denominator is given by a marginal
model integrated over a proposal or the prior (...) The
key idea is closely related to the discriminator network
in GANs mentioned above: a classifier is trained us-
ing supervised learning to discriminate two sets of data,
though in this case both sets come from the simulator
and are generated for different parameter points ϑ0 and
ϑ1. The classifier output function can be converted into
an approximation of the likelihood ratio between ϑ0 and
ϑ1! This manifestation of the Neyman-Pearson lemma
in a machine learning setting is often called the likeli-
hood ratio trick.”
[Cranmer et al., PNAS, 2020]
A comparison with MLE
[Guttmann  Hyvärinen, 2012]
A comparison with MLE
[Guttmann  Hyvärinen, 2012]
A comparison with MLE
[Guttmann  Hyvärinen, 2012]
Outline
1 Geyer’s 1994 logistic
2 Links with bridge sampling
3 Noise contrastive estimation
4 Generative models
5 Variational autoencoders (VAEs)
6 Generative adversarial networks
(GANs)
Generative models
“Deep generative model than can learn via the principle
of maximum likelihood differ with respect to how they
represent or approximate the likelihood.” I. Goodfellow
Likelihood function
L(ϑ|x1, . . . , xn) ∝
n
Y
i=1
pmodel(xi|ϑ)
leading to MLE estimate
^
ϑ(x1, . . . , xn) = arg max
ϑ
n
X
i=1
log pmodel(xi|ϑ)
with
^
ϑ(x1, . . . , xn) = arg max
ϑ
DKL
(pdata||pmodel(·|ϑ))
Likelihood complexity
Explicit solutions:
I domino representation (“fully visible belief networks”)
pmodel(x) =
T
Y
t=1
pmodel(xt|x1:t−1)
I “non-linear independent component analysis”
(cf. normalizing flows)
pmodel(x) = pz(g−1
ϕ (x))
∂g−1
ϕ (x)
∂x
Likelihood complexity
Explicit solutions:
I domino representation (“fully visible belief networks”)
pmodel(x) =
T
Y
t=1
pmodel(xt|Pa(xt))
I “non-linear independent component analysis”
(cf. normalizing flows)
pmodel(x) = pz(g−1
ϕ (x))
∂g−1
ϕ (x)
∂x
Likelihood complexity
Explicit solutions:
I domino representation (“fully visible belief networks”)
pmodel(x) =
T
Y
t=1
pmodel(xt|Pa(xt))
I “non-linear independent component analysis”
(cf. normalizing flows)
pmodel(x) = pz(g−1
ϕ (x))
∂g−1
ϕ (x)
∂x
I variational approximations
log pmodel(x; ϑ) ≥ L(x; ϑ)
represented by variational autoencoders
Likelihood complexity
Explicit solutions:
I domino representation (“fully visible belief networks”)
pmodel(x) =
T
Y
t=1
pmodel(xt|Pa(xt))
I “non-linear independent component analysis”
(cf. normalizing flows)
pmodel(x) = pz(g−1
ϕ (x))
∂g−1
ϕ (x)
∂x
I Markov chain Monte Carlo (MCMC) maximisation
Likelihood complexity
Implicit solutions involving sampling from the model pmodel
without computing density
I ABC algorithms for MLE derivation
[Piccini  Anderson, 2017]
I generative stochastic networks
[Bengio et al., 2014]
I generative adversarial networks (GANs)
[Goodfellow et al., 2014]
Variational autoencoders (VAEs)
1 Geyer’s 1994 logistic
2 Links with bridge sampling
3 Noise contrastive estimation
4 Generative models
5 Variational autoencoders (VAEs)
6 Generative adversarial networks
(GANs)
Variational autoencoders
“... provide a principled framework for learning deep
latent-variable models and corresponding inference mod-
els (...) can be viewed as two coupled, but indepen-
dently parameterized models: the encoder or recogni-
tion model, and the decoder or generative model.
These two models support each other. The recognition
model delivers to the generative model an approximation
to its posterior over latent random variables, which it
needs to update its parameters inside an iteration of “ex-
pectation maximization” learning. Reversely, the gener-
ative model is a scaffolding of sorts for the recognition
model to learn meaningful representations of the data
(...) The recognition model is the approximate inverse
of the generative model according to Bayes rule.”
[Kingma  Welling, 2019]
Autoencoders
“An autoencoder is a neural network that is trained to
attempt to copy its input x to its output r = g(h) via
a hidden layer h = f(x) (...) [they] are designed to be
unable to copy perfectly”
I undercomplete autoencoders (with dim(h)  dim(x))
I regularised autoencoders, with objective
L(x, g ◦ f(x)) + Ω(h)
where penalty akin to log-prior
I denoising autoencoders (learning x on noisy version x̃ of x)
I stochastic autoencoders (learning pdecode(x|h) for a given
pencode(h|x) w/o compatibility)
[Goodfellow et al., 2016, p.496]
Variational autoencoders (VAEs)
Variational autoencoders (VAEs)
“The key idea behind the variational autoencoder is to
attempt to sample values of Z that are likely to have
produced X = x, and compute p(x) just from those.”
Representation of (marginal) likelihood pϑ(x) based on latent
variable z
pϑ(x) =
Z
pϑ(x|z)pϑ(z) dz
Machine-learning usually preoccupied only by maximising pϑ(x)
(in ϑ) by simulating z efficiently (i.e., not from the prior)
log pϑ(x) − D[pϑ(·)||pϑ(·|x)] = Epϑ
[log pϑ(x|Z)] − D[qϕ(·)||pϑ(·)]
[Kingma  Welling, 2019]
Variational autoencoders (VAEs)
“The key idea behind the variational autoencoder is to
attempt to sample values of Z that are likely to have
produced X = x, and compute p(x) just from those.”
Representation of (marginal) likelihood pϑ(x) based on latent
variable z
pϑ(x) =
Z
pϑ(x|z)pϑ(z) dz
Machine-learning usually preoccupied only by maximising pϑ(x)
(in ϑ) by simulating z efficiently (i.e., not from the prior)
log pϑ(x)−D[qϕ(·|x)||pϑ(·|x)] = Eqϕ(·|x)[log pϑ(x|Z)]−D[qϕ(·|x)||pϑ(·)]
since x is fixed (Bayesian analogy)
[Kingma  Welling, 2019]
Variational autoencoders (VAEs)
[Kingma  Welling, 2019]
Variational autoencoders (VAEs)
log pϑ(x)−D[qϕ(·|x)||pϑ(·|x)] = Eqϕ(·|x)[log pϑ(x|Z)]−D[qϕ(·|x)||pϑ(·)]
I lhs is quantity to maximize (plus error term, small for good
approximation qϕ, or regularisation)
I rhs can be optimised by stochastic gradient descent when
qϕ manageable
I link with autoencoder, as qϕ(z|x) “encoding” x into z, and
pϑ(x|z) “decoding” z to reconstruct x
[Doersch, 2021]
Variational autoencoders (VAEs)
log pϑ(x)−D[qϕ(·|x)||pϑ(·|x)] = Eqϕ(·|x)[log pϑ(x|Z)]−D[qϕ(·|x)||pϑ(·)]
I lhs is quantity to maximize (plus error term, small for good
approximation qϕ, or regularisation)
I rhs can be optimised by stochastic gradient descent when
qϕ manageable
I link with autoencoder, as qϕ(z|x) “encoding” x into z, and
pϑ(x|z) “decoding” z to reconstruct x
[Doersch, 2021]
Variational autoencoders (VAEs)
“One major division in machine learning is generative
versus discriminative modeling (...) To turn a genera-
tive model into a discriminator we need Bayes rule.”
Representation of (marginal) likelihood pϑ(x) based on latent
variable z
Variational approximation qϕ(z|x) (also called encoder) to
posterior distribution on latent variable z, pϑ(z|x), associated
with conditional distribution pϑ(x|z) (also called decoder)
Example: qϕ(z|x) Normal distribution Nd(µ(x), Σ(x)) with
I (µ(x), Σ(x)) estimated by deep neural network
I (µ(x), Σ(x)) estimated by ABC (synthetic likelihood)
[Kingma  Welling, 2014]
Variational autoencoders (VAEs)
“One major division in machine learning is generative
versus discriminative modeling (...) To turn a genera-
tive model into a discriminator we need Bayes rule.”
Representation of (marginal) likelihood pϑ(x) based on latent
variable z
Variational approximation qϕ(z|x) (also called encoder) to
posterior distribution on latent variable z, pϑ(z|x), associated
with conditional distribution pϑ(x|z) (also called decoder)
Example: qϕ(z|x) Normal distribution Nd(µ(x), Σ(x)) with
I (µ(x), Σ(x)) estimated by deep neural network
I (µ(x), Σ(x)) estimated by ABC (synthetic likelihood)
[Kingma  Welling, 2014]
ELBO objective
Since
log pϑ(x) = Eqϕ(z|x)[log pϑ(x)]
= Eqϕ(z|x)[log
pϑ(x, z)
pϑ(z|x)
]
= Eqϕ(z|x)[log
pϑ(x, z)
qϕ(z|x)
] + Eqϕ(z|x)[log
qϕ(x, z)
pϑ(z|x)
]
| {z }
KL≥0
evidence lower bound (ELBO) defined by
Lϑ,ϕ(x) = Eqϕ(z|x)[log pϑ(x, z)] − Eqϕ(z|x)[log qϕ(z|x)]
and used as objective function to be maximised in (ϑ, ϕ)
ELBO maximisation
Stochastic gradient step, one parameter at a time
In iid settings
Lϑ,ϕ(x) =
n
X
i=1
Lϑ,ϕ(xi)
and
∇ϑLϑ,ϕ(xi) = Eqϕ(z|xi)[∇ϑ log pϑ(xi, z)] ≈ ∇ϑ log pϑ(x, z̃(xi))
for one simulation z̃(xi) ∼ qϕ(z|xi)
but ∇ϕLϑ,ϕ(xi) more difficult to compute
ELBO maximisation
Stochastic gradient step, one parameter at a time
In iid settings
Lϑ,ϕ(x) =
n
X
i=1
Lϑ,ϕ(xi)
and
∇ϑLϑ,ϕ(xi) = Eqϕ(z|xi)[∇ϑ log pϑ(xi, z)] ≈ ∇ϑ log pϑ(x, z̃(xi))
for one simulation z̃(xi) ∼ qϕ(z|xi)
but ∇ϕLϑ,ϕ(xi) more difficult to compute
ELBO maximisation (2)
Reparameterisation (form of normalising flow)
If z = g(x, ϕ, ε) ∼ qϕ(z|x) when ε ∼ r(ε),
Eqϕ(z|xi)[h(Z)] = Er[h(g(x, ϕ, ε))]
and
∇ϕEqϕ(z|xi)[h(Z)] = ∇ϕEr[h ◦ g(x, ϕ, ε)]
= Er[∇ϕh ◦ g(x, ϕ, ε)]
≈ ∇ϕh ◦ g(x, ϕ, ε̃)
for one simulation ε̃ ∼ r
[Kingma  Welling, 2014]
ELBO maximisation (2)
Reparameterisation (form of normalising flow)
If z = g(x, ϕ, ε) ∼ qϕ(z|x) when ε ∼ r(ε),
Eqϕ(z|xi)[h(Z)] = Er[h(g(x, ϕ, ε))]
leading to unbiased estimator of gradient of ELBO
∇ϑ,ϕ {log pϑ(x, g(x, ϕ, ε)) − log qϕ(g(x, ϕ, ε)|x)}
[Kingma  Welling, 2014]
ELBO maximisation (2)
Reparameterisation (form of normalising flow)
If z = g(x, ϕ, ε) ∼ qϕ(z|x) when ε ∼ r(ε),
Eqϕ(z|xi)[h(Z)] = Er[h(g(x, ϕ, ε))]
leading to unbiased estimator of gradient of ELBO
∇ϑ,ϕ
log pϑ(x, g(x, ϕ, ε)) − log r(ε) + log
∂z
∂ε
[Kingma  Welling, 2014]
Marginal likelihood estimation
Since
log pϑ(x) = log Eqϕ(z|x)

pϑ(x, Z)

qϕ(Z|x)

a importance sample estimate of the log-marginal likelihood is
log pϑ(x) ≈ log
1
T
T
X
t=1
pϑ(x, zt)

qϕ(zt|x) zt ∼ qϕ(z|x)
When T = 1
log pϑ(x)
| {z }
ideal objective
≈ log pϑ(x, z1(x))

qϕ(z1(x)|x)
| {z }
ELBO objective
ELBO estimator.
Marginal likelihood estimation
Since
log pϑ(x) = log Eqϕ(z|x)

pϑ(x, Z)

qϕ(Z|x)

a importance sample estimate of the log-marginal likelihood is
log pϑ(x) ≈ log
1
T
T
X
t=1
pϑ(x, zt)

qϕ(zt|x) zt ∼ qϕ(z|x)
When T = 1
log pϑ(x)
| {z }
ideal objective
≈ log pϑ(x, z1(x))

qϕ(z1(x)|x)
| {z }
ELBO objective
ELBO estimator.
Generative adversarial networks
1 Geyer’s 1994 logistic
2 Links with bridge sampling
3 Noise contrastive estimation
4 Generative models
5 Variational autoencoders (VAEs)
6 Generative adversarial networks
(GANs)
Generative adversarial networks (GANs)
“Generative adversarial networks (GANs) provide an
algorithmic framework for constructing generative mod-
els with several appealing properties:
– they do not require a likelihood function to be specified,
only a generating procedure;
– they provide samples that are sharp and compelling;
– they allow us to harness our knowledge of building
highly accurate neural network classifiers.”
[Mohamed  Lakshminarayanan, 2016]
Implicit generative models
Representation of random variables as
x = Gϑ(z) z ∼ µ(z)
where µ(·) reference distribution and Gϑ multi-layered and
highly non-linear transform (as, e.g., in normalizing flows)
I more general and flexible than “prescriptive” if implicit
(black box)
I connected with pseudo-random variable generation
I call for likelihood-free inference on ϑ
[Mohamed  Lakshminarayanan, 2016]
Implicit generative models
Representation of random variables as
x = Gϑ(z) z ∼ µ(z)
where µ(·) reference distribution and Gϑ multi-layered and
highly non-linear transform (as, e.g., in normalizing flows)
I more general and flexible than “prescriptive” if implicit
(black box)
I connected with pseudo-random variable generation
I call for likelihood-free inference on ϑ
[Mohamed  Lakshminarayanan, 2016]
Untractable likelihoods
Cases when the likelihood function
f(y|ϑ) is unavailable and when the
completion step
f(y|ϑ) =
Z
Z
f(y, z|ϑ) dz
is impossible or too costly because of
the dimension of z
© MCMC cannot be implemented!
Untractable likelihoods
Cases when the likelihood function
f(y|ϑ) is unavailable and when the
completion step
f(y|ϑ) =
Z
Z
f(y, z|ϑ) dz
is impossible or too costly because of
the dimension of z
© MCMC cannot be implemented!
The ABC method
Bayesian setting: target is π(ϑ)f(x|ϑ)
When likelihood f(x|ϑ) not in closed form, likelihood-free
rejection technique:
ABC algorithm
For an observation y ∼ f(y|ϑ), under the prior π(ϑ), keep jointly
simulating
ϑ0
∼ π(ϑ) , z ∼ f(z|ϑ0
) ,
until the auxiliary variable z is equal to the observed value,
z = y.
[Tavaré et al., 1997]
The ABC method
Bayesian setting: target is π(ϑ)f(x|ϑ)
When likelihood f(x|ϑ) not in closed form, likelihood-free
rejection technique:
ABC algorithm
For an observation y ∼ f(y|ϑ), under the prior π(ϑ), keep jointly
simulating
ϑ0
∼ π(ϑ) , z ∼ f(z|ϑ0
) ,
until the auxiliary variable z is equal to the observed value,
z = y.
[Tavaré et al., 1997]
The ABC method
Bayesian setting: target is π(ϑ)f(x|ϑ)
When likelihood f(x|ϑ) not in closed form, likelihood-free
rejection technique:
ABC algorithm
For an observation y ∼ f(y|ϑ), under the prior π(ϑ), keep jointly
simulating
ϑ0
∼ π(ϑ) , z ∼ f(z|ϑ0
) ,
until the auxiliary variable z is equal to the observed value,
z = y.
[Tavaré et al., 1997]
Why does it work?!
The proof is trivial:
f(ϑi) ∝
X
z∈D
π(ϑi)f(z|ϑi)Iy(z)
∝ π(ϑi)f(y|ϑi)
= π(ϑi|y) .
[Accept–Reject 101]
ABC as A...pproximative
When y is a continuous random variable, equality z = y is
replaced with a tolerance condition,
ρ{η(z), η(y)} ≤ ε
where ρ is a distance and η(y) defines a (not necessarily
sufficient) statistic
Output distributed from
π(ϑ) Pϑ{ρ(y, z)  ε} ∝ π(ϑ|ρ(η(y), η(z))  ε)
[Pritchard et al., 1999]
ABC as A...pproximative
When y is a continuous random variable, equality z = y is
replaced with a tolerance condition,
ρ{η(z), η(y)} ≤ ε
where ρ is a distance and η(y) defines a (not necessarily
sufficient) statistic
Output distributed from
π(ϑ) Pϑ{ρ(y, z)  ε} ∝ π(ϑ|ρ(η(y), η(z))  ε)
[Pritchard et al., 1999]
ABC posterior
The likelihood-free algorithm samples from the marginal in z of:
πε(ϑ, z|y) =
π(ϑ)f(z|ϑ)IAε,y (z)
R
Aε,y×Θ π(ϑ)f(z|ϑ)dzdϑ
,
where Aε,y = {z ∈ D|ρ(η(z), η(y))  ε}.
The idea behind ABC is that the summary statistics coupled
with a small tolerance should provide a good approximation of
the posterior distribution:
πε(ϑ|y) =
Z
πε(ϑ, z|y)dz ≈ π(ϑ|η(y)) .
ABC posterior
The likelihood-free algorithm samples from the marginal in z of:
πε(ϑ, z|y) =
π(ϑ)f(z|ϑ)IAε,y (z)
R
Aε,y×Θ π(ϑ)f(z|ϑ)dzdϑ
,
where Aε,y = {z ∈ D|ρ(η(z), η(y))  ε}.
The idea behind ABC is that the summary statistics coupled
with a small tolerance should provide a good approximation of
the posterior distribution:
πε(ϑ|y) =
Z
πε(ϑ, z|y)dz ≈ π(ϑ|η(y)) .
MA example
Back to the MA(2) model
xt = εt +
2
X
i=1
ϑiεt−i
Simple prior: uniform over the inverse [real and complex] roots
in
Q(u) = 1 −
2
X
i=1
ϑiui
under identifiability conditions
MA example
Back to the MA(2) model
xt = εt +
2
X
i=1
ϑiεt−i
Simple prior: uniform prior over identifiability zone
MA example (2)
ABC algorithm thus made of
1. picking a new value (ϑ1, ϑ2) in the triangle
2. generating an iid sequence (εt)−2t≤T
3. producing a simulated series (x0
t)1≤t≤T
Distance: basic distance between the series
ρ((x0
t)1≤t≤T , (xt)1≤t≤T ) =
T
X
t=1
(xt − x0
t)2
or distance between summary statistics like the 2
autocorrelations
τj =
T
X
t=j+1
xtxt−j
MA example (2)
ABC algorithm thus made of
1. picking a new value (ϑ1, ϑ2) in the triangle
2. generating an iid sequence (εt)−2t≤T
3. producing a simulated series (x0
t)1≤t≤T
Distance: basic distance between the series
ρ((x0
t)1≤t≤T , (xt)1≤t≤T ) =
T
X
t=1
(xt − x0
t)2
or distance between summary statistics like the 2
autocorrelations
τj =
T
X
t=j+1
xtxt−j
Comparison of distance impact
Evaluation of the tolerance on the ABC sample against both
distances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
Comparison of distance impact
0.0 0.2 0.4 0.6 0.8
0
1
2
3
4
θ1
−2.0 −1.0 0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
θ2
Evaluation of the tolerance on the ABC sample against both
distances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
Comparison of distance impact
0.0 0.2 0.4 0.6 0.8
0
1
2
3
4
θ1
−2.0 −1.0 0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
θ2
Evaluation of the tolerance on the ABC sample against both
distances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
Occurence of simulation in Econometrics
Simulation–based techniques in Econometrics
I Simulated method of moments
I Method of simulated moments
I Simulated pseudo-maximum-likelihood
I Indirect inference
[Gouriéroux  Monfort, 1996]
Simulated method of moments
Given observations yo
1:n from a model
yt = r(y1:(t−1), εt, ϑ) , εt ∼ g(·)
simulate ε?
1:n, derive
y?
t (ϑ) = r(y1:(t−1), ε?
t , ϑ)
and estimate ϑ by
arg min
ϑ
n
X
t=1
(yo
t − y?
t (ϑ))2
Simulated method of moments
Given observations yo
1:n from a model
yt = r(y1:(t−1), εt, ϑ) , εt ∼ g(·)
simulate ε?
1:n, derive
y?
t (ϑ) = r(y1:(t−1), ε?
t , ϑ)
and estimate ϑ by
arg min
ϑ
 n
X
t=1
yo
t −
n
X
t=1
y?
t (ϑ)
2
Indirect inference
Minimise (in ϑ) the distance between estimators ^
β based on
pseudo-models for genuine observations and for observations
simulated under the true model and the parameter ϑ.
[Gouriéroux, Monfort,  Renault, 1993;
Smith, 1993; Gallant  Tauchen, 1996]
Indirect inference (PML vs. PSE)
Example of the pseudo-maximum-likelihood (PML)
^
β(y) = arg max
β
X
t
log f?
(yt|β, y1:(t−1))
leading to
arg min
ϑ
||^
β(yo
) − ^
β(y1(ϑ), . . . , yS(ϑ))||2
when
ys(ϑ) ∼ f(y|ϑ) s = 1, . . . , S
Indirect inference (PML vs. PSE)
Example of the pseudo-score-estimator (PSE)
^
β(y) = arg min
β

X
t
∂ log f?
∂β
(yt|β, y1:(t−1))
2
leading to
arg min
ϑ
||^
β(yo
) − ^
β(y1(ϑ), . . . , yS(ϑ))||2
when
ys(ϑ) ∼ f(y|ϑ) s = 1, . . . , S
AR(2) vs. MA(1) example
true (MA) model
yt = εt − ϑεt−1
and [wrong!] auxiliary (AR) model
yt = β1yt−1 + β2yt−2 + ut
R code
x=eps=rnorm(250)
x[2:250]=x[2:250]-0.5*x[1:249] #MA(1)
simeps=rnorm(250)
propeta=seq(-.99,.99,le=199)
dist=rep(0,199)
bethat=as.vector(arima(x,c(2,0,0),incl=FALSE)$coef) #AR(2)
for (t in 1:199)
dist[t]=sum((as.vector(arima(c(simeps[1],simeps[2:250]-propeta[t]*
simeps[1:249]),c(2,0,0),incl=FALSE)$coef)-bethat)ˆ2)
AR(2) vs. MA(1) example
One sample:
−1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
θ
distance
AR(2) vs. MA(1) example
Many samples:
0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5
6
Bayesian synthetic likelihood
Approach contemporary (?) of ABC where distribution of
summary statistic s(·) replaced with parametric family, e.g.
g(s|ϑ) = ϕ(s; µ(ϑ), Σ(ϑ))
when ϑ [true] parameter value behind data
Normal parameters µ(ϑ), Σ(ϑ)) unknown in closed form and
evaluated by simulation, based on Monte Carlo sample of
zi ∼ f(z|ϑ)
Outcome used as substitute in posterior updating
[Wood, 2010; Drovandi  al., 2015; Price  al., 2018]
Bayesian synthetic likelihood
Approach contemporary (?) of ABC where distribution of
summary statistic s(·) replaced with parametric family, e.g.
g(s|ϑ) = ϕ(s; µ(ϑ), Σ(ϑ))
when ϑ [true] parameter value behind data
Normal parameters µ(ϑ), Σ(ϑ)) unknown in closed form and
evaluated by simulation, based on Monte Carlo sample of
zi ∼ f(z|ϑ)
Outcome used as substitute in posterior updating
[Wood, 2010; Drovandi  al., 2015; Price  al., 2018]
Bayesian synthetic likelihood
Approach contemporary (?) of ABC where distribution of
summary statistic s(·) replaced with parametric family, e.g.
g(s|ϑ) = ϕ(s; µ(ϑ), Σ(ϑ))
when ϑ [true] parameter value behind data
Normal parameters µ(ϑ), Σ(ϑ)) unknown in closed form and
evaluated by simulation, based on Monte Carlo sample of
zi ∼ f(z|ϑ)
Outcome used as substitute in posterior updating
[Wood, 2010; Drovandi  al., 2015; Price  al., 2018]
Asymptotics of BSL
Based on three approximations
1. representation of data information by summary statistic
information
2. Normal substitute for summary distribution
3. Monte Carlo versions of mean and variance
Existence of Bernstein-von Mises convergence under consistency
of selected covariance estimator
[Frazier  al., 2021]
Asymptotics of BSL
Based on three approximations
1. representation of data information by summary statistic
information
2. Normal substitute for summary distribution
3. Monte Carlo versions of mean and variance
Existence of Bernstein-von Mises convergence under consistency
of selected covariance estimator
[Frazier  al., 2021]
Asymptotics of BSL
Assumptions
I Central Limit Theorem on Sn = s(x1:n)
I Idenfitiability of parameter ϑ based on Sn
I Existence of some prior moment of Σ(ϑ)
I sub-Gaussian tail of simulated summaries
I Monte Carlo effort in nγ for γ  0
Similarity with ABC sufficient conditions, but BSL point
estimators generally asymptotically less efficient
[Frazier  al., 2018; Li  Fearnhead, 2018]
Asymptotics of BSL
Assumptions
I Central Limit Theorem on Sn = s(x1:n)
I Idenfitiability of parameter ϑ based on Sn
I Existence of some prior moment of Σ(ϑ)
I sub-Gaussian tail of simulated summaries
I Monte Carlo effort in nγ for γ  0
Similarity with ABC sufficient conditions, but BSL point
estimators generally asymptotically less efficient
[Frazier  al., 2018; Li  Fearnhead, 2018]
Asymptotics of ABC
For a sample y = y(n) and a tolerance ε = εn, when n → +∞,
assuming a parametric model ϑ ∈ Rk, k fixed
I Concentration of summary η(z): there exists b(ϑ) such
that
η(z) − b(ϑ) = oPϑ
(1)
I Consistency:
Πεn (kϑ − ϑ0k ≤ δ|y) = 1 + op(1)
I Convergence rate: there exists δn = o(1) such that
Πεn (kϑ − ϑ0k ≤ δn|y) = 1 + op(1)
[Frazier  al., 2018]
Asymptotics of ABC
Under assumptions
(A1) ∃σn → +∞
Pϑ

σ−1
n kη(z) − b(ϑ)k  u

≤ c(ϑ)h(u), lim
u→+∞
h(u) = 0
(A2)
Π(kb(ϑ) − b(ϑ0)k ≤ u)  uD
, u ≈ 0
posterior consistency and posterior concentration rate λT that
depends on the deviation control of d2{η(z), b(ϑ)}
posterior concentration rate for b(ϑ) bounded from below by
O(εT )
[Frazier  al., 2018]
Asymptotics of ABC
Under assumptions
(A1) ∃σn → +∞
Pϑ

σ−1
n kη(z) − b(ϑ)k  u

≤ c(ϑ)h(u), lim
u→+∞
h(u) = 0
(A2)
Π(kb(ϑ) − b(ϑ0)k ≤ u)  uD
, u ≈ 0
then
Πεn

kb(ϑ) − b(ϑ0)k . εn + σnh−1
(εD
n )|y

= 1 + op0
(1)
If also kϑ − ϑ0k ≤ Lkb(ϑ) − c(ϑ0)kα, locally and ϑ → b(ϑ) 1-1
Πεn (kϑ − ϑ0k . εα
n + σα
n(h−1
(εD
n ))α
| {z }
δn
|y) = 1 + op0
(1)
[Frazier  al., 2018]
Further ABC assumptions
I (B1) Concentration of summary η: Σn(ϑ) ∈ Rk1×k1 is o(1)
Σn(ϑ)−1
{η(z)−b(ϑ)} ⇒ Nk1
(0, Id), (Σn(ϑ)Σn(ϑ0)−1
)n = Co
I (B2) b(ϑ) is C1 and
kϑ − ϑ0k . kb(ϑ) − b(ϑ0)k
I (B3) Dominated convergence and
lim
n
Pϑ(Σn(ϑ)−1{η(z) − b(ϑ)} ∈ u + B(0, un))
Q
j un(j)
→ ϕ(u)
[Frazier  al., 2018]
ABC asymptotic regime
Set Σn(ϑ) = σnD(ϑ) for ϑ ≈ ϑ0 and Zo = Σn(ϑ0)−1(η(y) − ϑ0),
then under (B1) and (B2)
I when εnσ−1
n → +∞
Πεn [ε−1
n (ϑ − ϑ0) ∈ A|y] ⇒ UB0 (A), B0 = {x ∈ Rk
; kb0
(ϑ0)T
xk ≤ 1}
I when εnσ−1
n → c
Πεn [Σn(ϑ0)−1
(ϑ − ϑ0) − Zo
∈ A|y] ⇒ Qc(A), Qc 6= N
I when εnσ−1
n → 0 and (B3) holds, set
Vn = [b0
(ϑ0)]T
Σn(ϑ0)b0
(ϑ0)
then
Πεn [V−1
n (ϑ − ϑ0) − Z̃o
∈ A|y] ⇒ Φ(A),
[Frazier  al., 2018]
conclusion on ABC consistency
I asymptotic description of ABC: different regimes
depending on εn σn
I no point in choosing εn arbitrarily small: just εn = o(σn)
I no gain in iterative ABC
I results under weak conditions by not studying g(η(z)|ϑ)
[Frazier  al., 2018]

More Related Content

Similar to CDT 22 slides.pdf

Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methodsChristian Robert
 
Monte Carlo Methods 2017 July Talk in Montreal
Monte Carlo Methods 2017 July Talk in MontrealMonte Carlo Methods 2017 July Talk in Montreal
Monte Carlo Methods 2017 July Talk in MontrealFred J. Hickernell
 
Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Jagadeeswaran Rathinavel
 
Problem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element MethodProblem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element MethodPeter Herbert
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10FredrikRonquist
 
lecture 15
lecture 15lecture 15
lecture 15sajinsc
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clusteringFrank Nielsen
 
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Chiheb Ben Hammouda
 
Numerical Smoothing and Hierarchical Approximations for E cient Option Pricin...
Numerical Smoothing and Hierarchical Approximations for E cient Option Pricin...Numerical Smoothing and Hierarchical Approximations for E cient Option Pricin...
Numerical Smoothing and Hierarchical Approximations for E cient Option Pricin...Chiheb Ben Hammouda
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsChristian Robert
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixturesChristian Robert
 

Similar to CDT 22 slides.pdf (20)

ICCF_2022_talk.pdf
ICCF_2022_talk.pdfICCF_2022_talk.pdf
ICCF_2022_talk.pdf
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
 
Monte Carlo Methods 2017 July Talk in Montreal
Monte Carlo Methods 2017 July Talk in MontrealMonte Carlo Methods 2017 July Talk in Montreal
Monte Carlo Methods 2017 July Talk in Montreal
 
Presentation.pdf
Presentation.pdfPresentation.pdf
Presentation.pdf
 
Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration
 
Problem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element MethodProblem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element Method
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10
 
lecture 15
lecture 15lecture 15
lecture 15
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
Automatic bayesian cubature
Automatic bayesian cubatureAutomatic bayesian cubature
Automatic bayesian cubature
 
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
 
Presentation OCIP 2015
Presentation OCIP 2015Presentation OCIP 2015
Presentation OCIP 2015
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3
Unit 3Unit 3
Unit 3
 
Numerical Smoothing and Hierarchical Approximations for E cient Option Pricin...
Numerical Smoothing and Hierarchical Approximations for E cient Option Pricin...Numerical Smoothing and Hierarchical Approximations for E cient Option Pricin...
Numerical Smoothing and Hierarchical Approximations for E cient Option Pricin...
 
ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximations
 
Talk iccf 19_ben_hammouda
Talk iccf 19_ben_hammoudaTalk iccf 19_ben_hammouda
Talk iccf 19_ben_hammouda
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 

More from Christian Robert

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihoodChristian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerChristian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Christian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussionChristian Robert
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceChristian Robert
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsChristian Robert
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distancesChristian Robert
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferenceChristian Robert
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018Christian Robert
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distancesChristian Robert
 

More from Christian Robert (20)

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergence
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 

Recently uploaded

Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravitySubhadipsau21168
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 

Recently uploaded (20)

Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified Gravity
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 

CDT 22 slides.pdf

  • 1. ABC, NCE, GANs & VAEs Christian P. Robert U. Paris Dauphine & Warwick U. CDT masterclass, May 2022
  • 2. Outline 1 Geyer’s 1994 logistic 2 Links with bridge sampling 3 Noise contrastive estimation 4 Generative models 5 Variational autoencoders (VAEs) 6 Generative adversarial networks (GANs)
  • 3. An early entry A standard issue in Bayesian inference is to approximate the marginal likelihood (or evidence) Ek = Z Θk πk(ϑk)Lk(ϑk) dϑk, aka the marginal likelihood. [Jeffreys, 1939]
  • 4. Bayes factor For testing hypotheses H0 : ϑ ∈ Θ0 vs. Ha : ϑ 6∈ Θ0, under prior π(Θ0)π0(ϑ) + π(Θc 0)π1(ϑ) , central quantity B01 = π(Θ0|x) π(Θc 0|x) π(Θ0) π(Θc 0) = Z Θ0 f(x|ϑ)π0(ϑ)dϑ Z Θc 0 f(x|ϑ)π1(ϑ)dϑ [Kass Raftery, 1995, Jeffreys, 1939]
  • 5. Bayes factor approximation When approximating the Bayes factor B01 = Z Θ0 f0(x|ϑ0)π0(ϑ0)dϑ0 Z Θ1 f1(x|ϑ1)π1(ϑ1)dϑ1 use of importance functions $0 and $1 and b B01 = n−1 0 Pn0 i=1 f0(x|ϑi 0)π0(ϑi 0)/$0(ϑi 0) n−1 1 Pn1 i=1 f1(x|ϑi 1)π1(ϑi 1)/$1(ϑi 1) when ϑi 0 ∼ $0(ϑ) and ϑi 1 ∼ $1(ϑ)
  • 6. Forgetting and learning Counterintuitive choice of importance function based on mixtures If ϑit ∼ $i(ϑ) (i = 1, . . . , I, t = 1, . . . , Ti) Eπ[h(ϑ)] ≈ 1 Ti Ti X t=1 h(ϑit) π(ϑit) $i(ϑit) replaced with Eπ[h(ϑ)] ≈ I X i=1 Ti X t=1 h(ϑit) π(ϑit) PI j=1 Tj$j(ϑit) Preserves unbiasedness and brings stability (while forgetting about original index) [Geyer, 1991, unpublished; Owen Zhou, 2000]
  • 7. Enters the logistic If considering unnormalised $j’s, i.e. $j(ϑ) = cj e $j(ϑ) j = 1, . . . , I and realisations ϑit’s from the mixture $(ϑ) = 1 T I X i=1 Tj$j(ϑ) = 1 T I X i=1 e $j(ϑ)e ηj z }| { log(cj) + log(Tj) Geyer (1994) introduces allocation probabilities for the mixture components pj(ϑ, η) = e $j(ϑ)eηj . I X m=1 e $m(ϑ)eηm to construct a pseudo-log-likelihood `(η) := I X i=1 Ti X t=1 log pi(ϑit, η)
  • 8. Enters the logistic (2) Estimating η as ^ η = arg max η `(η) produces the reverse logistic regression estimator of the constants cj as I partial forgetting of initial distribution I objective function equivalent to a multinomial logistic regression with the log e $i(ϑit)’s as covariates I randomness reversed from Ti’s to ϑit’s I constants cj identifiable up to a constant I resulting biased importance sampling estimator
  • 9. Illustration Special case when I = 2, c1 = 1, T1 = T2 −`(c2) = T X t=1 log{1 + c2 e $2(ϑ1t)/$1(ϑ1t)} + T X t=1 log{1 + $1(ϑ2t)/c2 e $2(ϑ2t)} and $1(ϑ) = ϕ(ϑ; 0, 32 ) e $2(ϑ) = exp{−(ϑ − 5)2 /2} c2 = 1/ √ 2π
  • 10. Illustration Special case when I = 2, c1 = 1, T1 = T2 −`(c2) = T X t=1 log{1 + c2 e $2(ϑ1t)/$1(ϑ1t)} + T X t=1 log{1 + $1(ϑ2t)/c2 e $2(ϑ2t)} tg=function(x)exp(-(x-5)**2/2) pl=function(a) sum(log(1+a*tg(x)/dnorm(x,0,3)))+sum(log(1+dnorm(y,0,3)/a/tg(y))) nrm=matrix(0,3,1e2) for(i in 1:3) for(j in 1:1e2) x=rnorm(10**(i+1),0,3) y=rnorm(10**(i+1),5,1) nrm[i,j]=optimise(pl,c(.01,1))
  • 11. Illustration Special case when I = 2, c1 = 1, T1 = T2 −`(c2) = T X t=1 log{1 + c2 e $2(ϑ1t)/$1(ϑ1t)} + T X t=1 log{1 + $1(ϑ2t)/c2 e $2(ϑ2t)} 1 2 3 0.3 0.4 0.5 0.6 0.7
  • 12. Illustration Special case when I = 2, c1 = 1, T1 = T2 −`(c2) = T X t=1 log{1 + c2 e $2(ϑ1t)/$1(ϑ1t)} + T X t=1 log{1 + $1(ϑ2t)/c2 e $2(ϑ2t)} Full logistic
  • 13. Outline 1 Geyer’s 1994 logistic 2 Links with bridge sampling 3 Noise contrastive estimation 4 Generative models 5 Variational autoencoders (VAEs) 6 Generative adversarial networks (GANs)
  • 14. Bridge sampling Approximation of Bayes factors (and other ratios of integrals) Special case: If π1(ϑ1|x) ∝ π̃1(ϑ1|x) π2(ϑ2|x) ∝ π̃2(ϑ2|x) live on the same space (Θ1 = Θ2), then B12 ≈ 1 n n X i=1 π̃1(ϑi|x) π̃2(ϑi|x) ϑi ∼ π2(ϑ|x) [Bennett, 1976; Gelman Meng, 1998; Chen, Shao Ibrahim, 2000]
  • 15. Bridge sampling variance The bridge sampling estimator does poorly if var(b B12) B2 12 ≈ 1 n Eπ2 π1(ϑ) − π2(ϑ) π2(ϑ) 2 # is large, i.e. if π1 and π2 have little overlap...
  • 16. Bridge sampling variance The bridge sampling estimator does poorly if var(b B12) B2 12 ≈ 1 n Eπ2 π1(ϑ) − π2(ϑ) π2(ϑ) 2 # is large, i.e. if π1 and π2 have little overlap...
  • 17. (Further) bridge sampling General identity: c1 c2 = B12 = Z π̃2(ϑ|x)α(ϑ)π1(ϑ|x)dϑ Z π̃1(ϑ|x)α(ϑ)π2(ϑ|x)dϑ ∀ α(·) ≈ 1 n1 n1 X i=1 π̃2(ϑ1i|x)α(ϑ1i) 1 n2 n2 X i=1 π̃1(ϑ2i|x)α(ϑ2i) ϑji ∼ πj(ϑ|x)
  • 18. Optimal bridge sampling The optimal choice of auxiliary function is α? (ϑ) = n1 + n2 n1π1(ϑ|x) + n2π2(ϑ|x) leading to b B12 ≈ 1 n1 n1 X i=1 π̃2(ϑ1i|x) n1π1(ϑ1i|x) + n2π2(ϑ1i|x) 1 n2 n2 X i=1 π̃1(ϑ2i|x) n1π1(ϑ2i|x) + n2π2(ϑ2i|x)
  • 19. Optimal bridge sampling (2) Reason: Var(b B12) B2 12 ≈ 1 n1n2 R π1(ϑ)π2(ϑ)[n1π1(ϑ) + n2π2(ϑ)]α(ϑ)2 dϑ R π1(ϑ)π2(ϑ)α(ϑ) dϑ 2 − 1 (by the δ method) Drawback: Dependence on the unknown normalising constants solved iteratively
  • 20. Optimal bridge sampling (2) Reason: Var(b B12) B2 12 ≈ 1 n1n2 R π1(ϑ)π2(ϑ)[n1π1(ϑ) + n2π2(ϑ)]α(ϑ)2 dϑ R π1(ϑ)π2(ϑ)α(ϑ) dϑ 2 − 1 (by the δ method) Drawback: Dependence on the unknown normalising constants solved iteratively
  • 21. Back to the logistic When T1 = T2 = T, optimising −`(c2) = T X t=1 log{1+c2 e $2(ϑ1t)/$1(ϑ1t)}+ T X t=1 log{1+$1(ϑ2t)/c2 e $2(ϑ2t)} cancelling derivative in c2 T X t=1 e $2(ϑ1t) c2 e $2(ϑ1t) + $1(ϑ1t) − c−1 2 T X t=1 $1(ϑ2t) $1(ϑ2t) + c2 e $2(ϑ2t) = 0 leads to c0 2 = PT t=1 $1(ϑ2t) $1(ϑ2t)+c2 e $2(ϑ2t) PT t=1 e $2(ϑ1t) c2 e $2(ϑ1t)+$1(ϑ1t) EM step for the maximum pseudo-likelihood estimation
  • 22. Back to the logistic When T1 = T2 = T, optimising −`(c2) = T X t=1 log{1+c2 e $2(ϑ1t)/$1(ϑ1t)}+ T X t=1 log{1+$1(ϑ2t)/c2 e $2(ϑ2t)} cancelling derivative in c2 T X t=1 e $2(ϑ1t) c2 e $2(ϑ1t) + $1(ϑ1t) − c−1 2 T X t=1 $1(ϑ2t) $1(ϑ2t) + c2 e $2(ϑ2t) = 0 leads to c0 2 = PT t=1 $1(ϑ2t) $1(ϑ2t)+c2 e $2(ϑ2t) PT t=1 e $2(ϑ1t) c2 e $2(ϑ1t)+$1(ϑ1t) EM step for the maximum pseudo-likelihood estimation
  • 23. Back to the logistic When T1 = T2 = T, optimising −`(c2) = T X t=1 log{1+c2 e $2(ϑ1t)/$1(ϑ1t)}+ T X t=1 log{1+$1(ϑ2t)/c2 e $2(ϑ2t)} cancelling derivative in c2 T X t=1 e $2(ϑ1t) c2 e $2(ϑ1t) + $1(ϑ1t) − c−1 2 T X t=1 $1(ϑ2t) $1(ϑ2t) + c2 e $2(ϑ2t) = 0 leads to c0 2 = PT t=1 $1(ϑ2t) $1(ϑ2t)+c2 e $2(ϑ2t) PT t=1 e $2(ϑ1t) c2 e $2(ϑ1t)+$1(ϑ1t) EM step for the maximum pseudo-likelihood estimation
  • 24. Mixtures as proposals Design specific mixture for simulation purposes, with density e ϕ(ϑ) ∝ ω1π(ϑ)L(ϑ) + ϕ(ϑ) , where ϕ(ϑ) is arbitrary (but normalised) Note: ω1 is not a probability weight [Chopin Robert, 2011]
  • 25. Mixtures as proposals Design specific mixture for simulation purposes, with density e ϕ(ϑ) ∝ ω1π(ϑ)L(ϑ) + ϕ(ϑ) , where ϕ(ϑ) is arbitrary (but normalised) Note: ω1 is not a probability weight [Chopin Robert, 2011]
  • 26. evidence approximation by mixtures Rao-Blackwellised estimate ^ ξ = 1 T T X t=1 ω1π(ϑ(t) )L(ϑ(t) ) ω1π(ϑ(t) )L(ϑ(t) ) + ϕ(ϑ(t) ) , converges to ω1Z/{ω1Z + 1} Deduce ^ Z from ω1 ^ Z/{ω1 ^ Z + 1} = ^ ξ Back to bridge sampling optimal estimate [Chopin Robert, 2011]
  • 27. Non-parametric MLE “At first glance, the problem appears to be an exercise in calculus or numerical analysis, and not amenable to statistical formulation” Kong et al. (JRSS B, 2002) I use of Fisher information I non-parametric MLE based on simulations I comparison of sampling schemes through variances I Rao–Blackwellised improvements by invariance constraints [Meng, 2011, IRCEM]
  • 28. Non-parametric MLE “At first glance, the problem appears to be an exercise in calculus or numerical analysis, and not amenable to statistical formulation” Kong et al. (JRSS B, 2002) I use of Fisher information I non-parametric MLE based on simulations I comparison of sampling schemes through variances I Rao–Blackwellised improvements by invariance constraints [Meng, 2011, IRCEM]
  • 29. NPMLE Observing Yij ∼ Fi(t) = c−1 i Zt −∞ ωi(x) dF(x) with ωi known and F unknown
  • 30. NPMLE Observing Yij ∼ Fi(t) = c−1 i Zt −∞ ωi(x) dF(x) with ωi known and F unknown “Maximum likelihood estimate” defined by weighted empirical cdf X i,j ωi(yij)p(yij)δyij maximising in p Y ij c−1 i ωi(yij) p(yij)
  • 31. NPMLE Observing Yij ∼ Fi(t) = c−1 i Zt −∞ ωi(x) dF(x) with ωi known and F unknown “Maximum likelihood estimate” defined by weighted empirical cdf X i,j ωi(yij)p(yij)δyij maximising in p Y ij c−1 i ωi(yij) p(yij) Result such that X ij ^ c−1 r ωr(yij) P s ns^ c−1 s ωs(yij) = 1 [Vardi, 1985]
  • 32. NPMLE Observing Yij ∼ Fi(t) = c−1 i Zt −∞ ωi(x) dF(x) with ωi known and F unknown Result such that X ij ^ c−1 r ωr(yij) P s ns^ c−1 s ωs(yij) = 1 [Vardi, 1985] Bridge sampling estimator X ij ^ c−1 r ωr(yij) P s ns^ c−1 s ωs(yij) = 1 [Gelman Meng, 1998; Tan, 2004]
  • 33. end of the Series B 2002 discussion “...essentially every Monte Carlo activity may be interpreted as parameter estimation by maximum likelihood in a statistical model. We do not claim that this point of view is necessary; nor do we seek to establish a working principle from it.” I restriction to discrete support measures [may be] suboptimal [Ritov Bickel, 1990; Robins et al., 1997, 2000, 2003] I group averaging versions in-between multiple mixture estimators and quasi-Monte Carlo version [Owen Zhou, 2000; Cornuet et al., 2012; Owen, 2003] I statistical analogy provides at best narrative thread
  • 34. end of the Series B 2002 discussion “The hard part of the exercise is to construct a submodel such that the gain in precision is sufficient to justify the additional computational effort” I garden of forking paths, with infinite possibilities I no free lunch (variance, budget, time) I Rao–Blackwellisation may be detrimental in Markov setups
  • 35. end of the 2002 discussion “The statistician can considerably improve the efficiency of the estimator by using the known values of different functionals such as moments and probabilities of different sets. The algorithm becomes increasingly efficient as the number of functionals becomes larger. The result, however, is an extremely complicated algorithm, which is not necessarily faster.” Y. Ritov “...the analyst must violate the likelihood principle and eschew semiparametric, nonparametric or fully parametric maximum likelihood estimation in favour of non-likelihood-based locally efficient semiparametric estimators.” J. Robins
  • 36. Outline 1 Geyer’s 1994 logistic 2 Links with bridge sampling 3 Noise contrastive estimation 4 Generative models 5 Variational autoencoders (VAEs) 6 Generative adversarial networks (GANs)
  • 37. Noise contrastive estimation New estimation principle for parameterised and unnormalised statistical models also based on nonlinear logistic regression Case of parameterised model with density p(x; α) = p̃(x; α) Z(α) and untractable normalising constant Z(α) Estimating Z(α) as extra parameter is impossible via maximum likelihood methods Use of estimation techniques bypassing the constant like contrastive divergence (Hinton, 2002) and score matching (Hyvärinen, 2005) [Gutmann Hyvärinen, 2010]
  • 38. Noise contrastive estimation New estimation principle for parameterised and unnormalised statistical models also based on nonlinear logistic regression Case of parameterised model with density p(x; α) = p̃(x; α) Z(α) and untractable normalising constant Z(α) Estimating Z(α) as extra parameter is impossible via maximum likelihood methods Use of estimation techniques bypassing the constant like contrastive divergence (Hinton, 2002) and score matching (Hyvärinen, 2005) [Gutmann Hyvärinen, 2010]
  • 39. Noise contrastive estimation New estimation principle for parameterised and unnormalised statistical models also based on nonlinear logistic regression Case of parameterised model with density p(x; α) = p̃(x; α) Z(α) and untractable normalising constant Z(α) Estimating Z(α) as extra parameter is impossible via maximum likelihood methods Use of estimation techniques bypassing the constant like contrastive divergence (Hinton, 2002) and score matching (Hyvärinen, 2005) [Gutmann Hyvärinen, 2010]
  • 40. NCE principle As in Geyer’s method, given sample x1, . . . , xT from p(x; α) I generate artificial sample from known distribution q, y1, . . . , yT I maximise the classification log-likelihood (where ϑ = (α, c)) `(ϑ; x, y) := T X i=1 log h(xi; ϑ) + T X i=1 log{1 − h(yi; ϑ)} of a logistic regression model which discriminates the observed data from the simulated data, where h(z; ϑ) = cp̃(z; α) cp̃(z; α) + q(z)
  • 41. NCE principle As in Geyer’s method, given sample x1, . . . , xT from p(x; α) I generate artificial sample from known distribution q, y1, . . . , yT I maximise the classification log-likelihood (where ϑ = (α, c)) `(ϑ; x, y) := T X i=1 log h(xi; ϑ) + T X i=1 log{1 − h(yi; ϑ)} of a logistic regression model which discriminates the observed data from the simulated data, where h(z; ϑ) = cp̃(z; α) cp̃(z; α) + q(z)
  • 42. NCE consistency Objective function that converges (in T) to J(ϑ) = E [log h(x; ϑ) + log{1 − h(y; ϑ)}] Defining f(·) = log p(·; ϑ) and J̃(f) = Ep [log r(f(x) − log q(x)) + log{1 − r(f(y) − log q(y))}]
  • 43. NCE consistency Objective function that converges (in T) to J(ϑ) = E [log h(x; ϑ) + log{1 − h(y; ϑ)}] Defining f(·) = log p(·; ϑ) and J̃(f) = Ep [log r(f(x) − log q(x)) + log{1 − r(f(y) − log q(y))}] Assuming q(·) positive everywhere, I J̃(·) attains its maximum at f?(·) = log p(·) true distribution I maximization performed without any normalisation constraint
  • 44. NCE consistency Objective function that converges (in T) to J(ϑ) = E [log h(x; ϑ) + log{1 − h(y; ϑ)}] Defining f(·) = log p(·; ϑ) and J̃(f) = Ep [log r(f(x) − log q(x)) + log{1 − r(f(y) − log q(y))}] Under regularity condition, assuming the true distribution belongs to parametric family, the solution ^ ϑT = arg max ϑ `(ϑ; x, y) (1) converges to true ϑ Consequence: log-normalisation constant consistently estimated by maximizing (??)
  • 45. Convergence of noise contrastive estimation Opposition of Monte Carlo MLE à la Geyer (1994, JASA) L = 1/n n X i=1 log p̃(xi; ϑ) p̃(xi; ϑ0 ) − log 1/m m X j=1 p̃(zi; ϑ) p̃(zi; ϑ0 ) | {z } ≈Z(ϑ0)/Z(ϑ) x1, . . . , xn ∼ p∗ z1, . . . , zm ∼ p(z; ϑ0 ) [Riou-Durand Chopin, 2018]
  • 46. Convergence of noise contrastive estimation and of noise contrastive estimation à la Gutmann and Hyvärinen (2012) L(ϑ, ν) = 1/n n X i=1 log qϑ,ν(xi) + 1/m m X i=1 log[1 − qϑ,ν(zi)]m/n log qϑ,ν(z) 1 − qϑ,ν(z) = log p̃(xi; ϑ) p̃(xi; ϑ0) + ν + log n/m x1, . . . , xn ∼ p∗ z1, . . . , zm ∼ p(z; ϑ0 ) [Riou-Durand Chopin, 2018]
  • 47. Poisson transform Equivalent likelihoods L(ϑ, ν) = 1/n n X i=1 log p̃(xi; ϑ) p̃(xi; ϑ0) + ν − eν Z(ϑ) Z(ϑ0) and L(ϑ, ν) = 1/n n X i=1 log p̃(xi; ϑ) p̃(xi; ϑ0) + ν − eν m m X j=1 p̃(zi; ϑ) p̃(zi; ϑ0 ) sharing same ^ ϑ as originals
  • 48. NCE consistency Under mild assumptions, almost surely ^ ξMCMLE n,m m→∞ −→ ^ ξn and ^ ξNCE n,m m→∞ −→ ^ ξn the maximum likelihood estimator associated with x1, . . . , xn ∼ p(·; ϑ) and e−^ ν = Z(^ ϑ) Z(ϑ0) [Geyer, 1994; Riou-Durand Chopin, 2018]
  • 49. NCE asymptotics Under less mild assumptions (more robust for NCE), asymptotic normality of both NCE and MC-MLE estimates as n −→ +∞ m/n −→ τ √ n(^ ξMCMLE n,m − ξ∗ ) ≈ Nd(0, ΣMCMLE ) and √ n(^ ξNCE n,m − ξ∗ ) ≈ Nd(0, ΣNCE ) with important ordering ΣMCMLE ΣNCE showing that NCE dominates MCMLE in terms of mean square error (for iid simulations) [Geyer, 1994; Riou-Durand Chopin, 2018]
  • 50. NCE asymptotics Under less mild assumptions (more robust for NCE), asymptotic normality of both NCE and MC-MLE estimates as n −→ +∞ m/n −→ τ √ n(^ ξMCMLE n,m − ξ∗ ) ≈ Nd(0, ΣMCMLE ) and √ n(^ ξNCE n,m − ξ∗ ) ≈ Nd(0, ΣNCE ) with important ordering except when ϑ0 = ϑ∗ ΣMCMLE = ΣNCE = (1 + τ−1 )ΣRMLNCE [Geyer, 1994; Riou-Durand Chopin, 2018]
  • 52. NCE contrast distribution Choice of q(·) free but I easy to sample from I must allows for analytical expression of its log-pdf I must be close to true density p(·), so that mean squared error E[|^ ϑT − ϑ?|2] small Learning an approximation ^ q to p(·), for instance via normalising flows [Tabak and Turner, 2013; Jia Seljiak, 2019]
  • 53. NCE contrast distribution Choice of q(·) free but I easy to sample from I must allows for analytical expression of its log-pdf I must be close to true density p(·), so that mean squared error E[|^ ϑT − ϑ?|2] small Learning an approximation ^ q to p(·), for instance via normalising flows [Tabak and Turner, 2013; Jia Seljiak, 2019]
  • 54. NCE contrast distribution Choice of q(·) free but I easy to sample from I must allows for analytical expression of its log-pdf I must be close to true density p(·), so that mean squared error E[|^ ϑT − ϑ?|2] small Learning an approximation ^ q to p(·), for instance via normalising flows [Tabak and Turner, 2013; Jia Seljiak, 2019]
  • 55. Density estimation by normalising flows “A normalizing flow describes the transformation of a probability density through a sequence of invertible map- pings. By repeatedly applying the rule for change of variables, the initial density ‘flows’ through the sequence of invertible mappings. At the end of this sequence we obtain a valid probability distribution and hence this type of flow is referred to as a normalizing flow.” [Rezende Mohammed, 2015; Papamakarios et al., 2021]
  • 56. Density estimation by normalising flows Based on invertible and 2×differentiable transforms (diffeomorphisms) gi(·) = g(·; ηi) of a standard distribution ϕ(·) Representation z = g1 ◦ · · · ◦ gp(x) x ∼ ϕ(x) Density of z by Jacobian transform ϕ(x(z)) × detJg1◦···◦gp (z) = ϕ(x(z)) Y i |dgi/dzi−1|−1 where zi = gi(zi−1) Flow defined as x − z1 − . . . − zp = z [Rezende Mohammed, 2015; Papamakarios et al., 2021]
  • 57. Density estimation by normalising flows Flow defined as x − z1 − . . . − zp = z Density of z by Jacobian transform ϕ(x(z)) × detJg1◦···◦gp (z) = ϕ(x(z)) Y i |dgi/dzi−1|−1 where zi = gi(zi−1) Composition of transforms (g1 ◦ g2)−1 = g−1 2 ◦ g−1 1 (2) detJg1◦g2 (u) = detJg1 (g2(u)) × detJg2 (u) (3) [Rezende Mohammed, 2015; Papamakarios et al., 2021]
  • 58. Density estimation by normalising flows Flow defined as x, z1, . . . , zp = z Density of z by Jacobian transform ϕ(x(z)) × detJg1◦···◦gp (z) = ϕ(x(z)) Y i |dgi/dzi−1|−1 where zi = gi(zi−1) [Rezende Mohammed, 2015; Papamakarios et al., 2021]
  • 59. Density estimation by normalising flows Normalising flows are I flexible family of densities I easy to train by optimisation (e.g., maximum likelihood estimation, variational inference) I neural version of density estimation and generative model I trained from observed densities I natural tools for approximate Bayesian inference (variational inference, ABC, synthetic likelihood)
  • 60. Invertible linear-time transformations Family of transformations g(z) = z + uh(w0 z + b), u, w ∈ Rd , b ∈ R with h smooth element-wise non-linearity transform, with derivative h0 Jacobian term computed in O(d) time ψ(z) = h0 (w0 z + b)w
  • 61.
  • 62.
  • 63.
  • 65.
  • 66.
  • 67.
  • 68. = |det(Id + uψ(z)0 )| = |1 + u0 ψ(z)| [Rezende Mohammed, 2015]
  • 69. Invertible linear-time transformations Family of transformations g(z) = z + uh(w0 z + b), u, w ∈ Rd , b ∈ R with h smooth element-wise non-linearity transform, with derivative h0 Density q(z) obtained by transforming initial density ϕ(z) through sequence of maps gi, i.e. z = gp ◦ · · · ◦ g1(x) and log q(z) = log ϕ(x) − p X k=1 log |1 + u0 ψk(zk−1)| [Rezende Mohammed, 2015]
  • 70. General theory of normalising flows ”Normalizing flows provide a general mechanism for defining expressive probability distributions, only requir- ing the specification of a (usually simple) base distribu- tion and a series of bijective transformations.” T(u; ψ) = gp(gp−1(. . . g1(u; η1) . . . ; ηp−1); ηp) [Papamakarios et al., 2021]
  • 71. General theory of normalising flows “...how expressive are flow-based models? Can they rep- resent any distribution p(x), even if the base distribution is restricted to be simple? We show that this universal representation is possible under reasonable conditions on p(x).” Obvious when considering the inverse conditional cdf transforms, assuming differentiability [Papamakarios et al., 2021]
  • 72. General theory of normalising flows [Hyvärinen Pajunen (1999)] I Write px(x) = d Y i=1 p(xi|xi) I define zi = Fi(xi, xi) = P(Xi ≤ xi|xi) I deduce that det JF(x) = p(x) I conclude that pz(z) = 1 Uniform on (0, 1)d [Papamakarios et al., 2021]
  • 73. General theory of normalising flows “Minimizing the Monte Carlo approximation of the Kullback–Leibler divergence [between the true and the model densities] is equivalent to fitting the flow-based model to the sample by maximum likelihood estimation.” MLEstimate flow-based model parameters by arg max ψ n X i=1 log{ϕ(T−1 (xi; ψ))} − log |det{JT−1 (xi; ψ)}| Note possible use of reverse Kullback–Leibler divergence when learning an approximation (VA, IS, ABC) to a known [up to a constant] target p(x) [Papamakarios et al., 2021]
  • 75. Autoregressive flows Component-wise transform (i = 1, . . . , d) z0 i = τ(zi; hi) | {z } transformer where hi = ci(z1:(i−1)) | {z } conditioner = ci(z1:(i−1); ϕi) Jacobian log |detJϕ(z)| = log
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 88.
  • 89.
  • 90.
  • 92.
  • 93.
  • 94.
  • 95. Table 1: Multiple choices for I transformer τ(·; ϕ) I conditioner c(·) (neural network) [Papamakarios et al., 2021]
  • 96. Practical considerations “Implementing a flow often amounts to composing as many transformations as computation and memory will allow. Working with such deep flows introduces addi- tional challenges of a practical nature.” I the more the merrier?! I batch normalisation for maintaining stable gradients (between layers) I fighting curse of dimension (“evaluating T incurs an increasing computational cost as dimensionality grows”) with multiscale architecture (clamping: component-wise stopping rules) [Papamakarios et al., 2021]
  • 97. Practical considerations “Implementing a flow often amounts to composing as many transformations as computation and memory will allow. Working with such deep flows introduces addi- tional challenges of a practical nature.” I the more the merrier?! I batch normalisation for maintaining stable gradients (between layers) I “...early work on flow precursors dismissed the autoregressive approach as prohibitively expensive” addressed by sharing parameters within conditioners ci(·) [Papamakarios et al., 2021]
  • 98. Applications “Normalizing flows have two primitive operations: den- sity calculation and sampling. In turn, flows are effec- tive in any application requiring a probabilistic model with either of those capabilities.” I density estimation [speed of convergence?] I proxy generative model I importance sampling for integration by minimising distance to integrand or IS variance [finite?] I MCMC flow substitute for HMC [Papamakarios et al., 2021]
  • 99. Applications “Normalizing flows have two primitive operations: den- sity calculation and sampling. In turn, flows are effec- tive in any application requiring a probabilistic model with either of those capabilities.” I optimised reparameterisation of target for MCMC [exact?] I variational approximation by maximising evidence lower bound (ELBO) to posterior on parameter η = T(u, ϕ) n X i=1 log p(xobs , T(ui; ϕ)) | {z } joint + log |detJT (ui; ϕ)| I substitutes for likelihood-free inference on either π(η|xobs) or p(xobs|η) [Papamakarios et al., 2021]
  • 100. A[nother] revolution in machine learning? “One area where neural networks are being actively de- veloped is density estimation in high dimensions: given a set of points x ∼ p(x), the goal is to estimate the probability density p(·). As there are no explicit la- bels, this is usually considered an unsupervised learning task. We have already discussed that classical methods based for instance on histograms or kernel density esti- mation do not scale well to high-dimensional data. In this regime, density estimation techniques based on neu- ral networks are becoming more and more popular. One class of these neural density estimation techniques are normalizing flows.” [Cranmer et al., PNAS, 2020]
  • 101. Crucially lacking No connection with statistical density estimation, with no general study of convergence (in training sample size) to the true density ...or in evaluating approximation error (as in ABC) [Kobyzev et al., 2019; Papamakarios et al., 2021]
  • 102. Reconnecting with Geyer (1994) “...neural networks can be trained to learn the likelihood ratio function p(x|ϑ0)/p(x|ϑ1) or p(x|ϑ0)/p(x), where in the latter case the denominator is given by a marginal model integrated over a proposal or the prior (...) The key idea is closely related to the discriminator network in GANs mentioned above: a classifier is trained us- ing supervised learning to discriminate two sets of data, though in this case both sets come from the simulator and are generated for different parameter points ϑ0 and ϑ1. The classifier output function can be converted into an approximation of the likelihood ratio between ϑ0 and ϑ1! This manifestation of the Neyman-Pearson lemma in a machine learning setting is often called the likeli- hood ratio trick.” [Cranmer et al., PNAS, 2020]
  • 103. A comparison with MLE [Guttmann Hyvärinen, 2012]
  • 104. A comparison with MLE [Guttmann Hyvärinen, 2012]
  • 105. A comparison with MLE [Guttmann Hyvärinen, 2012]
  • 106. Outline 1 Geyer’s 1994 logistic 2 Links with bridge sampling 3 Noise contrastive estimation 4 Generative models 5 Variational autoencoders (VAEs) 6 Generative adversarial networks (GANs)
  • 107. Generative models “Deep generative model than can learn via the principle of maximum likelihood differ with respect to how they represent or approximate the likelihood.” I. Goodfellow Likelihood function L(ϑ|x1, . . . , xn) ∝ n Y i=1 pmodel(xi|ϑ) leading to MLE estimate ^ ϑ(x1, . . . , xn) = arg max ϑ n X i=1 log pmodel(xi|ϑ) with ^ ϑ(x1, . . . , xn) = arg max ϑ DKL (pdata||pmodel(·|ϑ))
  • 108. Likelihood complexity Explicit solutions: I domino representation (“fully visible belief networks”) pmodel(x) = T Y t=1 pmodel(xt|x1:t−1) I “non-linear independent component analysis” (cf. normalizing flows) pmodel(x) = pz(g−1 ϕ (x))
  • 109.
  • 110.
  • 111.
  • 112.
  • 114.
  • 115.
  • 116.
  • 117.
  • 118.
  • 119. Likelihood complexity Explicit solutions: I domino representation (“fully visible belief networks”) pmodel(x) = T Y t=1 pmodel(xt|Pa(xt)) I “non-linear independent component analysis” (cf. normalizing flows) pmodel(x) = pz(g−1 ϕ (x))
  • 120.
  • 121.
  • 122.
  • 123.
  • 125.
  • 126.
  • 127.
  • 128.
  • 129.
  • 130. Likelihood complexity Explicit solutions: I domino representation (“fully visible belief networks”) pmodel(x) = T Y t=1 pmodel(xt|Pa(xt)) I “non-linear independent component analysis” (cf. normalizing flows) pmodel(x) = pz(g−1 ϕ (x))
  • 131.
  • 132.
  • 133.
  • 134.
  • 136.
  • 137.
  • 138.
  • 139.
  • 140. I variational approximations log pmodel(x; ϑ) ≥ L(x; ϑ) represented by variational autoencoders
  • 141. Likelihood complexity Explicit solutions: I domino representation (“fully visible belief networks”) pmodel(x) = T Y t=1 pmodel(xt|Pa(xt)) I “non-linear independent component analysis” (cf. normalizing flows) pmodel(x) = pz(g−1 ϕ (x))
  • 142.
  • 143.
  • 144.
  • 145.
  • 147.
  • 148.
  • 149.
  • 150.
  • 151. I Markov chain Monte Carlo (MCMC) maximisation
  • 152. Likelihood complexity Implicit solutions involving sampling from the model pmodel without computing density I ABC algorithms for MLE derivation [Piccini Anderson, 2017] I generative stochastic networks [Bengio et al., 2014] I generative adversarial networks (GANs) [Goodfellow et al., 2014]
  • 153. Variational autoencoders (VAEs) 1 Geyer’s 1994 logistic 2 Links with bridge sampling 3 Noise contrastive estimation 4 Generative models 5 Variational autoencoders (VAEs) 6 Generative adversarial networks (GANs)
  • 154. Variational autoencoders “... provide a principled framework for learning deep latent-variable models and corresponding inference mod- els (...) can be viewed as two coupled, but indepen- dently parameterized models: the encoder or recogni- tion model, and the decoder or generative model. These two models support each other. The recognition model delivers to the generative model an approximation to its posterior over latent random variables, which it needs to update its parameters inside an iteration of “ex- pectation maximization” learning. Reversely, the gener- ative model is a scaffolding of sorts for the recognition model to learn meaningful representations of the data (...) The recognition model is the approximate inverse of the generative model according to Bayes rule.” [Kingma Welling, 2019]
  • 155. Autoencoders “An autoencoder is a neural network that is trained to attempt to copy its input x to its output r = g(h) via a hidden layer h = f(x) (...) [they] are designed to be unable to copy perfectly” I undercomplete autoencoders (with dim(h) dim(x)) I regularised autoencoders, with objective L(x, g ◦ f(x)) + Ω(h) where penalty akin to log-prior I denoising autoencoders (learning x on noisy version x̃ of x) I stochastic autoencoders (learning pdecode(x|h) for a given pencode(h|x) w/o compatibility) [Goodfellow et al., 2016, p.496]
  • 157. Variational autoencoders (VAEs) “The key idea behind the variational autoencoder is to attempt to sample values of Z that are likely to have produced X = x, and compute p(x) just from those.” Representation of (marginal) likelihood pϑ(x) based on latent variable z pϑ(x) = Z pϑ(x|z)pϑ(z) dz Machine-learning usually preoccupied only by maximising pϑ(x) (in ϑ) by simulating z efficiently (i.e., not from the prior) log pϑ(x) − D[pϑ(·)||pϑ(·|x)] = Epϑ [log pϑ(x|Z)] − D[qϕ(·)||pϑ(·)] [Kingma Welling, 2019]
  • 158. Variational autoencoders (VAEs) “The key idea behind the variational autoencoder is to attempt to sample values of Z that are likely to have produced X = x, and compute p(x) just from those.” Representation of (marginal) likelihood pϑ(x) based on latent variable z pϑ(x) = Z pϑ(x|z)pϑ(z) dz Machine-learning usually preoccupied only by maximising pϑ(x) (in ϑ) by simulating z efficiently (i.e., not from the prior) log pϑ(x)−D[qϕ(·|x)||pϑ(·|x)] = Eqϕ(·|x)[log pϑ(x|Z)]−D[qϕ(·|x)||pϑ(·)] since x is fixed (Bayesian analogy) [Kingma Welling, 2019]
  • 160. Variational autoencoders (VAEs) log pϑ(x)−D[qϕ(·|x)||pϑ(·|x)] = Eqϕ(·|x)[log pϑ(x|Z)]−D[qϕ(·|x)||pϑ(·)] I lhs is quantity to maximize (plus error term, small for good approximation qϕ, or regularisation) I rhs can be optimised by stochastic gradient descent when qϕ manageable I link with autoencoder, as qϕ(z|x) “encoding” x into z, and pϑ(x|z) “decoding” z to reconstruct x [Doersch, 2021]
  • 161. Variational autoencoders (VAEs) log pϑ(x)−D[qϕ(·|x)||pϑ(·|x)] = Eqϕ(·|x)[log pϑ(x|Z)]−D[qϕ(·|x)||pϑ(·)] I lhs is quantity to maximize (plus error term, small for good approximation qϕ, or regularisation) I rhs can be optimised by stochastic gradient descent when qϕ manageable I link with autoencoder, as qϕ(z|x) “encoding” x into z, and pϑ(x|z) “decoding” z to reconstruct x [Doersch, 2021]
  • 162. Variational autoencoders (VAEs) “One major division in machine learning is generative versus discriminative modeling (...) To turn a genera- tive model into a discriminator we need Bayes rule.” Representation of (marginal) likelihood pϑ(x) based on latent variable z Variational approximation qϕ(z|x) (also called encoder) to posterior distribution on latent variable z, pϑ(z|x), associated with conditional distribution pϑ(x|z) (also called decoder) Example: qϕ(z|x) Normal distribution Nd(µ(x), Σ(x)) with I (µ(x), Σ(x)) estimated by deep neural network I (µ(x), Σ(x)) estimated by ABC (synthetic likelihood) [Kingma Welling, 2014]
  • 163. Variational autoencoders (VAEs) “One major division in machine learning is generative versus discriminative modeling (...) To turn a genera- tive model into a discriminator we need Bayes rule.” Representation of (marginal) likelihood pϑ(x) based on latent variable z Variational approximation qϕ(z|x) (also called encoder) to posterior distribution on latent variable z, pϑ(z|x), associated with conditional distribution pϑ(x|z) (also called decoder) Example: qϕ(z|x) Normal distribution Nd(µ(x), Σ(x)) with I (µ(x), Σ(x)) estimated by deep neural network I (µ(x), Σ(x)) estimated by ABC (synthetic likelihood) [Kingma Welling, 2014]
  • 164. ELBO objective Since log pϑ(x) = Eqϕ(z|x)[log pϑ(x)] = Eqϕ(z|x)[log pϑ(x, z) pϑ(z|x) ] = Eqϕ(z|x)[log pϑ(x, z) qϕ(z|x) ] + Eqϕ(z|x)[log qϕ(x, z) pϑ(z|x) ] | {z } KL≥0 evidence lower bound (ELBO) defined by Lϑ,ϕ(x) = Eqϕ(z|x)[log pϑ(x, z)] − Eqϕ(z|x)[log qϕ(z|x)] and used as objective function to be maximised in (ϑ, ϕ)
  • 165. ELBO maximisation Stochastic gradient step, one parameter at a time In iid settings Lϑ,ϕ(x) = n X i=1 Lϑ,ϕ(xi) and ∇ϑLϑ,ϕ(xi) = Eqϕ(z|xi)[∇ϑ log pϑ(xi, z)] ≈ ∇ϑ log pϑ(x, z̃(xi)) for one simulation z̃(xi) ∼ qϕ(z|xi) but ∇ϕLϑ,ϕ(xi) more difficult to compute
  • 166. ELBO maximisation Stochastic gradient step, one parameter at a time In iid settings Lϑ,ϕ(x) = n X i=1 Lϑ,ϕ(xi) and ∇ϑLϑ,ϕ(xi) = Eqϕ(z|xi)[∇ϑ log pϑ(xi, z)] ≈ ∇ϑ log pϑ(x, z̃(xi)) for one simulation z̃(xi) ∼ qϕ(z|xi) but ∇ϕLϑ,ϕ(xi) more difficult to compute
  • 167. ELBO maximisation (2) Reparameterisation (form of normalising flow) If z = g(x, ϕ, ε) ∼ qϕ(z|x) when ε ∼ r(ε), Eqϕ(z|xi)[h(Z)] = Er[h(g(x, ϕ, ε))] and ∇ϕEqϕ(z|xi)[h(Z)] = ∇ϕEr[h ◦ g(x, ϕ, ε)] = Er[∇ϕh ◦ g(x, ϕ, ε)] ≈ ∇ϕh ◦ g(x, ϕ, ε̃) for one simulation ε̃ ∼ r [Kingma Welling, 2014]
  • 168. ELBO maximisation (2) Reparameterisation (form of normalising flow) If z = g(x, ϕ, ε) ∼ qϕ(z|x) when ε ∼ r(ε), Eqϕ(z|xi)[h(Z)] = Er[h(g(x, ϕ, ε))] leading to unbiased estimator of gradient of ELBO ∇ϑ,ϕ {log pϑ(x, g(x, ϕ, ε)) − log qϕ(g(x, ϕ, ε)|x)} [Kingma Welling, 2014]
  • 169. ELBO maximisation (2) Reparameterisation (form of normalising flow) If z = g(x, ϕ, ε) ∼ qϕ(z|x) when ε ∼ r(ε), Eqϕ(z|xi)[h(Z)] = Er[h(g(x, ϕ, ε))] leading to unbiased estimator of gradient of ELBO ∇ϑ,ϕ
  • 170. log pϑ(x, g(x, ϕ, ε)) − log r(ε) + log
  • 171.
  • 172.
  • 173.
  • 175.
  • 176.
  • 177.
  • 179. Marginal likelihood estimation Since log pϑ(x) = log Eqϕ(z|x) pϑ(x, Z) qϕ(Z|x) a importance sample estimate of the log-marginal likelihood is log pϑ(x) ≈ log 1 T T X t=1 pϑ(x, zt) qϕ(zt|x) zt ∼ qϕ(z|x) When T = 1 log pϑ(x) | {z } ideal objective ≈ log pϑ(x, z1(x)) qϕ(z1(x)|x) | {z } ELBO objective ELBO estimator.
  • 180. Marginal likelihood estimation Since log pϑ(x) = log Eqϕ(z|x) pϑ(x, Z) qϕ(Z|x) a importance sample estimate of the log-marginal likelihood is log pϑ(x) ≈ log 1 T T X t=1 pϑ(x, zt) qϕ(zt|x) zt ∼ qϕ(z|x) When T = 1 log pϑ(x) | {z } ideal objective ≈ log pϑ(x, z1(x)) qϕ(z1(x)|x) | {z } ELBO objective ELBO estimator.
  • 181. Generative adversarial networks 1 Geyer’s 1994 logistic 2 Links with bridge sampling 3 Noise contrastive estimation 4 Generative models 5 Variational autoencoders (VAEs) 6 Generative adversarial networks (GANs)
  • 182. Generative adversarial networks (GANs) “Generative adversarial networks (GANs) provide an algorithmic framework for constructing generative mod- els with several appealing properties: – they do not require a likelihood function to be specified, only a generating procedure; – they provide samples that are sharp and compelling; – they allow us to harness our knowledge of building highly accurate neural network classifiers.” [Mohamed Lakshminarayanan, 2016]
  • 183. Implicit generative models Representation of random variables as x = Gϑ(z) z ∼ µ(z) where µ(·) reference distribution and Gϑ multi-layered and highly non-linear transform (as, e.g., in normalizing flows) I more general and flexible than “prescriptive” if implicit (black box) I connected with pseudo-random variable generation I call for likelihood-free inference on ϑ [Mohamed Lakshminarayanan, 2016]
  • 184. Implicit generative models Representation of random variables as x = Gϑ(z) z ∼ µ(z) where µ(·) reference distribution and Gϑ multi-layered and highly non-linear transform (as, e.g., in normalizing flows) I more general and flexible than “prescriptive” if implicit (black box) I connected with pseudo-random variable generation I call for likelihood-free inference on ϑ [Mohamed Lakshminarayanan, 2016]
  • 185. Untractable likelihoods Cases when the likelihood function f(y|ϑ) is unavailable and when the completion step f(y|ϑ) = Z Z f(y, z|ϑ) dz is impossible or too costly because of the dimension of z © MCMC cannot be implemented!
  • 186. Untractable likelihoods Cases when the likelihood function f(y|ϑ) is unavailable and when the completion step f(y|ϑ) = Z Z f(y, z|ϑ) dz is impossible or too costly because of the dimension of z © MCMC cannot be implemented!
  • 187. The ABC method Bayesian setting: target is π(ϑ)f(x|ϑ) When likelihood f(x|ϑ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|ϑ), under the prior π(ϑ), keep jointly simulating ϑ0 ∼ π(ϑ) , z ∼ f(z|ϑ0 ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavaré et al., 1997]
  • 188. The ABC method Bayesian setting: target is π(ϑ)f(x|ϑ) When likelihood f(x|ϑ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|ϑ), under the prior π(ϑ), keep jointly simulating ϑ0 ∼ π(ϑ) , z ∼ f(z|ϑ0 ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavaré et al., 1997]
  • 189. The ABC method Bayesian setting: target is π(ϑ)f(x|ϑ) When likelihood f(x|ϑ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|ϑ), under the prior π(ϑ), keep jointly simulating ϑ0 ∼ π(ϑ) , z ∼ f(z|ϑ0 ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavaré et al., 1997]
  • 190. Why does it work?! The proof is trivial: f(ϑi) ∝ X z∈D π(ϑi)f(z|ϑi)Iy(z) ∝ π(ϑi)f(y|ϑi) = π(ϑi|y) . [Accept–Reject 101]
  • 191. ABC as A...pproximative When y is a continuous random variable, equality z = y is replaced with a tolerance condition, ρ{η(z), η(y)} ≤ ε where ρ is a distance and η(y) defines a (not necessarily sufficient) statistic Output distributed from π(ϑ) Pϑ{ρ(y, z) ε} ∝ π(ϑ|ρ(η(y), η(z)) ε) [Pritchard et al., 1999]
  • 192. ABC as A...pproximative When y is a continuous random variable, equality z = y is replaced with a tolerance condition, ρ{η(z), η(y)} ≤ ε where ρ is a distance and η(y) defines a (not necessarily sufficient) statistic Output distributed from π(ϑ) Pϑ{ρ(y, z) ε} ∝ π(ϑ|ρ(η(y), η(z)) ε) [Pritchard et al., 1999]
  • 193. ABC posterior The likelihood-free algorithm samples from the marginal in z of: πε(ϑ, z|y) = π(ϑ)f(z|ϑ)IAε,y (z) R Aε,y×Θ π(ϑ)f(z|ϑ)dzdϑ , where Aε,y = {z ∈ D|ρ(η(z), η(y)) ε}. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: πε(ϑ|y) = Z πε(ϑ, z|y)dz ≈ π(ϑ|η(y)) .
  • 194. ABC posterior The likelihood-free algorithm samples from the marginal in z of: πε(ϑ, z|y) = π(ϑ)f(z|ϑ)IAε,y (z) R Aε,y×Θ π(ϑ)f(z|ϑ)dzdϑ , where Aε,y = {z ∈ D|ρ(η(z), η(y)) ε}. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: πε(ϑ|y) = Z πε(ϑ, z|y)dz ≈ π(ϑ|η(y)) .
  • 195. MA example Back to the MA(2) model xt = εt + 2 X i=1 ϑiεt−i Simple prior: uniform over the inverse [real and complex] roots in Q(u) = 1 − 2 X i=1 ϑiui under identifiability conditions
  • 196. MA example Back to the MA(2) model xt = εt + 2 X i=1 ϑiεt−i Simple prior: uniform prior over identifiability zone
  • 197. MA example (2) ABC algorithm thus made of 1. picking a new value (ϑ1, ϑ2) in the triangle 2. generating an iid sequence (εt)−2t≤T 3. producing a simulated series (x0 t)1≤t≤T Distance: basic distance between the series ρ((x0 t)1≤t≤T , (xt)1≤t≤T ) = T X t=1 (xt − x0 t)2 or distance between summary statistics like the 2 autocorrelations τj = T X t=j+1 xtxt−j
  • 198. MA example (2) ABC algorithm thus made of 1. picking a new value (ϑ1, ϑ2) in the triangle 2. generating an iid sequence (εt)−2t≤T 3. producing a simulated series (x0 t)1≤t≤T Distance: basic distance between the series ρ((x0 t)1≤t≤T , (xt)1≤t≤T ) = T X t=1 (xt − x0 t)2 or distance between summary statistics like the 2 autocorrelations τj = T X t=j+1 xtxt−j
  • 199. Comparison of distance impact Evaluation of the tolerance on the ABC sample against both distances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
  • 200. Comparison of distance impact 0.0 0.2 0.4 0.6 0.8 0 1 2 3 4 θ1 −2.0 −1.0 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 θ2 Evaluation of the tolerance on the ABC sample against both distances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
  • 201. Comparison of distance impact 0.0 0.2 0.4 0.6 0.8 0 1 2 3 4 θ1 −2.0 −1.0 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 θ2 Evaluation of the tolerance on the ABC sample against both distances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
  • 202. Occurence of simulation in Econometrics Simulation–based techniques in Econometrics I Simulated method of moments I Method of simulated moments I Simulated pseudo-maximum-likelihood I Indirect inference [Gouriéroux Monfort, 1996]
  • 203. Simulated method of moments Given observations yo 1:n from a model yt = r(y1:(t−1), εt, ϑ) , εt ∼ g(·) simulate ε? 1:n, derive y? t (ϑ) = r(y1:(t−1), ε? t , ϑ) and estimate ϑ by arg min ϑ n X t=1 (yo t − y? t (ϑ))2
  • 204. Simulated method of moments Given observations yo 1:n from a model yt = r(y1:(t−1), εt, ϑ) , εt ∼ g(·) simulate ε? 1:n, derive y? t (ϑ) = r(y1:(t−1), ε? t , ϑ) and estimate ϑ by arg min ϑ n X t=1 yo t − n X t=1 y? t (ϑ) 2
  • 205. Indirect inference Minimise (in ϑ) the distance between estimators ^ β based on pseudo-models for genuine observations and for observations simulated under the true model and the parameter ϑ. [Gouriéroux, Monfort, Renault, 1993; Smith, 1993; Gallant Tauchen, 1996]
  • 206. Indirect inference (PML vs. PSE) Example of the pseudo-maximum-likelihood (PML) ^ β(y) = arg max β X t log f? (yt|β, y1:(t−1)) leading to arg min ϑ ||^ β(yo ) − ^ β(y1(ϑ), . . . , yS(ϑ))||2 when ys(ϑ) ∼ f(y|ϑ) s = 1, . . . , S
  • 207. Indirect inference (PML vs. PSE) Example of the pseudo-score-estimator (PSE) ^ β(y) = arg min β X t ∂ log f? ∂β (yt|β, y1:(t−1)) 2 leading to arg min ϑ ||^ β(yo ) − ^ β(y1(ϑ), . . . , yS(ϑ))||2 when ys(ϑ) ∼ f(y|ϑ) s = 1, . . . , S
  • 208. AR(2) vs. MA(1) example true (MA) model yt = εt − ϑεt−1 and [wrong!] auxiliary (AR) model yt = β1yt−1 + β2yt−2 + ut R code x=eps=rnorm(250) x[2:250]=x[2:250]-0.5*x[1:249] #MA(1) simeps=rnorm(250) propeta=seq(-.99,.99,le=199) dist=rep(0,199) bethat=as.vector(arima(x,c(2,0,0),incl=FALSE)$coef) #AR(2) for (t in 1:199) dist[t]=sum((as.vector(arima(c(simeps[1],simeps[2:250]-propeta[t]* simeps[1:249]),c(2,0,0),incl=FALSE)$coef)-bethat)ˆ2)
  • 209. AR(2) vs. MA(1) example One sample: −1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 θ distance
  • 210. AR(2) vs. MA(1) example Many samples: 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6
  • 211. Bayesian synthetic likelihood Approach contemporary (?) of ABC where distribution of summary statistic s(·) replaced with parametric family, e.g. g(s|ϑ) = ϕ(s; µ(ϑ), Σ(ϑ)) when ϑ [true] parameter value behind data Normal parameters µ(ϑ), Σ(ϑ)) unknown in closed form and evaluated by simulation, based on Monte Carlo sample of zi ∼ f(z|ϑ) Outcome used as substitute in posterior updating [Wood, 2010; Drovandi al., 2015; Price al., 2018]
  • 212. Bayesian synthetic likelihood Approach contemporary (?) of ABC where distribution of summary statistic s(·) replaced with parametric family, e.g. g(s|ϑ) = ϕ(s; µ(ϑ), Σ(ϑ)) when ϑ [true] parameter value behind data Normal parameters µ(ϑ), Σ(ϑ)) unknown in closed form and evaluated by simulation, based on Monte Carlo sample of zi ∼ f(z|ϑ) Outcome used as substitute in posterior updating [Wood, 2010; Drovandi al., 2015; Price al., 2018]
  • 213. Bayesian synthetic likelihood Approach contemporary (?) of ABC where distribution of summary statistic s(·) replaced with parametric family, e.g. g(s|ϑ) = ϕ(s; µ(ϑ), Σ(ϑ)) when ϑ [true] parameter value behind data Normal parameters µ(ϑ), Σ(ϑ)) unknown in closed form and evaluated by simulation, based on Monte Carlo sample of zi ∼ f(z|ϑ) Outcome used as substitute in posterior updating [Wood, 2010; Drovandi al., 2015; Price al., 2018]
  • 214. Asymptotics of BSL Based on three approximations 1. representation of data information by summary statistic information 2. Normal substitute for summary distribution 3. Monte Carlo versions of mean and variance Existence of Bernstein-von Mises convergence under consistency of selected covariance estimator [Frazier al., 2021]
  • 215. Asymptotics of BSL Based on three approximations 1. representation of data information by summary statistic information 2. Normal substitute for summary distribution 3. Monte Carlo versions of mean and variance Existence of Bernstein-von Mises convergence under consistency of selected covariance estimator [Frazier al., 2021]
  • 216. Asymptotics of BSL Assumptions I Central Limit Theorem on Sn = s(x1:n) I Idenfitiability of parameter ϑ based on Sn I Existence of some prior moment of Σ(ϑ) I sub-Gaussian tail of simulated summaries I Monte Carlo effort in nγ for γ 0 Similarity with ABC sufficient conditions, but BSL point estimators generally asymptotically less efficient [Frazier al., 2018; Li Fearnhead, 2018]
  • 217. Asymptotics of BSL Assumptions I Central Limit Theorem on Sn = s(x1:n) I Idenfitiability of parameter ϑ based on Sn I Existence of some prior moment of Σ(ϑ) I sub-Gaussian tail of simulated summaries I Monte Carlo effort in nγ for γ 0 Similarity with ABC sufficient conditions, but BSL point estimators generally asymptotically less efficient [Frazier al., 2018; Li Fearnhead, 2018]
  • 218. Asymptotics of ABC For a sample y = y(n) and a tolerance ε = εn, when n → +∞, assuming a parametric model ϑ ∈ Rk, k fixed I Concentration of summary η(z): there exists b(ϑ) such that η(z) − b(ϑ) = oPϑ (1) I Consistency: Πεn (kϑ − ϑ0k ≤ δ|y) = 1 + op(1) I Convergence rate: there exists δn = o(1) such that Πεn (kϑ − ϑ0k ≤ δn|y) = 1 + op(1) [Frazier al., 2018]
  • 219. Asymptotics of ABC Under assumptions (A1) ∃σn → +∞ Pϑ σ−1 n kη(z) − b(ϑ)k u ≤ c(ϑ)h(u), lim u→+∞ h(u) = 0 (A2) Π(kb(ϑ) − b(ϑ0)k ≤ u) uD , u ≈ 0 posterior consistency and posterior concentration rate λT that depends on the deviation control of d2{η(z), b(ϑ)} posterior concentration rate for b(ϑ) bounded from below by O(εT ) [Frazier al., 2018]
  • 220. Asymptotics of ABC Under assumptions (A1) ∃σn → +∞ Pϑ σ−1 n kη(z) − b(ϑ)k u ≤ c(ϑ)h(u), lim u→+∞ h(u) = 0 (A2) Π(kb(ϑ) − b(ϑ0)k ≤ u) uD , u ≈ 0 then Πεn kb(ϑ) − b(ϑ0)k . εn + σnh−1 (εD n )|y = 1 + op0 (1) If also kϑ − ϑ0k ≤ Lkb(ϑ) − c(ϑ0)kα, locally and ϑ → b(ϑ) 1-1 Πεn (kϑ − ϑ0k . εα n + σα n(h−1 (εD n ))α | {z } δn |y) = 1 + op0 (1) [Frazier al., 2018]
  • 221. Further ABC assumptions I (B1) Concentration of summary η: Σn(ϑ) ∈ Rk1×k1 is o(1) Σn(ϑ)−1 {η(z)−b(ϑ)} ⇒ Nk1 (0, Id), (Σn(ϑ)Σn(ϑ0)−1 )n = Co I (B2) b(ϑ) is C1 and kϑ − ϑ0k . kb(ϑ) − b(ϑ0)k I (B3) Dominated convergence and lim n Pϑ(Σn(ϑ)−1{η(z) − b(ϑ)} ∈ u + B(0, un)) Q j un(j) → ϕ(u) [Frazier al., 2018]
  • 222. ABC asymptotic regime Set Σn(ϑ) = σnD(ϑ) for ϑ ≈ ϑ0 and Zo = Σn(ϑ0)−1(η(y) − ϑ0), then under (B1) and (B2) I when εnσ−1 n → +∞ Πεn [ε−1 n (ϑ − ϑ0) ∈ A|y] ⇒ UB0 (A), B0 = {x ∈ Rk ; kb0 (ϑ0)T xk ≤ 1} I when εnσ−1 n → c Πεn [Σn(ϑ0)−1 (ϑ − ϑ0) − Zo ∈ A|y] ⇒ Qc(A), Qc 6= N I when εnσ−1 n → 0 and (B3) holds, set Vn = [b0 (ϑ0)]T Σn(ϑ0)b0 (ϑ0) then Πεn [V−1 n (ϑ − ϑ0) − Z̃o ∈ A|y] ⇒ Φ(A), [Frazier al., 2018]
  • 223. conclusion on ABC consistency I asymptotic description of ABC: different regimes depending on εn σn I no point in choosing εn arbitrarily small: just εn = o(σn) I no gain in iterative ABC I results under weak conditions by not studying g(η(z)|ϑ) [Frazier al., 2018]
  • 224. Back to Geyer’s 1994 Since we can easily draw samples from the model, we can use any method that compares two sets of samples— one from the true data distribution and one from the model distribution—to drive learning. This is a pro- cess of density estimation-by-comparison, comprising two steps: comparison and estimation. For compari- son, we test the hypothesis that the true data distribu- tion p∗(x) and our model distribution qϑ(x) are equal, using the density difference p∗(x) − qϑ(x), or the den- sity ratio p∗(x)/qϑ(x) (...) The density ratio can be computed by building a classifier to distinguish observed data from that generated by the model. [Mohamed Lakshminarayanan, 2016]
  • 225. Class-probability estimation Closest to Geyer’s (1994) idea: Data xobs ∼ p∗(x) and simulated sample xsim ∼ qϑ(x) with same size Classification indicators yobs = 1 and ysim = 0 Then p∗(x) qϑ(x) = P(Y = 1|x) P(Y = 0|x) means that frequency ratio of allocations to models p∗ and qϑ) is an estimator of the ratio p∗/qϑ Learning P(Y = 1|x) via statistical or machine-learning tools P(Y = 1|x) = D(x; ϕ) [Mohamed Lakshminarayanan, 2016]
  • 226. Proper scoring rules Learning about [discriminator] parameter ϕ via proper scoring rule Advantage of proper scoring rule is that global optimum is achieved iff qϑ = p∗ albeit with no convergence guarantees since non-convex optimisation [Mohamed Lakshminarayanan, 2016]
  • 227. Proper scoring rules Learning about [discriminator] parameter ϕ via proper scoring rule For instance, L(ϕ, ϑ) = Ep(x,y)[−Y log D(X; ϕ) − (1 − Y) log(1 − D(X; ϕ))] [Mohamed Lakshminarayanan, 2016]
  • 228. Proper scoring rules Learning about [discriminator] parameter ϕ via proper scoring rule For instance, L(ϕ, ϑ) = Ep∗(x)[− log D(X; ϕ)] + Eµ(z)[− log(1 − D(Gϑ(Z); ϕ))] Principle of generative adversarial networks (GANs) with score minimised in ϕ AND maximised in ϑ [Mohamed Lakshminarayanan, 2016]
  • 229. Divergence minimisation Use of f-divergence Df[p∗ ||qϑ] = Z qϑ(x)f{p∗ (x) qϑ(x)}dx = Eqϑ [f(rϕ(X))] ≥ sup t Ep∗ [t(X)] − Eqϑ [f† (t(X))] where f convex with derivative f0 and Fenchel conjugate f† f† (x∗ ) := − inf {f(x) − hx∗ , xi : x ∈ X} [Mohamed Lakshminarayanan, 2016]
  • 230. Divergence minimisation Use of f-divergence Df[p∗ ||qϑ] = Z qϑ(x)f{p∗ (x) qϑ(x)}dx = Eqϑ [f(rϕ(X))] ≥ sup t Ep∗ [t(X)] − Eqϑ [f† (t(X))] where f convex with derivative f0 and Fenchel conjugate f† Includes Kullback–Leibler divergence KL(p, q) = Ep[log p(X)/q(X) [Mohamed Lakshminarayanan, 2016]
  • 231. Divergence minimisation Use of f-divergence Df[p∗ ||qϑ] = Z qϑ(x)f{p∗ (x) qϑ(x)}dx = Eqϑ [f(rϕ(X))] ≥ sup t Ep∗ [t(X)] − Eqϑ [f† (t(X))] where f convex with derivative f0 and Fenchel conjugate f† Includes Kullback–Leibler and Jensen–Shannon divergences JS(p, q) = 1/2KL(p, p+q/2) + 1/2KL(q, p+q/2) [Mohamed Lakshminarayanan, 2016]
  • 232. Divergence minimisation Use of f-divergence Df[p∗ ||qϑ] = Z qϑ(x)f{p∗ (x) qϑ(x)}dx = Eqϑ [f(rϕ(X))] ≥ sup t Ep∗ [t(X)] − Eqϑ [f† (t(X))] where f convex with derivative f0 and Fenchel conjugate f† Turning into bi-level optimisation of L = Ep∗ [−f0 (rϕ(X))] + Eqϑ [f† (f0 (rϕ(X))](g) Minimise in ϕ ratio loss L, minimising negative variational lowerbound and minimise in ϑ generative loss (g) to drive ratio to one [Mohamed Lakshminarayanan, 2016]
  • 233. Ratio matching Minimise error between true density ratio r∗(x) and its estimate L = 1/2 Z qϑ(x)(r(x) − r∗ (x))2 dx = 1/2 Eqϑ [rϕ(X)2 ] − Ep∗ [rϕ(X)] Equivalence with ratio loss derived using divergence minimisation [Mohamed Lakshminarayanan, 2016]
  • 234. Moment matching Compare moments of both distributions by minimising distance, using test [summary] statistics s(x) that provide moments of interest L(ϕ, ϑ) = (Ep∗ [s(X)] − Eqϑ [s(X)])2 = (Ep∗ [s(X)] − Eµ[s(G(Z; ϑ))])2 choice of test statistics critical: case of statistics defined within reproducing kernel Hilbert space, leading to maximum mean discrepancy [Mohamed Lakshminarayanan, 2016]
  • 235. Moment matching Compare moments of both distributions by minimising distance, using test [summary] statistics s(x) that provide moments of interest L(ϕ, ϑ) = (Ep∗ [s(X)] − Eqϑ [s(X)])2 = (Ep∗ [s(X)] − Eµ[s(G(Z; ϑ))])2 “There is a great deal of opportunity for exchange be- tween GANs, ABC and ratio estimation in aspects of scalability, applications,and theoretical understanding” [Mohamed Lakshminarayanan, 2016]
  • 236. Generative adversarial networks (GANs) Adversarial setting opposing generator and discriminator: I generator G = G(z; ϑG) ∈ X as “best” guess to model data production x ∼ pmodel(x; ϑG) with z latent variable I discriminator D = D(x; ϑD) ∈ [0, 1] as measuring discrepancy between data (generation) and model (generation), ^ P(x = G(z; ϑG); ϑD) and ^ P(G(z) 6= G(z; ϑG); ϑD) Antagonism due to objective function J(ϑG; ϑD) where G aims at confusing D with equilibrium G = pdata and D ≡ 1/2 Both G and D possibly modelled as deep neural networks [Goodfellow, 2016]
  • 237. Generative adversarial networks (GANs) Adversarial setting opposing generator and discriminator: I generator G = G(z; ϑG) ∈ X as “best” guess to model data production x ∼ pmodel(x; ϑG) with z latent variable I discriminator D = D(x; ϑD) ∈ [0, 1] as measuring discrepancy between data (generation) and model (generation), ^ P(x = G(z; ϑG); ϑD) and ^ P(G(z) 6= G(z; ϑG); ϑD) Antagonism due to objective function J(ϑG; ϑD) where G aims at confusing D with equilibrium G = pdata and D ≡ 1/2 Both G and D possibly modelled as deep neural networks [Goodfellow, 2016]
  • 238. Generative adversarial networks (GANs) Adversarial setting opposing generator and discriminator: I generator G = G(z; ϑG) ∈ X as “best” guess to model data production x ∼ pmodel(x; ϑG) with z latent variable I discriminator D = D(x; ϑD) ∈ [0, 1] as measuring discrepancy between data (generation) and model (generation), ^ P(x = G(z; ϑG); ϑD) and ^ P(G(z) 6= G(z; ϑG); ϑD) Antagonism due to objective function J(ϑG; ϑD) where G aims at confusing D with equilibrium G = pdata and D ≡ 1/2 Both G and D possibly modelled as deep neural networks [Goodfellow, 2016]
  • 239. Theoretical insights on GANs “The aim of GANs is to generate data that look ‘similar’ to samples collected from some unknown probability measure... [Biau, Sangnier Tanielian, 2021]
  • 240. Theoretical insights on GANs “The aim of GANs is to generate data that look ‘similar’ to samples collected from some unknown probability measure... In the initial version, GANs were shown to reduce, under appropriate conditions, the Jensen–Shanon divergence between the true distribution and the class of parameterized distributions... [Biau, Sangnier Tanielian, 2021]
  • 241. Theoretical insights on GANs In the initial version, GANs were shown to reduce, under appropriate conditions, the Jensen–Shanon divergence between the true distribution and the class of parameterized distributions... ...many empirical studies have described cases where the optimal generative distribution to a few modes of the distribution µ∗. This phenomenon is known under the term of mode collapse (...) in cases where µ∗ and µϑ lie on disjoint supports, these authors proved the existence of a perfect discriminator with null gradient on both supports, which consequently does not convey meaningful information to the generator.” [Biau, Sangnier Tanielian, 2021]
  • 242. Theoretical insights on GAN convergence (2) Objective to solve empirical minimaxisation of ^ L(ϑ, D): inf ϑ sup D∈D n X i=1 log D(xi) + n X i=1 log{1 − D(Gϑ(zi))} [Biau, Cadre, Sangnier Tanielian, 2018]
  • 243. Theoretical insights on GAN convergence (2) Objective to solve empirical minimaxisation of ^ L(ϑ, D): inf ϑ sup D∈D n X i=1 log D(xi) + n X i=1 log{1 − D(Gϑ(zi))} With no constraint on D, unique optimal choice D? ϑ = pdata pdata + pϑ [Biau, Cadre, Sangnier Tanielian, 2018]
  • 244. Theoretical insights on GAN convergence (2) Objective to solve empirical minimaxisation of ^ L(ϑ, D): inf ϑ sup D∈D n X i=1 log D(xi) + n X i=1 log{1 − D(Gϑ(zi))} When pϑ identifiable, compact, and all densities involved are upper bounded, ϑ? = argϑ inf ϑ L(ϑ, D? ϑ) is unique [Biau, Cadre, Sangnier Tanielian, 2018]
  • 245. Theoretical insights on GAN convergence (2) Objective to solve empirical minimaxisation of ^ L(ϑ, D): inf ϑ sup D∈D n X i=1 log D(xi) + n X i=1 log{1 − D(Gϑ(zi))} For a ε-dense parameterised D0, under some assumptions, 0 ≤ DJS(p? , pϑ) − DJS(p? , pϑ? ) ≤ cε2 for ϑ = argϑ inf ϑ sup D∈D0 L(ϑ, D) [Biau, Cadre, Sangnier Tanielian, 2018]
  • 246. Theoretical insights on GAN convergence (2) Objective to solve empirical minimaxisation of ^ L(ϑ, D): inf ϑ sup D∈D n X i=1 log D(xi) + n X i=1 log{1 − D(Gϑ(zi))} For a ε-dense parameterised D0, under further assumptions, E JS(p? , p^ ϑ) − DJS(p? , pϑ? ) = O(ε2 + n−1/2 ) for ^ ϑ = argϑ inf ϑ sup D∈D0 ^ L(ϑ, D) [Biau, Cadre, Sangnier Tanielian, 2018]
  • 247. Theoretical insights on GAN convergence (2) Objective to solve empirical minimaxisation of ^ L(ϑ, D): inf ϑ sup D∈D n X i=1 log D(xi) + n X i=1 log{1 − D(Gϑ(zi))} Under yet more assumptions, ^ ϑ −→ ϑ ^ α −→ α [Biau, Cadre, Sangnier Tanielian, 2018]
  • 248. GANs losses 1. discriminator cost JD (ϑD , ϑG ) = −Epdata [log D(X)] − EZ[log{1 − D(G(Z))}] as expected misclassification error, trained in ϑD on both the dataset and a G generated dataset 2. generator cost several solutions 2.1 minimax loss, JG = −JD, leading to ^ ϑG as minimax estimator [poor perf] 2.2 maximum confusion JG (ϑD , ϑG ) = −EZ[log D(G(Z))] as probability of discriminator D being mistaken by generator G 2.3 maximum likelihood loss JG (ϑD , ϑG ) = −EZ[exp{σ−1 (D{G(Z)})}] where σ logistic sigmoid function
  • 249. GANs losses 1. discriminator cost JD (ϑD , ϑG ) = −Epdata [log D(X)] − EZ[log{1 − D(G(Z))}] as expected misclassification error, trained in ϑD on both the dataset and a G generated dataset 2. generator cost several solutions 2.1 minimax loss, JG = −JD, leading to ^ ϑG as minimax estimator [poor perf] 2.2 maximum confusion JG (ϑD , ϑG ) = −EZ[log D(G(Z))] as probability of discriminator D being mistaken by generator G 2.3 maximum likelihood loss JG (ϑD , ϑG ) = −EZ[exp{σ−1 (D{G(Z)})}] where σ logistic sigmoid function
  • 250. GANs implementation Recursive algorithm where at each iteration 1. minimise JD(ϑD, ϑG) in ϑD 2. minimise JG(ϑD, ϑG) in ϑG based on empirical versions of expectations Epdata and Epmodel
  • 251. GANs implementation Recursive algorithm where at each iteration 1. reduce JD(ϑD, ϑG) by a gradient step in ϑD 2. reduce JG(ϑD, ϑG) by a gradient step in ϑG based on empirical versions of expectations Epdata and Epmodel
  • 252. Rates of convergence Under proper assumptions (with d dimension and β Hilbert regularity) min ϑ JS(p? , qϑ) . (log n/n)2β/2β+d + log δ n w.p. 1 − δ E[min ϑ JS(p? , qϑ)] . (log n/n)2β/4β+d sup p? E[min ϑ JS(p? , qϑ)] (log n/n)2β/2β+d [Belomestny al;., 2021]
  • 253. Contrastive vs. adversarial Connection with the noise contrastive approach: Estimation of parameters ϑ and Zϑ for unormalised model pmodel(x; ϑ, Zϑ) = ^ pmodel(x; ϑ) Zϑ and loss function JG (ϑ, Zϑ) = −Epmodel [log h(X)] − Eq[log{1 − h(Y)}] where q is an arbitrary generative distribution and h(x) = 1 . 1 + q(x) pmodel(x; ϑ, Zϑ) Converging method when q everywhere positive and side estimate of Zϑ [Guttman Hyvärinen, 2010]
  • 254. Contrastive vs. adversarial Connection with the noise contrastive approach: Estimation of parameters ϑ and Zϑ for unormalised model pmodel(x; ϑ, Zϑ) = ^ pmodel(x; ϑ) Zϑ and loss function JG (ϑ, Zϑ) = −Epmodel [log h(X)] − Eq[log{1 − h(Y)}] where q is an arbitrary generative distribution and h(x) = 1 . 1 + q(x) pmodel(x; ϑ, Zϑ) Converging method when q everywhere positive and side estimate of Zϑ [Guttman Hyvärinen, 2010]
  • 255. Contrastive versus adversarial Connection with the noise contrastive approach: Earlier version of GANs where I learning from artificial method (as in Geyer, 1994) I only learning generator G while D is fixed I free choice of q (generative, available, close to pdata) I and no optimisation of q
  • 256. Unfortunately... “... learning in GANs can be difficult in practice when G and D are represented by neural networks and max D L(G, D) is not convex... In general, simultaneous gradient descent on two play- ers’ costs is not guaranteed to reach an equilibrium... ... equilibria for a minimax game are not local minima of L. Instead (...) they are saddle points... ... the best-performing formulation of the GAN game is a different formulation that is neither zero-sum nor equivalent to maximum likelihood” [Goodfellow et al., 2016, Chap. 20]
  • 257. Wasserstein GANs “WGANs cure the main training problems of GANs. In particular, training WGANs does not require maintain- ing a careful balance in training of the discriminator and the generator, or a careful design of the network archi- tecture either.1 The mode dropping phenomenon that is typical in GANs is also drastically reduced.” Use of Wasserstein distance W(p, q) = inf ω Eω[kX − Yk] where infimum over all joints ω with marginals p and q or W(p? , pϑ) = inf ω Eω[kX − Gϑ(Z)k] [Arjoski et al., 2017] 1 Argument for normalising flows: provide solution for distributions restricted to manifold of smaller dimension
  • 258. Wasserstein GANs “WGANs cure the main training problems of GANs. In particular, training WGANs does not require maintain- ing a careful balance in training of the discriminator and the generator, or a careful design of the network archi- tecture either.1 The mode dropping phenomenon that is typical in GANs is also drastically reduced.” Use of Wasserstein distance W(p, q) = inf ω Eω[kX − Yk] where infimum over all joints ω with marginals p and q or W(p? , pϑ) = inf ω Eω[kX − Gϑ(Z)k] [Arjoski et al., 2017] 1 Argument for normalising flows: provide solution for distributions restricted to manifold of smaller dimension
  • 259. Wasserstein GANs “WGANs cure the main training problems of GANs. In particular, training WGANs does not require maintain- ing a careful balance in training of the discriminator and the generator, or a careful design of the network archi- tecture either.1 The mode dropping phenomenon that is typical in GANs is also drastically reduced.” Use of Wasserstein distance W(p, q) = inf ω Eω[kX − Yk] where infimum over all joints ω with marginals p and q or W(p? , pϑ) = inf ω Eω[kX − Gϑ(Z)k] [Arjoski et al., 2017] 1 Argument for normalising flows: provide solution for distributions restricted to manifold of smaller dimension
  • 260. Wasserstein GANs Kantorovich-Rubinstein equivalence W(p? , pϑ) = sup f;kfk1 Ep? [f(X)] − Eµ[f(Gϑ(Z))] approximated into W(p? , pϑ) = sup ϕ∈Rd Ep? [fϕ(X)] − Eµ[fϕ(Gϑ(Z))] (4) under kfϕk 1 constraint, with fϕ acting as parameterised discriminator [Arjoski et al., 2017]
  • 261. Wasserstein GANs Kantorovich-Rubinstein equivalence W(p? , pϑ) = sup f;kfk1 Ep? [f(X)] − Eµ[f(Gϑ(Z))] approximated into W(p? , pϑ) = sup ϕ∈Rd Ep? [fϕ(X)] − Eµ[fϕ(Gϑ(Z))] (4) under kfϕk 1 constraint, with fϕ acting as parameterised discriminator [Arjoski et al., 2017]
  • 262. WGANs optimisation “...question of finding the function fϕ that solves the maximization problem in (4). To roughly approximate this, something that we can do is train a neural net- work...” Gradients ∇ϕ n X i=1 fϕ(xi) − ∇ϕ n X i=1 fϕ(Gϑ(zi)) and −∇ϑ n X i=1 fϕ(Gϑ(zi)) with x1, . . . , xn ∼ p? (x) z1, . . . , zn ∼ µ(z) [Arjoski et al., 2017]
  • 263. WGAN in perspective “WGAN allows us to train the critic till optimality. When the critic is trained to completion, it simply pro- vides a loss to the generator that we can train as any other neural network (...) we no longer need to balance generator and discriminator’s capacity properly. The better the critic, the higher quality the gradients we use to train the generator.” Connections with I energy-based GANS (Zhao et al., 2016) I maximum-mean-discrepancy (Gretton et al., 2012) I generative moment matching (Li et al., 2015) [Arjoski et al., 2017]
  • 264. WGAN in perspective “WGAN allows us to train the critic till optimality. When the critic is trained to completion, it simply pro- vides a loss to the generator that we can train as any other neural network (...) we no longer need to balance generator and discriminator’s capacity properly. The better the critic, the higher quality the gradients we use to train the generator.” Connections with I energy-based GANS (Zhao et al., 2016) I maximum-mean-discrepancy (Gretton et al., 2012) I generative moment matching (Li et al., 2015) [Arjoski et al., 2017]
  • 265. WGAN optimality Under a compactness requirement on Gϑ’s I there exist(s) ϕ such that |Ep? [fϕ(X)] − Eµ[fϕ(Gϑ(Z))]| = dD(µ? , µϑ) I the minimum min ϑ dD(µ? , µϑ) can be achieved I if µ? has compact support, dD(µ? , ^ µn) ≤ c/ √ n with minimal garanteed probability [Stéphanovitch al., 2022; Biau, Sangnier Tanielian, 2021]
  • 266. MLE vs. WGAN [Danihelka et al., 2017]
  • 267. MLE vs. WGAN I MLE seeks to minimising forward Kullback-Leibler divergence between model distribution µϑ and data distribution µ? I adversarial learning (GAN) minimises Jensen-Shannon divergence between model distribution and data distribution, which relates to inverse Kullback-Leibler divergence [Zhao al., 2020]
  • 268. Overall WGAN minimiser For univariate sample x1, . . . , xn ∼ p?(x) ordered as X(1) ≤ · · · ≤ X(n) and with spacings ∆1 = x(2) − x(1), . . . , ∆n−1 = x(n) − x(n−1) the quantile function G? K(u) =            x(1) u 1/n − ∆1/2K x(i) + K(u − i/n + ∆i/2K) |u − i/n| ≤ ∆i/2K . . . . . . x(n) u n−1/n − ∆n−1/2K [Stépanovitch al., 2022]
  • 269. Overall WGAN minimiser For univariate sample x1, . . . , xn ∼ p?(x) ordered as X(1) ≤ · · · ≤ X(n) and with spacings ∆1 = x(2) − x(1), . . . , ∆n−1 = x(n) − x(n−1) the quantile function G? K(u) optimal among all Lipschitz-K generators G for 1-Wasserstein distance [Stépanovitch al., 2022]
  • 270. Overall WGAN minimiser G? K(u) optimal among all Lipschitz-K generators G for 1-Wasserstein distance [Stépanovitch al., 2022]
  • 271. Overall WGAN minimiser Less efficient ReLU neural network solutions [Stépanovitch al., 2022]
  • 273. Statistics vs. GANs Consider special setting with I MLP generator gϑ with depth L and width d, using leaky ReLU σa(·) activation I true distribution within family for ϑ = ϑ? and Z ∼ U([0, 1]d) I discriminator fω feedforward neural network with depth L + 2, using dual leaky ReLU σ1/a(·) activations, with parameter ω ∈ Ω(d, L) error of GAN E[dTV(µ? , µ^ ϑ, ^ ϕ)] . q d2L log{dl} (log n/n + log m/m) [Liang, 2021]
  • 274. Statistics vs. GANs Consider special setting with I true distribution Nd(µ, Σ) I linear generator (zero layer neural network) I m generated realisations and n observations I input distribution Np(o, Ip) (p d) I discriminator one hidden layer neural network with quadratic activation error of GAN E[dTV(µ? , µ^ ϑ, ^ ϕ)] . q d2 log d/n + (pd+d2) log(p+d)/m [Liang, 2021]
  • 275. Bayesian GANs Bayesian version of generative adversarial networks, with priors on both model (generator) and discriminator parameters [Saatchi Wilson, 2016]
  • 276. Bayesian GANs Somewhat remote from genuine statistical inference: “GANs transform white noise through a deep neural net- work to generate candidate samples from a data distri- bution. A discriminator learns, in a supervised manner, how to tune its parameters so as to correctly classify whether a given sample has come from the generator or the true data distribution. Meanwhile, the generator updates its parameters so as to fool the discriminator. As long as the generator has sufficient capacity, it can approximate the cdf inverse-cdf composition required to sample from a data distribution...” [Saatchi Wilson, 2016]
  • 277. Bayesian GANs Again I rephrasing statistical model as white noise transform x = G(z, ϑ) (or implicit generative model, cf. above) I resorting to prior distribution on ϑ still relevant (cf. probabilistic numerics) I use of discriminator function D(x; ϕ) “probability that x comes from data distribution” (versus parametric alternative associated with generator G(·; ϑ)) I difficulty with posterior distribution (cf. below) [Saatchi Wilson, 2016]
  • 278. Partly Bayesian GANs Posterior distribution unorthodox in being associated with conditional posteriors π(ϑ|z, ϕ) ∝   n Y i=1 D(G(zi; ϑ); ϕ)   π(ϑ|αg) (2) π(ϕ|z, X, ϑ) ∝ m Y i=1 D(xi; ϕ) × n Y i=1 (1 − D(G(zi; ϑ); ϕ)) × π(ϕ|αd) (3) 1. generative conditional posterior π(ϑ|z, ϕ) aims at fooling discriminator D, by favouring generative ϑ values helping wrong allocation of pseudo-data 2. discriminative conditional posterior π(ϕ|z, X, ϑ) [almost] standard Bayesian posterior based on the original and generated samples [Saatchi Wilson, 2016]
  • 279. Partly Bayesian GANs Posterior distribution unorthodox in being associated with conditional posteriors π(ϑ|z, ϕ) ∝   n Y i=1 D(G(zi; ϑ); ϕ)   π(ϑ|αg) (2) π(ϕ|z, X, ϑ) ∝ m Y i=1 D(xi; ϕ) × n Y i=1 (1 − D(G(zi; ϑ); ϕ)) × π(ϕ|αd) (3) 1. generative conditional posterior π(ϑ|z, ϕ) aims at fooling discriminator D, by favouring generative ϑ values helping wrong allocation of pseudo-data 2. discriminative conditional posterior π(ϕ|z, X, ϑ) [almost] standard Bayesian posterior based on the original and generated samples [Saatchi Wilson, 2016]
  • 280. Partly Bayesian GANs Posterior distribution unorthodox in being associated with conditional posteriors π(ϑ|z, ϕ) ∝   n Y i=1 D(G(zi; ϑ); ϕ)   π(ϑ|αg) (2) π(ϕ|z, X, ϑ) ∝ m Y i=1 D(xi; ϕ) × n Y i=1 (1 − D(G(zi; ϑ); ϕ)) × π(ϕ|αd) (3) Opens possibility for two-stage Gibbs sampler “By iteratively sampling from (2) and (3) (...)obtain samples from the approximate posteriors over [both sets of parameters].” [Hobert Casella, 1994; Saatchi Wilson, 2016]
  • 281. Partly Bayesian GANs Difficulty with concept is that (2) and (3)cannot be compatible conditionals: There is no joint distribution whose conditionals are (2) and (3) Reason: pseudo-data appears in D for (2)and (1 − D) in (3) Convergence of Gibbs sampler delicate to ascertain (limit?) [Hobert Casella, 1994; Saatchi Wilson, 2016]
  • 282. More Bayesian GANs? Difficulty later pointed out by Han et al. (2018): “the previous Bayesian method (Saatchi Wilson, 2017) for any minimax GAN objective induces incom- patibility of its defined conditional distributions.” New approach uses Bayesian framework in prior feedback spirit (Robert, 1993) to converge to a single parameter value (and there are “two likelihoods” for the same data, one being the inverse of the other in the minimax GAN case)
  • 283. More Bayesian GANs? Difficulty later pointed out by Han et al. (2018): “the previous Bayesian method (Saatchi Wilson, 2017) for any minimax GAN objective induces incom- patibility of its defined conditional distributions.” New approach uses Bayesian framework in prior feedback spirit (Robert, 1993) to converge to a single parameter value (and there are “two likelihoods” for the same data, one being the inverse of the other in the minimax GAN case)
  • 284. Further Bayesian GANs ProbGAN: distribution over generator parameter derived from objective (score) functions Ld(ϑ, ϕ) = Epdata [Φ1(D(X; ϕ))] + Epϑ [Φ2(D(X; ϕ))] and Lg(ϑ, ϕ) = Epϑ [Φ3(D(X; ϕ))] [He et al., ICLR 2019]
  • 285. Probabilistic GANS Iterative construct πt+1(ϑ) ∝ exp{Lg(ϑ, ϕt)}πt(ϑ) and πt+1(ϕ) ∝ exp{Ld(ϑt, ϕ)} with scores2 replacing likelihoods as in Bissiri et al. (2013) and no true prior [He et al., ICLR 2019] 2 in Lg and Ld discriminator and generator integrated in ϕt and ϑt, resp.
  • 286. Probabilistic GANS Iterative construct πt+1(ϑ) ∝ exp{Lg(ϑ, ϕt)}πt(ϑ) and πt+1(ϕ) ∝ exp{Ld(ϑt, ϕ)} with scores2 replacing likelihoods as in Bissiri et al. (2013) and no true prior Simulated by Gibbs-wise HMC (Neal, 2001; Chen et al. 2014)3 [He et al., ICLR 2019] 2 in Lg and Ld discriminator and generator integrated in ϕt and ϑt, resp. 3 Algorithm 1 sounds like unscented Langevin
  • 287. Probabilistic GANS Iterative construct πt+1(ϑ) ∝ exp{Lg(ϑ, ϕt)}πt(ϑ) and πt+1(ϕ) ∝ exp{Ld(ϑt, ϕ)} with scores2 replacing likelihoods as in Bissiri et al. (2013) and no true prior [He et al., ICLR 2019] 2 in Lg and Ld discriminator and generator integrated in ϕt and ϑt, resp.
  • 288. Probabilistic GANS Iterative construct πt+1(ϑ) ∝ exp{Lg(ϑ, ϕt)}πt(ϑ) and πt+1(ϕ) ∝ exp{Ld(ϑt, ϕ)} with scores2 replacing likelihoods as in Bissiri et al. (2013) and no true prior Stability of “ideal generator” distribution when it exists but no proof of convergence [He et al., ICLR 2019] 2 in Lg and Ld discriminator and generator integrated in ϕt and ϑt, resp.
  • 289. Probabilistic GANS vs Bayesian GANs For Bayesian GANS πt+1(ϑ) ∝ exp{Lg(ϑ, ϕt)}π0(ϑ) and πt+1(ϕ) ∝ exp{Ld(ϑ, ϕ)}π0(ϕ) are incompatible3 [with a joint] Probabilistic GANS evacuate compatibility issues by converging (?) to degenerate distribution on ϑ and neutral discriminator [He et al., ICLR 2019] 3 Lg and Ld integrated in ϕt and ϑt, resp.
  • 290. Metropolis-Hastings GANs “ With a perfect discriminator, this wrapped generator samples from the true distribution on the data exactly even when the generator is imperfect.” Goal: sample from distribution implicitly defined by GAN discriminator D learned for generator G Recall that “If D converges optimally for a fixed G, then D = pdata/(pdata + pG), and if both D and G converge then pG = pdata” [Turner et al., ICML 2019]
  • 291. Metropolis-Hastings GANs “ With a perfect discriminator, this wrapped generator samples from the true distribution on the data exactly even when the generator is imperfect.” Goal: sample from distribution implicitly defined by GAN discriminator D learned for generator G Recall that “If D converges optimally for a fixed G, then D = pdata/(pdata + pG), and if both D and G converge then pG = pdata” [Turner et al., ICML 2019]
  • 292. Metropolis-Hastings GANs In ideal setting, D produces pdata since pdata(x) pG(x) = 1 D−1(x) − 1 and using pG as Metropolis-Hastings proposal α(x, x0 ) = 1 ∧ D−1(x) − 1 D−1(x0) − 1 Quite naı̈ve (e.g., does not account for D and G being updated on-line or D being imperfect) and lacking connections with density estimation [Turner et al., ICML 2019]
  • 293. Metropolis-Hastings GANs In ideal setting, D produces pdata since pdata(x) pG(x) = 1 D−1(x) − 1 and using pG as Metropolis-Hastings proposal α(x, x0 ) = 1 ∧ D−1(x) − 1 D−1(x0) − 1 Quite naı̈ve (e.g., does not account for D and G being updated on-line or D being imperfect) and lacking connections with density estimation [Turner et al., ICML 2019]
  • 294. Secret GANs “...evidence that it is beneficial to sample from the energy-based model defined both by the generator and the discriminator instead of from the generator only.” Post-processing of GAN generator G output, generating from mixture of both generator and discriminator, via unscented Langevin algorithm Same core idea: if pdata true data generating process, pG the estimated generator and D discriminator pdata(x) ≈ p0 (x) ∝ pG(x) exp(D(x)) Again approximation exact only when discriminator optimal (Theorem 1) [Che et al., NeurIPS 2020]
  • 295. Secret GANs “...evidence that it is beneficial to sample from the energy-based model defined both by the generator and the discriminator instead of from the generator only.” Post-processing of GAN generator G output, generating from mixture of both generator and discriminator, via unscented Langevin algorithm Same core idea: if pdata true data generating process, pG the estimated generator and D discriminator pdata(x) ≈ p0 (x) ∝ pG(x) exp(D(x)) Again approximation exact only when discriminator optimal (Theorem 1) [Che et al., NeurIPS 2020]
  • 296. Secret GANs Difficulties with proposal I latent variable (white noise) z [such that x = G(z)] may imply huge increase in dimension I pG may be unavailable (contraposite of normalising flows) Generation from p0 seen as accept-reject with [prior] proposal µ(z) with acceptance probability proportional to exp[D{G(z)}] | {z } d(z) In practice, unscented [i.e., non-Metropolised] Langevin move zt+1 = zt − ε 2 ∇zE(z) + √ εη η ∼ N(0, I) ε 1 [Che et al., NeurIPS 2020]
  • 297. Secret GANs Difficulties with proposal I latent variable (white noise) z [such that x = G(z)] may imply huge increase in dimension I pG may be unavailable (contraposite of normalising flows) Generation from p0 seen as accept-reject with [prior] proposal µ(z) with acceptance probability proportional to exp[D{G(z)}] | {z } d(z) In practice, unscented [i.e., non-Metropolised] Langevin move zt+1 = zt − ε 2 ∇zE(z) + √ εη η ∼ N(0, I) ε 1 [Che et al., NeurIPS 2020]
  • 298. Secret GANs Alternative WGAN: at step t 1. Discriminator with parameter ϕ trained to match pt(x) ∝ pg(x) exp{Dϕ(x) with data distribution pdata and gradient Ept [∇ϕD(X; ϕ)] − Epdata [∇ϕD(X; ϕ)] whose first expectation approximated by Langevin 2. Generator with parameter ϑ trained to match pg(x) and pt(x) [not pdata] with gradient Eµ[∇ϑD(G(Z; ϑ))] [Che et al., NeurIPS 2020]
  • 299. Where are we now? (c.) Floydhub, 2019
  • 300. More questions? (c.) cross validated, 29 March 2021