3. An early entry
A standard issue in Bayesian inference is to approximate the
marginal likelihood (or evidence)
Ek =
Z
Θk
πk(ϑk)Lk(ϑk) dϑk,
aka the marginal likelihood.
[Jeffreys, 1939]
4. Bayes factor
For testing hypotheses H0 : ϑ ∈ Θ0 vs. Ha : ϑ 6∈ Θ0, under prior
π(Θ0)π0(ϑ) + π(Θc
0)π1(ϑ) ,
central quantity
B01 =
π(Θ0|x)
π(Θc
0|x)
π(Θ0)
π(Θc
0)
=
Z
Θ0
f(x|ϑ)π0(ϑ)dϑ
Z
Θc
0
f(x|ϑ)π1(ϑ)dϑ
[Kass Raftery, 1995, Jeffreys, 1939]
5. Bayes factor approximation
When approximating the Bayes factor
B01 =
Z
Θ0
f0(x|ϑ0)π0(ϑ0)dϑ0
Z
Θ1
f1(x|ϑ1)π1(ϑ1)dϑ1
use of importance functions $0 and $1 and
b
B01 =
n−1
0
Pn0
i=1 f0(x|ϑi
0)π0(ϑi
0)/$0(ϑi
0)
n−1
1
Pn1
i=1 f1(x|ϑi
1)π1(ϑi
1)/$1(ϑi
1)
when ϑi
0 ∼ $0(ϑ) and ϑi
1 ∼ $1(ϑ)
6. Forgetting and learning
Counterintuitive choice of importance function based on
mixtures
If ϑit ∼ $i(ϑ) (i = 1, . . . , I, t = 1, . . . , Ti)
Eπ[g(ϑ)] ≈
1
Ti
Ti
X
t=1
h(ϑit)
π(ϑit)
$i(ϑit)
replaced with
Eπ[g(ϑ)] ≈
I
X
i=1
Ti
X
t=1
h(ϑit)
π(ϑit)
PI
j=1 Ti$j(ϑit)
Preserves unbiasedness and brings stability (while forgetting
about original index)
[Geyer, 1991, unpublished; Owen Zhou, 2000]
7. Enters the logistic
If considering unnormalised $j’s, i.e.
$j(ϑ) = cj$̃j(ϑ) j = 1, . . . , I
and realisations ϑit’s from the mixture
$(ϑ) =
1
T
I
X
i=1
Ti$j(ϑ) =
1
T
I
X
i=1
$̃j(ϑ)e
log(cj) + log(Tj)
| {z }
ηj
Geyer (1994) introduces allocation probabilities for the mixture
components
pj(ϑ, η) = $̃j(ϑ)eηj
. I
X
m=1
$̃m(ϑ)eηm
to construct a pseudo-loglikelihood
`(η) :=
I
X
i=1
Ti
X
t=1
log pi(ϑit, η)
8. Enters the logistic (2)
Estimating η as
^
η = arg max
η
`(η)
produces the reverse logistic regression estimator of the
constants cj as
I partial forgetting of initial distribution
I objective function equivalent to a multinomial logistic
regression with the log $̃i(ϑit)’s as covariates
I randomness reversed from Ti’s to ϑit’s
I constants cj idenfitiable up to a constant
I resulting biased importance sampling estimator
9. Illustration
Special case when I = 2, c1 = 1, T1 = T2
`(c2) =
T
X
t=1
log 1+c2$̃2(ϑ1t)/$1(ϑ1t)+
T
X
t=1
log 1+$1(ϑ2t)/c2$̃2(ϑ2t)
and
$1(ϑ) = ϕ(ϑ; 0, 32
) $̃2(ϑ) = exp{−ϑ2
/2} c2 = 1/
√
2π
10. Illustration
Special case when I = 2, c1 = 1, T1 = T2
`(c2) =
T
X
t=1
log 1+c2$̃2(ϑ1t)/$1(ϑ1t)+
T
X
t=1
log 1+$1(ϑ2t)/c2$̃2(ϑ2t)
and
$1(ϑ) = ϕ(ϑ; 0, 32
) $̃2(ϑ) = exp{−ϑ2
/2} c2 = 1/
√
2π
tg=function(x)exp(-(x-5)**2/2)
pl=function(a)
sum(log(1+a*tg(x)/dnorm(x,0,3)))+sum(log(1+dnorm(y,0,3)/a/tg(y)))
nrm=matrix(0,3,1e2)
for(i in 1:3)
for(j in 1:1e2)
x=rnorm(10**(i+1),0,3)
y=rnorm(10**(i+1),5,1)
nrm[i,j]=optimise(pl,c(.01,1))
11. Illustration
Special case when I = 2, c1 = 1, T1 = T2
`(c2) =
T
X
t=1
log 1+c2$̃2(ϑ1t)/$1(ϑ1t)+
T
X
t=1
log 1+$1(ϑ2t)/c2$̃2(ϑ2t)
and
$1(ϑ) = ϕ(ϑ; 0, 32
) $̃2(ϑ) = exp{−ϑ2
/2} c2 = 1/
√
2π
1 2 3
0.3
0.4
0.5
0.6
0.7
12. Illustration
Special case when I = 2, c1 = 1, T1 = T2
`(c2) =
T
X
t=1
log 1+c2$̃2(ϑ1t)/$1(ϑ1t)+
T
X
t=1
log 1+$1(ϑ2t)/c2$̃2(ϑ2t)
and
$1(ϑ) = ϕ(ϑ; 0, 32
) $̃2(ϑ) = exp{−ϑ2
/2} c2 = 1/
√
2π
Full logistic
14. Bridge sampling
Approximation of Bayes factors (and other ratios of integrals)
Special case:
If
π1(ϑ1|x) ∝ π̃1(ϑ1|x)
π2(ϑ2|x) ∝ π̃2(ϑ2|x)
live on the same space (Θ1 = Θ2), then
B12 ≈
1
n
n
X
i=1
π̃1(ϑi|x)
π̃2(ϑi|x)
ϑi ∼ π2(ϑ|x)
[Bennett, 1976; Gelman Meng, 1998; Chen, Shao Ibrahim, 2000]
15. Bridge sampling variance
The bridge sampling estimator does poorly if
var(b
B12)
B2
12
≈
1
n
E
π1(ϑ) − π2(ϑ)
π2(ϑ)
2
#
is large, i.e. if π1 and π2 have little overlap...
16. Bridge sampling variance
The bridge sampling estimator does poorly if
var(b
B12)
B2
12
≈
1
n
E
π1(ϑ) − π2(ϑ)
π2(ϑ)
2
#
is large, i.e. if π1 and π2 have little overlap...
17. (Further) bridge sampling
General identity:
B12 =
Z
π̃2(ϑ|x)α(ϑ)π1(ϑ|x)dϑ
Z
π̃1(ϑ|x)α(ϑ)π2(ϑ|x)dϑ
∀ α(·)
≈
1
n1
n1
X
i=1
π̃2(ϑ1i|x)α(ϑ1i)
1
n2
n2
X
i=1
π̃1(ϑ2i|x)α(ϑ2i)
ϑji ∼ πj(ϑ|x)
18. Optimal bridge sampling
The optimal choice of auxiliary function is
α?
=
n1 + n2
n1π1(ϑ|x) + n2π2(ϑ|x)
leading to
b
B12 ≈
1
n1
n1
X
i=1
π̃2(ϑ1i|x)
n1π1(ϑ1i|x) + n2π2(ϑ1i|x)
1
n2
n2
X
i=1
π̃1(ϑ2i|x)
n1π1(ϑ2i|x) + n2π2(ϑ2i|x)
19. Optimal bridge sampling (2)
Reason:
Var(b
B12)
B2
12
≈
1
n1n2
R
π1(ϑ)π2(ϑ)[n1π1(ϑ) + n2π2(ϑ)]α(ϑ)2 dϑ
R
π1(ϑ)π2(ϑ)α(ϑ) dϑ
2
− 1
(by the δ method)
Drawback: Dependence on the unknown normalising constants
solved iteratively
20. Optimal bridge sampling (2)
Reason:
Var(b
B12)
B2
12
≈
1
n1n2
R
π1(ϑ)π2(ϑ)[n1π1(ϑ) + n2π2(ϑ)]α(ϑ)2 dϑ
R
π1(ϑ)π2(ϑ)α(ϑ) dϑ
2
− 1
(by the δ method)
Drawback: Dependence on the unknown normalising constants
solved iteratively
21. Back to the logistic
When T1 = T2 = T, optimising
`(c2) =
T
X
t=1
log{1+c2$̃2(ϑ1t)/$1(ϑ1t)}+
T
X
t=1
log{1+$1(ϑ2t)/c2$̃2(ϑ2t)}
cancelling derivative in c2
T
X
t=1
$̃2(ϑ1t)
c2$̃2(ϑ1t) + $1(ϑ1t)
− c−1
2
T
X
t=1
$1(ϑ2t)
$1(ϑ2t) + c2$̃2(ϑ2t)
leads to
c0
2 =
PT
t=1
$1(ϑ2t)
$1(ϑ2t)+c2$̃2(ϑ2t)
PT
t=1
$̃2(ϑ1t)
c2$̃2(ϑ1t)+$1(ϑ1t)
EM step for the maximum pseudo-likelihood estimation
22. Back to the logistic
When T1 = T2 = T, optimising
`(c2) =
T
X
t=1
log{1+c2$̃2(ϑ1t)/$1(ϑ1t)}+
T
X
t=1
log{1+$1(ϑ2t)/c2$̃2(ϑ2t)}
cancelling derivative in c2
T
X
t=1
$̃2(ϑ1t)
c2$̃2(ϑ1t) + $1(ϑ1t)
− c−1
2
T
X
t=1
$1(ϑ2t)
$1(ϑ2t) + c2$̃2(ϑ2t)
leads to
c0
2 =
PT
t=1
$1(ϑ2t)
$1(ϑ2t)+c2$̃2(ϑ2t)
PT
t=1
$̃2(ϑ1t)
c2$̃2(ϑ1t)+$1(ϑ1t)
EM step for the maximum pseudo-likelihood estimation
23. Back to the logistic
When T1 = T2 = T, optimising
`(c2) =
T
X
t=1
log{1+c2$̃2(ϑ1t)/$1(ϑ1t)}+
T
X
t=1
log{1+$1(ϑ2t)/c2$̃2(ϑ2t)}
cancelling derivative in c2
T
X
t=1
$̃2(ϑ1t)
c2$̃2(ϑ1t) + $1(ϑ1t)
− c−1
2
T
X
t=1
$1(ϑ2t)
$1(ϑ2t) + c2$̃2(ϑ2t)
leads to
c0
2 =
PT
t=1
$1(ϑ2t)
$1(ϑ2t)+c2$̃2(ϑ2t)
PT
t=1
$̃2(ϑ1t)
c2$̃2(ϑ1t)+$1(ϑ1t)
EM step for the maximum pseudo-likelihood estimation
24. Mixtures as proposals
Design specific mixture for simulation purposes, with density
ϕ̃(ϑ) ∝ ω1π(ϑ)L(ϑ) + ϕ(ϑ) ,
where ϕ(ϑ) is arbitrary (but normalised)
Note: ω1 is not a probability weight
[Chopin Robert, 2011]
25. Mixtures as proposals
Design specific mixture for simulation purposes, with density
ϕ̃(ϑ) ∝ ω1π(ϑ)L(ϑ) + ϕ(ϑ) ,
where ϕ(ϑ) is arbitrary (but normalised)
Note: ω1 is not a probability weight
[Chopin Robert, 2011]
26. evidence approximation by mixtures
Rao-Blackwellised estimate
^
ξ =
1
T
T
X
t=1
ω1π(ϑ(t)
)L(ϑ(t)
)
ω1π(ϑ(t)
)L(ϑ(t)
) + ϕ(ϑ(t)
) ,
converges to ω1Z/{ω1Z + 1}
Deduce ^
Z from
ω1
^
Z/{ω1
^
Z + 1} = ^
ξ
Back to bridge sampling optimal estimate
[Chopin Robert, 2011]
27. Non-parametric MLE
“At first glance, the problem appears to be an exercise in
calculus or numerical analysis, and not amenable to statistical
formulation” Kong et al. (JRSS B, 2002)
I use of Fisher information
I non-parametric MLE based on
simulations
I comparison of sampling
schemes through variances
I Rao–Blackwellised
improvements by invariance
constraints [Meng, 2011, IRCEM]
28. Non-parametric MLE
“At first glance, the problem appears to be an exercise in
calculus or numerical analysis, and not amenable to statistical
formulation” Kong et al. (JRSS B, 2002)
I use of Fisher information
I non-parametric MLE based on
simulations
I comparison of sampling
schemes through variances
I Rao–Blackwellised
improvements by invariance
constraints [Meng, 2011, IRCEM]
30. NPMLE
Observing
Yij ∼ Fi(t) = c−1
i
Zt
−∞
ωi(x) dF(x)
with ωi known and F unknown
“Maximum likelihood estimate” defined by weighted empirical
cdf X
i,j
ωi(yij)p(yij)δyij
maximising in p Y
ij
c−1
i ωi(yij) p(yij)
31. NPMLE
Observing
Yij ∼ Fi(t) = c−1
i
Zt
−∞
ωi(x) dF(x)
with ωi known and F unknown
“Maximum likelihood estimate” defined by weighted empirical
cdf X
i,j
ωi(yij)p(yij)δyij
maximising in p Y
ij
c−1
i ωi(yij) p(yij)
Result such that
X
ij
^
c−1
r ωr(yij)
P
s ns^
c−1
s ωs(yij)
= 1
[Vardi, 1985]
32. NPMLE
Observing
Yij ∼ Fi(t) = c−1
i
Zt
−∞
ωi(x) dF(x)
with ωi known and F unknown
Result such that
X
ij
^
c−1
r ωr(yij)
P
s ns^
c−1
s ωs(yij)
= 1
[Vardi, 1985]
Bridge sampling estimator
X
ij
^
c−1
r ωr(yij)
P
s ns^
c−1
s ωs(yij)
= 1
[Gelman Meng, 1998; Tan, 2004]
33. end of the Series B 2002 discussion
“...essentially every Monte Carlo activity may be interpreted as
parameter estimation by maximum likelihood in a statistical
model. We do not claim that this point of view is necessary; nor
do we seek to establish a working principle from it.”
I restriction to discrete support measures [may be] suboptimal
[Ritov Bickel, 1990; Robins et al., 1997, 2000, 2003]
I group averaging versions in-between multiple mixture
estimators and quasi-Monte Carlo version
[Owen Zhou, 2000; Cornuet et al., 2012; Owen, 2003]
I statistical analogy provides at best narrative thread
34. end of the Series B 2002 discussion
“The hard part of the exercise is to construct a submodel such
that the gain in precision is sufficient to justify the additional
computational effort”
I garden of forking paths, with infinite possibilities
I no free lunch (variance, budget, time)
I Rao–Blackwellisation may be detrimental in Markov setups
35. end of the 2002 discussion
“The statistician can considerably improve the efficiency of the
estimator by using the known values of different functionals
such as moments and probabilities of different sets. The
algorithm becomes increasingly efficient as the number of
functionals becomes larger. The result, however, is an extremely
complicated algorithm, which is not necessarily faster.” Y. Ritov
“...the analyst must violate the likelihood principle and eschew
semiparametric, nonparametric or fully parametric maximum
likelihood estimation in favour of non-likelihood-based locally
efficient semiparametric estimators.” J. Robins
37. Noise contrastive estimation
New estimation principle for parameterised and unnormalised
statistical models also based on onlinear logistic regression
Case of parameterised model with density
p(x; α) =
p̃(x; α)
Z(α)
and untractable normalising constant Z(α)
Estimating Z(α) as extra parameter is impossible via maximum
likelihood methods
Use of estimation techniques bypassing the constant like
contrastive divergence (Hinton, 2002) and score matching
(Hyvärinen, 2005)
[Gutmann Hyvärinen, 2010]
38. NCE principle
As in Geyer’s method, given sample x1, . . . , xT from p(x; α)
I generate artificial sample from known distribution q,
y1, . . . , yT
I maximise the classification loglikelihood (where ϑ = (α, c))
`(ϑ; x, y) :=
T
X
i=1
log h(xi; ϑ) +
T
X
i=1
log{1 − h(yi; ϑ)}
of a logistic regression model which discriminates the
observed data from the simulated data, where
h(z; ϑ) =
cp̃(z; α)
cp̃(z; α) + q(z)
39. NCE consistency
Objective function that converges (in T) to
J(ϑ) = E [log h(x; ϑ) + log{1 − h(y; ϑ)}]
Defining f(·) = log p(·; ϑ) and
J̃(f) = E [log r(f(x) − log q(x)) + log{1 − r(f(y) − log q(y))}]
Assuming q(·) positive everywhere,
I J̃(·) attains its maximum at f(·) = log p(·) true distribution
I maximization performed without any normalisation
constraint
40. NCE consistency
Objective function that converges (in T) to
J(ϑ) = E [log h(x; ϑ) + log{1 − h(y; ϑ)}]
Defining f(·) = log p(·; ϑ) and
J̃(f) = E [log r(f(x) − log q(x)) + log{1 − r(f(y) − log q(y))}]
Under regularity condition, assuming the true distribution
belongs to parametric family, the solution
^
ϑT = arg max
ϑ
`(ϑ; x, y) (1)
converges to true ϑ
Consequence: log-normalisation constant consistently
estimated by maximization (??)
41. Convergence of noise contrastive estimation
Opposition of Monte Carlo MLE à la Geyer (1994, JASA)
L = 1/n
n
X
i=1
log p̃(xi; ϑ)
p̃(xi; ϑ0
)
− log
1/m
m
X
j=1
p̃(zi; ϑ)
p̃(zi; ϑ0
)
| {z }
≈Z(ϑ0)/Z(ϑ)
x1, . . . , xn ∼ p∗
z1, . . . , zm ∼ p(z; ϑ0
)
[Riou-Durand Chopin, 2018]
42. Convergence of noise contrastive estimation
and of noise contrastive estimation à la Gutmann and
Hyvärinen (2012)
L(ϑ, ν) = 1/n
n
X
i=1
log qϑ,ν(xi) + 1/m
m
X
i=1
log[1 − qϑ,ν(zi)]m/n
log
qϑ,ν(z)
1 − qϑ,ν(z)
= log
p̃(xi; ϑ)
p̃(xi; ϑ0)
+ ν + log n/m
x1, . . . , xn ∼ p∗
z1, . . . , zm ∼ p(z; ϑ0
)
[Riou-Durand Chopin, 2018]
43. Poisson transform
Equivalent likelihoods
L(ϑ, ν) = 1/n
n
X
i=1
log
p̃(xi; ϑ)
p̃(xi; ϑ0)
+ ν − eν Z(ϑ)
Z(ϑ0)
and
L(ϑ, ν) = 1/n
n
X
i=1
log
p̃(xi; ϑ)
p̃(xi; ϑ0)
+ ν −
eν
m
m
X
j=1
p̃(zi; ϑ)
p̃(zi; ϑ0
)
sharing same ^
ϑ as originals
44. NCE consistency
Under mild assumptions, almost surely
^
ξMCMLE
n,m
m→∞
−→ ^
ξn
and
^
ξNCE
n,m
m→∞
−→ ^
ξn
the maximum likelihood estimator associated with
x1, . . . , xn ∼ p(·; ϑ)
and
^
ν =
Z(^
ϑ)
Z(ϑ0)
[Geyer, 1994; Riou-Durand Chopin, 2018]
45. NCE asymptotics
Under less mild assumptions (more robust for NCE),
asymptotic normality of both NCE and MC-MLE estimates as
n −→ +∞ m/m −→ τ
√
n(^
ξMCMLE
n,m − ξ∗
) ≈ Nd(0, ΣMCMLE
)
and √
n(^
ξNCE
n,m − ξ∗
) ≈ Nd(0, ΣNCE
)
with important ordering
ΣMCMLE
ΣNCE
showing that NCE dominates MCMLE in terms of mean square
error (for iid simulations)
[Geyer, 1994; Riou-Durand Chopin, 2018]
46. NCE asymptotics
Under less mild assumptions (more robust for NCE),
asymptotic normality of both NCE and MC-MLE estimates as
n −→ +∞ m/m −→ τ
√
n(^
ξMCMLE
n,m − ξ∗
) ≈ Nd(0, ΣMCMLE
)
and √
n(^
ξNCE
n,m − ξ∗
) ≈ Nd(0, ΣNCE
)
with important ordering except when ϑ0 = ϑ∗
ΣMCMLE
= ΣNCE
= (1 + τ−1
)ΣRMLNCE
[Geyer, 1994; Riou-Durand Chopin, 2018]
48. NCE contrast distribution
Choice of q(·) free but
I easy to sample from
I must allows for analytical expression of its log-pdf
I must be close to true density p(·), so that mean squared
error E[|^
ϑT − ϑ?|2] small
Learning an approximation ^
q to p(·), for instance via
normalising flows
[Tabak and Turner, 2013; Jia Seljiak, 2019]
49. NCE contrast distribution
Choice of q(·) free but
I easy to sample from
I must allows for analytical expression of its log-pdf
I must be close to true density p(·), so that mean squared
error E[|^
ϑT − ϑ?|2] small
Learning an approximation ^
q to p(·), for instance via
normalising flows
[Tabak and Turner, 2013; Jia Seljiak, 2019]
50. NCE contrast distribution
Choice of q(·) free but
I easy to sample from
I must allows for analytical expression of its log-pdf
I must be close to true density p(·), so that mean squared
error E[|^
ϑT − ϑ?|2] small
Learning an approximation ^
q to p(·), for instance via
normalising flows
[Tabak and Turner, 2013; Jia Seljiak, 2019]
51. Density estimation by normalising flows
“A normalizing flow describes the transformation of a
probability density through a sequence of invertible map-
pings. By repeatedly applying the rule for change of
variables, the initial density ‘flows’ through the sequence
of invertible mappings. At the end of this sequence we
obtain a valid probability distribution and hence this type
of flow is referred to as a normalizing flow.”
[Rezende Mohammed, 2015; Papamakarios et al., 2019]
52. Density estimation by normalising flows
Invertible and 2×differentiable transforms (diffeomorphisms)
gi(·) = g(·; ηi) of standard distribution ϕ(·)
Representation
z = g1 ◦ · · · ◦ gp(x) x ∼ ϕ(x)
Density of z by Jacobian transform
ϕ(x(z)) × detJg1◦···◦gp (z) = ϕ(x(z))
Y
i
|dgi/dzi−1|−1
where zi = gi(zi−1)
Flow defined as x, z1, . . . , zp = z
[Rezende Mohammed, 2015; Papamakarios et al., 2019]
53. Density estimation by normalising flows
Flow defined as x, z1, . . . , zp = z
Density of z by Jacobian transform
ϕ(x(z)) × detJg1◦···◦gp (z) = ϕ(x(z))
Y
i
|dgi/dzi−1|−1
where zi = gi(zi−1)
Composition of transforms
(g1 ◦ g2)−1
= g−1
2 ◦ g−1
1 (2)
detJg1◦g2
(u) = detJg1
(g2(u)) × detJg2
(u) (3)
[Rezende Mohammed, 2015; Papamakarios et al., 2019]
54. Density estimation by normalising flows
Flow defined as x, z1, . . . , zp = z
Density of z by Jacobian transform
ϕ(x(z)) × detJg1◦···◦gp (z) = ϕ(x(z))
Y
i
|dgi/dzi−1|−1
where zi = gi(zi−1)
[Rezende Mohammed, 2015; Papamakarios et al., 2019]
55. Density estimation by normalising flows
Normalising flows are
I flexible family of densities
I easy to train by optimisation (e.g., maximum likelihood
estimation)
I neural version of density estimation and generative model
I trained from observed densities
I natural tools for approximate Bayesian inference
(variational inference, ABC, synthetic likelihood)
[Rezende Mohammed, 2015; Papamakarios et al., 2019]
56. Invertible linear-time transformations
Family of transformations
g(z) = z + uh(w0
z + b), u, w ∈ Rd
, b ∈ R
with h smooth element-wise non-linearity transform, with
derivative h0
Jacobian term computed in O(d) time
ψ(z) = h0
(w0
z + b)w
65. Invertible linear-time transformations
Family of transformations
g(z) = z + uh(w0
z + b), u, w ∈ Rd
, b ∈ R
with h smooth element-wise non-linearity transform, with
derivative h0
Density q(z) obtained by transforming initial density ϕ(z)
through sequence of maps gi, i.e.
z = gp ◦ · · · ◦ g1(x)
and
log q(z) = log ϕ(x) −
p
X
k=1
log |1 + u0
ψk(zk−1)|
[Rezende Mohammed, 2015]
66. General theory of normalising flows
”Normalizing flows provide a general mechanism for
defining expressive probability distributions, only requir-
ing the specification of a (usually simple) base distribu-
tion and a series of bijective transformations.”
T(u; ψ) = gp(gp−1(. . . g1(u; η1) . . . ; ηp−1); ηp)
[Papamakarios et al., 2019]
67. General theory of normalising flows
“...how expressive are flow-based models? Can they rep-
resent any distribution p(x), even if the base distribution
is restricted to be simple? We show that this universal
representation is possible under reasonable conditions
on p(x).”
Obvious when considering the inverse conditional cdf
transforms, assuming differentiability
[Papamakarios et al., 2019]
68. General theory of normalising flows
“Minimizing the Monte Carlo approximation of the
Kullback–Leibler divergence [between the true and the
model densities] is equivalent to fitting the flow-based
model to the sample by maximum likelihood estimation.”
MLEstimate flow-based model parameters by
arg max
ψ
n
X
i=1
log{ϕ(T−1
(xi; ψ))} − log |det{JT−1 (xi; ψ)}|
Note possible use of reverse Kullback–Leibler divergence when
learning an approximation (VA, IS, ABC) to a known [up to a
constant] target p(x)
[Papamakarios et al., 2019]
90. Table 1: Multiple choices for
I transformer τ(·; ϕ)
I conditioner c(·) (neural network)
[Papamakarios et al., 2019]
91. Practical considerations
“Implementing a flow often amounts to composing as
many transformations as computation and memory will
allow. Working with such deep flows introduces addi-
tional challenges of a practical nature.”
I the more the merrier?!
I batch normalisation for maintaining stable gradients
(between layers)
I fighting curse of dimension (“evaluating T incurs an
increasing computational cost as dimensionality grows”)
with multiscale architecture (clamping: component-wise
stopping rules)
[Papamakarios et al., 2019]
92. Applications
“Normalizing flows have two primitive operations: den-
sity calculation and sampling. In turn, flows are effec-
tive in any application requiring a probabilistic model
with either of those capabilities.”
I density estimation [speed of convergence?]
I proxy generative model
I importance sampling for integration by minimising distance
to integrand or IS variance [finite?]
I MCMC flow substitute for HMC
[Papamakarios et al., 2019]
93. Applications
“Normalizing flows have two primitive operations: den-
sity calculation and sampling. In turn, flows are effec-
tive in any application requiring a probabilistic model
with either of those capabilities.”
I optimised reparameterisation of target for MCMC [exact?]
I variational approximation by maximising evidence lower
bound (ELBO) to posterior on parameter η = T(u, ϕ)
n
X
i=1
log p(xobs
, T(ui; ϕ))
| {z }
joint
+ log |detJT (ui; ϕ)|
I substitutes for likelihood-free inference on either π(η|xobs)
or p(xobs|η)
[Papamakarios et al., 2019]
94. A revolution in machine learning?
“One area where neural networks are being actively de-
veloped is density estimation in high dimensions: given
a set of points x ∼ p(x), the goal is to estimate the
probability density p(·). As there are no explicit la-
bels, this is usually considered an unsupervised learning
task. We have already discussed that classical methods
based for instance on histograms or kernel density esti-
mation do not scale well to high-dimensional data. In
this regime, density estimation techniques based on neu-
ral networks are becoming more and more popular. One
class of these neural density estimation techniques are
normalizing flows.”
[Cranmer et al., PNAS, 2020]
95. Reconnecting with Geyer (1994)
“...neural networks can be trained to learn the likelihood
ratio function p(x|ϑ0)/p(x|ϑ1) or p(x|ϑ0)/p(x), where in
the latter case the denominator is given by a marginal
model integrated over a proposal or the prior (...) The
key idea is closely related to the discriminator network
in GANs mentioned above: a classifier is trained us-
ing supervised learning to discriminate two sets of data,
though in this case both sets come from the simulator
and are generated for different parameter points ϑ0 and
ϑ1. The classifier output function can be converted into
an approximation of the likelihood ratio between ϑ0 and
ϑ1! This manifestation of the Neyman-Pearson lemma
in a machine learning setting is often called the likeli-
hood ratio trick.”
[Cranmer et al., PNAS, 2020]
97. Generative models
“Deep generative model than can learn via the principle
of maximum likelihood differ with respect to how they
represent or approximate the likelihood.” I. Goodfellow
Likelihood function
L(ϑ|x1, . . . , xn) ∝
n
Y
i=1
pmodel(xi|ϑ)
leading to MLE estimate
^
ϑ(x1, . . . , xn) = arg max
ϑ
n
X
i=1
log pmodel(xi|ϑ)
with
^
ϑ(x1, . . . , xn) = arg max
ϑ
DKL
(pdata||pmodel(·|ϑ))
98. Likelihood complexity
Explicit solutions:
I domino representation (“fully visible belief networks”)
pmodel(x) =
T
Y
t=1
pmodel(xt|x1:t−1)
I “non-linear independent component analysis”
(cf. normalizing flows)
pmodel(x) = pz(g−1
ϕ (x))
141. log pmodel(x; ϑ) ≥ L(x; ϑ)
represented by variational autoencoders
I Markov chain Monte Carlo (MCMC) maximisation
142. Likelihood complexity
Implicit solutions involving sampling from the model pmodel
without computing density
I ABC algorithms for MLE derivation
[Piccini Anderson, 2017]
I generative stochastic networks
[Bengio et al., 2014]
I generative adversarial networks (GANs)
[Goodfellow et al., 2014]
144. Autoencoders
“An autoencoder is a neural network that is trained to
attempt to copy its input x to its output r = g(h) via
a hidden layer h = f(x) (...) [they] are designed to be
unable to copy perfectly”
I undercomplete autoencoders (with dim(h) dim(x))
I regularised autoencoders, with objective
L(x, g ◦ f(x)) + Ω(h)
where penalty akin to log-prior
I denoising autoencoders (learning x on noisy version x̃ of x)
I stochastic autoencoders (learning pdecode(x|h) for a given
pencode(h|x) w/o compatibility)
[Goodfellow et al., 2016, p.496]
145. Variational autoencoders (VAEs)
“The key idea behind the variational autoencoder is to
attempt to sample values of Z that are likely to have
produced X = x, and compute p(x) just from those.”
Representation of (marginal) likelihood pϑ(x) based on latent
variable z
pϑ(x) =
Z
pϑ(x|z)pϑ(z) dz
Machine-learning usually preoccupied only by maximising pϑ(x)
(in ϑ) by simulating z efficiently (i.e., not from the prior)
log pϑ(x) − D[qϕ(·)||pϑ(·|x)] = Eqϑ
[log pϑ(x|Z)] − D[qϕ(·)||pϑ(·)]
146. Variational autoencoders (VAEs)
“The key idea behind the variational autoencoder is to
attempt to sample values of Z that are likely to have
produced X = x, and compute p(x) just from those.”
Representation of (marginal) likelihood pϑ(x) based on latent
variable z
pϑ(x) =
Z
pϑ(x|z)pϑ(z) dz
Machine-learning usually preoccupied only by maximising pϑ(x)
(in ϑ) by simulating z efficiently (i.e., not from the prior)
log pϑ(x)−D[qϕ(·|x)||pϑ(·|x)] = Eqϑ(·|x)[log pϑ(x|Z)]−D[qϕ(·|x)||pϑ(·)]
since x is fixed (Bayesian analogy)
147. Variational autoencoders (VAEs)
log pϑ(x)−D[qϕ(·|x)||pϑ(·|x)] = Eqϕ(·|x)[log pϑ(x|Z)]−D[qϕ(·|x)||pϑ(·)]
I lhs is quantity to maximize (plus error term, small for good
approximation qϕ, or regularisation)
I rhs can be optimised by stochastic gradient descent when
qϕ manageable
I link with autoencoder, as qϕ(z|x) “encoding” x into z, and
pϑ(x|z) “decoding” z to reconstruct x
[Doersch, 2021]
148. Variational autoencoders (VAEs)
log pϑ(x)−D[qϕ(·|x)||pϑ(·|x)] = Eqϕ(·|x)[log pϑ(x|Z)]−D[qϕ(·|x)||pϑ(·)]
I lhs is quantity to maximize (plus error term, small for good
approximation qϕ, or regularisation)
I rhs can be optimised by stochastic gradient descent when
qϕ manageable
I link with autoencoder, as qϕ(z|x) “encoding” x into z, and
pϑ(x|z) “decoding” z to reconstruct x
[Doersch, 2021]
149. Variational autoencoders (VAEs)
“One major division in machine learning is generative
versus discriminative modeling (...) To turn a genera-
tive model into a discriminator we need Bayes rule.”
Representation of (marginal) likelihood pϑ(x) based on latent
variable z
Variational approximation qϕ(z|x) (also called encoder) to
posterior distribution on latent variable z, pϑ(z|x), associated
with conditional distribution pϑ(x|z) (also called decoder)
Example: qϕ(z|x) Normal distribution Nd(µ(x), Σ(x)) with
I (µ(x), Σ(x)) estimated by deep neural network
I (µ(x), Σ(x)) estimated by ABC (synthetic likelihood)
[Kingma Welling, 2014]
150. Variational autoencoders (VAEs)
“One major division in machine learning is generative
versus discriminative modeling (...) To turn a genera-
tive model into a discriminator we need Bayes rule.”
Representation of (marginal) likelihood pϑ(x) based on latent
variable z
Variational approximation qϕ(z|x) (also called encoder) to
posterior distribution on latent variable z, pϑ(z|x), associated
with conditional distribution pϑ(x|z) (also called decoder)
Example: qϕ(z|x) Normal distribution Nd(µ(x), Σ(x)) with
I (µ(x), Σ(x)) estimated by deep neural network
I (µ(x), Σ(x)) estimated by ABC (synthetic likelihood)
[Kingma Welling, 2014]
151. ELBO objective
Since
log pϑ(x) = Eqϕ(z|x)[log pϑ(x)]
= Eqϕ(z|x)[log
pϑ(x, z)
pϑ(z|x)
]
= Eqϕ(z|x)[log
pϑ(x, z)
qϕ(z|x)
] + Eqϕ(z|x)[log
qϕ(x, z)
pϑ(z|x)
]
| {z }
≥0
evidence lower bound (ELBO) defined by
Lϑ,ϕ(x) = Eqϕ(z|x)[log pϑ(x, z)] − Eqϕ(z|x)[log qϕ(z|x)]
and used as objective function to be maximised in (ϑ, ϕ)
152. ELBO maximisation
Stochastic gradient step, one parameter at a time
In iid settings
Lϑ,ϕ(x) =
n
X
i=1
Lϑ,ϕ(xi)
and
∇ϑLϑ,ϕ(xi) = Eqϕ(z|xi)[∇ϑ log pϑ(xi, z)] ≈ ∇ϑ log pϑ(x, z̃(xi))
for one simulation z̃(xi) ∼ qϕ(z|xi)
but ∇ϕLϑ,ϕ(xi) more difficult to compute
153. ELBO maximisation
Stochastic gradient step, one parameter at a time
In iid settings
Lϑ,ϕ(x) =
n
X
i=1
Lϑ,ϕ(xi)
and
∇ϑLϑ,ϕ(xi) = Eqϕ(z|xi)[∇ϑ log pϑ(xi, z)] ≈ ∇ϑ log pϑ(x, z̃(xi))
for one simulation z̃(xi) ∼ qϕ(z|xi)
but ∇ϕLϑ,ϕ(xi) more difficult to compute
154. ELBO maximisation (2)
Reparameterisation (also called normalising flow)
If z = g(x, ϕ, ε) ∼ qϕ(z|x) when ε ∼ r(ε),
Eqϕ(z|xi)[h(Z)] = Er[h(g(x, ϕ, ε))]
and
∇ϕEqϕ(z|xi)[h(Z)] = ∇ϕEr[h ◦ g(x, ϕ, ε)]
= Er[∇ϕh ◦ g(x, ϕ, ε)]
≈ ∇ϕh ◦ g(x, ϕ, ε̃)
for one simulation ε̃ ∼ r
[Kingma Welling, 2014]
155. ELBO maximisation (2)
Reparameterisation (also called normalising flow)
If z = g(x, ϕ, ε) ∼ qϕ(z|x) when ε ∼ r(ε),
Eqϕ(z|xi)[h(Z)] = Er[h(g(x, ϕ, ε))]
leading to unbiased estimator of gradient of ELBO
∇ϑ,ϕ {log pϑ(x, g(x, ϕ, ε)) − log qϕ(g(x, ϕ, ε)|x)}
[Kingma Welling, 2014]
156. ELBO maximisation (2)
Reparameterisation (also called normalising flow)
If z = g(x, ϕ, ε) ∼ qϕ(z|x) when ε ∼ r(ε),
Eqϕ(z|xi)[h(Z)] = Er[h(g(x, ϕ, ε))]
leading to unbiased estimator of gradient of ELBO
∇ϑ,ϕ
169. Generative adversarial networks (GANs)
Generative adversarial networks (GANs) provide an al-
gorithmic framework for constructing generative models
with several appealing properties:
– they do not require a likelihood function to be specified,
only a generating procedure;
– they provide samples that are sharp and compelling;
– they allow us to harness our knowledge of building
highly accurate neural network classifiers.
Shakir Mohamed Balaji Lakshminarayanan, 2016
170. Implicit generative models
Representation of random variables as
x = Gϑ(z) z ∼ µ(z)
where µ(·) reference distribution and Gϑ multi-layered and
highly non-linear transform (as, e.g., in normalizing flows)
I more general and flexible than “prescriptive” if implicit
(black box)
I connected with pseudo-random variable generation
I call for likelihood-free inference on ϑ
[Mohamed Lakshminarayanan, 2016]
171. Implicit generative models
Representation of random variables as
x = Gϑ(z) z ∼ µ(z)
where µ(·) reference distribution and Gϑ multi-layered and
highly non-linear transform (as, e.g., in normalizing flows)
I more general and flexible than “prescriptive” if implicit
(black box)
I connected with pseudo-random variable generation
I call for likelihood-free inference on ϑ
[Mohamed Lakshminarayanan, 2016]
174. The ABC method
Bayesian setting: target is π(ϑ)f(x|ϑ)
When likelihood f(x|ϑ) not in closed form, likelihood-free
rejection technique:
ABC algorithm
For an observation y ∼ f(y|ϑ), under the prior π(ϑ), keep jointly
simulating
ϑ0
∼ π(ϑ) , z ∼ f(z|ϑ0
) ,
until the auxiliary variable z is equal to the observed value,
z = y.
[Tavaré et al., 1997]
175. The ABC method
Bayesian setting: target is π(ϑ)f(x|ϑ)
When likelihood f(x|ϑ) not in closed form, likelihood-free
rejection technique:
ABC algorithm
For an observation y ∼ f(y|ϑ), under the prior π(ϑ), keep jointly
simulating
ϑ0
∼ π(ϑ) , z ∼ f(z|ϑ0
) ,
until the auxiliary variable z is equal to the observed value,
z = y.
[Tavaré et al., 1997]
176. The ABC method
Bayesian setting: target is π(ϑ)f(x|ϑ)
When likelihood f(x|ϑ) not in closed form, likelihood-free
rejection technique:
ABC algorithm
For an observation y ∼ f(y|ϑ), under the prior π(ϑ), keep jointly
simulating
ϑ0
∼ π(ϑ) , z ∼ f(z|ϑ0
) ,
until the auxiliary variable z is equal to the observed value,
z = y.
[Tavaré et al., 1997]
177. Why does it work?!
The proof is trivial (remember Rassmus’ socks):
f(ϑi) ∝
X
z∈D
π(ϑi)f(z|ϑi)Iy(z)
∝ π(ϑi)f(y|ϑi)
= π(ϑi|y) .
[Accept–Reject 101]
178. ABC as A...pproximative
When y is a continuous random variable, equality z = y is
replaced with a tolerance condition,
ρ{η(z), η(y)} ≤ ε
where ρ is a distance and η(y) defines a (not necessarily
sufficient) statistic
Output distributed from
π(ϑ) Pϑ{ρ(y, z) ε} ∝ π(ϑ|ρ(η(y), η(z)) ε)
[Pritchard et al., 1999]
179. ABC as A...pproximative
When y is a continuous random variable, equality z = y is
replaced with a tolerance condition,
ρ{η(z), η(y)} ≤ ε
where ρ is a distance and η(y) defines a (not necessarily
sufficient) statistic
Output distributed from
π(ϑ) Pϑ{ρ(y, z) ε} ∝ π(ϑ|ρ(η(y), η(z)) ε)
[Pritchard et al., 1999]
180. ABC posterior
The likelihood-free algorithm samples from the marginal in z of:
πε(ϑ, z|y) =
π(ϑ)f(z|ϑ)IAε,y (z)
R
Aε,y×Θ π(ϑ)f(z|ϑ)dzdϑ
,
where Aε,y = {z ∈ D|ρ(η(z), η(y)) ε}.
The idea behind ABC is that the summary statistics coupled
with a small tolerance should provide a good approximation of
the posterior distribution:
πε(ϑ|y) =
Z
πε(ϑ, z|y)dz ≈ π(ϑ|η(y)) .
181. ABC posterior
The likelihood-free algorithm samples from the marginal in z of:
πε(ϑ, z|y) =
π(ϑ)f(z|ϑ)IAε,y (z)
R
Aε,y×Θ π(ϑ)f(z|ϑ)dzdϑ
,
where Aε,y = {z ∈ D|ρ(η(z), η(y)) ε}.
The idea behind ABC is that the summary statistics coupled
with a small tolerance should provide a good approximation of
the posterior distribution:
πε(ϑ|y) =
Z
πε(ϑ, z|y)dz ≈ π(ϑ|η(y)) .
182. MA example
Back to the MA(2) model
xt = εt +
2
X
i=1
ϑiεt−i
Simple prior: uniform over the inverse [real and complex] roots
in
Q(u) = 1 −
2
X
i=1
ϑiui
under identifiability conditions
183. MA example
Back to the MA(2) model
xt = εt +
2
X
i=1
ϑiεt−i
Simple prior: uniform prior over identifiability zone
184. MA example (2)
ABC algorithm thus made of
1. picking a new value (ϑ1, ϑ2) in the triangle
2. generating an iid sequence (εt)−2t≤T
3. producing a simulated series (x0
t)1≤t≤T
Distance: basic distance between the series
ρ((x0
t)1≤t≤T , (xt)1≤t≤T ) =
T
X
t=1
(xt − x0
t)2
or distance between summary statistics like the 2
autocorrelations
τj =
T
X
t=j+1
xtxt−j
185. MA example (2)
ABC algorithm thus made of
1. picking a new value (ϑ1, ϑ2) in the triangle
2. generating an iid sequence (εt)−2t≤T
3. producing a simulated series (x0
t)1≤t≤T
Distance: basic distance between the series
ρ((x0
t)1≤t≤T , (xt)1≤t≤T ) =
T
X
t=1
(xt − x0
t)2
or distance between summary statistics like the 2
autocorrelations
τj =
T
X
t=j+1
xtxt−j
186. Comparison of distance impact
Evaluation of the tolerance on the ABC sample against both
distances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
187. Comparison of distance impact
0.0 0.2 0.4 0.6 0.8
0
1
2
3
4
θ1
−2.0 −1.0 0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
θ2
Evaluation of the tolerance on the ABC sample against both
distances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
188. Comparison of distance impact
0.0 0.2 0.4 0.6 0.8
0
1
2
3
4
θ1
−2.0 −1.0 0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
θ2
Evaluation of the tolerance on the ABC sample against both
distances (ε = 100%, 10%, 1%, 0.1%) for an MA(2) model
189. Back to Geyer’s 1994
Since we can easily draw samples from the model, we
can use any method that compares two sets of samples—
one from the true data distribution and one from the
model distribution—to drive learning. This is a pro-
cess of density estimation-by-comparison, comprising
two steps: comparison and estimation. For compari-
son, we test the hypothesis that the true data distribu-
tion p∗(x) and our model distribution qϑ(x) are equal,
using the density difference p∗(x) − qϑ(x), or the den-
sity ratio p∗(x)/qϑ(x) (...) The density ratio can be
computed by building a classifier to distinguish observed
data from that generated by the model.
[Mohamed Lakshminarayanan, 2016]
190. Class-probability estimation
Closest to Geyer’s (1994) idea:
Data xobs ∼ p∗(x) and simulated sample xsim ∼ qϑ(x) with same
size
Classification indicators yobs = 1 and ysim = 0
Then
p∗(x)
qϑ(x)
=
P(Y = 1|x)
P(Y = 0|x)
means that frequency ratio of allocations to models p∗ and qϑ)
is an estimator of the ratio p∗/qϑ
Learning P(Y = 1|x) via statistical or machine-learning tools
P(Y = 1|x) = D(x; ϕ)
[Mohamed Lakshminarayanan, 2016]
191. Proper scoring rules
Learning about parameter ϕ via proper scoring rule
Advantage of proper scoring rule is that global optimum is
achieved iff qϑ = p∗ with no convergence guarantees since
non-convex optimisation
[Mohamed Lakshminarayanan, 2016]
193. Proper scoring rules
Learning about parameter ϕ via proper scoring rule
For instance,
L(ϕ, ϑ) = Ep∗(x)[− log D(X; ϕ)] + Eµ(z)[− log(1 − D(Gϑ(Z); ϕ))]
Principle of generative adversarial networks (GANs) with score
minimised in ϕ and maximised in ϑ
[Mohamed Lakshminarayanan, 2016]
194. Divergence minimisation
Use of f-divergence
Df[p∗
||qϑ)] =
Z
qϑ(x)f(p∗
(x)
qϑ(x))dx
= Eqϑ
[f(rϕ(X))]
≥ sup
t
Ep∗ [t(X)] − Eqϑ
[f†
(t(X))]
where f convex with derivative f0 and Fenchel conjugate f†
[Mohamed Lakshminarayanan, 2016]
195. Divergence minimisation
Use of f-divergence
Df[p∗
||qϑ)] =
Z
qϑ(x)f(p∗
(x)
qϑ(x))dx
= Eqϑ
[f(rϕ(X))]
≥ sup
t
Ep∗ [t(X)] − Eqϑ
[f†
(t(X))]
where f convex with derivative f0 and Fenchel conjugate f†
Includes Kullback–Leibler and Jensen–Shannon divergences
[Mohamed Lakshminarayanan, 2016]
196. Divergence minimisation
Use of f-divergence
Df[p∗
||qϑ)] =
Z
qϑ(x)f(p∗
(x)
qϑ(x))dx
= Eqϑ
[f(rϕ(X))]
≥ sup
t
Ep∗ [t(X)] − Eqϑ
[f†
(t(X))]
where f convex with derivative f0 and Fenchel conjugate f†
Turning into bi-level optimisation of
L = Ep∗ [−f0
(rϕ(X))] + Eqϑ
[f†
(f0
(rϕ(X))](g)
Minimise in ϕ ratio loss L, minimising negative variational
lowerbound and minimise in ϑ generative loss (g) to drive ratio
to one
[Mohamed Lakshminarayanan, 2016]
197. Ratio matching
Minimise error between true density ratio r∗(x) and its estimate
L = 1/2
Z
qϑ(x)(r(x) − r∗
(x))2
dx
= 1/2 Eqϑ
[rϕ(X)2
] − Ep∗ [rϕ(X)]
Equivalence with ratio loss derived using divergence
minimisation
[Mohamed Lakshminarayanan, 2016]
198. Moment matching
Compare moments of both distributions by minimising
distance, using test [summary] statistics s(x) that provide
moments of interest
L(ϕ, ϑ) = (Ep∗ [s(X)] − Eqϑ
)[s(X)])2
= (Ep∗ [s(X)] − Eµ[s(G(Z; ϑ))])2
choice of test statistics critical: case of statistics defined within
reproducing kernel Hilbert space, leading to maximum mean
discrepancy
[Mohamed Lakshminarayanan, 2016]
199. Moment matching
Compare moments of both distributions by minimising
distance, using test [summary] statistics s(x) that provide
moments of interest
L(ϕ, ϑ) = (Ep∗ [s(X)] − Eqϑ
)[s(X)])2
= (Ep∗ [s(X)] − Eµ[s(G(Z; ϑ))])2
“There is a great deal of opportunity for exchange be-
tween GANs, ABC and ratio estimation in aspects of
scalability, applications,and theoretical understanding”
[Mohamed Lakshminarayanan, 2016]
200. Generative adversarial networks (GANs)
Adversarial setting opposing generator and discriminator:
I generator G = G(z; ϑG) ∈ X as “best” guess to model data
production x ∼ pmodel(x; ϑG) with z latent variable
I discriminator D = D(x; ϑD) ∈ [0, 1] as measuring
discrepancy between data (generation) and model
(generation), ^
P(x = G(z; ϑG); ϑD) and
^
P(G(z) 6= G(z; ϑG); ϑD)
Antagonism due to objective function J(ϑG; ϑD) where G aims at
confusing D with equilibrium G = pdata and D ≡ 1/2
Both G and D possibly modelled as deep neural networks
[Goodfellow, 2016]
201. Generative adversarial networks (GANs)
Adversarial setting opposing generator and discriminator:
I generator G = G(z; ϑG) ∈ X as “best” guess to model data
production x ∼ pmodel(x; ϑG) with z latent variable
I discriminator D = D(x; ϑD) ∈ [0, 1] as measuring
discrepancy between data (generation) and model
(generation), ^
P(x = G(z; ϑG); ϑD) and
^
P(G(z) 6= G(z; ϑG); ϑD)
Antagonism due to objective function J(ϑG; ϑD) where G aims at
confusing D with equilibrium G = pdata and D ≡ 1/2
Both G and D possibly modelled as deep neural networks
[Goodfellow, 2016]
202. Generative adversarial networks (GANs)
Adversarial setting opposing generator and discriminator:
I generator G = G(z; ϑG) ∈ X as “best” guess to model data
production x ∼ pmodel(x; ϑG) with z latent variable
I discriminator D = D(x; ϑD) ∈ [0, 1] as measuring
discrepancy between data (generation) and model
(generation), ^
P(x = G(z; ϑG); ϑD) and
^
P(G(z) 6= G(z; ϑG); ϑD)
Antagonism due to objective function J(ϑG; ϑD) where G aims at
confusing D with equilibrium G = pdata and D ≡ 1/2
Both G and D possibly modelled as deep neural networks
[Goodfellow, 2016]
203. GANs losses
1. discriminator cost
JD
(ϑD
, ϑG
) = −Epdata
[log D(X)] − EZ[log{1 − D(G(Z))}]
as expected misclassification error, trained in ϑD on both
the dataset and a G generated dataset
2. generator cost Several solutions
2.1 minimax loss, JG = −JD, leading to ^
ϑG as minimax
estimator [poor perf]
2.2 maximum confusion
JG
(ϑD
, ϑG
) = −EZ[log D(G(Z))]
as probability of discriminator D being mistaken by
generator G
2.3 maximum likelihood loss
JG
(ϑD
, ϑG
) = −EZ[exp{σ−1
(D{G(Z)})}]
where σ logistic sigmoid function
204. GANs losses
1. discriminator cost
JD
(ϑD
, ϑG
) = −Epdata
[log D(X)] − EZ[log{1 − D(G(Z))}]
as expected misclassification error, trained in ϑD on both
the dataset and a G generated dataset
2. generator cost Several solutions
2.1 minimax loss, JG = −JD, leading to ^
ϑG as minimax
estimator [poor perf]
2.2 maximum confusion
JG
(ϑD
, ϑG
) = −EZ[log D(G(Z))]
as probability of discriminator D being mistaken by
generator G
2.3 maximum likelihood loss
JG
(ϑD
, ϑG
) = −EZ[exp{σ−1
(D{G(Z)})}]
where σ logistic sigmoid function
205. GANs implementation
Recursive algorithm where at each iteration
1. minimise JD(ϑD, ϑG) in ϑD
2. minimise JG(ϑD, ϑG) in ϑG
based on empirical versions of expectations Epdata
and Epmodel
206. GANs implementation
Recursive algorithm where at each iteration
1. reduce JD(ϑD, ϑG) by a gradient step in ϑD
2. reduce JG(ϑD, ϑG) by a gradient step in ϑG
based on empirical versions of expectations Epdata
and Epmodel
207. noise contrastive versus adversarial
Connection with the noise contrastive approach:
Estimation of parameters ϑ and Zϑ for unormalised model
pmodel(x; ϑ, Zϑ) = ^
pmodel(x; ϑ)
Zϑ
and loss function
JG
(ϑ, Zϑ) = −Epmodel
[log h(X)] − Eq[log{1 − h(Y)}]
where q is an arbitrary generative distribution and
h(x) = 1
.
1 + q(x)
pmodel(x; ϑ, Zϑ)
Converging method when q everywhere positive and side
estimate of Zϑ
[Guttman Hyvärinen, 2010]
208. noise contrastive versus adversarial
Connection with the noise contrastive approach:
Estimation of parameters ϑ and Zϑ for unormalised model
pmodel(x; ϑ, Zϑ) = ^
pmodel(x; ϑ)
Zϑ
and loss function
JG
(ϑ, Zϑ) = −Epmodel
[log h(X)] − Eq[log{1 − h(Y)}]
where q is an arbitrary generative distribution and
h(x) = 1
.
1 + q(x)
pmodel(x; ϑ, Zϑ)
Converging method when q everywhere positive and side
estimate of Zϑ
[Guttman Hyvärinen, 2010]
209. noise contrastive versus adversarial
Connection with the noise contrastive approach:
Earlier version of GANs where
I learning from artificial method (as in Geyer, 1994)
I only learning generator G while D is fixed
I free choice of q (generative, available, close to pdata)
I and no optimisation of q
210. Bayesian GANs
Bayesian version of generative adversarial networks, with priors
on both model (generator) and discriminator parameters
[Saatchi Wilson, 2016]
211. Bayesian GANs
Somewhat remote from genuine statistical inference:
“GANs transform white noise through a deep neural net-
work to generate candidate samples from a data distri-
bution. A discriminator learns, in a supervised manner,
how to tune its parameters so as to correctly classify
whether a given sample has come from the generator
or the true data distribution. Meanwhile, the generator
updates its parameters so as to fool the discriminator.
As long as the generator has sufficient capacity, it can
approximate the cdf inverse-cdf composition required to
sample from a data distribution of interest.”
[Saatchi Wilson, 2016]
212. Bayesian GANs
I rephrasing statistical model as white noise transform
x = G(z, ϑ) (or implicit generative model, see above)
I resorting to prior distribution on ϑ still relevant (cf.
probabilistic numerics)
I use of discriminator function D(x; ϕ) “probability that x
comes from data distribution” (versus parametric
alternative associated with generator G(·; ϑ))
I difficulty with posterior distribution (cf. below)
[Saatchi Wilson, 2016]
213. Partly Bayesian GANs
Posterior distribution unorthodox in being associated with
conditional posteriors
π(ϑ|z, ϕ) ∝
n
Y
i=1
D(G(zi; ϑ); ϕ)
π(ϑ|αg) (2)
π(ϕ|z, X, ϑ) ∝
m
Y
i=1
D(xi; ϕ) ×
n
Y
i=1
(1 − D(G(zi; ϑ); ϕ)) × π(ϕ|αd)
(3)
1. generative conditional posterior π(ϑ|z, ϕ) aims at fooling
discriminator D, by favouring generative ϑ values helping
wrong allocation of pseudo-data
2. discriminative conditional posterior π(ϕ|z, X, ϑ) standard
Bayesian posterior based on the original and generated
sample
[Saatchi Wilson, 2016]
214. Partly Bayesian GANs
Posterior distribution unorthodox in being associated with
conditional posteriors
π(ϑ|z, ϕ) ∝
n
Y
i=1
D(G(zi; ϑ); ϕ)
π(ϑ|αg) (2)
π(ϕ|z, X, ϑ) ∝
m
Y
i=1
D(xi; ϕ) ×
n
Y
i=1
(1 − D(G(zi; ϑ); ϕ)) × π(ϕ|αd)
(3)
1. generative conditional posterior π(ϑ|z, ϕ) aims at fooling
discriminator D, by favouring generative ϑ values helping
wrong allocation of pseudo-data
2. discriminative conditional posterior π(ϕ|z, X, ϑ) standard
Bayesian posterior based on the original and generated
sample
[Saatchi Wilson, 2016]
215. Partly Bayesian GANs
Posterior distribution unorthodox in being associated with
conditional posteriors
π(ϑ|z, ϕ) ∝
n
Y
i=1
D(G(zi; ϑ); ϕ)
π(ϑ|αg)
π(ϕ|z, X, ϑ) ∝
m
Y
i=1
D(xi; ϕ) ×
n
Y
i=1
(1 − D(G(zi; ϑ); ϕ)) × π(ϕ|αd)
Opens possibility for two-stage Gibbs sampler
“By iteratively sampling from (2) and (3) (...) obtain
samples from the approximate posteriors over [both sets
of parameters].”
[Hobert Casella, 1994; Saatchi Wilson, 2016]
216. Partly Bayesian GANs
Difficulty with concept is that (2) and (3) cannot be compatible
conditionals: There is no joint distribution for which (2)
and (3) would be its conditionals
Reason: pseudo-data appears in D for (2) and (1 − D) in (3)
Convergence of Gibbs sampler delicate to ascertain (limit?)
[Hobert Casella, 1994; Saatchi Wilson, 2016]
217. More Bayesian GANs?
Difficulty later pointed out by Han et al. (2018):
“the previous Bayesian method (Saatchi Wilson,
2017) for any minimax GAN objective induces incom-
patibility of its defined conditional distributions.”
New approach uses Bayesian framework in prior feedback spirit
(Robert, 1993) to converge to a single parameter value (and
there are “two likelihoods” for the same data, one being the
inverse of the other in the minimax GAN case)
218. More Bayesian GANs?
Difficulty later pointed out by Han et al. (2018):
“the previous Bayesian method (Saatchi Wilson,
2017) for any minimax GAN objective induces incom-
patibility of its defined conditional distributions.”
New approach uses Bayesian framework in prior feedback spirit
(Robert, 1993) to converge to a single parameter value (and
there are “two likelihoods” for the same data, one being the
inverse of the other in the minimax GAN case)
219. Metropolis-Hastings GANs
“ With a perfect discriminator, this wrapped generator
samples from the true distribution on the data exactly
even when the generator is imperfect.”
Goal: sample from distribution implicitly defined by GAN
discriminator D learned for generator G
Recall that
“If D converges optimally for a fixed G, then D =
pdata/(pdata + pG), and if both D and G converge then
pG = pdata”
[Turner et al., ICML 2019]
220. Metropolis-Hastings GANs
“ With a perfect discriminator, this wrapped generator
samples from the true distribution on the data exactly
even when the generator is imperfect.”
Goal: sample from distribution implicitly defined by GAN
discriminator D learned for generator G
Recall that
“If D converges optimally for a fixed G, then D =
pdata/(pdata + pG), and if both D and G converge then
pG = pdata”
[Turner et al., ICML 2019]
221. Metropolis-Hastings GANs
In ideal setting, D produces pdata since
pdata(x)
pG(x)
=
1
D1 (x) − 1
and using pG as Metropolis-Hastings proposal
α(x, x0
) = 1 ∧
D−1(x) − 1
D−1(x0) − 1
Quite naı̈ve (e.g., does not account for D and G being updated
on-line or D being imperfect) and lacking connections with
density estimation
[Turner et al., ICML 2019]
222. Metropolis-Hastings GANs
In ideal setting, D produces pdata since
pdata(x)
pG(x)
=
1
D1 (x) − 1
and using pG as Metropolis-Hastings proposal
α(x, x0
) = 1 ∧
D−1(x) − 1
D−1(x0) − 1
Quite naı̈ve (e.g., does not account for D and G being updated
on-line or D being imperfect) and lacking connections with
density estimation
[Turner et al., ICML 2019]
223. Secret GANs
“...evidence that it is beneficial to sample from the
energy-based model defined both by the generator and
the discriminator instead of from the generator only.”
Post-processing of GAN generator G output, generating from
mixture of both generator and discriminator, via unscented
Langevin algorithm
Same core idea: if pdata true data generating process, pG the
estimated generator and D discriminator
pdata(x) ≈ p0
(x) ∝ pG(x) exp(D(x))
Again approximation exact only when discriminator optimal
(Theorem 1)
[Che et al., NeurIPS 2020]
224. Secret GANs
“...evidence that it is beneficial to sample from the
energy-based model defined both by the generator and
the discriminator instead of from the generator only.”
Post-processing of GAN generator G output, generating from
mixture of both generator and discriminator, via unscented
Langevin algorithm
Same core idea: if pdata true data generating process, pG the
estimated generator and D discriminator
pdata(x) ≈ p0
(x) ∝ pG(x) exp(D(x))
Again approximation exact only when discriminator optimal
(Theorem 1)
[Che et al., NeurIPS 2020]
225. Secret GANs
Difficulties with proposal
I latent variable (white noise) z [such that x = G(z)] may
imply huge increase in dimension
I pG may be unavailable (contraposite of normalising flows)
Generation from p0 seen as accept-reject with [prior] proposal
µ(z) with acceptance probability proportional to
exp[D{G(z)}]
| {z }
d(z)
In practice, unscented [i.e., non-Metropolised] Langevin move
zt+1 = zt −
ε
2
∇zE(z) +
√
εη η ∼ N(0, I) ε 1
[Che et al., NeurIPS 2020]
226. Secret GANs
Difficulties with proposal
I latent variable (white noise) z [such that x = G(z)] may
imply huge increase in dimension
I pG may be unavailable (contraposite of normalising flows)
Generation from p0 seen as accept-reject with [prior] proposal
µ(z) with acceptance probability proportional to
exp[D{G(z)}]
| {z }
d(z)
In practice, unscented [i.e., non-Metropolised] Langevin move
zt+1 = zt −
ε
2
∇zE(z) +
√
εη η ∼ N(0, I) ε 1
[Che et al., NeurIPS 2020]
227. Secret GANs
Alternative WGAN: at step t
1. Discriminator with parameter ϕ trained to match
pt(x) ∝ pg(x) exp{Dϕ(x)
with data distribution pdata and gradient
Ept [∇ϕD(X; ϕ)] − Epdata
[∇ϕD(X; ϕ)]
whose first expectation approximated by Langevin
2. Generator with parameter ϑ trained to match pg(x) and
pt(x) [not pdata] with gradient
Eµ[∇ϑD(G(Z; ϑ))]
[Che et al., NeurIPS 2020]