1. GANs from a statistical point of view
Maxime Sangnier
International workshop Machine Learning & Artificial Intelligence
September 17, 2018
Sorbonne Université, CNRS, LPSM, LIP6, Paris, France
Joint work with Gérard Biau1
, Benoît Cadre2
and Ugo Tanielian1,3
1
Sorbonne Université, CNRS, LPSM, Paris, France
2
ENS Rennes, Univ Rennes, CNRS, IRMAR, Rennes, France
3
Criteo, Paris, France
7. Painting
Interactive GAN.1
1
J.-Y. Zhu et al. “Generative Visual Manipulation on the Natural Image Manifold”. In: European
Conference on Computer Vision. 2016.
5
12. Motivation
Generative models aim at generating artificial contents.
• Outstanding image generation and extrapolation5
.
• And even more I’m not aware of. . .
5
T. Karras et al. “Progressive Growing of GANs for Improved Quality, Stability, and Variation”. In:
International Conference on Learning Representations. 2018.
10
13. Motivation
Generative models aim at generating artificial contents.
• Outstanding image generation and extrapolation5
.
• And even more I’m not aware of. . .
Generative models are used for:
• exploring unseen realities;
• providing many answers to a single question.
5
Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation”.
10
14. Generate from data
X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd
.
How to sample according to p ?
11
15. Generate from data
X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd
.
How to sample according to p ?
Naive approach
1. estimate p by ˆp;
2. sample according to ˆp.
11
16. Generate from data
X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd
.
How to sample according to p ?
Naive approach
1. estimate p by ˆp;
2. sample according to ˆp.
Drawbacks
• both problems are difficult in themselves;
• we cannot define a realistic parametric statistical model;
• non-parametric density estimation inefficient in high dimension;
• this approach violates Vapnik’s principle:
When solving a problem of interest, do not solve a more general
problem as an intermediate step.
11
17. Some generative methods
METHOD DENSITY-
FREE
FLEXIBILITY SIMPLE
SAMPLING
Autoregressive models (WaveNet6
)
Nonlinear independent components analy-
sis (Real NVP7
)
Variational autoencoders8
Boltzmann machines9
Generative stochastic networks10
Generative adversarial networks
6
A.v.d. Oord et al. “WaveNet: A Generative Model for Raw Audio”. In: arXiv:1609.03499 [cs]
(2016).
7
L. Dinh, J. Sohl-Dickstein, and S. Bengio. “Density estimation using Real NVP”. . In:
arXiv:1605.08803 [cs, stat] (2016).
8
D.P. Kingma and M. Welling. “Auto-Encoding Variational Bayes”. In: International Conference on
Learning Representations. 2013.
9
S.E. Fahlman, G.E. Hinton, and T.J. Sejnowski. “Massively Parallel Architectures for AI: Netl,
Thistle, and Boltzmann Machines”. In: Proceedings of the Third AAAI Conference on Artificial
Intelligence. 1983.
10
Y. Bengio et al. “Deep Generative Stochastic Networks Trainable by Backprop”. In: International
Conference on Machine Learning. 2014. 12
19. A direct approach
Cornerstone: don’t estimate p .
11
I. Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural Information Processing
Systems. 2014.
13
20. A direct approach
Cornerstone: don’t estimate p .
General procedure:
• sample U1, . . . , Un i.i.d. thanks to a
parametric model;
• compare X1, . . . , Xn and U1, . . . , Un
and update the model.
11
Goodfellow et al., “Generative Adversarial Nets”.
13
21. A direct approach
Cornerstone: don’t estimate p .
General procedure:
• sample U1, . . . , Un i.i.d. thanks to a
parametric model;
• compare X1, . . . , Xn and U1, . . . , Un
and update the model.
GANs11
follow this principle.
11
Goodfellow et al., “Generative Adversarial Nets”.
13
22. Generating a random sample
Inverse transform sampling
• S: scalar random variable;
• FS: cumulative distribution function of S;
• Z ∼ U([0, 1]).
• F−1
S (Z)
d
= S.
14
23. Generating a random sample
Inverse transform sampling
• S: scalar random variable;
• FS: cumulative distribution function of S;
• Z ∼ U([0, 1]).
• F−1
S (Z)
d
= S.
Generators
• X1, . . . , Xn i.i.d. according to a density p on E ⊆ Rd
, dominated by a
known measure µ.
• G = {Gθ : Rd
→ E}θ∈Θ, Θ ⊂ Rp
: parametric family of generators
(d d);
• Z1, . . . , Zn random vectors from Rd
(typically U([0, 1]d
));
• Ui = Gθ(Zi ): generated sample;
• P = {pθ}θ∈Θ: associated family of densities with by definition
Gθ(Z1)
d
= pθdµ.
14
24. Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
15
25. Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
• The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the
analysis.
15
26. Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
• The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the
analysis.
• It is not assumed that p belongs to P.
15
27. Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
• The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the
analysis.
• It is not assumed that p belongs to P.
• In GANs: Gθ is a neural network with p weights, stored in θ ∈ Rp
.
15
28. Comparing two samples
The next step
• The procedure should drive θ such that Gθ(Z1)
d
= X1.
• Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ.
16
29. Comparing two samples
The next step
• The procedure should drive θ such that Gθ(Z1)
d
= X1.
• Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ.
Supervised learning
• Both samples have same distribution as soon as we cannot distinguish
them.
• This is a classification problem:
Class Y = 0 Class Y = 1
Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn
16
30. Comparing two samples
The next step
• The procedure should drive θ such that Gθ(Z1)
d
= X1.
• Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ.
Supervised learning
• Both samples have same distribution as soon as we cannot distinguish
them.
• This is a classification problem:
Class Y = 0 Class Y = 1
Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn
16
32. Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
17
33. Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
• Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with
same distribution as (X, Y).
17
34. Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
• Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with
same distribution as (X, Y).
• Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x).
17
35. Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
• Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with
same distribution as (X, Y).
• Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x).
• Maximum (conditional) likelihood estimation:
sup
D∈D
n
i=1
D(Xi ) ×
n
i=1
(1 − D(Gθ(Zi ))) or sup
D∈D
ˆL(θ, D),
with
ˆL(θ, D) =
1
n
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))) .
17
40. The GAN Zoo
Curbing the discriminator
• least squares12
:
inf
D∈D
n
i=1
(D(Xi ) − 1)2
+
n
i=1
D(Gθ(Zi ))2
, inf
θ∈Θ
n
i=1
(D(Gθ(Zi )) − 1)2
.
• asymmetric hinge13
:
inf
D∈D
−
n
i=1
D(Xi ) +
n
i=1
max (0, 1 − D(Gθ(Zi ))) , inf
θ∈Θ
−
n
i=1
D(Gθ(Zi )).
12
X. Mao et al. “Least Squares Generative Adversarial Networks”. In: IEEE International
Conference on Computer Vision. 2017.
13
J. Zhao, M. Mathieu, and Y. LeCun. “Energy-based Generative Adversarial Network”. In:
International Conference on Learning Representations. 2017.
20
41. The GAN Zoo
Metrics as minimax games
• Maximum mean discrepancy14
and Wasserstein15
:
inf
θ∈Θ
sup
T∈T
Tp dµ − Tpθdµ.
• f-divergences16
:
inf
θ∈Θ
sup
T∈T
Tp dµ − (f ◦ T)pθdµ.
With T being a prescribed class of functions and f the convex conjugate of
a lower-semicontinuous function f.
14
G.K. Dziugaite, D.M. Roy, and Z. Ghahramani. “Training generative neural networks via Maximum
Mean Discrepancy optimization”. In: Proceedings of the Thirty-First Conference on Uncertainty in
Artificial Intelligence. 2015; Y. Li, K. Swersky, and R. Zemel. “Generative Moment Matching
Networks”. In: International Conference on Machine Learning. 2015.
15
M. Arjovsky, S. Chintala, and L. Bottou. “Wasserstein Generative Adversarial Networks”. In:
International Conference on Machine Learning. 2017.
16
S. Nowozin, B. Cseke, and R. Tomioka. “f-GAN: Training Generative Neural Samplers using
Variational Divergence Minimization”. In: Neural Information Processing Systems. June 2016.
21
42. Roadmap
• Minimum divergence estimation: uniqueness of minimizers.
• Approximation properties: importance of the family of discriminators on
the quality of the approximation
• Statistical analysis: consistency and rate of convergence.
22
44. Kullback-Leibler and Jensen divergences
Kullback-Leibler
• For P Q probability measures on E:
DKL(P Q) = ln
dP
dQ
dP.
• Properties:
DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q.
• If p = dP
dµ
and q = dQ
dµ
:
DKL(P Q) = p ln
p
q
dµ.
• DKL is not symmetric and
defined only for P Q.
23
45. Kullback-Leibler and Jensen divergences
Kullback-Leibler
• For P Q probability measures on E:
DKL(P Q) = ln
dP
dQ
dP.
• Properties:
DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q.
• If p = dP
dµ
and q = dQ
dµ
:
DKL(P Q) = p ln
p
q
dµ.
• DKL is not symmetric and
defined only for P Q.
23
46. Kullback-Leibler and Jensen divergences
Jensen-Shannon
• For P and Q probability measures on E:
DJS(P, Q) =
1
2
DKL P
P + Q
2
+
1
2
DKL Q
P + Q
2
.
• Property:
0 ≤ DJS(P, Q) ≤ ln 2.
• (P, Q) → DJS(P, Q) is a distance.
24
47. Kullback-Leibler and Jensen divergences
Jensen-Shannon
• For P and Q probability measures on E:
DJS(P, Q) =
1
2
DKL P
P + Q
2
+
1
2
DKL Q
P + Q
2
.
• Property:
0 ≤ DJS(P, Q) ≤ ln 2.
• (P, Q) → DJS(P, Q) is a distance.
24
48. Kullback-Leibler and Jensen divergences
Jensen-Shannon
• For P and Q probability measures on E:
DJS(P, Q) =
1
2
DKL P
P + Q
2
+
1
2
DKL Q
P + Q
2
.
• Property:
0 ≤ DJS(P, Q) ≤ ln 2.
• (P, Q) → DJS(P, Q) is a distance.
24
49. GAN and Jensen-Shannon divergence
GANs
• Empirical criteria:
ˆL(θ, D) =
1
n
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))) .
• Problem:
inf
θ∈Θ
sup
D∈D
ˆL(θ, D) .
25
50. GAN and Jensen-Shannon divergence
GANs
• Empirical criteria:
ˆL(θ, D) =
1
n
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))) .
• Problem:
inf
θ∈Θ
sup
D∈D
ˆL(θ, D) .
Ideal GANs
• Population version of the criteria:
L(θ, D) = ln(D)p dµ + ln(1 − D)pθdµ.
• No constraint: D = D∞, set of all functions from E to [0, 1].
• Problem:
inf
θ∈Θ
sup
D∈D∞
L(θ, D) .
25
51. GAN and Jensen-Shannon divergence
From GAN to JS divergence
• Criteria:
sup
D∈D∞
L(θ, D) = sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ
≤ sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ.
26
52. GAN and Jensen-Shannon divergence
From GAN to JS divergence
• Criteria:
sup
D∈D∞
L(θ, D) = sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ
≤ sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ.
• Optimal discriminator:
Dθ =
p
p + pθ
,
with convention 0/0 = 0.
26
53. GAN and Jensen-Shannon divergence
From GAN to JS divergence
• Criteria:
sup
D∈D∞
L(θ, D) = sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ
≤ sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ.
• Optimal discriminator:
Dθ =
p
p + pθ
,
with convention 0/0 = 0.
• Optimal criteria:
sup
D∈D∞
L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4.
26
55. The quest for Dθ
Numerical approach
• Big n, big D: try to approximate Dθ with arg maxD∈D
ˆL(θ, D).
• Close to divergence minimization: supD∈D
ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4.
17
Goodfellow et al., “Generative Adversarial Nets”.
27
56. The quest for Dθ
Numerical approach
• Big n, big D: try to approximate Dθ with arg maxD∈D
ˆL(θ, D).
• Close to divergence minimization: supD∈D
ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4.
Theorem
Let θ ∈ Θ and Aθ = {p = pθ = 0}.
If µ(Aθ) = 0, then
{Dθ } = arg maxD∈D∞
L(θ, D).
If µ(Aθ) > 0, then Dθ is unique only on EAθ.
Completes Proposition 1 in17
.
17
Goodfellow et al., “Generative Adversarial Nets”.
27
57. Oracle parameter
• Oracle parameter regarding the Jensen-Shannon divergence:
θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ).
• Gθ is the ideal generator.
• If p ∈ P,
p = pθ DJS(p , pθ ) = 0 Dθ =
1
2
.
• What if p /∈ P? Existence and uniqueness of θ ?
28
58. Oracle parameter
• Oracle parameter regarding the Jensen-Shannon divergence:
θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ).
• Gθ is the ideal generator.
• If p ∈ P,
p = pθ DJS(p , pθ ) = 0 Dθ =
1
2
.
• What if p /∈ P? Existence and uniqueness of θ ?
Theorem
Assume that P is a convex and compact set for the JS distance.
If p > 0 µ-almost everywhere, then there exists ¯p ∈ P such that
{¯p} = arg minp∈P DJS(p , p).
In addition, if the model P is identifiable, then there exists θ ∈ Θ such
mathematical
{θ } = arg minθ∈Θ L(θ, Dθ ).
28
59. Oracle parameter
Existence and uniqueness
• Compactness of P and continuity of DJS(p , ·).
• p > 0 µ-a.e. enables strict convexity of DJS(p , ·).
29
60. Oracle parameter
Existence and uniqueness
• Compactness of P and continuity of DJS(p , ·).
• p > 0 µ-a.e. enables strict convexity of DJS(p , ·).
Compactness of P with respect to the JS distance
1. Θ compact and P convex.
2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous.
3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1
(µ).
29
61. Oracle parameter
Existence and uniqueness
• Compactness of P and continuity of DJS(p , ·).
• p > 0 µ-a.e. enables strict convexity of DJS(p , ·).
Compactness of P with respect to the JS distance
1. Θ compact and P convex.
2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous.
3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1
(µ).
Identifiability
High-dimensional parametric setting often misspecified =⇒ identifiability
not satisfied.
29
63. From JS divergence to likelihood
GAN = JS divergence
• GANs don’t minimize the Jensen-Shannon divergence.
• Considering supD∈D∞
L(θ, D) means knowing Dθ = p
p +pθ
, thus knowing
p .
30
64. From JS divergence to likelihood
GAN = JS divergence
• GANs don’t minimize the Jensen-Shannon divergence.
• Considering supD∈D∞
L(θ, D) means knowing Dθ = p
p +pθ
, thus knowing
p .
Parametrized discriminators
• D = {Dα}α∈Λ, Λ ⊂ Rq
: parametric family of discriminators.
• Likelihood-type problem with two parametric families:
inf
θ∈Θ
sup
α∈Λ
L(θ, Dα).
• Likelihood parameter:
¯θ ∈ arg minθ∈Θ sup
α∈Λ
L(θ, Dα).
• How close the best candidate p¯θ is to the ideal density pθ ?
• How does it depend on the capability of D to approximate Dθ ?
30
66. Approximation result
(Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2
(µ) such that
m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and p¯θ ≤ M.
Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant
c1 > 0 (depending only upon m and M) such that
DJS(p , p¯θ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
.
31
67. Approximation result
(Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2
(µ) such that
m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and p¯θ ≤ M.
Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant
c1 > 0 (depending only upon m and M) such that
DJS(p , p¯θ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
.
Remarks
As soon as the class D becomes richer:
• minimizing supα∈Λ L(θ, Dα) over Θ helps minimizing DJS(p , pθ).
• since under some assumptions {pθ } = arg minpθ:θ∈Θ DJS(p , pθ), p¯θ
comes closer to pθ . 31
70. The estimation problem
Estimator
ˆθ ∈ arg minθ∈Θ sup
α∈Λ
ˆL(θ, α),
where
ˆL(θ, α) =
1
n
n
i=1
ln(Dα(Xi )) +
n
i=1
ln(1 − Dα(Gθ(Zi ))) .
(Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ˆθ exists (and so for ¯θ).
32
71. The estimation problem
Estimator
ˆθ ∈ arg minθ∈Θ sup
α∈Λ
ˆL(θ, α),
where
ˆL(θ, α) =
1
n
n
i=1
ln(Dα(Xi )) +
n
i=1
ln(1 − Dα(Gθ(Zi ))) .
(Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ˆθ exists (and so for ¯θ).
Questions
• How far DJS(p , pˆθ) is from minθ∈Θ DJS(p , pθ) = DJS(p , pθ )?
• Does ˆθ converge towards ¯θ as n → ∞?
• What is the asymptotic distribution of ˆθ − ¯θ?
32
72. Non-asymptotic bound on the JS divergence
(Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists
D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε.
33
73. Non-asymptotic bound on the JS divergence
(Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists
D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ.
Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two
constants c1 > 0 (depending only upon m and M) and c2 such that
E DJS(p , pˆθ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
+ c2
1
√
n
.
33
74. Non-asymptotic bound on the JS divergence
(Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists
D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ.
Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two
constants c1 > 0 (depending only upon m and M) and c2 such that
E DJS(p , pˆθ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
+ c2
1
√
n
.
Remarks
• Under (Hreg), {ˆL(θ, α) − L(θ, α)}θ∈Θ,α∈Λ is a subgaussian process for
· /
√
n.
• Dudley’s inequality: E supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)| = O 1√
n
.
• c2 scales as p + q =⇒ loose bound in the usual over-parametrized
regime (LSUN, FACES:
√
n ≈ 1000 p + q ≈ 1500000).
33
75. Illustration
Setting
• p (x) = e−x/s
s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R).
• Gθ and Dα are two fully connected neural networks.
• Z ∼ U([0, 1]): scalar noise.
• n = 100000 (1/
√
n is negligible) and 30 replications.
34
76. Illustration
Setting
• p (x) = e−x/s
s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R).
• Gθ and Dα are two fully connected neural networks.
• Z ∼ U([0, 1]): scalar noise.
• n = 100000 (1/
√
n is negligible) and 30 replications.
34
78. Convergence of ˆθ
(Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist.
36
79. Convergence of ˆθ
(Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist.
(H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ).
Theorem
Under Assumptions (Hreg) and (H1),
ˆθ
a.s.
→ ¯θ and ˆα
a.s.
→ ¯α.
36
80. Convergence of ˆθ
(Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist.
(H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ).
Theorem
Under Assumptions (Hreg) and (H1),
ˆθ
a.s.
→ ¯θ and ˆα
a.s.
→ ¯α.
Remarks
• Convergence of ˆθ comes from supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)|
a.s.
→ 0.
• It does not need uniqueness of ¯α.
• Convergence of ˆα comes from that of ˆθ.
36
81. Illustration
Setting
• Three models:
1. Laplace: p (x) = 1
3
e−
2|x|
3 vs pθ(x) = 1√
2πθ
e
− x2
2θ2 .
2. Claw: p (x) = pclaw(x) vs pθ(x) = 1√
2πθ
e
− x2
2θ2 .
3. Exponential: p (x) = e−x 1R+ vs pθ(x) = 1
θ
1[0,θ](x).
• Gθ: generalized inverse of the cdf of pθ.
• Z ∼ U([0, 1]): scalar noise.
• Dα =
pα1
pα1
+pα0
.
• n = 10 to 10000 and 200 replications.
37
83. Central limit theorem
(Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are
invertible).
Theorem
Under Assumptions (Hreg), (H1) and (Hloc),
√
n(ˆθ − ¯θ)
d
→ N(0, Σ).
39
84. Central limit theorem
(Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are
invertible).
Theorem
Under Assumptions (Hreg), (H1) and (Hloc),
√
n(ˆθ − ¯θ)
d
→ N(0, Σ).
Remark
One has Σ 2 = O(p3
q4
), which suggests that ˆθ has a large dispersion
around ¯θ in the over-parametrized regime.
39
87. Take-home message
A first step for understanding GANs
• From data to sampling.
• The richness of the class of discriminators D controls the gap between
GANs and the JS divergence.
• The generator parameters θ are asymptotically normal with rate
√
n.
41
88. Take-home message
A first step for understanding GANs
• From data to sampling.
• The richness of the class of discriminators D controls the gap between
GANs and the JS divergence.
• The generator parameters θ are asymptotically normal with rate
√
n.
Future investigations
1. Impact of the latent variable Z (dimension, distribution) and the networks
(number of layers in Gθ, dimensionality of Θ) on the performance of
GANs (currently it is assumed p µ, pθ µ =⇒ information on the
supporting manifold of p ).
2. How much assumptions (Hε) and (Hε) are satisfied for neural nets as
discriminators?
3. Over-parametrized regime: convergence of distributions instead of
parameters.
41