SlideShare une entreprise Scribd logo
1  sur  88
Télécharger pour lire hors ligne
GANs from a statistical point of view
Maxime Sangnier
International workshop Machine Learning & Artificial Intelligence
September 17, 2018
Sorbonne Université, CNRS, LPSM, LIP6, Paris, France
Joint work with Gérard Biau1
, Benoît Cadre2
and Ugo Tanielian1,3
1
Sorbonne Université, CNRS, LPSM, Paris, France
2
ENS Rennes, Univ Rennes, CNRS, IRMAR, Rennes, France
3
Criteo, Paris, France
Contributors
Gérard Biau (Sorbonne Université) Benoît Cadre (ENS Rennes)
Ugo Tanielian (Sorbonne Université & Criteo) 1
Generative models
Motivation
Generative models aim at generating artificial contents.
• Images:
• merchandising;
• painting;
• art;
• super-resolution and denoising;
• text to image.
• Movies:
• pose to movie;
• Audio:
• speech synthesis ;
• music.
2
Merchandising
vue.ai
3
Art
prisma-ai.com
4
Painting
Interactive GAN.1
1
J.-Y. Zhu et al. “Generative Visual Manipulation on the Natural Image Manifold”. In: European
Conference on Computer Vision. 2016.
5
Superresolution
SuperResolution GAN.2
2
C. Ledig et al. “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial
Network”. In: arXiv:1609.04802 [cs, stat] (2016).
6
Text-to-image
Stacked GAN.3
3
H. Zhang et al. “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative
Adversarial Networks”. In: arXiv:1612.03242 [cs, stat] (2016).
7
Movies
Everybody Dance Now.4
4
C. Chan et al. “Everybody Dance Now”. In: arXiv:1808.07371 [cs] (2018).
8
Speech synthesis
WaveNet by DeepMind.
9
Motivation
Generative models aim at generating artificial contents.
• Outstanding image generation and extrapolation5
.
• And even more I’m not aware of. . .
5
T. Karras et al. “Progressive Growing of GANs for Improved Quality, Stability, and Variation”. In:
International Conference on Learning Representations. 2018.
10
Motivation
Generative models aim at generating artificial contents.
• Outstanding image generation and extrapolation5
.
• And even more I’m not aware of. . .
Generative models are used for:
• exploring unseen realities;
• providing many answers to a single question.
5
Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation”.
10
Generate from data
X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd
.
How to sample according to p ?
11
Generate from data
X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd
.
How to sample according to p ?
Naive approach
1. estimate p by ˆp;
2. sample according to ˆp.
11
Generate from data
X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd
.
How to sample according to p ?
Naive approach
1. estimate p by ˆp;
2. sample according to ˆp.
Drawbacks
• both problems are difficult in themselves;
• we cannot define a realistic parametric statistical model;
• non-parametric density estimation inefficient in high dimension;
• this approach violates Vapnik’s principle:
When solving a problem of interest, do not solve a more general
problem as an intermediate step.
11
Some generative methods
METHOD DENSITY-
FREE
FLEXIBILITY SIMPLE
SAMPLING
Autoregressive models (WaveNet6
)
Nonlinear independent components analy-
sis (Real NVP7
)
Variational autoencoders8
Boltzmann machines9
Generative stochastic networks10
Generative adversarial networks
6
A.v.d. Oord et al. “WaveNet: A Generative Model for Raw Audio”. In: arXiv:1609.03499 [cs]
(2016).
7
L. Dinh, J. Sohl-Dickstein, and S. Bengio. “Density estimation using Real NVP”. . In:
arXiv:1605.08803 [cs, stat] (2016).
8
D.P. Kingma and M. Welling. “Auto-Encoding Variational Bayes”. In: International Conference on
Learning Representations. 2013.
9
S.E. Fahlman, G.E. Hinton, and T.J. Sejnowski. “Massively Parallel Architectures for AI: Netl,
Thistle, and Boltzmann Machines”. In: Proceedings of the Third AAAI Conference on Artificial
Intelligence. 1983.
10
Y. Bengio et al. “Deep Generative Stochastic Networks Trainable by Backprop”. In: International
Conference on Machine Learning. 2014. 12
Generative adversarial models
A direct approach
Cornerstone: don’t estimate p .
11
I. Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural Information Processing
Systems. 2014.
13
A direct approach
Cornerstone: don’t estimate p .
General procedure:
• sample U1, . . . , Un i.i.d. thanks to a
parametric model;
• compare X1, . . . , Xn and U1, . . . , Un
and update the model.
11
Goodfellow et al., “Generative Adversarial Nets”.
13
A direct approach
Cornerstone: don’t estimate p .
General procedure:
• sample U1, . . . , Un i.i.d. thanks to a
parametric model;
• compare X1, . . . , Xn and U1, . . . , Un
and update the model.
GANs11
follow this principle.
11
Goodfellow et al., “Generative Adversarial Nets”.
13
Generating a random sample
Inverse transform sampling
• S: scalar random variable;
• FS: cumulative distribution function of S;
• Z ∼ U([0, 1]).
• F−1
S (Z)
d
= S.
14
Generating a random sample
Inverse transform sampling
• S: scalar random variable;
• FS: cumulative distribution function of S;
• Z ∼ U([0, 1]).
• F−1
S (Z)
d
= S.
Generators
• X1, . . . , Xn i.i.d. according to a density p on E ⊆ Rd
, dominated by a
known measure µ.
• G = {Gθ : Rd
→ E}θ∈Θ, Θ ⊂ Rp
: parametric family of generators
(d d);
• Z1, . . . , Zn random vectors from Rd
(typically U([0, 1]d
));
• Ui = Gθ(Zi ): generated sample;
• P = {pθ}θ∈Θ: associated family of densities with by definition
Gθ(Z1)
d
= pθdµ.
14
Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
15
Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
• The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the
analysis.
15
Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
• The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the
analysis.
• It is not assumed that p belongs to P.
15
Generating a random sample
Remarks
• Each pθ is a candidate to represent p .
• The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the
analysis.
• It is not assumed that p belongs to P.
• In GANs: Gθ is a neural network with p weights, stored in θ ∈ Rp
.
15
Comparing two samples
The next step
• The procedure should drive θ such that Gθ(Z1)
d
= X1.
• Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ.
16
Comparing two samples
The next step
• The procedure should drive θ such that Gθ(Z1)
d
= X1.
• Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ.
Supervised learning
• Both samples have same distribution as soon as we cannot distinguish
them.
• This is a classification problem:
Class Y = 0 Class Y = 1
Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn
16
Comparing two samples
The next step
• The procedure should drive θ such that Gθ(Z1)
d
= X1.
• Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ.
Supervised learning
• Both samples have same distribution as soon as we cannot distinguish
them.
• This is a classification problem:
Class Y = 0 Class Y = 1
Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn
16
Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
17
Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
17
Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
• Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with
same distribution as (X, Y).
17
Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
• Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with
same distribution as (X, Y).
• Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x).
17
Adversarial principle
Discriminator
• D a family of functions from E to [0, 1]: the discriminators.
• Choose D ∈ D such that for any x ∈ E,
D(x) ≥ 1/2 =⇒ true observation (1)
D(x) < 1/2 =⇒ fake (generated) point. (2)
• Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with
same distribution as (X, Y).
• Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x).
• Maximum (conditional) likelihood estimation:
sup
D∈D
n
i=1
D(Xi ) ×
n
i=1
(1 − D(Gθ(Zi ))) or sup
D∈D
ˆL(θ, D),
with
ˆL(θ, D) =
1
n
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))) .
17
Adversarial principle
Generator
• supD∈D
ˆL(θ, D) acts like a divergence between the distributions of
Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn.
18
Adversarial principle
Generator
• supD∈D
ˆL(θ, D) acts like a divergence between the distributions of
Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn.
• Minimum divergence estimation:
inf
θ∈Θ
sup
D∈D
ˆL(θ, D) .
or
inf
θ∈Θ
sup
D∈D
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))).
18
Adversarial principle
Generator
• supD∈D
ˆL(θ, D) acts like a divergence between the distributions of
Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn.
• Minimum divergence estimation:
inf
θ∈Θ
sup
D∈D
ˆL(θ, D) .
or
inf
θ∈Θ
sup
D∈D
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))).
• Adversarial, minimax or zero-sum game.
18
The GAN Zoo
Avinash Hindupur’s Github. 19
The GAN Zoo
Curbing the discriminator
• least squares12
:
inf
D∈D
n
i=1
(D(Xi ) − 1)2
+
n
i=1
D(Gθ(Zi ))2
, inf
θ∈Θ
n
i=1
(D(Gθ(Zi )) − 1)2
.
• asymmetric hinge13
:
inf
D∈D
−
n
i=1
D(Xi ) +
n
i=1
max (0, 1 − D(Gθ(Zi ))) , inf
θ∈Θ
−
n
i=1
D(Gθ(Zi )).
12
X. Mao et al. “Least Squares Generative Adversarial Networks”. In: IEEE International
Conference on Computer Vision. 2017.
13
J. Zhao, M. Mathieu, and Y. LeCun. “Energy-based Generative Adversarial Network”. In:
International Conference on Learning Representations. 2017.
20
The GAN Zoo
Metrics as minimax games
• Maximum mean discrepancy14
and Wasserstein15
:
inf
θ∈Θ
sup
T∈T
Tp dµ − Tpθdµ.
• f-divergences16
:
inf
θ∈Θ
sup
T∈T
Tp dµ − (f ◦ T)pθdµ.
With T being a prescribed class of functions and f the convex conjugate of
a lower-semicontinuous function f.
14
G.K. Dziugaite, D.M. Roy, and Z. Ghahramani. “Training generative neural networks via Maximum
Mean Discrepancy optimization”. In: Proceedings of the Thirty-First Conference on Uncertainty in
Artificial Intelligence. 2015; Y. Li, K. Swersky, and R. Zemel. “Generative Moment Matching
Networks”. In: International Conference on Machine Learning. 2015.
15
M. Arjovsky, S. Chintala, and L. Bottou. “Wasserstein Generative Adversarial Networks”. In:
International Conference on Machine Learning. 2017.
16
S. Nowozin, B. Cseke, and R. Tomioka. “f-GAN: Training Generative Neural Samplers using
Variational Divergence Minimization”. In: Neural Information Processing Systems. June 2016.
21
Roadmap
• Minimum divergence estimation: uniqueness of minimizers.
• Approximation properties: importance of the family of discriminators on
the quality of the approximation
• Statistical analysis: consistency and rate of convergence.
22
Minimum divergence estimation
Kullback-Leibler and Jensen divergences
Kullback-Leibler
• For P Q probability measures on E:
DKL(P Q) = ln
dP
dQ
dP.
• Properties:
DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q.
• If p = dP
dµ
and q = dQ
dµ
:
DKL(P Q) = p ln
p
q
dµ.
• DKL is not symmetric and
defined only for P Q.
23
Kullback-Leibler and Jensen divergences
Kullback-Leibler
• For P Q probability measures on E:
DKL(P Q) = ln
dP
dQ
dP.
• Properties:
DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q.
• If p = dP
dµ
and q = dQ
dµ
:
DKL(P Q) = p ln
p
q
dµ.
• DKL is not symmetric and
defined only for P Q.
23
Kullback-Leibler and Jensen divergences
Jensen-Shannon
• For P and Q probability measures on E:
DJS(P, Q) =
1
2
DKL P
P + Q
2
+
1
2
DKL Q
P + Q
2
.
• Property:
0 ≤ DJS(P, Q) ≤ ln 2.
• (P, Q) → DJS(P, Q) is a distance.
24
Kullback-Leibler and Jensen divergences
Jensen-Shannon
• For P and Q probability measures on E:
DJS(P, Q) =
1
2
DKL P
P + Q
2
+
1
2
DKL Q
P + Q
2
.
• Property:
0 ≤ DJS(P, Q) ≤ ln 2.
• (P, Q) → DJS(P, Q) is a distance.
24
Kullback-Leibler and Jensen divergences
Jensen-Shannon
• For P and Q probability measures on E:
DJS(P, Q) =
1
2
DKL P
P + Q
2
+
1
2
DKL Q
P + Q
2
.
• Property:
0 ≤ DJS(P, Q) ≤ ln 2.
• (P, Q) → DJS(P, Q) is a distance.
24
GAN and Jensen-Shannon divergence
GANs
• Empirical criteria:
ˆL(θ, D) =
1
n
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))) .
• Problem:
inf
θ∈Θ
sup
D∈D
ˆL(θ, D) .
25
GAN and Jensen-Shannon divergence
GANs
• Empirical criteria:
ˆL(θ, D) =
1
n
n
i=1
ln(D(Xi )) +
n
i=1
ln(1 − D(Gθ(Zi ))) .
• Problem:
inf
θ∈Θ
sup
D∈D
ˆL(θ, D) .
Ideal GANs
• Population version of the criteria:
L(θ, D) = ln(D)p dµ + ln(1 − D)pθdµ.
• No constraint: D = D∞, set of all functions from E to [0, 1].
• Problem:
inf
θ∈Θ
sup
D∈D∞
L(θ, D) .
25
GAN and Jensen-Shannon divergence
From GAN to JS divergence
• Criteria:
sup
D∈D∞
L(θ, D) = sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ
≤ sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ.
26
GAN and Jensen-Shannon divergence
From GAN to JS divergence
• Criteria:
sup
D∈D∞
L(θ, D) = sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ
≤ sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ.
• Optimal discriminator:
Dθ =
p
p + pθ
,
with convention 0/0 = 0.
26
GAN and Jensen-Shannon divergence
From GAN to JS divergence
• Criteria:
sup
D∈D∞
L(θ, D) = sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ
≤ sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ.
• Optimal discriminator:
Dθ =
p
p + pθ
,
with convention 0/0 = 0.
• Optimal criteria:
sup
D∈D∞
L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4.
26
GAN and Jensen-Shannon divergence
From GAN to JS divergence
• Criteria:
sup
D∈D∞
L(θ, D) = sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ
≤ sup
D∈D∞
[ln(D)p + ln(1 − D)pθ] dµ.
• Optimal discriminator:
Dθ =
p
p + pθ
,
with convention 0/0 = 0.
• Optimal criteria:
sup
D∈D∞
L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4.
• Problem:
inf
θ∈Θ
sup
D∈D∞
L(θ, D) = inf
θ∈Θ
L(θ, Dθ ) = 2 inf
θ∈Θ
DJS(p , pθ) − ln 4.
26
The quest for Dθ
Numerical approach
• Big n, big D: try to approximate Dθ with arg maxD∈D
ˆL(θ, D).
• Close to divergence minimization: supD∈D
ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4.
17
Goodfellow et al., “Generative Adversarial Nets”.
27
The quest for Dθ
Numerical approach
• Big n, big D: try to approximate Dθ with arg maxD∈D
ˆL(θ, D).
• Close to divergence minimization: supD∈D
ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4.
Theorem
Let θ ∈ Θ and Aθ = {p = pθ = 0}.
If µ(Aθ) = 0, then
{Dθ } = arg maxD∈D∞
L(θ, D).
If µ(Aθ) > 0, then Dθ is unique only on EAθ.
Completes Proposition 1 in17
.
17
Goodfellow et al., “Generative Adversarial Nets”.
27
Oracle parameter
• Oracle parameter regarding the Jensen-Shannon divergence:
θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ).
• Gθ is the ideal generator.
• If p ∈ P,
p = pθ DJS(p , pθ ) = 0 Dθ =
1
2
.
• What if p /∈ P? Existence and uniqueness of θ ?
28
Oracle parameter
• Oracle parameter regarding the Jensen-Shannon divergence:
θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ).
• Gθ is the ideal generator.
• If p ∈ P,
p = pθ DJS(p , pθ ) = 0 Dθ =
1
2
.
• What if p /∈ P? Existence and uniqueness of θ ?
Theorem
Assume that P is a convex and compact set for the JS distance.
If p > 0 µ-almost everywhere, then there exists ¯p ∈ P such that
{¯p} = arg minp∈P DJS(p , p).
In addition, if the model P is identifiable, then there exists θ ∈ Θ such
mathematical
{θ } = arg minθ∈Θ L(θ, Dθ ).
28
Oracle parameter
Existence and uniqueness
• Compactness of P and continuity of DJS(p , ·).
• p > 0 µ-a.e. enables strict convexity of DJS(p , ·).
29
Oracle parameter
Existence and uniqueness
• Compactness of P and continuity of DJS(p , ·).
• p > 0 µ-a.e. enables strict convexity of DJS(p , ·).
Compactness of P with respect to the JS distance
1. Θ compact and P convex.
2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous.
3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1
(µ).
29
Oracle parameter
Existence and uniqueness
• Compactness of P and continuity of DJS(p , ·).
• p > 0 µ-a.e. enables strict convexity of DJS(p , ·).
Compactness of P with respect to the JS distance
1. Θ compact and P convex.
2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous.
3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1
(µ).
Identifiability
High-dimensional parametric setting often misspecified =⇒ identifiability
not satisfied.
29
Approximation properties
From JS divergence to likelihood
GAN = JS divergence
• GANs don’t minimize the Jensen-Shannon divergence.
• Considering supD∈D∞
L(θ, D) means knowing Dθ = p
p +pθ
, thus knowing
p .
30
From JS divergence to likelihood
GAN = JS divergence
• GANs don’t minimize the Jensen-Shannon divergence.
• Considering supD∈D∞
L(θ, D) means knowing Dθ = p
p +pθ
, thus knowing
p .
Parametrized discriminators
• D = {Dα}α∈Λ, Λ ⊂ Rq
: parametric family of discriminators.
• Likelihood-type problem with two parametric families:
inf
θ∈Θ
sup
α∈Λ
L(θ, Dα).
• Likelihood parameter:
¯θ ∈ arg minθ∈Θ sup
α∈Λ
L(θ, Dα).
• How close the best candidate p¯θ is to the ideal density pθ ?
• How does it depend on the capability of D to approximate Dθ ?
30
Approximation result
(Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2
(µ) such that
m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε.
31
Approximation result
(Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2
(µ) such that
m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and p¯θ ≤ M.
Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant
c1 > 0 (depending only upon m and M) such that
DJS(p , p¯θ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
.
31
Approximation result
(Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2
(µ) such that
m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and p¯θ ≤ M.
Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant
c1 > 0 (depending only upon m and M) such that
DJS(p , p¯θ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
.
Remarks
As soon as the class D becomes richer:
• minimizing supα∈Λ L(θ, Dα) over Θ helps minimizing DJS(p , pθ).
• since under some assumptions {pθ } = arg minpθ:θ∈Θ DJS(p , pθ), p¯θ
comes closer to pθ . 31
Statistical analysis
The estimation problem
Estimator
ˆθ ∈ arg minθ∈Θ sup
α∈Λ
ˆL(θ, α),
where
ˆL(θ, α) =
1
n
n
i=1
ln(Dα(Xi )) +
n
i=1
ln(1 − Dα(Gθ(Zi ))) .
32
The estimation problem
Estimator
ˆθ ∈ arg minθ∈Θ sup
α∈Λ
ˆL(θ, α),
where
ˆL(θ, α) =
1
n
n
i=1
ln(Dα(Xi )) +
n
i=1
ln(1 − Dα(Gθ(Zi ))) .
(Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ˆθ exists (and so for ¯θ).
32
The estimation problem
Estimator
ˆθ ∈ arg minθ∈Θ sup
α∈Λ
ˆL(θ, α),
where
ˆL(θ, α) =
1
n
n
i=1
ln(Dα(Xi )) +
n
i=1
ln(1 − Dα(Gθ(Zi ))) .
(Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ˆθ exists (and so for ¯θ).
Questions
• How far DJS(p , pˆθ) is from minθ∈Θ DJS(p , pθ) = DJS(p , pθ )?
• Does ˆθ converge towards ¯θ as n → ∞?
• What is the asymptotic distribution of ˆθ − ¯θ?
32
Non-asymptotic bound on the JS divergence
(Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists
D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε.
33
Non-asymptotic bound on the JS divergence
(Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists
D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ.
Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two
constants c1 > 0 (depending only upon m and M) and c2 such that
E DJS(p , pˆθ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
+ c2
1
√
n
.
33
Non-asymptotic bound on the JS divergence
(Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists
D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε.
Theorem
Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ.
Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two
constants c1 > 0 (depending only upon m and M) and c2 such that
E DJS(p , pˆθ) − min
θ∈Θ
DJS(p , pθ) ≤ c1ε2
+ c2
1
√
n
.
Remarks
• Under (Hreg), {ˆL(θ, α) − L(θ, α)}θ∈Θ,α∈Λ is a subgaussian process for
· /
√
n.
• Dudley’s inequality: E supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)| = O 1√
n
.
• c2 scales as p + q =⇒ loose bound in the usual over-parametrized
regime (LSUN, FACES:
√
n ≈ 1000 p + q ≈ 1500000).
33
Illustration
Setting
• p (x) = e−x/s
s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R).
• Gθ and Dα are two fully connected neural networks.
• Z ∼ U([0, 1]): scalar noise.
• n = 100000 (1/
√
n is negligible) and 30 replications.
34
Illustration
Setting
• p (x) = e−x/s
s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R).
• Gθ and Dα are two fully connected neural networks.
• Z ∼ U([0, 1]): scalar noise.
• n = 100000 (1/
√
n is negligible) and 30 replications.
34
Illustration
Setting
• Generator depth: 3.
• Discriminator depth: 2 then 5.
35
Convergence of ˆθ
(Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist.
36
Convergence of ˆθ
(Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist.
(H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ).
Theorem
Under Assumptions (Hreg) and (H1),
ˆθ
a.s.
→ ¯θ and ˆα
a.s.
→ ¯α.
36
Convergence of ˆθ
(Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα).
Existence
Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist.
(H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ).
Theorem
Under Assumptions (Hreg) and (H1),
ˆθ
a.s.
→ ¯θ and ˆα
a.s.
→ ¯α.
Remarks
• Convergence of ˆθ comes from supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)|
a.s.
→ 0.
• It does not need uniqueness of ¯α.
• Convergence of ˆα comes from that of ˆθ.
36
Illustration
Setting
• Three models:
1. Laplace: p (x) = 1
3
e−
2|x|
3 vs pθ(x) = 1√
2πθ
e
− x2
2θ2 .
2. Claw: p (x) = pclaw(x) vs pθ(x) = 1√
2πθ
e
− x2
2θ2 .
3. Exponential: p (x) = e−x 1R+ vs pθ(x) = 1
θ
1[0,θ](x).
• Gθ: generalized inverse of the cdf of pθ.
• Z ∼ U([0, 1]): scalar noise.
• Dα =
pα1
pα1
+pα0
.
• n = 10 to 10000 and 200 replications.
37
Illustration
Claw vs Gaussian Exponential vs Uniform
38
Central limit theorem
(Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are
invertible).
Theorem
Under Assumptions (Hreg), (H1) and (Hloc),
√
n(ˆθ − ¯θ)
d
→ N(0, Σ).
39
Central limit theorem
(Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are
invertible).
Theorem
Under Assumptions (Hreg), (H1) and (Hloc),
√
n(ˆθ − ¯θ)
d
→ N(0, Σ).
Remark
One has Σ 2 = O(p3
q4
), which suggests that ˆθ has a large dispersion
around ¯θ in the over-parametrized regime.
39
Illustration
Histograms of
√
n(ˆθ − ¯θ):
Claw vs Gaussian Exponential vs Uniform
40
Conclusion
Take-home message
A first step for understanding GANs
• From data to sampling.
• The richness of the class of discriminators D controls the gap between
GANs and the JS divergence.
• The generator parameters θ are asymptotically normal with rate
√
n.
41
Take-home message
A first step for understanding GANs
• From data to sampling.
• The richness of the class of discriminators D controls the gap between
GANs and the JS divergence.
• The generator parameters θ are asymptotically normal with rate
√
n.
Future investigations
1. Impact of the latent variable Z (dimension, distribution) and the networks
(number of layers in Gθ, dimensionality of Θ) on the performance of
GANs (currently it is assumed p µ, pθ µ =⇒ information on the
supporting manifold of p ).
2. How much assumptions (Hε) and (Hε) are satisfied for neural nets as
discriminators?
3. Over-parametrized regime: convergence of distributions instead of
parameters.
41

Contenu connexe

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

En vedette

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn
 

En vedette (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

What can a statistician expect from GANs?

  • 1. GANs from a statistical point of view Maxime Sangnier International workshop Machine Learning & Artificial Intelligence September 17, 2018 Sorbonne Université, CNRS, LPSM, LIP6, Paris, France Joint work with Gérard Biau1 , Benoît Cadre2 and Ugo Tanielian1,3 1 Sorbonne Université, CNRS, LPSM, Paris, France 2 ENS Rennes, Univ Rennes, CNRS, IRMAR, Rennes, France 3 Criteo, Paris, France
  • 2. Contributors Gérard Biau (Sorbonne Université) Benoît Cadre (ENS Rennes) Ugo Tanielian (Sorbonne Université & Criteo) 1
  • 4. Motivation Generative models aim at generating artificial contents. • Images: • merchandising; • painting; • art; • super-resolution and denoising; • text to image. • Movies: • pose to movie; • Audio: • speech synthesis ; • music. 2
  • 7. Painting Interactive GAN.1 1 J.-Y. Zhu et al. “Generative Visual Manipulation on the Natural Image Manifold”. In: European Conference on Computer Vision. 2016. 5
  • 8. Superresolution SuperResolution GAN.2 2 C. Ledig et al. “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”. In: arXiv:1609.04802 [cs, stat] (2016). 6
  • 9. Text-to-image Stacked GAN.3 3 H. Zhang et al. “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”. In: arXiv:1612.03242 [cs, stat] (2016). 7
  • 10. Movies Everybody Dance Now.4 4 C. Chan et al. “Everybody Dance Now”. In: arXiv:1808.07371 [cs] (2018). 8
  • 12. Motivation Generative models aim at generating artificial contents. • Outstanding image generation and extrapolation5 . • And even more I’m not aware of. . . 5 T. Karras et al. “Progressive Growing of GANs for Improved Quality, Stability, and Variation”. In: International Conference on Learning Representations. 2018. 10
  • 13. Motivation Generative models aim at generating artificial contents. • Outstanding image generation and extrapolation5 . • And even more I’m not aware of. . . Generative models are used for: • exploring unseen realities; • providing many answers to a single question. 5 Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation”. 10
  • 14. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? 11
  • 15. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? Naive approach 1. estimate p by ˆp; 2. sample according to ˆp. 11
  • 16. Generate from data X1, . . . , Xn i.i.d. according to an unknown density p on E ⊆ Rd . How to sample according to p ? Naive approach 1. estimate p by ˆp; 2. sample according to ˆp. Drawbacks • both problems are difficult in themselves; • we cannot define a realistic parametric statistical model; • non-parametric density estimation inefficient in high dimension; • this approach violates Vapnik’s principle: When solving a problem of interest, do not solve a more general problem as an intermediate step. 11
  • 17. Some generative methods METHOD DENSITY- FREE FLEXIBILITY SIMPLE SAMPLING Autoregressive models (WaveNet6 ) Nonlinear independent components analy- sis (Real NVP7 ) Variational autoencoders8 Boltzmann machines9 Generative stochastic networks10 Generative adversarial networks 6 A.v.d. Oord et al. “WaveNet: A Generative Model for Raw Audio”. In: arXiv:1609.03499 [cs] (2016). 7 L. Dinh, J. Sohl-Dickstein, and S. Bengio. “Density estimation using Real NVP”. . In: arXiv:1605.08803 [cs, stat] (2016). 8 D.P. Kingma and M. Welling. “Auto-Encoding Variational Bayes”. In: International Conference on Learning Representations. 2013. 9 S.E. Fahlman, G.E. Hinton, and T.J. Sejnowski. “Massively Parallel Architectures for AI: Netl, Thistle, and Boltzmann Machines”. In: Proceedings of the Third AAAI Conference on Artificial Intelligence. 1983. 10 Y. Bengio et al. “Deep Generative Stochastic Networks Trainable by Backprop”. In: International Conference on Machine Learning. 2014. 12
  • 19. A direct approach Cornerstone: don’t estimate p . 11 I. Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural Information Processing Systems. 2014. 13
  • 20. A direct approach Cornerstone: don’t estimate p . General procedure: • sample U1, . . . , Un i.i.d. thanks to a parametric model; • compare X1, . . . , Xn and U1, . . . , Un and update the model. 11 Goodfellow et al., “Generative Adversarial Nets”. 13
  • 21. A direct approach Cornerstone: don’t estimate p . General procedure: • sample U1, . . . , Un i.i.d. thanks to a parametric model; • compare X1, . . . , Xn and U1, . . . , Un and update the model. GANs11 follow this principle. 11 Goodfellow et al., “Generative Adversarial Nets”. 13
  • 22. Generating a random sample Inverse transform sampling • S: scalar random variable; • FS: cumulative distribution function of S; • Z ∼ U([0, 1]). • F−1 S (Z) d = S. 14
  • 23. Generating a random sample Inverse transform sampling • S: scalar random variable; • FS: cumulative distribution function of S; • Z ∼ U([0, 1]). • F−1 S (Z) d = S. Generators • X1, . . . , Xn i.i.d. according to a density p on E ⊆ Rd , dominated by a known measure µ. • G = {Gθ : Rd → E}θ∈Θ, Θ ⊂ Rp : parametric family of generators (d d); • Z1, . . . , Zn random vectors from Rd (typically U([0, 1]d )); • Ui = Gθ(Zi ): generated sample; • P = {pθ}θ∈Θ: associated family of densities with by definition Gθ(Z1) d = pθdµ. 14
  • 24. Generating a random sample Remarks • Each pθ is a candidate to represent p . 15
  • 25. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. 15
  • 26. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. • It is not assumed that p belongs to P. 15
  • 27. Generating a random sample Remarks • Each pθ is a candidate to represent p . • The statistical model P = {pθ}θ∈Θ is just a mathematical tool for the analysis. • It is not assumed that p belongs to P. • In GANs: Gθ is a neural network with p weights, stored in θ ∈ Rp . 15
  • 28. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. 16
  • 29. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. Supervised learning • Both samples have same distribution as soon as we cannot distinguish them. • This is a classification problem: Class Y = 0 Class Y = 1 Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn 16
  • 30. Comparing two samples The next step • The procedure should drive θ such that Gθ(Z1) d = X1. • Need to confront Gθ(Z1), . . . , Gθ(Zn) to X1, . . . , Xn in order to update θ. Supervised learning • Both samples have same distribution as soon as we cannot distinguish them. • This is a classification problem: Class Y = 0 Class Y = 1 Gθ(Z1), . . . , Gθ(Zn) X1, . . . , Xn 16
  • 31. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. 17
  • 32. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) 17
  • 33. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). 17
  • 34. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). • Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x). 17
  • 35. Adversarial principle Discriminator • D a family of functions from E to [0, 1]: the discriminators. • Choose D ∈ D such that for any x ∈ E, D(x) ≥ 1/2 =⇒ true observation (1) D(x) < 1/2 =⇒ fake (generated) point. (2) • Assume {(X1, 1), . . . , (Xn, 1), (Gθ(Z1), 0), . . . , (Gθ(Zn), 0)} i.i.d. with same distribution as (X, Y). • Classification model: Y|X = x ∼ B(D(x)), i.e. P(Y = 1|X = x) = D(x). • Maximum (conditional) likelihood estimation: sup D∈D n i=1 D(Xi ) × n i=1 (1 − D(Gθ(Zi ))) or sup D∈D ˆL(θ, D), with ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . 17
  • 36. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. 18
  • 37. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. • Minimum divergence estimation: inf θ∈Θ sup D∈D ˆL(θ, D) . or inf θ∈Θ sup D∈D n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))). 18
  • 38. Adversarial principle Generator • supD∈D ˆL(θ, D) acts like a divergence between the distributions of Gθ(Z1), . . . , Gθ(Zn) and X1, . . . , Xn. • Minimum divergence estimation: inf θ∈Θ sup D∈D ˆL(θ, D) . or inf θ∈Θ sup D∈D n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))). • Adversarial, minimax or zero-sum game. 18
  • 39. The GAN Zoo Avinash Hindupur’s Github. 19
  • 40. The GAN Zoo Curbing the discriminator • least squares12 : inf D∈D n i=1 (D(Xi ) − 1)2 + n i=1 D(Gθ(Zi ))2 , inf θ∈Θ n i=1 (D(Gθ(Zi )) − 1)2 . • asymmetric hinge13 : inf D∈D − n i=1 D(Xi ) + n i=1 max (0, 1 − D(Gθ(Zi ))) , inf θ∈Θ − n i=1 D(Gθ(Zi )). 12 X. Mao et al. “Least Squares Generative Adversarial Networks”. In: IEEE International Conference on Computer Vision. 2017. 13 J. Zhao, M. Mathieu, and Y. LeCun. “Energy-based Generative Adversarial Network”. In: International Conference on Learning Representations. 2017. 20
  • 41. The GAN Zoo Metrics as minimax games • Maximum mean discrepancy14 and Wasserstein15 : inf θ∈Θ sup T∈T Tp dµ − Tpθdµ. • f-divergences16 : inf θ∈Θ sup T∈T Tp dµ − (f ◦ T)pθdµ. With T being a prescribed class of functions and f the convex conjugate of a lower-semicontinuous function f. 14 G.K. Dziugaite, D.M. Roy, and Z. Ghahramani. “Training generative neural networks via Maximum Mean Discrepancy optimization”. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. 2015; Y. Li, K. Swersky, and R. Zemel. “Generative Moment Matching Networks”. In: International Conference on Machine Learning. 2015. 15 M. Arjovsky, S. Chintala, and L. Bottou. “Wasserstein Generative Adversarial Networks”. In: International Conference on Machine Learning. 2017. 16 S. Nowozin, B. Cseke, and R. Tomioka. “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization”. In: Neural Information Processing Systems. June 2016. 21
  • 42. Roadmap • Minimum divergence estimation: uniqueness of minimizers. • Approximation properties: importance of the family of discriminators on the quality of the approximation • Statistical analysis: consistency and rate of convergence. 22
  • 44. Kullback-Leibler and Jensen divergences Kullback-Leibler • For P Q probability measures on E: DKL(P Q) = ln dP dQ dP. • Properties: DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q. • If p = dP dµ and q = dQ dµ : DKL(P Q) = p ln p q dµ. • DKL is not symmetric and defined only for P Q. 23
  • 45. Kullback-Leibler and Jensen divergences Kullback-Leibler • For P Q probability measures on E: DKL(P Q) = ln dP dQ dP. • Properties: DKL(P Q) ≥ 0 DKL(P Q) = 0 ⇐⇒ P = Q. • If p = dP dµ and q = dQ dµ : DKL(P Q) = p ln p q dµ. • DKL is not symmetric and defined only for P Q. 23
  • 46. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
  • 47. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
  • 48. Kullback-Leibler and Jensen divergences Jensen-Shannon • For P and Q probability measures on E: DJS(P, Q) = 1 2 DKL P P + Q 2 + 1 2 DKL Q P + Q 2 . • Property: 0 ≤ DJS(P, Q) ≤ ln 2. • (P, Q) → DJS(P, Q) is a distance. 24
  • 49. GAN and Jensen-Shannon divergence GANs • Empirical criteria: ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . • Problem: inf θ∈Θ sup D∈D ˆL(θ, D) . 25
  • 50. GAN and Jensen-Shannon divergence GANs • Empirical criteria: ˆL(θ, D) = 1 n n i=1 ln(D(Xi )) + n i=1 ln(1 − D(Gθ(Zi ))) . • Problem: inf θ∈Θ sup D∈D ˆL(θ, D) . Ideal GANs • Population version of the criteria: L(θ, D) = ln(D)p dµ + ln(1 − D)pθdµ. • No constraint: D = D∞, set of all functions from E to [0, 1]. • Problem: inf θ∈Θ sup D∈D∞ L(θ, D) . 25
  • 51. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. 26
  • 52. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. 26
  • 53. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. • Optimal criteria: sup D∈D∞ L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4. 26
  • 54. GAN and Jensen-Shannon divergence From GAN to JS divergence • Criteria: sup D∈D∞ L(θ, D) = sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ ≤ sup D∈D∞ [ln(D)p + ln(1 − D)pθ] dµ. • Optimal discriminator: Dθ = p p + pθ , with convention 0/0 = 0. • Optimal criteria: sup D∈D∞ L(θ, D) = L(θ, Dθ ) = 2DJS(p , pθ) − ln 4. • Problem: inf θ∈Θ sup D∈D∞ L(θ, D) = inf θ∈Θ L(θ, Dθ ) = 2 inf θ∈Θ DJS(p , pθ) − ln 4. 26
  • 55. The quest for Dθ Numerical approach • Big n, big D: try to approximate Dθ with arg maxD∈D ˆL(θ, D). • Close to divergence minimization: supD∈D ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4. 17 Goodfellow et al., “Generative Adversarial Nets”. 27
  • 56. The quest for Dθ Numerical approach • Big n, big D: try to approximate Dθ with arg maxD∈D ˆL(θ, D). • Close to divergence minimization: supD∈D ˆL(θ, D) ≈ 2DJS(p , pθ) − ln 4. Theorem Let θ ∈ Θ and Aθ = {p = pθ = 0}. If µ(Aθ) = 0, then {Dθ } = arg maxD∈D∞ L(θ, D). If µ(Aθ) > 0, then Dθ is unique only on EAθ. Completes Proposition 1 in17 . 17 Goodfellow et al., “Generative Adversarial Nets”. 27
  • 57. Oracle parameter • Oracle parameter regarding the Jensen-Shannon divergence: θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ). • Gθ is the ideal generator. • If p ∈ P, p = pθ DJS(p , pθ ) = 0 Dθ = 1 2 . • What if p /∈ P? Existence and uniqueness of θ ? 28
  • 58. Oracle parameter • Oracle parameter regarding the Jensen-Shannon divergence: θ ∈ arg minθ∈Θ L(θ, Dθ ) = arg minθ∈Θ DJS(p , pθ). • Gθ is the ideal generator. • If p ∈ P, p = pθ DJS(p , pθ ) = 0 Dθ = 1 2 . • What if p /∈ P? Existence and uniqueness of θ ? Theorem Assume that P is a convex and compact set for the JS distance. If p > 0 µ-almost everywhere, then there exists ¯p ∈ P such that {¯p} = arg minp∈P DJS(p , p). In addition, if the model P is identifiable, then there exists θ ∈ Θ such mathematical {θ } = arg minθ∈Θ L(θ, Dθ ). 28
  • 59. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). 29
  • 60. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). Compactness of P with respect to the JS distance 1. Θ compact and P convex. 2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous. 3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1 (µ). 29
  • 61. Oracle parameter Existence and uniqueness • Compactness of P and continuity of DJS(p , ·). • p > 0 µ-a.e. enables strict convexity of DJS(p , ·). Compactness of P with respect to the JS distance 1. Θ compact and P convex. 2. For all x ∈ E, θ ∈ Θ → pθ(x) is continuous. 3. sup(θ,θ )∈Θ2 |pθ ln pθ | ∈ L1 (µ). Identifiability High-dimensional parametric setting often misspecified =⇒ identifiability not satisfied. 29
  • 63. From JS divergence to likelihood GAN = JS divergence • GANs don’t minimize the Jensen-Shannon divergence. • Considering supD∈D∞ L(θ, D) means knowing Dθ = p p +pθ , thus knowing p . 30
  • 64. From JS divergence to likelihood GAN = JS divergence • GANs don’t minimize the Jensen-Shannon divergence. • Considering supD∈D∞ L(θ, D) means knowing Dθ = p p +pθ , thus knowing p . Parametrized discriminators • D = {Dα}α∈Λ, Λ ⊂ Rq : parametric family of discriminators. • Likelihood-type problem with two parametric families: inf θ∈Θ sup α∈Λ L(θ, Dα). • Likelihood parameter: ¯θ ∈ arg minθ∈Θ sup α∈Λ L(θ, Dα). • How close the best candidate p¯θ is to the ideal density pθ ? • How does it depend on the capability of D to approximate Dθ ? 30
  • 65. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. 31
  • 66. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and p¯θ ≤ M. Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant c1 > 0 (depending only upon m and M) such that DJS(p , p¯θ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 . 31
  • 67. Approximation result (Hε) There exist ε > 0, m ∈ (0, 1/2) and D ∈ D ∩ L2 (µ) such that m ≤ D ≤ 1 − m and D − D¯θ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and p¯θ ≤ M. Then, under Assumption (Hε) with ε < 1/(2M), there exists a constant c1 > 0 (depending only upon m and M) such that DJS(p , p¯θ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 . Remarks As soon as the class D becomes richer: • minimizing supα∈Λ L(θ, Dα) over Θ helps minimizing DJS(p , pθ). • since under some assumptions {pθ } = arg minpθ:θ∈Θ DJS(p , pθ), p¯θ comes closer to pθ . 31
  • 69. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . 32
  • 70. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . (Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ˆθ exists (and so for ¯θ). 32
  • 71. The estimation problem Estimator ˆθ ∈ arg minθ∈Θ sup α∈Λ ˆL(θ, α), where ˆL(θ, α) = 1 n n i=1 ln(Dα(Xi )) + n i=1 ln(1 − Dα(Gθ(Zi ))) . (Hreg) Regularity conditions of order 1 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ˆθ exists (and so for ¯θ). Questions • How far DJS(p , pˆθ) is from minθ∈Θ DJS(p , pθ) = DJS(p , pθ )? • Does ˆθ converge towards ¯θ as n → ∞? • What is the asymptotic distribution of ˆθ − ¯θ? 32
  • 72. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. 33
  • 73. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ. Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two constants c1 > 0 (depending only upon m and M) and c2 such that E DJS(p , pˆθ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 + c2 1 √ n . 33
  • 74. Non-asymptotic bound on the JS divergence (Hε) There exist ε > 0, m ∈ (0, 1/2) such that for all θ ∈ Θ, there exists D ∈ D with m ≤ D ≤ 1 − m and D − Dθ 2 ≤ ε. Theorem Assume that, for some M > 0, p ≤ M and pθ ≤ M for all θ ∈ Θ. Then, under Assumptions (Hreg) and (Hε) with ε < 1/(2M), there exist two constants c1 > 0 (depending only upon m and M) and c2 such that E DJS(p , pˆθ) − min θ∈Θ DJS(p , pθ) ≤ c1ε2 + c2 1 √ n . Remarks • Under (Hreg), {ˆL(θ, α) − L(θ, α)}θ∈Θ,α∈Λ is a subgaussian process for · / √ n. • Dudley’s inequality: E supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)| = O 1√ n . • c2 scales as p + q =⇒ loose bound in the usual over-parametrized regime (LSUN, FACES: √ n ≈ 1000 p + q ≈ 1500000). 33
  • 75. Illustration Setting • p (x) = e−x/s s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R). • Gθ and Dα are two fully connected neural networks. • Z ∼ U([0, 1]): scalar noise. • n = 100000 (1/ √ n is negligible) and 30 replications. 34
  • 76. Illustration Setting • p (x) = e−x/s s(1+e−x/s)2 , x ∈ R : logistic density (x ∈ R). • Gθ and Dα are two fully connected neural networks. • Z ∼ U([0, 1]): scalar noise. • n = 100000 (1/ √ n is negligible) and 30 replications. 34
  • 77. Illustration Setting • Generator depth: 3. • Discriminator depth: 2 then 5. 35
  • 78. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. 36
  • 79. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. (H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ). Theorem Under Assumptions (Hreg) and (H1), ˆθ a.s. → ¯θ and ˆα a.s. → ¯α. 36
  • 80. Convergence of ˆθ (Hreg) Regularity conditions of order 2 on the models (Gθ, pθ and Dα). Existence Under (Hreg), ¯θ and ¯α ∈ arg minα∈Λ L(¯θ, α) exist. (H1) The pair (¯θ, ¯α) is unique and belongs to int(Θ) × int(Λ). Theorem Under Assumptions (Hreg) and (H1), ˆθ a.s. → ¯θ and ˆα a.s. → ¯α. Remarks • Convergence of ˆθ comes from supθ∈Θ,α∈Λ |ˆL(θ, α) − L(θ, α)| a.s. → 0. • It does not need uniqueness of ¯α. • Convergence of ˆα comes from that of ˆθ. 36
  • 81. Illustration Setting • Three models: 1. Laplace: p (x) = 1 3 e− 2|x| 3 vs pθ(x) = 1√ 2πθ e − x2 2θ2 . 2. Claw: p (x) = pclaw(x) vs pθ(x) = 1√ 2πθ e − x2 2θ2 . 3. Exponential: p (x) = e−x 1R+ vs pθ(x) = 1 θ 1[0,θ](x). • Gθ: generalized inverse of the cdf of pθ. • Z ∼ U([0, 1]): scalar noise. • Dα = pα1 pα1 +pα0 . • n = 10 to 10000 and 200 replications. 37
  • 82. Illustration Claw vs Gaussian Exponential vs Uniform 38
  • 83. Central limit theorem (Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are invertible). Theorem Under Assumptions (Hreg), (H1) and (Hloc), √ n(ˆθ − ¯θ) d → N(0, Σ). 39
  • 84. Central limit theorem (Hloc) Local smoothness conditions around (¯θ, ¯α) (such that Hessians are invertible). Theorem Under Assumptions (Hreg), (H1) and (Hloc), √ n(ˆθ − ¯θ) d → N(0, Σ). Remark One has Σ 2 = O(p3 q4 ), which suggests that ˆθ has a large dispersion around ¯θ in the over-parametrized regime. 39
  • 85. Illustration Histograms of √ n(ˆθ − ¯θ): Claw vs Gaussian Exponential vs Uniform 40
  • 87. Take-home message A first step for understanding GANs • From data to sampling. • The richness of the class of discriminators D controls the gap between GANs and the JS divergence. • The generator parameters θ are asymptotically normal with rate √ n. 41
  • 88. Take-home message A first step for understanding GANs • From data to sampling. • The richness of the class of discriminators D controls the gap between GANs and the JS divergence. • The generator parameters θ are asymptotically normal with rate √ n. Future investigations 1. Impact of the latent variable Z (dimension, distribution) and the networks (number of layers in Gθ, dimensionality of Θ) on the performance of GANs (currently it is assumed p µ, pθ µ =⇒ information on the supporting manifold of p ). 2. How much assumptions (Hε) and (Hε) are satisfied for neural nets as discriminators? 3. Over-parametrized regime: convergence of distributions instead of parameters. 41