Deep generative learning_icml_part1

Bayesian Posterior Inference
in the
Big Data Arena
Max Welling
Anoop Korattikara
1vrijdag 4 juli 14

Outline
• Introduction
• Stochastic Variational Inference
– Variational Inference 101
– Stochastic Variational Inference
– Deep Generative Models with SVB
• MCMC with mini-batches
– MCMC 101
– MCMC using noisy gradients
– MCMC using noisy Metropolis-Hastings
– Theoretical results
2vrijdag 4 juli 14

Big Data (mine is bigger than yours)
Square
Kilometer
Array
(SKA)
produces
1
Exabyte
per
day
by
2024…

(interested
to
do
approximate
inference
on
this
data,
talk
to
me)
3vrijdag 4 juli 14

Introduction
Why
do
we
need
posterior
inference
if
the
datasets
are
BIG?

4vrijdag 4 juli 14

p>>N

Big
data
may
mean
large
p,
small
n
Gene
expression
data
fMRI
data
5
5vrijdag 4 juli 14

Planning
Planning
against
uncertainty
needs
probabili=es

6
6vrijdag 4 juli 14

Little data inside Big data
Not
every
data-‐case
carries
informa=on
about
every
model
component

New
user
with
no
raGngs
(cold
start
problem)
7
7vrijdag 4 juli 14

1943:
First
NN
(+/-‐
N=10)
1988:
NetTalk
(+/-‐
N=20K)
2009:
Hinton’s

Deep
Belief
Net
(+/-‐
N=10M)
2013:
Google/Y!

(N=+/-‐
10B)
Big
Models!
Models
grow
faster
than
useful
informa=on
in
data
8
8vrijdag 4 juli 14

Two Ingredients for Big Data Bayes
Any
big
data
posterior
inference
algorithm
should:
1. easily
run
on
a
distributed
architecture.
2. only
use
a
small
mini-‐batch
of
the
data
at
every
itera=on.

9vrijdag 4 juli 14

Bayesian Posterior Inference
Variational Sampling
Variational Family Q
All probability distributions
• DeterminisGc
• Biased

• Local
minima
• Easy
to
assess
convergence
• StochasGc
(sample
error)
• Unbiased
• Hard
to
mix
between
modes
• Hard
to
assess
convergence
10vrijdag 4 juli 14

Variational Bayes
11
Hinton
&
van
Camp
(1993)
Neal
&
Hinton
(1999)
Saul
&
Jordan
(1996)
Saul,
Jaakkola
&
Jordan
(1996)
ATas
(1999,2000)

Wiegerinck
(2000)
Ghahramani
&
Beal
(2000,2001)
Coordinate
descent
on
Q
P
Q
(Bishop,
PaYern
Recogni[on

and
Machine
Learning)
11vrijdag 4 juli 14

Stochastic VB Hoﬀman,
Blei
&
Bach,
2010
Stochas=c
natural
gradient
descent
on
Q

12
• P
and
Q
in
exponenGal
family.
• Q
factorized:
• At
every
iteraGon:
subsample
n<<N
data-‐cases:
• solve

analyGcally.
• update
parameter

using
stochas=c
natural
gradient
descent.
12vrijdag 4 juli 14

General SVB
very
high
variance
sample
13
subsample
X

(ignoring
latent
variables
Z)
13vrijdag 4 juli 14

Reparameterization Trick
14
-‐Varia[onal
Bayesian
Inference
with
Stochas[c
Search
[D.M.
Blei,
M.I.
Jordan
and
J.W.
Paisley,
2012]
-‐Fixed-‐Form
Varia[onal
Posterior
Approxima[on
through
Stochas[c
Linear
Regression
[T.
Salimans
and
A.
Knowles,
2013].
-‐Black
Box
Varia[onal
Inference.
[R.
Ranganath,
S.
Gerrish
and
D.M.
Blei.
2013]
-‐Stochas[c
Varia[onal
Inference
[M.D.
Hoﬀman,
D.
Blei,
C.
Wang
and
J.
Paisley,
2013]
-‐Es[ma[ng
or
propaga[ng
gradients
through
stochas[c
neurons.
[Y.
Bengio,
2013].
-‐Neural
Varia[onal
Inference
and
Learning
in
Belief
Networks.
[A.
Mnih
and
K.
Gregor,
2014]
Kingma
2013,
Bengio
2013,
Kingma
&
W.
2014
Other
solu=ons
to
solve
the
same
"large
variance
problem":
14vrijdag 4 juli 14

Reparameterization Trick
14
-‐Varia[onal
Bayesian
Inference
with
Stochas[c
Search
[D.M.
Blei,
M.I.
Jordan
and
J.W.
Paisley,
2012]
-‐Fixed-‐Form
Varia[onal
Posterior
Approxima[on
through
Stochas[c
Linear
Regression
[T.
Salimans
and
A.
Knowles,
2013].
-‐Black
Box
Varia[onal
Inference.
[R.
Ranganath,
S.
Gerrish
and
D.M.
Blei.
2013]
-‐Stochas[c
Varia[onal
Inference
[M.D.
Hoﬀman,
D.
Blei,
C.
Wang
and
J.
Paisley,
2013]
-‐Es[ma[ng
or
propaga[ng
gradients
through
stochas[c
neurons.
[Y.
Bengio,
2013].
-‐Neural
Varia[onal
Inference
and
Learning
in
Belief
Networks.
[A.
Mnih
and
K.
Gregor,
2014]
Kingma
2013,
Bengio
2013,
Kingma
&
W.
2014
Other
solu=ons
to
solve
the
same
"large
variance
problem":
Talk Monday June 23, 15:20
In Track F (Deep Learning II)
14vrijdag 4 juli 14

Auto Encoding Variational Bayes
Both
P(X|Z)
and
Q(Z|X)
are
general
models

(e.g.
deep
neural
net)
Kingma
&
W.,
2013,
Rezende
et
al
2014
15
The
Helmholtz
machine

Wake/Sleep
algorithm
Dayan,
Hinton,
Neal,
Zemel,
1995
Z
X
Q(Z|X)

P(X|Z)P(Z)
15vrijdag 4 juli 14

The VB Landscape
SVB SSVB
AEVBFSSVB
Stochas[c
Varia[onal
Bayes
Auto-‐Encoding
Varia[onal
Bayes
Structured
Stoch.
Varia[onal
Bayes
Fully
Struc.
Stoch.
Varia[onal
Bayes (ICML
2015)
16vrijdag 4 juli 14

Variational Auto-Encoder
(with 2 latent variables)
17
17vrijdag 4 juli 14

Face Model
18vrijdag 4 juli 14

REFERENCES
SVB:
-‐Prac[cal
Varia[onal
Inference
for
Neural
Networks
[Alex
Graves,
2011]
-‐Varia[onal
Bayesian
Inference
with
Stochas[c
Search
[D.M.
Blei,
M.I.
Jordan
and
J.W.
Paisley,
2012]
-‐Fixed-‐Form
Varia[onal
Posterior
Approxima[on
through
Stochas[c
Linear
Regression.
Bayesian
Analysis
[T.
Salimans
and
A.
Knowles,
2013].
-‐Black
Box
Varia[onal
Inference.
[R.
Ranganath,
S.
Gerrish
and
D.M.
Blei.
2013]
-‐Stochas[c
Varia[onal
Inference
[M.D.
Hoffman,
D.
Blei,
C.
Wang
and
J.
Paisley,
2013]
-‐Stochas[c
Structured
Mean
Field
Varia[onal
Inference
[MaYhew
Hoffman,

2013]
-‐Doubly
Stochas/c
Varia/onal
Bayes
for
non-‐Conjugate
Inference
[M.
K.
Titsias
and
M.
Lázaro-‐Gredilla,
2014]
REFERENCES
STOCHASTIC
BACKPROP
AND
DEEP
GENERATIVE
MODELS
-‐Fast
Gradient-‐Based
Inference
with
Con[nuous
Latent
Variable
Models
in
Auxiliary
Form.
[D.P.
Kingma,
2013].
-‐Es[ma[ng
or
propaga[ng
gradients
through
stochas[c
neurons.
[Y.
Bengio,
2013].
-‐Auto-‐Encoding
Varia[onal
Bayes
[D.P.
Kingma
and
M.
W.,
2013].
-‐Semi-‐supervised
Learning
with
Deep
Genera[ve
Models
[D.P.
Kingma,
D.J.
Rezende,
S.
Mohamed,
M.
W.,
2014]
-‐Efficient
Gradient-‐Based
Inference
through
Transforma/ons
between
Bayes
Nets
and
Neural
Nets
[D.P.
Kingma
and
M.
W.,
2014]
-‐Deep
Genera/ve
Stochas/c
Networks
Trainable
by
Backprop
[Y.
Bengio,
E.
Laufer,
G.
Alain,
J,
Yosinski,
2014]
-‐Stochas/c
Back-‐propaga/on
and
Approximate
Inference
in
Deep
Genera/ve
Models
[D.J.
Rezende,
S.
Mohamed
and
D.
Wierstra,
2014]
-‐Deep
AutoRegressive
Networks
[K.
Gregor,
A.
Mnih
and
D.
Wierstra,
2014].
-‐Neural
Varia/onal
Inference
and
Learning
in
Belief
Networks.
[A.
Mnih
and
K.
Gregor,
2014].
References: Lots of action at ICML 2014!
20vrijdag 4 juli 14

Sampling 101 – Why MCMC?
21vrijdag 4 juli 14

Generating Independent Samples
Sample from g and suppress samples with low p(θ|X)
e.g. a) Rejection Sampling b) Importance Sampling
- Does not scale to high dimensions
21vrijdag 4 juli 14

Markov Chain Monte Carlo
21vrijdag 4 juli 14

• Make steps by perturbing previous sample
21vrijdag 4 juli 14

• Make steps by perturbing previous sample
• Probability of visiting a state is equal to P(θ|X)
21vrijdag 4 juli 14

Sampling 101 – What is MCMC?
.
.
.

22vrijdag 4 juli 14

.
.
.
.
.
.

22vrijdag 4 juli 14

Burn-in ( Throw away)
.
.
.
.
.
.

Samples from S0
22vrijdag 4 juli 14

.
.
.
.
.
.

Samples from S0
Auto correlation time
22vrijdag 4 juli 14

.
.
.
.
.
.

Samples from S0
Auto correlation time
High
τ
Low

τ
22vrijdag 4 juli 14

Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)
23vrijdag 4 juli 14

Propose
23vrijdag 4 juli 14

Accept/Reject TestPropose
23vrijdag 4 juli 14

Is the new state
more probable?
23vrijdag 4 juli 14

Is it easy to come back
to the current state?
23vrijdag 4 juli 14

O
(N
)
For Bayesian Posterior Inference,
1) Burn-in is unnecessarily slow.
23vrijdag 4 juli 14

O
(N
)
For Bayesian Posterior Inference,
2) is too high.
1) Burn-in is unnecessarily slow.
23vrijdag 4 juli 14

Approximate MCMC
Low
Variance
( Fast )
High Bias
xx
x
x
x x
x xx x
x x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x x
x
24vrijdag 4 juli 14

Approximate MCMC
Low
Variance
( Fast )
High
Variance
( Slow )
High Bias Low Bias
xx
x
x
x x
x xx x
x x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x x
x
x
x
x
Decreasing ϵ
24vrijdag 4 juli 14

Minimizing Risk
25
Risk Bias Variance
= +
2
25vrijdag 4 juli 14

Minimizing Risk
X Axis – ϵ, Y Axis – Bias2, Variance, Risk
25
Risk Bias Variance
= +
2
25vrijdag 4 juli 14

Minimizing Risk
Computational Time
25
Risk Bias Variance
= +
2
25vrijdag 4 juli 14

Minimizing Risk
Computational Time
25
Risk Bias Variance
= +
2
Given finite sampling
time, ϵ=0 is not the
optimal setting.
25vrijdag 4 juli 14

Designing fast MCMC samplers
Propose Accept/Reject
O(N)
26vrijdag 4 juli 14

O(N)
Method 1
Develop an approximate
accept/reject test that uses
only a fraction of the data
26vrijdag 4 juli 14

Method 2
Develop a proposal with
acceptance probability ≈ 1
and avoid the expensive
accept/reject test
O(N)
Method 1
Develop an approximate
accept/reject test that uses
only a fraction of the data
26vrijdag 4 juli 14

Stochastic Gradient Langevin Dynamics
W.
&
Teh,
2011
27vrijdag 4 juli 14

Langevin Dynamics
θt+1 is then accepted /rejected using a Metropolis-Hastings test
W.
&
Teh,
2011
27vrijdag 4 juli 14

Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD)
W.
&
Teh,
2011
27vrijdag 4 juli 14

Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD)
Avoid expensive Metropolis-Hastings test by keeping ε small
W.
&
Teh,
2011
27vrijdag 4 juli 14

SGLD & Optimization
OptimizationLarge ε
28
28vrijdag 4 juli 14

SGLD & Optimization
Optimization
Small ε
29
29vrijdag 4 juli 14

The SGLD Knob
Burn-in Biased Exact
Decrease ϵ over time
Low
Variance
( Fast )
High
Variance
( Slow )
High Bias Low Bias
xx
x
x
x x
x xx x
x x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x x
x
x
x
x
30vrijdag 4 juli 14

Demo: SGLD
31
31vrijdag 4 juli 14

Deep generative learning_icml_part1

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Deep generative learning_icml_part1