This document discusses Bayesian posterior inference methods for large datasets. It introduces stochastic variational inference (SVI), which uses minibatches of data to perform variational inference in a scalable way. SVI uses the reparameterization trick to reduce variance. The document also discusses using SVI for deep generative models like variational autoencoders, which can learn complex representations from big data in a distributed manner.
2. Outline
• Introduction
• Stochastic Variational Inference
– Variational Inference 101
– Stochastic Variational Inference
– Deep Generative Models with SVB
• MCMC with mini-batches
– MCMC 101
– MCMC using noisy gradients
– MCMC using noisy Metropolis-Hastings
– Theoretical results
2vrijdag 4 juli 14
3. Big Data (mine is bigger than yours)
Square
Kilometer
Array
(SKA)
produces
1
Exabyte
per
day
by
2024…
(interested
to
do
approximate
inference
on
this
data,
talk
to
me)
3vrijdag 4 juli 14
4. Introduction
Why
do
we
need
posterior
inference
if
the
datasets
are
BIG?
4vrijdag 4 juli 14
5. p>>N
Big
data
may
mean
large
p,
small
n
Gene
expression
data
fMRI
data
5
5vrijdag 4 juli 14
7. Little data inside Big data
Not
every
data-‐case
carries
informa=on
about
every
model
component
New
user
with
no
raGngs
(cold
start
problem)
7
7vrijdag 4 juli 14
8. 1943:
First
NN
(+/-‐
N=10)
1988:
NetTalk
(+/-‐
N=20K)
2009:
Hinton’s
Deep
Belief
Net
(+/-‐
N=10M)
2013:
Google/Y!
(N=+/-‐
10B)
Big
Models!
Models
grow
faster
than
useful
informa=on
in
data
8
8vrijdag 4 juli 14
9. Two Ingredients for Big Data Bayes
Any
big
data
posterior
inference
algorithm
should:
1. easily
run
on
a
distributed
architecture.
2. only
use
a
small
mini-‐batch
of
the
data
at
every
itera=on.
9vrijdag 4 juli 14
10. Bayesian Posterior Inference
Variational Sampling
Variational Family Q
All probability distributions
• DeterminisGc
• Biased
• Local
minima
• Easy
to
assess
convergence
• StochasGc
(sample
error)
• Unbiased
• Hard
to
mix
between
modes
• Hard
to
assess
convergence
10vrijdag 4 juli 14
11. Variational Bayes
11
Hinton
&
van
Camp
(1993)
Neal
&
Hinton
(1999)
Saul
&
Jordan
(1996)
Saul,
Jaakkola
&
Jordan
(1996)
ATas
(1999,2000)
Wiegerinck
(2000)
Ghahramani
&
Beal
(2000,2001)
Coordinate
descent
on
Q
P
Q
(Bishop,
PaYern
Recogni[on
and
Machine
Learning)
11vrijdag 4 juli 14
12. Stochastic VB Hoffman,
Blei
&
Bach,
2010
Stochas=c
natural
gradient
descent
on
Q
12
• P
and
Q
in
exponenGal
family.
• Q
factorized:
• At
every
iteraGon:
subsample
n<<N
data-‐cases:
• solve
analyGcally.
• update
parameter
using
stochas=c
natural
gradient
descent.
12vrijdag 4 juli 14
13. General SVB
very
high
variance
sample
13
subsample
X
(ignoring
latent
variables
Z)
13vrijdag 4 juli 14
14. Reparameterization Trick
14
-‐Varia[onal
Bayesian
Inference
with
Stochas[c
Search
[D.M.
Blei,
M.I.
Jordan
and
J.W.
Paisley,
2012]
-‐Fixed-‐Form
Varia[onal
Posterior
Approxima[on
through
Stochas[c
Linear
Regression
[T.
Salimans
and
A.
Knowles,
2013].
-‐Black
Box
Varia[onal
Inference.
[R.
Ranganath,
S.
Gerrish
and
D.M.
Blei.
2013]
-‐Stochas[c
Varia[onal
Inference
[M.D.
Hoffman,
D.
Blei,
C.
Wang
and
J.
Paisley,
2013]
-‐Es[ma[ng
or
propaga[ng
gradients
through
stochas[c
neurons.
[Y.
Bengio,
2013].
-‐Neural
Varia[onal
Inference
and
Learning
in
Belief
Networks.
[A.
Mnih
and
K.
Gregor,
2014]
Kingma
2013,
Bengio
2013,
Kingma
&
W.
2014
Other
solu=ons
to
solve
the
same
"large
variance
problem":
14vrijdag 4 juli 14
15. Reparameterization Trick
14
-‐Varia[onal
Bayesian
Inference
with
Stochas[c
Search
[D.M.
Blei,
M.I.
Jordan
and
J.W.
Paisley,
2012]
-‐Fixed-‐Form
Varia[onal
Posterior
Approxima[on
through
Stochas[c
Linear
Regression
[T.
Salimans
and
A.
Knowles,
2013].
-‐Black
Box
Varia[onal
Inference.
[R.
Ranganath,
S.
Gerrish
and
D.M.
Blei.
2013]
-‐Stochas[c
Varia[onal
Inference
[M.D.
Hoffman,
D.
Blei,
C.
Wang
and
J.
Paisley,
2013]
-‐Es[ma[ng
or
propaga[ng
gradients
through
stochas[c
neurons.
[Y.
Bengio,
2013].
-‐Neural
Varia[onal
Inference
and
Learning
in
Belief
Networks.
[A.
Mnih
and
K.
Gregor,
2014]
Kingma
2013,
Bengio
2013,
Kingma
&
W.
2014
Other
solu=ons
to
solve
the
same
"large
variance
problem":
Talk Monday June 23, 15:20
In Track F (Deep Learning II)
14vrijdag 4 juli 14
16. Auto Encoding Variational Bayes
Both
P(X|Z)
and
Q(Z|X)
are
general
models
(e.g.
deep
neural
net)
Kingma
&
W.,
2013,
Rezende
et
al
2014
15
The
Helmholtz
machine
Wake/Sleep
algorithm
Dayan,
Hinton,
Neal,
Zemel,
1995
Z
X
Q(Z|X)
P(X|Z)P(Z)
15vrijdag 4 juli 14
22. Semi-supervised Model
Z
X
Y
Q(Y,Z|X)
=
Q(Z|Y,X)Q(Y|X)
Analogies:
Fix
Z,
vary
Y,
sample
X|Z,Y
P(X,Z,Y)
=
P(X|Z,Y)P(Y)P(Z)
Kingma,
Rezende,
Mohamed,
Wierstra,
W.,
2014
19vrijdag 4 juli 14
23. REFERENCES
SVB:
-‐Prac[cal
Varia[onal
Inference
for
Neural
Networks
[Alex
Graves,
2011]
-‐Varia[onal
Bayesian
Inference
with
Stochas[c
Search
[D.M.
Blei,
M.I.
Jordan
and
J.W.
Paisley,
2012]
-‐Fixed-‐Form
Varia[onal
Posterior
Approxima[on
through
Stochas[c
Linear
Regression.
Bayesian
Analysis
[T.
Salimans
and
A.
Knowles,
2013].
-‐Black
Box
Varia[onal
Inference.
[R.
Ranganath,
S.
Gerrish
and
D.M.
Blei.
2013]
-‐Stochas[c
Varia[onal
Inference
[M.D.
Hoffman,
D.
Blei,
C.
Wang
and
J.
Paisley,
2013]
-‐Stochas[c
Structured
Mean
Field
Varia[onal
Inference
[MaYhew
Hoffman,
2013]
-‐Doubly
Stochas/c
Varia/onal
Bayes
for
non-‐Conjugate
Inference
[M.
K.
Titsias
and
M.
Lázaro-‐Gredilla,
2014]
REFERENCES
STOCHASTIC
BACKPROP
AND
DEEP
GENERATIVE
MODELS
-‐Fast
Gradient-‐Based
Inference
with
Con[nuous
Latent
Variable
Models
in
Auxiliary
Form.
[D.P.
Kingma,
2013].
-‐Es[ma[ng
or
propaga[ng
gradients
through
stochas[c
neurons.
[Y.
Bengio,
2013].
-‐Auto-‐Encoding
Varia[onal
Bayes
[D.P.
Kingma
and
M.
W.,
2013].
-‐Semi-‐supervised
Learning
with
Deep
Genera[ve
Models
[D.P.
Kingma,
D.J.
Rezende,
S.
Mohamed,
M.
W.,
2014]
-‐Efficient
Gradient-‐Based
Inference
through
Transforma/ons
between
Bayes
Nets
and
Neural
Nets
[D.P.
Kingma
and
M.
W.,
2014]
-‐Deep
Genera/ve
Stochas/c
Networks
Trainable
by
Backprop
[Y.
Bengio,
E.
Laufer,
G.
Alain,
J,
Yosinski,
2014]
-‐Stochas/c
Back-‐propaga/on
and
Approximate
Inference
in
Deep
Genera/ve
Models
[D.J.
Rezende,
S.
Mohamed
and
D.
Wierstra,
2014]
-‐Deep
AutoRegressive
Networks
[K.
Gregor,
A.
Mnih
and
D.
Wierstra,
2014].
-‐Neural
Varia/onal
Inference
and
Learning
in
Belief
Networks.
[A.
Mnih
and
K.
Gregor,
2014].
References: Lots of action at ICML 2014!
20vrijdag 4 juli 14
26. Sampling 101 – Why MCMC?
Generating Independent Samples
Sample from g and suppress samples with low p(θ|X)
e.g. a) Rejection Sampling b) Importance Sampling
- Does not scale to high dimensions
21vrijdag 4 juli 14
27. Sampling 101 – Why MCMC?
Generating Independent Samples
Sample from g and suppress samples with low p(θ|X)
e.g. a) Rejection Sampling b) Importance Sampling
- Does not scale to high dimensions
Markov Chain Monte Carlo
21vrijdag 4 juli 14
28. Sampling 101 – Why MCMC?
Generating Independent Samples
Sample from g and suppress samples with low p(θ|X)
e.g. a) Rejection Sampling b) Importance Sampling
- Does not scale to high dimensions
Markov Chain Monte Carlo
• Make steps by perturbing previous sample
21vrijdag 4 juli 14
29. Sampling 101 – Why MCMC?
Generating Independent Samples
Sample from g and suppress samples with low p(θ|X)
e.g. a) Rejection Sampling b) Importance Sampling
- Does not scale to high dimensions
Markov Chain Monte Carlo
• Make steps by perturbing previous sample
• Probability of visiting a state is equal to P(θ|X)
21vrijdag 4 juli 14
42. Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)
Accept/Reject TestPropose
Is the new state
more probable?
23vrijdag 4 juli 14
43. Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)
Accept/Reject TestPropose
Is it easy to come back
to the current state?
23vrijdag 4 juli 14
45. Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)
Accept/Reject TestPropose
O
(N
)
For Bayesian Posterior Inference,
1) Burn-in is unnecessarily slow.
23vrijdag 4 juli 14
46. Sampling 101 – Metropolis-Hastings
Transition Kernel T(θt+1|θt)
Accept/Reject TestPropose
O
(N
)
For Bayesian Posterior Inference,
2) is too high.
1) Burn-in is unnecessarily slow.
23vrijdag 4 juli 14
49. Approximate MCMC
Low
Variance
( Fast )
High
Variance
( Slow )
High Bias Low Bias
xx
x
x
x x
x xx x
x x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x x
x
x
x
x
Decreasing ϵ
24vrijdag 4 juli 14
51. Minimizing Risk
X Axis – ϵ, Y Axis – Bias2, Variance, Risk
25
Risk Bias Variance
= +
2
25vrijdag 4 juli 14
52. Minimizing Risk
X Axis – ϵ, Y Axis – Bias2, Variance, Risk
Computational Time
25
Risk Bias Variance
= +
2
25vrijdag 4 juli 14
53. Minimizing Risk
X Axis – ϵ, Y Axis – Bias2, Variance, Risk
Computational Time
25
Risk Bias Variance
= +
2
25vrijdag 4 juli 14
54. Minimizing Risk
X Axis – ϵ, Y Axis – Bias2, Variance, Risk
Computational Time
25
Risk Bias Variance
= +
2
Given finite sampling
time, ϵ=0 is not the
optimal setting.
25vrijdag 4 juli 14
56. Designing fast MCMC samplers
Propose Accept/Reject
O(N)
Method 1
Develop an approximate
accept/reject test that uses
only a fraction of the data
26vrijdag 4 juli 14
57. Designing fast MCMC samplers
Method 2
Develop a proposal with
acceptance probability ≈ 1
and avoid the expensive
accept/reject test
Propose Accept/Reject
O(N)
Method 1
Develop an approximate
accept/reject test that uses
only a fraction of the data
26vrijdag 4 juli 14
59. Stochastic Gradient Langevin Dynamics
Langevin Dynamics
θt+1 is then accepted /rejected using a Metropolis-Hastings test
W.
&
Teh,
2011
27vrijdag 4 juli 14
60. Stochastic Gradient Langevin Dynamics
Langevin Dynamics
θt+1 is then accepted /rejected using a Metropolis-Hastings test
W.
&
Teh,
2011
27vrijdag 4 juli 14
61. Stochastic Gradient Langevin Dynamics
Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD)
θt+1 is then accepted /rejected using a Metropolis-Hastings test
W.
&
Teh,
2011
27vrijdag 4 juli 14
62. Stochastic Gradient Langevin Dynamics
Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD)
θt+1 is then accepted /rejected using a Metropolis-Hastings test
Avoid expensive Metropolis-Hastings test by keeping ε small
W.
&
Teh,
2011
27vrijdag 4 juli 14
63. Stochastic Gradient Langevin Dynamics
Langevin Dynamics
Stochastic Gradient Langevin Dynamics (SGLD)
θt+1 is then accepted /rejected using a Metropolis-Hastings test
Avoid expensive Metropolis-Hastings test by keeping ε small
W.
&
Teh,
2011
27vrijdag 4 juli 14
66. The SGLD Knob
Burn-in Biased Exact
Decrease ϵ over time
Low
Variance
( Fast )
High
Variance
( Slow )
High Bias Low Bias
xx
x
x
x x
x xx x
x x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x x
x
x
x
x
30vrijdag 4 juli 14