https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
1. [course
site]
Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Optimization for neural
network training
Day 4 Lecture 1
#DLUPC
2. Previously
in
DLAI…
• Mul.layer
perceptron
• Training:
(stochas.c
/
mini-‐batch)
gradient
descent
• Backpropaga.on
but…
What
type
of
op.miza.on
problem?
Do
local
minima
and
saddle
points
cause
problems?
Does
gradient
descent
perform
well?
How
to
set
the
learning
rate?
How
to
ini.alize
weights?
How
does
batch
size
affect
training?
2
3. Index
• Op6miza6on
for
a
machine
learning
task;
difference
between
learning
and
pure
op6miza6on
• Expected
and
empirical
risk
• Surrogate
loss
func.ons
and
early
stopping
• Batch
and
mini-‐batch
algorithms
• Challenges
• Local
minima
• Saddle
points
and
other
flat
regions
• Cliffs
and
exploding
gradients
• Prac6cal
algorithms
• Stochas.c
Gradient
Descent
• Momentum
• Nesterov
Momentum
• Learning
rate
• Adap.ve
learning
rates:
adaGrad,
RMSProp,
Adam
• Approximate
second-‐order
methods
• Parameter
ini6aliza6on
• Batch
Normaliza6on
3
5. Op6miza6on
for
NN
training
• Goal:
Find
the
parameters
that
minimize
the
expected
risk
(generaliza.on
error)
• x
input,
predicted
output,
y
target
output,
E
expecta.on
• pdata
true
(unknown)
data
distribu.on
• L
loss
func6on
(how
wrong
predic6ons
are)
• But
we
only
have
a
training
set
of
samples:
we
minimize
the
empirical
risk,
average
loss
on
a
finite
dataset
D
J(θ) = Ε(x,y)∼pdata
L( f (x;θ), y)
f (x,θ)
J(θ) = Ε(x,y)∼ ˆpdata
L( f (x;θ), y) =
1
D
L( f (x(i)
,θ), y(i)
)
(x(i)
,y(i)
)∈D
∑
where
is
the
empirical
distribu.on,
|D|
is
the
number
of
examples
in
D
5
ˆpdata
6. Surrogate
loss
• O]en
minimizing
the
real
loss
is
intractable
• e.g.
0-‐1
loss
(0
if
correctly
classified,
1
if
it
is
not)
(intractable
even
for
linear
classifiers
(Marcoae
1992)
• Minimize
a
surrogate
loss
instead
• e.g.
nega.ve
log-‐likelihood
for
the
0-‐1
loss
• Some.mes
the
surrogate
loss
may
learn
more
• test
error
0-‐1
loss
keeps
decreasing
even
a]er
training
0-‐1
loss
is
zero
• further
pushing
classes
apart
from
each
other
6
0-‐1
loss
(blue)
and
surrogate
losses
(square,
hinge,
logis.c)
0-‐1
loss
(blue)
and
nega.ve
log
likelihood
(red)
7. Surrogate
loss
func6ons
7
Probabilistic
classifier
Outputs
probability
of
class
1
g(x) ≈ P(y=1 | x) Probability for class 0 is 1-g(x)
Binary cross-entropy loss:
L(g(x),y) = -(y log(g(x)) + (1-y) log(1-g(x))
Decision function:f(x) = Ig(x)>0.5
Outputs
a
vector
of
probabili.es:
g(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) )
Negative conditional log likelihood loss
L(g(x),y) = -log g(x)y
Decision function:f(x) = argmax(g(x))
Non-
Hinge
loss:
probabilistic
classifier
Outputs a «score» g(x) for class 1.
score for the other class is -g(x)
L(g(x),t) = max(0, 1-tg(x)) where t=2y-1
Decision function: f(x) = Ig(x)>0
Outputs
a
vector
g(x) of
real-‐valued
scores
for
the
m
classes.
Mul.class
margin
loss
L(g(x),y) = max(0,1+max(g(x)k)-g(x)y )
k≠y
Decision function: f(x) = argmax(g(x))
Binary classifier Multiclass classifier
8. Early
stopping
• Training
algorithms
usually
do
not
halt
at
a
local
minimum
• Early
stopping:
• based
on
the
true
underlying
loss
(ex
0-‐1
loss)
measured
on
a
valida6on
set
• #
training
steps
=
hyperparameter
controlling
the
effec.ve
capacity
of
the
model
• simple,
effec.ve,
must
keep
a
copy
of
the
best
parameters
• acts
as
a
regularizer
(Bishop
1995,…)
8
Training
error
decreases
steadily
Valida.on
error
begins
to
increase
Return
parameters
at
point
with
lowest
valida6on
error
9. Batch
and
mini-‐batch
algorithms
• In
most
op.miza.on
methods
used
in
ML
the
objec.ve
func.on
decomposes
as
a
sum
over
a
training
set
• Gradient
descent:
examples
from
the
training
set
with
corresponding
targets
• Using
the
complete
training
set
can
be
very
expensive
(the
gain
of
using
more
samples
is
less
than
linear
–
standard
error
of
mean
drops
propor.onally
to
sqrt(m)-‐,
training
set
may
be
redundant:
use
a
subset
of
the
training
set
• How
many
samples
in
each
update
step?
• Determinis.c
or
batch
gradient
methods:
process
all
training
samples
in
a
large
batch
• Stochas.c
methods:
use
a
single
example
at
a
.me
• online
methods:
samples
are
drawn
from
a
stream
of
con.nually
created
samples
• Mini-‐batch
stochas.c
methods:
use
several
(not
all)
samples
9
∇θ
J(θ) = Ε(x,y)∼ ˆpdata
∇θ
L( f (x;θ), y) =
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
{x(i)
}i=1...m
{y(i)
}i=1...m
10. Batch
and
mini-‐batch
algorithms
Mini-‐batch
size?
• Larger
batches:
more
accurate
es.mate
of
the
gradient
but
less
than
linear
return
• Very
small
batches:
Mul.core
architectures
under-‐u.lized
• If
samples
processed
in
parallel:
memory
scales
with
batch
size
• Smaller
batches
provide
noisier
gradient
es.mates
• Small
batches
may
offer
a
regularizing
effect
(add
noise)
• but
may
require
small
learning
rate
• may
increase
number
of
steps
for
convergence
Minbatches
should
be
selected
randomly
(shuffle
samples)
• unbiased
es.mate
of
gradients
10
12. Local
minima
• Convex
op.miza.on
• any
local
minimum
is
a
global
minimum
• there
are
several
opt.
algorithms
(polynomial-‐.me)
• Non-‐convex
op.miza.on
• objec6ve
func6on
in
deep
networks
is
non-‐convex
• deep
models
may
have
several
local
minima
• but
this
is
not
necessarily
a
major
problem!
12
13. Local
minima
and
saddle
points
• Cri.cal
points:
• For
high
dimensional
loss
func.ons,
local
minima
are
rare
compared
to
saddle
points
• Hessian
matrix:
real,
symmetric
eigenvector/eigenvalue
decomposi.on
• Intui.on:
eigenvalues
of
the
Hessian
matrix
• local
minimum/maximum:
all
posi.ve
/
all
nega.ve
eigenvalues:
exponen.ally
unlikely
as
n
grows
• saddle
points:
both
posi.ve
and
nega.ve
eigenvalues
13
Dauphin
et
al.
Iden.fying
and
aaacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
NIPS
2014
Hij
=
∂2
f
∂xi
∂xj
f :!n
→ !
∇x
f (x) = 0
14. Local
minima
and
saddle
points
• It
is
believed
that
for
many
problems
including
learning
deep
nets,
almost
all
local
minimum
have
very
similar
func.on
value
to
the
global
op.mum
• Finding
a
local
minimum
is
good
enough
14
Value
of
local
minima
found
by
running
SGD
for
200
itera.ons
on
a
simplified
version
of
MNIST
from
different
ini.al
star.ng
points.
As
number
of
parameters
increases,
local
minima
tend
to
cluster
more
.ghtly.
• For
many
random
func.ons
local
minima
are
more
likely
to
have
low
cost
than
high
cost.
Dauphin
et
al.
Iden.fying
and
aaacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
NIPS
2014
15. Saddle
points
• How
to
escape
from
saddle
points?
• First
order
methods
• ini.ally
aaracted
to
saddle
points,
but
unless
exact
hit,
it
will
be
repelled
when
close
• hiqng
cri.cal
point
exactly
is
unlikely
(es.mated
gradient
is
noisy)
• saddle
points
are
very
unstable:
noise
(stochas.c
gradient
descent)
helps
convergence,
trajectory
escapes
quickly
• Second
order
moments:
• Netwon’s
method
can
jump
to
saddle
points
(where
gradient
is
0)
15
S.
Credit:
K.McGuinness
SGD
tends
to
oscillate
between
slowly
approaching
a
saddle
point
and
quickly
escaping
from
it
16. Other
difficul6es
• Cliffs
and
exploding
gradients
• Nets
with
many
layers
/
recurrent
nets
can
contain
very
steep
regions
(cliffs)
mul.plica.on
of
several
parameters):
gradient
descent
can
move
parameters
too
far,
jumping
off
of
the
cliff.
(solu.ons:
gradient
clipping)
• Long
term
dependencies:
• computa.onal
graph
becomes
very
deep:
vanishing
and
exploding
gradients
16
18. Stochas6c
Gradient
Descent
(SGD)
• Most
used
algorithm
for
deep
learning
• Do
not
confuse
with
determinis.c
gradient
descent:
stochas.c
uses
mini-‐batches
Algorithm
• Require:
Learning
rate
α,
ini.al
parameter
θ
• while
stopping
criterion
not
met
do
• sample
a
minibatch
of
m
examples
from
the
training
set
with
corresponding
targets
• compute
gradient
es.mate
• apply
update
• end
while
18
{x(i)
}i=1...m
{y(i)
}i=1...m
ˆg ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
θ ←θ −α ˆg
19. Momentum
19
• Designed
to
accelerate
learning,
especially
for
high
curvature,
small
but
consistent
gradients
or
noisy
gradients
• Momentum
aims
to
solve:
poor
condi.oning
of
Hessian
matrix
and
variance
in
the
stochas.c
gradient
Contour
lines=
a
quadra.c
loss
with
poor
condi.oning
of
Hessian
Path
(red)
followed
by
SGD
(le])
and
momentum
(right)
20. Momentum
• New
variable
v
(velocity).
direc.on
and
speed
at
which
parameters
move:
exponen.ally
decaying
average
of
nega.ve
gradient
Algorithm
• Require:
learning
rate
α,
ini.al
parameter
θ,
momentum
parameter
λ
,
ini6al
velocity
v
• Update
rule:
• compute
gradient
es.mate
• compute
velocity
update
• apply
update
• Typical
values
λ=.5,
.9,.99
(in
[0,1})
• Size
of
step
depends
on
how
large
and
aligned
a
sequence
of
gradients
are.
• Read
physical
analogy
in
Deep
Learning
book
(Goodfellow
et
al)
20
g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
θ ←θ + v
v ← λv −αg
21. Nesterov
accelerated
gradient
(NAG)
• A
variant
of
momentum,
where
gradient
is
evaluated
a]er
current
velocity
is
applied:
• Approximate
where
the
parameters
will
be
on
the
next
.me
step
using
current
velocity
• Update
velocity
using
gradient
where
we
predict
parameters
will
be
Algorithm
• Require:
learning
rate
α,
ini.al
parameter
θ,
momentum
parameter
λ
,
ini.al
velocity
v
• Update:
• apply
interim
update
• compute
gradient
(at
interim
point)
• compute
velocity
update
• apply
update
• Interpreta.on:
add
a
correc.on
factor
to
momentum
21
g ← +
1
m
∇!θ
L( f (x(i)
; !θ), y(i)
)i∑
θ ←θ + v
v ← λv −αg
!θ ←θ + λv
22. Nesterov
accelerated
gradient
(NAG)
22
current
loca.on
wt
vt
∇L(wt) vt+1
S.
Credit:
K.
McGuinness
predicted
loca.on
based
on
velocity
alone
wt + 𝛾v
∇L(wt + 𝛾vt)
vt
vt+1
23. SGD:
learning
rate
• Learning
rate
is
a
crucial
parameter
for
SGD
• To
large:
overshoots
local
minimum,
loss
increases
• Too
small:
makes
very
slow
progress,
can
get
stuck
• Good
learning
rate:
makes
steady
progress
toward
local
minimum
• In
prac.ce
it
is
necessary
to
gradually
decrease
learning
rate
• step
decay
(e.g.
decay
by
half
every
few
epochs)
• exponen6al
decay
• 1/t
decay
• Sufficient
condi.ons
for
convergence:
• Usually:
adapt
learning
rate
by
monitoring
learning
curves
that
plot
the
objec.ve
func.on
as
a
func.on
of
.me
(more
of
an
art
than
a
science!)
23
αt
= ∞
t=1
∞
∑ αt
2
= ∞
t=1
∞
∑
α = α0
e−kt
α = α0
/ (1+ kt)
t
=
itera/on
number
24. Adap6ve
learning
rates
• Learning
rate
is
one
of
the
hyperparameters
that
is
the
most
difficult
to
set;
it
has
a
significant
impact
on
the
model
performance
• Cost
if
o]en
sensi.ve
to
some
direc.ons
and
insensi.ve
to
others
• Momentum/Nesterov
mi.gate
this
issue
but
introduce
another
hyperparameter
• Solu6on:
Use
a
separate
learning
rate
for
each
parameter
and
automa6cally
adapt
it
through
the
course
of
learning
• Algorithms
(mini-‐batch
based)
• AdaGrad
• RMSProp
• Adam
• RMSProp
with
Nesterov
momentum
24
25. AdaGrad
• Adapts
the
learning
rate
of
each
parameter
based
on
sizes
of
previous
updates:
• scales
updates
to
be
larger
for
parameters
that
are
updated
less
• scales
updates
to
be
smaller
for
parameters
that
are
updated
more
• The
net
effect
is
greater
progress
in
the
more
gently
sloped
direc.ons
of
parameter
space
• Desirable
theore.cal
proper.es
but
empirically
(for
deep
models)
can
result
in
a
premature
and
excessive
decrease
in
effec6ve
learning
rate
• Require:
learning
rate
α,
ini.al
parameter
θ,
small
constant
δ
(e.g.
10-‐7)
for
numerical
stability
• Update:
• compute
gradient
• accumulate
squared
gradient
• compute
update
• apply
update
25
g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← r + g ⊙ g sum
of
all
previous
squared
gradients
updates
inversely
propor.onal
to
the
square
root
of
the
sum
Duchi
et
al.
Adap.ve
Subgradient
Methods
for
Online
Learning
and
Stochas.c
Op.miza.on.
JMRL
2011
26. Root
Mean
Square
Propaga6on
(RMSProp)
• Modifies
AdaGrad
to
perform
beaer
in
non-‐convex
surfaces,
for
aggressively
decaying
learning
rates
• Changes
gradient
accumula.on
by
an
exponen6ally
decaying
average
of
sum
of
squares
of
gradients
• Requires:
learning
rate
α,
ini.al
parameter
θ,
decay
rate
ρ,
small
constant
δ
(e.g.
10-‐7)
for
numerical
stability
• Update:
• compute
gradient
• accumulate
squared
gradient
• compute
update
• apply
update
26
θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← ρr + (1− ρ)g ⊙ g
g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
It
can
be
combined
with
Nesterov
momentum
Geoff
Hinton,
Unpublished
27. ADAp6ve
Moments
(Adam)
• Combina.on
of
RMSProp
and
momentum,
but:
• Keep
decaying
average
of
both
first-‐order
moment
of
gradient
(momentum)
and
second-‐order
moment
(RMSProp)
• Includes
bias
correc.ons
(firs
and
second
moments)
to
account
for
their
ini.aliza.on
at
origin
Update:
• compute
gradient
• updated
biased
first
moment
es6mate
• update
biased
second
moment
• correct
biases
• compute
update
(opera.ons
applied
elementwise)
• apply
update
27
θ ←θ + Δθ
Δθ ← −α
ˆs
δ + ˆr
s ← ρ1
s + (1− ρ1
)g
g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
r ← ρ2
r + (1− ρ2
)g ⊙ g
ˆs ←
s
1− ρ1
ˆr ←
r
1− ρ2
Kingma
et
al.
Adam:
a
Method
for
Stochas.c
Op.miza.on.
ICLR
2015
δ=e-‐8,
ρ1=0.9,
ρ2=0.999
31. Second
order
op6miza6on
• Second
order
Taylor
expansion
• Solving
for
the
cri.cal
point
we
obtain
the
Newton
parameter
update:
• Problem:
Hessian
has
O(N2)
elements,
inver.ng
H
O(N3)
(N
parameters
≈millions)
• Alterna6ves:
• Quasi-‐Newton
methods
(BGFS
Broyden-‐Fletcher-‐Goldfarb-‐Shanno):
instead
of
inver.ng
Hessian,
approximate
inverse
Hessianwith
rank
1
updates
over
.me
O(N2)
each
• L-‐BFGS
(Limited
memory
BFGS):
does
not
form/store
the
full
inverse
Hessian
31
J(θ) ≈ J(θ0
) + (θ −θ0
)T
∇θ
J(θ0
) +
1
2
(θ −θ0
)T
H(θ −θ0
)
θ*
= θ0
− H−1
∇θ
J(θ0
)
32. Parameter
ini6aliza6on
• Weights
• Can’t
ini.alize
weights
to
0
(gradients
would
be
0)
• Can’t
ini.alize
all
weights
to
the
same
value
(all
hidden
units
in
a
layer
will
always
behave
the
same;
need
to
break
symmetry)
• Small
random
number,
e.g.,
uniform
or
gaussian
distribu6on
• if
weights
start
too
small,
the
signal
shrinks
as
it
passes
through
each
layer
un.l
it
is
too
.ny
to
be
useful
• Calibra6ng
variances
with
1/sqrt(n)
(Xavier
Ini6aliza6on)
• each
neuron:
w
=
randn(n)
/
sqrt(n)
,
n
inputs
• He
ini6aliza6on
(for
ReLu
ac6va6ons)
sqrt(2/n)
• each
neuron
w
=
randn(n)
*
sqrt(2.0
/n)
,
n
inputs
• Biases
• ini.alize
all
to
0
(except
for
output
unit
for
skewed
distribu.ons,
0.01
to
avoid
satura.ng
RELU)
• Alterna6ve:
Ini.alize
using
machine
learning;
parameters
learned
by
unsupervised
model
trained
on
the
same
inputs
/
trained
on
unrelated
task
32
N(0,10−2
)
33. Batch
normaliza6on
• As
learning
progresses,
the
distribu.on
of
the
layer
inputs
changes
due
to
parameter
updates
(
internal
covariate
shi])
• This
can
result
in
most
inputs
being
in
the
non-‐linear
regime
of
the
ac.va.on
func.on,
slowing
down
learning
• Bach
normaliza.on
is
a
technique
to
reduce
this
effect
• Explicitly
force
the
layer
ac.va.ons
to
have
zero
mean
and
unit
variance
w.r.t
running
batch
es.mates
• Adds
a
learnable
scale
and
bias
term
to
allow
the
network
to
s.ll
use
the
nonlinearity
33
Ioffe
and
Szegedy,
2015.
“Batch
normaliza.on:
accelera.ng
deep
network
training
by
reducing
internal
covariate
shi]”
FC
/
Conv
Batch
norm
ReLu
FC
/
Conv
Batch
norm
ReLu
34. Batch
normaliza6on
• Can
be
applied
to
any
input
or
hidden
layer
• For
a
mini-‐batch
of
N
ac.va.ons
of
the
layer
1. compute
empirical
mean
and
variance
for
each
dimension
D
2. normalize
• Note:
normaliza.on
can
reduce
the
expressive
power
of
the
network
(e.g.
normalize
intputs
of
a
sigmoid
would
constrain
them
to
the
linear
regime
• So
let
the
network
learn
the
iden.ty:
scale
and
shi]
• To
recover
the
iden.ty
mapping
the
newtork
can
lean
34
ˆx(k )
=
x(k )
− E(x(k )
)
var(x(k )
)
N
D
X
β(k )
= E(x(k )
)γ (k )
= var(x(k )
)
y(k )
= γ (k )
ˆx(k )
+ β(k )
35. Batch
normaliza6on
35
1. Improves
gradient
flow
through
the
network
2. Allows
higher
learning
rates
3. Reduces
the
strong
dependency
on
ini.aliza.on
4. Reduces
the
need
of
regulariza.on
36. Batch
normaliza6on
36
At
test
6me
BN
layers
func6ons
differently:
1. Mean
and
std
are
not
computed
on
the
batch.
2. Instead,
a
single
fixed
empirical
mean
and
std
of
ac.va.ons
computed
during
training
is
used
(can
be
es.mated
with
running
averages)
37. Summary
37
• Op.miza.on
for
NN
is
different
from
pure
op.miza.on:
• GD
with
mini-‐batches
• early
stopping
• non-‐convex
surface,
local
minima
and
saddle
points
• Learning
rate
has
a
significant
impact
on
model
performance
• Several
extensions
to
SGD
can
improve
convergence
• Ada.ve
learning-‐rate
methods
are
likely
to
achieve
best
results
• RMSProp,
Adam
• Weight
ini.aliza.on:
He
w=
randn(n)
sqrt(2/n)
• Batch
normaliza.on
to
reduce
the
internal
covariance
shi]
38. Bibliograpy
• Goodfellow,
I.,
Bengio,
Y.,
and
A.,
C.
(2016),
Deep
Learning,
MIT
Press.
• Choromanska,
A.,
Henaff,
M.,
Mathieu,
M.,
Arous,
G.
B.,
and
LeCun,
Y.
(2015),
The
loss
surfaces
of
mul.layer
networks.
In
AISTATS.
• Dauphin,
Y.
N.,
Pascanu,
R.,
Gulcehre,
C.,
Cho,
K.,
Ganguli,
S.,
and
Bengio,
Y.
(2014).
Iden.fying
and
aaacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
In
Advances
in
Neural
Informa.on
Processing.
Systems,
pages
2933–2941.
• Duchi,
J.,
Hazan,
E.,
and
Singer,
Y.
(2011).
Adap.ve
subgradient
methods
for
online
learning
and
stochas.c
op.miza.on.
Journal
of
Machine
Learning
Research,
12(Jul):2121–2159.
• Goodfellow,
I.
J.,
Vinyals,
O.,
and
Saxe,
A.
M.
(2015).
Qualita.vely
characterizing
neural
network
op.miza.on
problems.
In
Interna.onal
Conference
on
Learning
Representa.ons.
• Hinton,
G.
(2012).
Neural
networks
for
machine
learning.
Coursera,
video
lectures
• Jacobs,
R.
A.
(1988).
Increased
rates
of
convergence
through
learning
rate
adapta.on.
Neural
networks,
1(4):295–307.
• Kingma,
D.
and
Ba,
J.
(2014)-‐
Adam:
A
method
for
stochas.c
op.miza.on.
arXiv
preprint
arXiv:
1412.6980.
• Saxe,
A.
M.,
McClelland,
J.
L.,
and
Ganguli,
S.
(2013).
Exact
solu.ons
to
the
nonlinear
dynamics
of
learning
in
deep
linear
neural
networks.
In
Interna.onal
Conference
on
Learning
Representa.ons
38