SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
[course	
  site]	
  
Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Optimization for neural
network training
Day 4 Lecture 1
#DLUPC
Previously	
  in	
  DLAI…	
  
•  Mul.layer	
  perceptron	
  
•  Training:	
  (stochas.c	
  /	
  mini-­‐batch)	
  gradient	
  descent	
  
•  Backpropaga.on	
  
	
  
but…	
  
What	
  type	
  of	
  op.miza.on	
  problem?	
  
Do	
  local	
  minima	
  and	
  saddle	
  points	
  cause	
  problems?	
  
Does	
  gradient	
  descent	
  perform	
  well?	
  
How	
  to	
  set	
  the	
  learning	
  rate?	
  
How	
  to	
  ini.alize	
  weights?	
  
How	
  does	
  batch	
  size	
  affect	
  training?	
  
	
  
	
  
	
  
2	
  
Index	
  
•  Op6miza6on	
  for	
  a	
  machine	
  learning	
  task;	
  difference	
  between	
  learning	
  and	
  pure	
  op6miza6on	
  
•  Expected	
  and	
  empirical	
  risk	
  
•  Surrogate	
  loss	
  func.ons	
  and	
  early	
  stopping	
  
•  Batch	
  and	
  mini-­‐batch	
  algorithms	
  
•  Challenges	
  
•  Local	
  minima	
  	
  
•  Saddle	
  points	
  and	
  other	
  flat	
  regions	
  
•  Cliffs	
  and	
  exploding	
  gradients	
  
•  Prac6cal	
  algorithms	
  
•  Stochas.c	
  Gradient	
  Descent	
  
•  Momentum	
  
•  Nesterov	
  Momentum	
  
•  Learning	
  rate	
  
•  Adap.ve	
  learning	
  rates:	
  adaGrad,	
  RMSProp,	
  Adam	
  
•  Approximate	
  second-­‐order	
  methods	
  
•  Parameter	
  ini6aliza6on	
  
•  Batch	
  Normaliza6on	
  
3	
  
Differences	
  between	
  learning	
  and	
  pure	
  
op6miza6on	
  
Op6miza6on	
  for	
  NN	
  training	
  
•  Goal:	
  Find	
  the	
  parameters	
  that	
  minimize	
  the	
  expected	
  risk	
  (generaliza.on	
  error)	
  
•  x	
  input,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  predicted	
  output,	
  y	
  target	
  output,	
  E	
  expecta.on	
  
•  pdata	
  true	
  (unknown)	
  data	
  distribu.on	
  
•  L	
  	
  loss	
  func6on	
  (how	
  wrong	
  predic6ons	
  are)	
  
•  But	
  we	
  only	
  have	
  a	
  training	
  set	
  of	
  samples:	
  we	
  minimize	
  the	
  empirical	
  risk,	
  average	
  
loss	
  on	
  a	
  finite	
  dataset	
  D	
  
J(θ) = Ε(x,y)∼pdata
L( f (x;θ), y)
f (x,θ)
J(θ) = Ε(x,y)∼ ˆpdata
L( f (x;θ), y) =
1
D
L( f (x(i)
,θ), y(i)
)
(x(i)
,y(i)
)∈D
∑
where	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  is	
  the	
  empirical	
  distribu.on,	
  |D|	
  is	
  the	
  number	
  of	
  examples	
  in	
  D	
  
5	
  
ˆpdata
Surrogate	
  loss	
  
•  O]en	
  minimizing	
  the	
  real	
  loss	
  is	
  intractable	
  
•  e.g.	
  0-­‐1	
  loss	
  (0	
  if	
  correctly	
  classified,	
  1	
  if	
  it	
  is	
  not)	
  
	
  	
  	
  	
  	
  	
  	
  (intractable	
  even	
  for	
  linear	
  classifiers	
  (Marcoae	
  1992)	
  	
  	
  
•  Minimize	
  a	
  surrogate	
  loss	
  instead	
  
•  e.g.	
  nega.ve	
  log-­‐likelihood	
  for	
  the	
  0-­‐1	
  loss	
  
•  Some.mes	
  the	
  surrogate	
  loss	
  may	
  learn	
  more	
  
•  test	
  error	
  0-­‐1	
  loss	
  keeps	
  decreasing	
  even	
  a]er	
  
training	
  0-­‐1	
  loss	
  is	
  zero	
  
•  further	
  pushing	
  classes	
  apart	
  from	
  each	
  other	
  
6	
  
0-­‐1	
  loss	
  (blue)	
  and	
  surrogate	
  
losses	
  (square,	
  hinge,	
  logis.c)	
  	
  
0-­‐1	
  loss	
  (blue)	
  and	
  	
  
nega.ve	
  log	
  likelihood	
  (red)	
  
Surrogate	
  loss	
  func6ons	
  
7	
  
Probabilistic
classifier
Outputs	
  probability	
  of	
  class	
  1	
  
g(x) ≈ P(y=1 | x) Probability for class 0 is 1-g(x)
Binary cross-entropy loss:
L(g(x),y) = -(y log(g(x)) + (1-y) log(1-g(x))
Decision function:f(x) = Ig(x)>0.5
Outputs	
  a	
  vector	
  of	
  probabili.es:	
  
g(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) )
Negative conditional log likelihood loss
L(g(x),y) = -log g(x)y
Decision function:f(x) = argmax(g(x))
Non-
Hinge	
  loss:	
  probabilistic
classifier
Outputs a «score» g(x) for class 1.
score for the other class is -g(x)
L(g(x),t) = max(0, 1-tg(x)) where t=2y-1
Decision function: f(x) = Ig(x)>0
Outputs	
  a	
  vector	
  g(x) of	
  real-­‐valued	
  
scores	
  for	
  the	
  m	
  classes.
Mul.class	
  margin	
  loss	
  
L(g(x),y) = max(0,1+max(g(x)k)-g(x)y )
k≠y
Decision function: f(x) = argmax(g(x))
Binary classifier Multiclass classifier
Early	
  stopping	
  
•  Training	
  algorithms	
  usually	
  do	
  not	
  halt	
  at	
  a	
  local	
  minimum	
  
•  Early	
  stopping:	
  
•  based	
  on	
  the	
  true	
  underlying	
  loss	
  (ex	
  0-­‐1	
  loss)	
  measured	
  on	
  a	
  valida6on	
  set	
  
•  #	
  training	
  steps	
  =	
  hyperparameter	
  controlling	
  the	
  effec.ve	
  capacity	
  of	
  the	
  model	
  
•  simple,	
  effec.ve,	
  must	
  keep	
  a	
  copy	
  of	
  the	
  best	
  parameters	
  
•  acts	
  as	
  a	
  regularizer	
  (Bishop	
  1995,…)	
  
8	
  
Training	
  error	
  decreases	
  steadily	
  
Valida.on	
  error	
  begins	
  to	
  increase	
  
	
  
Return	
  parameters	
  at	
  point	
  with	
  
lowest	
  valida6on	
  error	
  
Batch	
  and	
  mini-­‐batch	
  algorithms	
  
•  In	
  most	
  op.miza.on	
  methods	
  used	
  in	
  ML	
  the	
  objec.ve	
  func.on	
  decomposes	
  as	
  a	
  sum	
  over	
  a	
  
training	
  set	
  
•  Gradient	
  descent:	
  
	
   	
  examples	
  from	
  the	
  training	
  set	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  with	
  corresponding	
  targets	
  	
  
•  Using	
  the	
  complete	
  training	
  set	
  can	
  be	
  very	
  expensive	
  (the	
  gain	
  of	
  using	
  more	
  samples	
  is	
  less	
  
than	
  linear	
  –	
  standard	
  error	
  of	
  mean	
  drops	
  propor.onally	
  to	
  sqrt(m)-­‐,	
  training	
  set	
  may	
  be	
  
redundant:	
  use	
  a	
  subset	
  of	
  the	
  training	
  set	
  
•  How	
  many	
  samples	
  in	
  each	
  update	
  step?	
  
•  Determinis.c	
  or	
  batch	
  gradient	
  methods:	
  process	
  all	
  training	
  samples	
  in	
  a	
  large	
  batch	
  
•  Stochas.c	
  methods:	
  use	
  a	
  single	
  example	
  at	
  a	
  .me	
  
•  online	
  methods:	
  samples	
  are	
  drawn	
  from	
  a	
  stream	
  of	
  con.nually	
  created	
  samples	
  
•  Mini-­‐batch	
  stochas.c	
  methods:	
  use	
  several	
  (not	
  all)	
  samples	
  
9	
  
∇θ
J(θ) = Ε(x,y)∼ ˆpdata
∇θ
L( f (x;θ), y) =
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
{x(i)
}i=1...m
{y(i)
}i=1...m
Batch	
  and	
  mini-­‐batch	
  algorithms	
  
Mini-­‐batch	
  size?	
  
•  Larger	
  batches:	
  more	
  accurate	
  es.mate	
  of	
  the	
  gradient	
  but	
  less	
  than	
  linear	
  return	
  	
  
•  Very	
  small	
  batches:	
  Mul.core	
  architectures	
  under-­‐u.lized	
  
•  If	
  samples	
  processed	
  in	
  parallel:	
  memory	
  scales	
  with	
  batch	
  size	
  
•  Smaller	
  batches	
  provide	
  noisier	
  gradient	
  es.mates	
  
•  Small	
  batches	
  may	
  offer	
  a	
  regularizing	
  effect	
  	
  (add	
  noise)	
  
•  but	
  may	
  require	
  small	
  learning	
  rate	
  
•  may	
  increase	
  number	
  of	
  steps	
  for	
  convergence	
  
	
  
	
  
Minbatches	
  should	
  be	
  selected	
  randomly	
  (shuffle	
  samples)	
  
•  unbiased	
  es.mate	
  of	
  gradients	
  
10	
  
Challenges	
  in	
  NN	
  op6miza6on	
  
Local	
  minima	
  
•  Convex	
  op.miza.on	
  
•  any	
  local	
  minimum	
  is	
  a	
  global	
  minimum	
  
•  there	
  are	
  several	
  opt.	
  algorithms	
  (polynomial-­‐.me)	
  
	
  
•  Non-­‐convex	
  op.miza.on	
  
•  objec6ve	
  func6on	
  in	
  deep	
  networks	
  is	
  non-­‐convex	
  
•  deep	
  models	
  may	
  have	
  several	
  local	
  minima	
  
•  but	
  this	
  is	
  not	
  necessarily	
  a	
  major	
  problem!	
  
12	
  
Local	
  minima	
  and	
  saddle	
  points	
  
•  Cri.cal	
  points:	
  	
  
•  For	
  high	
  dimensional	
  loss	
  func.ons,	
  local	
  minima	
  are	
  rare	
  compared	
  to	
  saddle	
  points	
  
•  Hessian	
  matrix:	
  	
  
	
  real,	
  symmetric	
  
	
  	
  eigenvector/eigenvalue	
  decomposi.on	
  
	
  
•  Intui.on:	
  eigenvalues	
  of	
  the	
  Hessian	
  matrix	
  	
  
•  local	
  minimum/maximum:	
  all	
  posi.ve	
  /	
  all	
  nega.ve	
  eigenvalues:	
  exponen.ally	
  unlikely	
  as	
  n	
  grows	
  
•  saddle	
  points:	
  both	
  posi.ve	
  and	
  nega.ve	
  eigenvalues	
  
13	
  Dauphin	
  et	
  al.	
  Iden.fying	
  and	
  aaacking	
  the	
  saddle	
  point	
  problem	
  in	
  high-­‐dimensional	
  non-­‐convex	
  op.miza.on.	
  NIPS	
  2014	
  	
  
Hij
=
∂2
f
∂xi
∂xj
f :!n
→ !
∇x
f (x) = 0
Local	
  minima	
  and	
  saddle	
  points	
  
•  It	
  is	
  believed	
  that	
  for	
  many	
  problems	
  
including	
  learning	
  deep	
  nets,	
  almost	
  all	
  local	
  
minimum	
  have	
  very	
  similar	
  func.on	
  value	
  to	
  
the	
  global	
  op.mum	
  
•  Finding	
  a	
  local	
  minimum	
  is	
  good	
  enough	
  
14	
  
Value	
  of	
  local	
  minima	
  found	
  by	
  running	
  SGD	
  for	
  200	
  
itera.ons	
  on	
  a	
  simplified	
  version	
  of	
  MNIST	
  from	
  different	
  
ini.al	
  star.ng	
  points.	
  As	
  number	
  of	
  parameters	
  increases,	
  
local	
  minima	
  tend	
  to	
  cluster	
  more	
  .ghtly.	
  
•  For	
  many	
  random	
  func.ons	
  local	
  minima	
  are	
  more	
  likely	
  to	
  have	
  low	
  cost	
  than	
  high	
  
cost.	
  
Dauphin	
  et	
  al.	
  Iden.fying	
  and	
  aaacking	
  the	
  saddle	
  point	
  problem	
  in	
  high-­‐dimensional	
  non-­‐convex	
  op.miza.on.	
  NIPS	
  2014	
  	
  
Saddle	
  points	
  
•  How	
  to	
  escape	
  from	
  saddle	
  points?	
  
•  First	
  order	
  methods	
  
•  ini.ally	
  aaracted	
  to	
  saddle	
  points,	
  but	
  unless	
  
exact	
  hit,	
  it	
  will	
  be	
  repelled	
  when	
  close	
  
•  hiqng	
  cri.cal	
  point	
  exactly	
  is	
  unlikely	
  (es.mated	
  
gradient	
  is	
  noisy)	
  
•  saddle	
  points	
  are	
  very	
  unstable:	
  noise	
  (stochas.c	
  
gradient	
  descent)	
  helps	
  convergence,	
  trajectory	
  
escapes	
  quickly	
  
•  Second	
  order	
  moments:	
  
•  Netwon’s	
  method	
  can	
  jump	
  to	
  saddle	
  points	
  
(where	
  gradient	
  is	
  0)	
  
15	
  S.	
  Credit:	
  K.McGuinness	
  
SGD	
  tends	
  to	
  oscillate	
  between	
  slowly	
  approaching	
  
a	
  saddle	
  point	
  and	
  quickly	
  escaping	
  from	
  it	
  
Other	
  difficul6es	
  
•  Cliffs	
  and	
  exploding	
  gradients	
  
•  Nets	
  with	
  many	
  layers	
  /	
  recurrent	
  nets	
  can	
  contain	
  very	
  steep	
  regions	
  (cliffs)	
  
mul.plica.on	
  of	
  several	
  parameters):	
  gradient	
  descent	
  can	
  move	
  parameters	
  too	
  
far,	
  jumping	
  off	
  of	
  the	
  cliff.	
  (solu.ons:	
  gradient	
  clipping)	
  
•  Long	
  term	
  dependencies:	
  
•  computa.onal	
  graph	
  becomes	
  very	
  deep:	
  vanishing	
  and	
  exploding	
  gradients	
  
16	
  
Algorithms	
  
Stochas6c	
  Gradient	
  Descent	
  (SGD)	
  
•  Most	
  used	
  algorithm	
  for	
  deep	
  learning	
  
•  Do	
  not	
  confuse	
  with	
  determinis.c	
  gradient	
  descent:	
  stochas.c	
  uses	
  mini-­‐batches	
  
Algorithm	
  
•  Require:	
  Learning	
  rate	
  α,	
  ini.al	
  parameter	
  θ	
  
•  while	
  stopping	
  criterion	
  not	
  met	
  do	
  
•  sample	
  a	
  minibatch	
  of	
  m	
  examples	
  from	
  the	
  training	
  set	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  with	
  
corresponding	
  targets	
  	
  
•  compute	
  gradient	
  es.mate	
  
•  apply	
  update	
  	
  
•  end	
  while	
  
18	
  
{x(i)
}i=1...m
{y(i)
}i=1...m
ˆg ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
θ ←θ −α ˆg
Momentum	
  
19	
  
•  Designed	
  to	
  accelerate	
  learning,	
  especially	
  for	
  high	
  curvature,	
  small	
  but	
  consistent	
  
gradients	
  or	
  noisy	
  gradients	
  
•  Momentum	
  aims	
  to	
  solve:	
  poor	
  condi.oning	
  of	
  Hessian	
  matrix	
  and	
  variance	
  in	
  the	
  
stochas.c	
  gradient	
  
Contour	
  lines=	
  a	
  quadra.c	
  loss	
  with	
  poor	
  condi.oning	
  of	
  Hessian	
  
Path	
  (red)	
  followed	
  by	
  SGD	
  (le])	
  and	
  momentum	
  (right)	
  
Momentum	
  	
  
•  New	
  variable	
  v	
  (velocity).	
  direc.on	
  and	
  speed	
  at	
  which	
  parameters	
  move:	
  
exponen.ally	
  decaying	
  average	
  of	
  nega.ve	
  gradient	
  
Algorithm	
  
•  Require:	
  learning	
  rate	
  α,	
  ini.al	
  parameter	
  θ,	
  	
  momentum	
  parameter	
  λ	
  	
  ,	
  ini6al	
  velocity	
  v
•  Update	
  rule:	
  
•  compute	
  gradient	
  es.mate	
  
•  compute	
  velocity	
  update	
  
•  apply	
  update	
  	
  
	
  
	
  
•  Typical	
  values	
  λ=.5,	
  .9,.99	
  (in	
  [0,1})	
  
•  Size	
  of	
  step	
  depends	
  on	
  how	
  large	
  and	
  aligned	
  a	
  sequence	
  of	
  gradients	
  are.	
  
•  Read	
  physical	
  analogy	
  in	
  Deep	
  Learning	
  book	
  (Goodfellow	
  et	
  al)	
  
20	
  
g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
θ ←θ + v
v ← λv −αg
Nesterov	
  accelerated	
  gradient	
  (NAG)	
  
•  A	
  variant	
  of	
  momentum,	
  where	
  gradient	
  is	
  evaluated	
  a]er	
  current	
  velocity	
  is	
  applied:	
  
•  Approximate	
  where	
  the	
  parameters	
  will	
  be	
  on	
  the	
  next	
  .me	
  step	
  using	
  current	
  velocity	
  
•  Update	
  velocity	
  using	
  gradient	
  where	
  we	
  predict	
  parameters	
  will	
  be	
  
Algorithm	
  
•  Require:	
  learning	
  rate	
  α,	
  ini.al	
  parameter	
  θ,	
  momentum	
  parameter	
  λ	
  	
  ,	
  ini.al	
  velocity	
  v
•  Update:	
  
•  apply	
  interim	
  update	
  
•  compute	
  gradient	
  (at	
  interim	
  point)	
  
•  compute	
  velocity	
  update	
  
•  apply	
  update	
  	
  
•  Interpreta.on:	
  add	
  a	
  correc.on	
  factor	
  to	
  momentum	
  
21	
  
g ← +
1
m
∇!θ
L( f (x(i)
; !θ), y(i)
)i∑
θ ←θ + v
v ← λv −αg
!θ ←θ + λv
Nesterov	
  accelerated	
  gradient	
  (NAG)	
  
22	
  
current	
  loca.on	
  wt
vt
∇L(wt) vt+1
S.	
  Credit:	
  K.	
  McGuinness	
  
predicted	
  loca.on	
  based	
  on	
  velocity	
  alone	
  wt + 𝛾v
∇L(wt + 𝛾vt)
vt
vt+1
SGD:	
  learning	
  rate	
  
•  Learning	
  rate	
  is	
  a	
  crucial	
  parameter	
  for	
  SGD	
  
•  To	
  large:	
  overshoots	
  local	
  minimum,	
  loss	
  increases	
  
•  Too	
  small:	
  makes	
  very	
  slow	
  progress,	
  can	
  get	
  stuck	
  
•  Good	
  learning	
  rate:	
  makes	
  steady	
  progress	
  toward	
  local	
  minimum	
  
•  In	
  prac.ce	
  it	
  is	
  necessary	
  to	
  gradually	
  decrease	
  learning	
  rate	
  
•  step	
  decay	
  (e.g.	
  decay	
  by	
  half	
  every	
  few	
  epochs)	
  
•  exponen6al	
  decay	
  
•  1/t	
  decay	
  	
  
	
  
•  Sufficient	
  condi.ons	
  for	
  convergence:	
  
	
  
•  Usually:	
  adapt	
  learning	
  rate	
  by	
  monitoring	
  learning	
  curves	
  that	
  plot	
  the	
  objec.ve	
  func.on	
  
as	
  a	
  func.on	
  of	
  .me	
  (more	
  of	
  an	
  art	
  than	
  a	
  science!)	
  
23	
  
αt
= ∞
t=1
∞
∑ αt
2
= ∞
t=1
∞
∑
α = α0
e−kt
α = α0
/ (1+ kt)
t	
  =	
  itera/on	
  number	
  
Adap6ve	
  learning	
  rates	
  
•  Learning	
  rate	
  is	
  one	
  of	
  the	
  hyperparameters	
  that	
  is	
  the	
  most	
  difficult	
  to	
  set;	
  it	
  has	
  a	
  
significant	
  impact	
  on	
  the	
  model	
  performance	
  
•  Cost	
  if	
  o]en	
  sensi.ve	
  to	
  some	
  direc.ons	
  and	
  insensi.ve	
  to	
  others	
  
•  Momentum/Nesterov	
  mi.gate	
  this	
  issue	
  but	
  introduce	
  another	
  hyperparameter	
  
•  Solu6on:	
  Use	
  a	
  separate	
  learning	
  rate	
  for	
  each	
  parameter	
  and	
  automa6cally	
  adapt	
  it	
  
through	
  the	
  course	
  of	
  learning	
  	
  
•  Algorithms	
  (mini-­‐batch	
  based)	
  
•  AdaGrad	
  
•  RMSProp	
  
•  Adam	
  
•  RMSProp	
  with	
  Nesterov	
  momentum	
  
	
  
24	
  
AdaGrad	
  
•  Adapts	
  the	
  learning	
  rate	
  of	
  each	
  parameter	
  based	
  on	
  sizes	
  of	
  previous	
  updates:	
  	
  
•  scales	
  updates	
  to	
  be	
  larger	
  for	
  parameters	
  that	
  are	
  updated	
  less	
  
•  scales	
  updates	
  to	
  be	
  smaller	
  for	
  parameters	
  that	
  are	
  updated	
  more	
  
	
  
•  The	
  net	
  effect	
  is	
  greater	
  progress	
  in	
  the	
  more	
  gently	
  sloped	
  direc.ons	
  of	
  parameter	
  space	
  	
  
•  Desirable	
  theore.cal	
  proper.es	
  but	
  empirically	
  (for	
  deep	
  models)	
  can	
  result	
  in	
  a	
  premature	
  and	
  
excessive	
  decrease	
  in	
  effec6ve	
  learning	
  rate	
  
•  Require:	
  learning	
  rate	
  α,	
  ini.al	
  parameter	
  θ,	
  small	
  constant	
  δ	
  (e.g.	
  10-­‐7)	
  for	
  numerical	
  stability
•  Update:	
  
•  compute	
  gradient	
  
•  accumulate	
  squared	
  gradient	
  
•  compute	
  update	
  
•  apply	
  update	
  	
  
25	
  
g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← r + g ⊙ g sum	
  of	
  	
  all	
  previous	
  squared	
  gradients	
  
	
  
updates	
  inversely	
  propor.onal	
  to	
  the	
  
square	
  root	
  of	
  the	
  sum	
  
Duchi	
  et	
  al.	
  Adap.ve	
  Subgradient	
  Methods	
  for	
  Online	
  Learning	
  and	
  Stochas.c	
  Op.miza.on.	
  JMRL	
  2011	
  
Root	
  Mean	
  Square	
  Propaga6on	
  (RMSProp)	
  
•  Modifies	
  AdaGrad	
  to	
  perform	
  beaer	
  in	
  non-­‐convex	
  surfaces,	
  for	
  aggressively	
  decaying	
  
learning	
  rates	
  
•  Changes	
  gradient	
  accumula.on	
  by	
  an	
  exponen6ally	
  decaying	
  average	
  of	
  sum	
  of	
  
squares	
  of	
  gradients	
  
	
  
•  Requires:	
  learning	
  rate	
  α,	
  ini.al	
  parameter	
  θ,	
  decay	
  rate	
  ρ,	
  small	
  constant	
  δ	
  (e.g.	
  10-­‐7)	
  
for	
  numerical	
  stability
•  Update:	
  
•  compute	
  gradient	
  
•  accumulate	
  squared	
  gradient	
  
•  compute	
  update	
  
•  apply	
  update	
  	
  
26	
  
θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← ρr + (1− ρ)g ⊙ g
g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
It	
  can	
  be	
  combined	
  with	
  
Nesterov	
  momentum	
  
Geoff	
  Hinton,	
  Unpublished	
  
ADAp6ve	
  Moments	
  (Adam)	
  
•  Combina.on	
  of	
  RMSProp	
  and	
  momentum,	
  but:	
  
•  Keep	
  decaying	
  average	
  of	
  both	
  first-­‐order	
  moment	
  of	
  gradient	
  (momentum)	
  and	
  second-­‐order	
  
moment	
  (RMSProp)	
  
•  Includes	
  bias	
  correc.ons	
  (firs	
  and	
  second	
  moments)	
  to	
  account	
  for	
  their	
  ini.aliza.on	
  at	
  origin	
  
Update:	
  
•  compute	
  gradient	
  
•  updated	
  biased	
  first	
  moment	
  es6mate	
  
•  update	
  biased	
  second	
  moment	
  
•  correct	
  biases	
  
•  compute	
  update	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (opera.ons	
  applied	
  elementwise)	
  
•  apply	
  update	
  
27	
  
θ ←θ + Δθ
Δθ ← −α
ˆs
δ + ˆr
s ← ρ1
s + (1− ρ1
)g
g ← +
1
m
∇θ
L( f (x(i)
;θ), y(i)
)i∑
r ← ρ2
r + (1− ρ2
)g ⊙ g
ˆs ←
s
1− ρ1
ˆr ←
r
1− ρ2
Kingma	
  et	
  al.	
  Adam:	
  a	
  Method	
  for	
  Stochas.c	
  Op.miza.on.	
  ICLR	
  2015	
  
	
  δ=e-­‐8,	
  ρ1=0.9,	
  ρ2=0.999	
  
Example:	
  test	
  func6on	
  
28	
  
Image	
  credit:	
  Alec	
  Radford.	
  
Beale’s	
  func.on	
  
Example:	
  saddle	
  point	
  
29	
  
Image	
  credit:	
  Alec	
  Radford.	
  
Second	
  order	
  op6miza6on	
  
30	
  
First	
  order 	
   	
   	
   	
   	
   	
   	
   	
  Second	
  order	
  
Second	
  order	
  op6miza6on	
  
•  Second	
  order	
  Taylor	
  expansion	
  
	
  
•  Solving	
  for	
  the	
  cri.cal	
  point	
  we	
  obtain	
  the	
  Newton	
  parameter	
  update:	
  
•  Problem:	
  Hessian	
  has	
  O(N2)	
  elements,	
  inver.ng	
  H	
  O(N3)	
  (N	
  parameters	
  ≈millions)	
  
•  Alterna6ves:	
  
•  Quasi-­‐Newton	
  methods	
  (BGFS	
  Broyden-­‐Fletcher-­‐Goldfarb-­‐Shanno):	
  instead	
  of	
  
inver.ng	
  Hessian,	
  approximate	
  inverse	
  Hessianwith	
  rank	
  1	
  updates	
  over	
  .me	
  
O(N2)	
  each	
  
•  L-­‐BFGS	
  (Limited	
  memory	
  BFGS):	
  does	
  not	
  form/store	
  the	
  full	
  inverse	
  Hessian	
  
31	
  
J(θ) ≈ J(θ0
) + (θ −θ0
)T
∇θ
J(θ0
) +
1
2
(θ −θ0
)T
H(θ −θ0
)
θ*
= θ0
− H−1
∇θ
J(θ0
)
Parameter	
  ini6aliza6on	
  
•  Weights	
  
•  Can’t	
  ini.alize	
  weights	
  to	
  0	
  	
  (gradients	
  would	
  be	
  0)	
  
•  Can’t	
  ini.alize	
  all	
  weights	
  to	
  the	
  same	
  value	
  (all	
  hidden	
  units	
  in	
  a	
  layer	
  will	
  always	
  
behave	
  the	
  same;	
  need	
  to	
  break	
  symmetry)	
  
•  Small	
  random	
  number,	
  e.g.,	
  uniform	
  or	
  gaussian	
  distribu6on	
  	
  
•  if	
  weights	
  start	
  too	
  small,	
  the	
  signal	
  shrinks	
  as	
  it	
  passes	
  through	
  each	
  layer	
  un.l	
  it	
  is	
  
too	
  .ny	
  to	
  be	
  useful	
  
•  Calibra6ng	
  variances	
  with	
  1/sqrt(n)	
  	
  (Xavier	
  Ini6aliza6on)	
  
•  each	
  neuron:	
  w	
  =	
  randn(n)	
  /	
  sqrt(n)	
  ,	
  n	
  inputs	
  
•  He	
  ini6aliza6on	
  (for	
  ReLu	
  ac6va6ons)	
  sqrt(2/n)	
  
•  each	
  neuron	
  w	
  =	
  randn(n)	
  *	
  sqrt(2.0	
  /n)	
  ,	
  n	
  inputs	
  
•  Biases	
  
•  ini.alize	
  all	
  to	
  0	
  (except	
  for	
  output	
  unit	
  for	
  skewed	
  distribu.ons,	
  0.01	
  to	
  avoid	
  satura.ng	
  RELU)	
  
•  Alterna6ve:	
  Ini.alize	
  using	
  machine	
  learning;	
  parameters	
  learned	
  by	
  unsupervised	
  model	
  
trained	
  on	
  the	
  same	
  inputs	
  /	
  trained	
  on	
  unrelated	
  task	
   32	
  
N(0,10−2
)
Batch	
  normaliza6on	
  
•  As	
  learning	
  progresses,	
  the	
  distribu.on	
  of	
  the	
  layer	
  inputs	
  changes	
  due	
  
to	
  parameter	
  updates	
  (	
  internal	
  covariate	
  shi])	
  
•  This	
  can	
  result	
  in	
  most	
  inputs	
  being	
  in	
  the	
  non-­‐linear	
  regime	
  of	
  	
  
	
  the	
  ac.va.on	
  func.on,	
  slowing	
  down	
  learning	
  
	
  
•  Bach	
  normaliza.on	
  is	
  a	
  technique	
  to	
  reduce	
  this	
  effect	
  
•  Explicitly	
  force	
  the	
  layer	
  ac.va.ons	
  to	
  have	
  zero	
  mean	
  and	
  unit	
  
variance	
  w.r.t	
  running	
  batch	
  es.mates	
  
	
  
•  Adds	
  a	
  learnable	
  scale	
  and	
  bias	
  term	
  to	
  allow	
  the	
  network	
  to	
  s.ll	
  
use	
  the	
  nonlinearity	
  
33	
  	
  Ioffe	
  and	
  Szegedy,	
  2015.	
  “Batch	
  normaliza.on:	
  accelera.ng	
  deep	
  network	
  training	
  by	
  reducing	
  internal	
  covariate	
  shi]”	
  
FC	
  /	
  Conv	
  
Batch	
  norm	
  
ReLu	
  
FC	
  /	
  Conv	
  
Batch	
  norm	
  
ReLu	
  
Batch	
  normaliza6on	
  
•  Can	
  be	
  applied	
  to	
  any	
  input	
  or	
  hidden	
  layer	
  
•  For	
  a	
  mini-­‐batch	
  of	
  N	
  ac.va.ons	
  of	
  the	
  layer	
  
1.  compute	
  empirical	
  mean	
  and	
  variance	
  for	
  each	
  dimension	
  D	
  
2.  normalize	
  
	
  
•  Note:	
  normaliza.on	
  can	
  reduce	
  the	
  expressive	
  power	
  of	
  the	
  network	
  (e.g.	
  normalize	
  
intputs	
  of	
  a	
  sigmoid	
  would	
  constrain	
  them	
  to	
  the	
  linear	
  regime	
  
•  So	
  let	
  the	
  network	
  learn	
  the	
  iden.ty:	
  scale	
  and	
  shi]	
  
•  To	
  recover	
  the	
  iden.ty	
  mapping	
  the	
  newtork	
  can	
  lean	
  
34	
  
ˆx(k )
=
x(k )
− E(x(k )
)
var(x(k )
)
N
D
X
β(k )
= E(x(k )
)γ (k )
= var(x(k )
)
y(k )
= γ (k )
ˆx(k )
+ β(k )
Batch	
  normaliza6on	
  
35	
  
1.  Improves	
  gradient	
  flow	
  
through	
  the	
  network	
  
2.  Allows	
  higher	
  learning	
  rates	
  
3.  Reduces	
  the	
  strong	
  
dependency	
  on	
  
ini.aliza.on	
  
4.  Reduces	
  the	
  need	
  of	
  
regulariza.on	
  
Batch	
  normaliza6on	
  
36	
  
At	
  test	
  6me	
  BN	
  layers	
  func6ons	
  
differently:	
  
	
  
1.  Mean	
  and	
  std	
  are	
  not	
  
computed	
  on	
  the	
  batch.	
  
2.  Instead,	
  a	
  single	
  fixed	
  
empirical	
  mean	
  and	
  std	
  of	
  
ac.va.ons	
  computed	
  during	
  
training	
  is	
  used	
  
	
  	
  	
  	
  	
  	
  	
  
	
  (can	
  be	
  es.mated	
  with	
  
	
  running	
  averages)	
  
Summary	
  
37	
  
•  Op.miza.on	
  for	
  NN	
  is	
  different	
  from	
  pure	
  op.miza.on:	
  
•  GD	
  with	
  mini-­‐batches	
  
•  early	
  stopping	
  
•  non-­‐convex	
  surface,	
  local	
  minima	
  and	
  saddle	
  points	
  
•  Learning	
  rate	
  has	
  a	
  significant	
  impact	
  on	
  model	
  performance	
  
•  Several	
  extensions	
  to	
  SGD	
  can	
  improve	
  convergence	
  
•  Ada.ve	
  learning-­‐rate	
  methods	
  are	
  likely	
  to	
  achieve	
  best	
  results	
  
•  RMSProp,	
  Adam	
  	
  
•  Weight	
  ini.aliza.on:	
  He	
  	
  	
  	
  w=	
  randn(n)	
  sqrt(2/n)	
  	
  
•  Batch	
  normaliza.on	
  to	
  reduce	
  the	
  internal	
  covariance	
  shi]	
  
Bibliograpy	
  
•  Goodfellow,	
  I.,	
  Bengio,	
  Y.,	
  and	
  A.,	
  C.	
  (2016),	
  Deep	
  Learning,	
  MIT	
  Press.	
  
•  Choromanska,	
  A.,	
  Henaff,	
  M.,	
  Mathieu,	
  M.,	
  Arous,	
  G.	
  B.,	
  and	
  LeCun,	
  Y.	
  (2015),	
  The	
  loss	
  surfaces	
  of	
  
mul.layer	
  networks.	
  In	
  AISTATS.	
  
•  Dauphin,	
  Y.	
  N.,	
  Pascanu,	
  R.,	
  Gulcehre,	
  C.,	
  Cho,	
  K.,	
  Ganguli,	
  S.,	
  and	
  Bengio,	
  Y.	
  (2014).	
  Iden.fying	
  and	
  
aaacking	
  the	
  saddle	
  point	
  problem	
  in	
  high-­‐dimensional	
  non-­‐convex	
  op.miza.on.	
  In	
  Advances	
  in	
  
Neural	
  Informa.on	
  Processing.	
  Systems,	
  pages	
  2933–2941.	
  
•  Duchi,	
  J.,	
  Hazan,	
  E.,	
  and	
  Singer,	
  Y.	
  (2011).	
  Adap.ve	
  subgradient	
  methods	
  for	
  online	
  learning	
  and	
  
stochas.c	
  op.miza.on.	
  Journal	
  of	
  Machine	
  Learning	
  Research,	
  12(Jul):2121–2159.	
  
•  Goodfellow,	
  I.	
  J.,	
  Vinyals,	
  O.,	
  and	
  Saxe,	
  A.	
  M.	
  (2015).	
  Qualita.vely	
  characterizing	
  neural	
  network	
  
op.miza.on	
  problems.	
  In	
  Interna.onal	
  Conference	
  on	
  Learning	
  Representa.ons.	
  
•  Hinton,	
  G.	
  (2012).	
  Neural	
  networks	
  for	
  machine	
  learning.	
  Coursera,	
  video	
  lectures	
  
•  Jacobs,	
  R.	
  A.	
  (1988).	
  Increased	
  rates	
  of	
  convergence	
  through	
  learning	
  rate	
  adapta.on.	
  Neural	
  
networks,	
  1(4):295–307.	
  
•  Kingma,	
  D.	
  and	
  Ba,	
  J.	
  (2014)-­‐	
  Adam:	
  A	
  method	
  for	
  stochas.c	
  op.miza.on.	
  arXiv	
  preprint	
  arXiv:
1412.6980.	
  
•  Saxe,	
  A.	
  M.,	
  McClelland,	
  J.	
  L.,	
  and	
  Ganguli,	
  S.	
  (2013).	
  Exact	
  solu.ons	
  to	
  the	
  nonlinear	
  dynamics	
  of	
  
learning	
  in	
  deep	
  linear	
  neural	
  networks.	
  In	
  Interna.onal	
  Conference	
  on	
  Learning	
  Representa.ons	
  38	
  

Contenu connexe

Tendances

Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Universitat Politècnica de Catalunya
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Universitat Politècnica de Catalunya
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowOswald Campesato
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...Universitat Politècnica de Catalunya
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain홍배 김
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative ModelsKenta Oono
 
Anomaly Detection and Localization Using GAN and One-Class Classifier
Anomaly Detection and Localization  Using GAN and One-Class ClassifierAnomaly Detection and Localization  Using GAN and One-Class Classifier
Anomaly Detection and Localization Using GAN and One-Class Classifier홍배 김
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionEun Ji Lee
 

Tendances (20)

Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
 
The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
Anomaly Detection and Localization Using GAN and One-Class Classifier
Anomaly Detection and Localization  Using GAN and One-Class ClassifierAnomaly Detection and Localization  Using GAN and One-Class Classifier
Anomaly Detection and Localization Using GAN and One-Class Classifier
 
Scala and Deep Learning
Scala and Deep LearningScala and Deep Learning
Scala and Deep Learning
 
Deep Learning for Computer Vision: Deep Networks (UPC 2016)
Deep Learning for Computer Vision: Deep Networks (UPC 2016)Deep Learning for Computer Vision: Deep Networks (UPC 2016)
Deep Learning for Computer Vision: Deep Networks (UPC 2016)
 
C++ and Deep Learning
C++ and Deep LearningC++ and Deep Learning
C++ and Deep Learning
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
 

Similaire à Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsMarina Santini
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAI Summary
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Machine learning interviews day2
Machine learning interviews   day2Machine learning interviews   day2
Machine learning interviews day2rajmohanc
 
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGmohanapriyastp
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesXavier Rafael Palou
 
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vectorDr Fereidoun Dejahang
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptxEmanAl15
 
super vector machines algorithms using deep
super vector machines algorithms using deepsuper vector machines algorithms using deep
super vector machines algorithms using deepKNaveenKumarECE
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validationgmorishita
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerSeiya Tokui
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopYahoo Developer Network
 

Similaire à Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence) (20)

DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Machine learning interviews day2
Machine learning interviews   day2Machine learning interviews   day2
Machine learning interviews day2
 
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniques
 
Mit6 094 iap10_lec03
Mit6 094 iap10_lec03Mit6 094 iap10_lec03
Mit6 094 iap10_lec03
 
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
 
super vector machines algorithms using deep
super vector machines algorithms using deepsuper vector machines algorithms using deep
super vector machines algorithms using deep
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
DNN_M3_Optimization.pdf
DNN_M3_Optimization.pdfDNN_M3_Optimization.pdf
DNN_M3_Optimization.pdf
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with Chainer
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using Hadoop
 

Plus de Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 

Plus de Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 

Dernier

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 

Dernier (20)

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)

  • 1. [course  site]   Verónica Vilaplana veronica.vilaplana@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Optimization for neural network training Day 4 Lecture 1 #DLUPC
  • 2. Previously  in  DLAI…   •  Mul.layer  perceptron   •  Training:  (stochas.c  /  mini-­‐batch)  gradient  descent   •  Backpropaga.on     but…   What  type  of  op.miza.on  problem?   Do  local  minima  and  saddle  points  cause  problems?   Does  gradient  descent  perform  well?   How  to  set  the  learning  rate?   How  to  ini.alize  weights?   How  does  batch  size  affect  training?         2  
  • 3. Index   •  Op6miza6on  for  a  machine  learning  task;  difference  between  learning  and  pure  op6miza6on   •  Expected  and  empirical  risk   •  Surrogate  loss  func.ons  and  early  stopping   •  Batch  and  mini-­‐batch  algorithms   •  Challenges   •  Local  minima     •  Saddle  points  and  other  flat  regions   •  Cliffs  and  exploding  gradients   •  Prac6cal  algorithms   •  Stochas.c  Gradient  Descent   •  Momentum   •  Nesterov  Momentum   •  Learning  rate   •  Adap.ve  learning  rates:  adaGrad,  RMSProp,  Adam   •  Approximate  second-­‐order  methods   •  Parameter  ini6aliza6on   •  Batch  Normaliza6on   3  
  • 4. Differences  between  learning  and  pure   op6miza6on  
  • 5. Op6miza6on  for  NN  training   •  Goal:  Find  the  parameters  that  minimize  the  expected  risk  (generaliza.on  error)   •  x  input,                                predicted  output,  y  target  output,  E  expecta.on   •  pdata  true  (unknown)  data  distribu.on   •  L    loss  func6on  (how  wrong  predic6ons  are)   •  But  we  only  have  a  training  set  of  samples:  we  minimize  the  empirical  risk,  average   loss  on  a  finite  dataset  D   J(θ) = Ε(x,y)∼pdata L( f (x;θ), y) f (x,θ) J(θ) = Ε(x,y)∼ ˆpdata L( f (x;θ), y) = 1 D L( f (x(i) ,θ), y(i) ) (x(i) ,y(i) )∈D ∑ where                            is  the  empirical  distribu.on,  |D|  is  the  number  of  examples  in  D   5   ˆpdata
  • 6. Surrogate  loss   •  O]en  minimizing  the  real  loss  is  intractable   •  e.g.  0-­‐1  loss  (0  if  correctly  classified,  1  if  it  is  not)                (intractable  even  for  linear  classifiers  (Marcoae  1992)       •  Minimize  a  surrogate  loss  instead   •  e.g.  nega.ve  log-­‐likelihood  for  the  0-­‐1  loss   •  Some.mes  the  surrogate  loss  may  learn  more   •  test  error  0-­‐1  loss  keeps  decreasing  even  a]er   training  0-­‐1  loss  is  zero   •  further  pushing  classes  apart  from  each  other   6   0-­‐1  loss  (blue)  and  surrogate   losses  (square,  hinge,  logis.c)     0-­‐1  loss  (blue)  and     nega.ve  log  likelihood  (red)  
  • 7. Surrogate  loss  func6ons   7   Probabilistic classifier Outputs  probability  of  class  1   g(x) ≈ P(y=1 | x) Probability for class 0 is 1-g(x) Binary cross-entropy loss: L(g(x),y) = -(y log(g(x)) + (1-y) log(1-g(x)) Decision function:f(x) = Ig(x)>0.5 Outputs  a  vector  of  probabili.es:   g(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) ) Negative conditional log likelihood loss L(g(x),y) = -log g(x)y Decision function:f(x) = argmax(g(x)) Non- Hinge  loss:  probabilistic classifier Outputs a «score» g(x) for class 1. score for the other class is -g(x) L(g(x),t) = max(0, 1-tg(x)) where t=2y-1 Decision function: f(x) = Ig(x)>0 Outputs  a  vector  g(x) of  real-­‐valued   scores  for  the  m  classes. Mul.class  margin  loss   L(g(x),y) = max(0,1+max(g(x)k)-g(x)y ) k≠y Decision function: f(x) = argmax(g(x)) Binary classifier Multiclass classifier
  • 8. Early  stopping   •  Training  algorithms  usually  do  not  halt  at  a  local  minimum   •  Early  stopping:   •  based  on  the  true  underlying  loss  (ex  0-­‐1  loss)  measured  on  a  valida6on  set   •  #  training  steps  =  hyperparameter  controlling  the  effec.ve  capacity  of  the  model   •  simple,  effec.ve,  must  keep  a  copy  of  the  best  parameters   •  acts  as  a  regularizer  (Bishop  1995,…)   8   Training  error  decreases  steadily   Valida.on  error  begins  to  increase     Return  parameters  at  point  with   lowest  valida6on  error  
  • 9. Batch  and  mini-­‐batch  algorithms   •  In  most  op.miza.on  methods  used  in  ML  the  objec.ve  func.on  decomposes  as  a  sum  over  a   training  set   •  Gradient  descent:      examples  from  the  training  set                                                with  corresponding  targets     •  Using  the  complete  training  set  can  be  very  expensive  (the  gain  of  using  more  samples  is  less   than  linear  –  standard  error  of  mean  drops  propor.onally  to  sqrt(m)-­‐,  training  set  may  be   redundant:  use  a  subset  of  the  training  set   •  How  many  samples  in  each  update  step?   •  Determinis.c  or  batch  gradient  methods:  process  all  training  samples  in  a  large  batch   •  Stochas.c  methods:  use  a  single  example  at  a  .me   •  online  methods:  samples  are  drawn  from  a  stream  of  con.nually  created  samples   •  Mini-­‐batch  stochas.c  methods:  use  several  (not  all)  samples   9   ∇θ J(θ) = Ε(x,y)∼ ˆpdata ∇θ L( f (x;θ), y) = 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ {x(i) }i=1...m {y(i) }i=1...m
  • 10. Batch  and  mini-­‐batch  algorithms   Mini-­‐batch  size?   •  Larger  batches:  more  accurate  es.mate  of  the  gradient  but  less  than  linear  return     •  Very  small  batches:  Mul.core  architectures  under-­‐u.lized   •  If  samples  processed  in  parallel:  memory  scales  with  batch  size   •  Smaller  batches  provide  noisier  gradient  es.mates   •  Small  batches  may  offer  a  regularizing  effect    (add  noise)   •  but  may  require  small  learning  rate   •  may  increase  number  of  steps  for  convergence       Minbatches  should  be  selected  randomly  (shuffle  samples)   •  unbiased  es.mate  of  gradients   10  
  • 11. Challenges  in  NN  op6miza6on  
  • 12. Local  minima   •  Convex  op.miza.on   •  any  local  minimum  is  a  global  minimum   •  there  are  several  opt.  algorithms  (polynomial-­‐.me)     •  Non-­‐convex  op.miza.on   •  objec6ve  func6on  in  deep  networks  is  non-­‐convex   •  deep  models  may  have  several  local  minima   •  but  this  is  not  necessarily  a  major  problem!   12  
  • 13. Local  minima  and  saddle  points   •  Cri.cal  points:     •  For  high  dimensional  loss  func.ons,  local  minima  are  rare  compared  to  saddle  points   •  Hessian  matrix:      real,  symmetric      eigenvector/eigenvalue  decomposi.on     •  Intui.on:  eigenvalues  of  the  Hessian  matrix     •  local  minimum/maximum:  all  posi.ve  /  all  nega.ve  eigenvalues:  exponen.ally  unlikely  as  n  grows   •  saddle  points:  both  posi.ve  and  nega.ve  eigenvalues   13  Dauphin  et  al.  Iden.fying  and  aaacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  NIPS  2014     Hij = ∂2 f ∂xi ∂xj f :!n → ! ∇x f (x) = 0
  • 14. Local  minima  and  saddle  points   •  It  is  believed  that  for  many  problems   including  learning  deep  nets,  almost  all  local   minimum  have  very  similar  func.on  value  to   the  global  op.mum   •  Finding  a  local  minimum  is  good  enough   14   Value  of  local  minima  found  by  running  SGD  for  200   itera.ons  on  a  simplified  version  of  MNIST  from  different   ini.al  star.ng  points.  As  number  of  parameters  increases,   local  minima  tend  to  cluster  more  .ghtly.   •  For  many  random  func.ons  local  minima  are  more  likely  to  have  low  cost  than  high   cost.   Dauphin  et  al.  Iden.fying  and  aaacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  NIPS  2014    
  • 15. Saddle  points   •  How  to  escape  from  saddle  points?   •  First  order  methods   •  ini.ally  aaracted  to  saddle  points,  but  unless   exact  hit,  it  will  be  repelled  when  close   •  hiqng  cri.cal  point  exactly  is  unlikely  (es.mated   gradient  is  noisy)   •  saddle  points  are  very  unstable:  noise  (stochas.c   gradient  descent)  helps  convergence,  trajectory   escapes  quickly   •  Second  order  moments:   •  Netwon’s  method  can  jump  to  saddle  points   (where  gradient  is  0)   15  S.  Credit:  K.McGuinness   SGD  tends  to  oscillate  between  slowly  approaching   a  saddle  point  and  quickly  escaping  from  it  
  • 16. Other  difficul6es   •  Cliffs  and  exploding  gradients   •  Nets  with  many  layers  /  recurrent  nets  can  contain  very  steep  regions  (cliffs)   mul.plica.on  of  several  parameters):  gradient  descent  can  move  parameters  too   far,  jumping  off  of  the  cliff.  (solu.ons:  gradient  clipping)   •  Long  term  dependencies:   •  computa.onal  graph  becomes  very  deep:  vanishing  and  exploding  gradients   16  
  • 18. Stochas6c  Gradient  Descent  (SGD)   •  Most  used  algorithm  for  deep  learning   •  Do  not  confuse  with  determinis.c  gradient  descent:  stochas.c  uses  mini-­‐batches   Algorithm   •  Require:  Learning  rate  α,  ini.al  parameter  θ   •  while  stopping  criterion  not  met  do   •  sample  a  minibatch  of  m  examples  from  the  training  set                                          with   corresponding  targets     •  compute  gradient  es.mate   •  apply  update     •  end  while   18   {x(i) }i=1...m {y(i) }i=1...m ˆg ← + 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ θ ←θ −α ˆg
  • 19. Momentum   19   •  Designed  to  accelerate  learning,  especially  for  high  curvature,  small  but  consistent   gradients  or  noisy  gradients   •  Momentum  aims  to  solve:  poor  condi.oning  of  Hessian  matrix  and  variance  in  the   stochas.c  gradient   Contour  lines=  a  quadra.c  loss  with  poor  condi.oning  of  Hessian   Path  (red)  followed  by  SGD  (le])  and  momentum  (right)  
  • 20. Momentum     •  New  variable  v  (velocity).  direc.on  and  speed  at  which  parameters  move:   exponen.ally  decaying  average  of  nega.ve  gradient   Algorithm   •  Require:  learning  rate  α,  ini.al  parameter  θ,    momentum  parameter  λ    ,  ini6al  velocity  v •  Update  rule:   •  compute  gradient  es.mate   •  compute  velocity  update   •  apply  update         •  Typical  values  λ=.5,  .9,.99  (in  [0,1})   •  Size  of  step  depends  on  how  large  and  aligned  a  sequence  of  gradients  are.   •  Read  physical  analogy  in  Deep  Learning  book  (Goodfellow  et  al)   20   g ← + 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ θ ←θ + v v ← λv −αg
  • 21. Nesterov  accelerated  gradient  (NAG)   •  A  variant  of  momentum,  where  gradient  is  evaluated  a]er  current  velocity  is  applied:   •  Approximate  where  the  parameters  will  be  on  the  next  .me  step  using  current  velocity   •  Update  velocity  using  gradient  where  we  predict  parameters  will  be   Algorithm   •  Require:  learning  rate  α,  ini.al  parameter  θ,  momentum  parameter  λ    ,  ini.al  velocity  v •  Update:   •  apply  interim  update   •  compute  gradient  (at  interim  point)   •  compute  velocity  update   •  apply  update     •  Interpreta.on:  add  a  correc.on  factor  to  momentum   21   g ← + 1 m ∇!θ L( f (x(i) ; !θ), y(i) )i∑ θ ←θ + v v ← λv −αg !θ ←θ + λv
  • 22. Nesterov  accelerated  gradient  (NAG)   22   current  loca.on  wt vt ∇L(wt) vt+1 S.  Credit:  K.  McGuinness   predicted  loca.on  based  on  velocity  alone  wt + 𝛾v ∇L(wt + 𝛾vt) vt vt+1
  • 23. SGD:  learning  rate   •  Learning  rate  is  a  crucial  parameter  for  SGD   •  To  large:  overshoots  local  minimum,  loss  increases   •  Too  small:  makes  very  slow  progress,  can  get  stuck   •  Good  learning  rate:  makes  steady  progress  toward  local  minimum   •  In  prac.ce  it  is  necessary  to  gradually  decrease  learning  rate   •  step  decay  (e.g.  decay  by  half  every  few  epochs)   •  exponen6al  decay   •  1/t  decay       •  Sufficient  condi.ons  for  convergence:     •  Usually:  adapt  learning  rate  by  monitoring  learning  curves  that  plot  the  objec.ve  func.on   as  a  func.on  of  .me  (more  of  an  art  than  a  science!)   23   αt = ∞ t=1 ∞ ∑ αt 2 = ∞ t=1 ∞ ∑ α = α0 e−kt α = α0 / (1+ kt) t  =  itera/on  number  
  • 24. Adap6ve  learning  rates   •  Learning  rate  is  one  of  the  hyperparameters  that  is  the  most  difficult  to  set;  it  has  a   significant  impact  on  the  model  performance   •  Cost  if  o]en  sensi.ve  to  some  direc.ons  and  insensi.ve  to  others   •  Momentum/Nesterov  mi.gate  this  issue  but  introduce  another  hyperparameter   •  Solu6on:  Use  a  separate  learning  rate  for  each  parameter  and  automa6cally  adapt  it   through  the  course  of  learning     •  Algorithms  (mini-­‐batch  based)   •  AdaGrad   •  RMSProp   •  Adam   •  RMSProp  with  Nesterov  momentum     24  
  • 25. AdaGrad   •  Adapts  the  learning  rate  of  each  parameter  based  on  sizes  of  previous  updates:     •  scales  updates  to  be  larger  for  parameters  that  are  updated  less   •  scales  updates  to  be  smaller  for  parameters  that  are  updated  more     •  The  net  effect  is  greater  progress  in  the  more  gently  sloped  direc.ons  of  parameter  space     •  Desirable  theore.cal  proper.es  but  empirically  (for  deep  models)  can  result  in  a  premature  and   excessive  decrease  in  effec6ve  learning  rate   •  Require:  learning  rate  α,  ini.al  parameter  θ,  small  constant  δ  (e.g.  10-­‐7)  for  numerical  stability •  Update:   •  compute  gradient   •  accumulate  squared  gradient   •  compute  update   •  apply  update     25   g ← + 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ θ ←θ + Δθ Δθ ← − α δ + r ⊙ g r ← r + g ⊙ g sum  of    all  previous  squared  gradients     updates  inversely  propor.onal  to  the   square  root  of  the  sum   Duchi  et  al.  Adap.ve  Subgradient  Methods  for  Online  Learning  and  Stochas.c  Op.miza.on.  JMRL  2011  
  • 26. Root  Mean  Square  Propaga6on  (RMSProp)   •  Modifies  AdaGrad  to  perform  beaer  in  non-­‐convex  surfaces,  for  aggressively  decaying   learning  rates   •  Changes  gradient  accumula.on  by  an  exponen6ally  decaying  average  of  sum  of   squares  of  gradients     •  Requires:  learning  rate  α,  ini.al  parameter  θ,  decay  rate  ρ,  small  constant  δ  (e.g.  10-­‐7)   for  numerical  stability •  Update:   •  compute  gradient   •  accumulate  squared  gradient   •  compute  update   •  apply  update     26   θ ←θ + Δθ Δθ ← − α δ + r ⊙ g r ← ρr + (1− ρ)g ⊙ g g ← + 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ It  can  be  combined  with   Nesterov  momentum   Geoff  Hinton,  Unpublished  
  • 27. ADAp6ve  Moments  (Adam)   •  Combina.on  of  RMSProp  and  momentum,  but:   •  Keep  decaying  average  of  both  first-­‐order  moment  of  gradient  (momentum)  and  second-­‐order   moment  (RMSProp)   •  Includes  bias  correc.ons  (firs  and  second  moments)  to  account  for  their  ini.aliza.on  at  origin   Update:   •  compute  gradient   •  updated  biased  first  moment  es6mate   •  update  biased  second  moment   •  correct  biases   •  compute  update                                                                                              (opera.ons  applied  elementwise)   •  apply  update   27   θ ←θ + Δθ Δθ ← −α ˆs δ + ˆr s ← ρ1 s + (1− ρ1 )g g ← + 1 m ∇θ L( f (x(i) ;θ), y(i) )i∑ r ← ρ2 r + (1− ρ2 )g ⊙ g ˆs ← s 1− ρ1 ˆr ← r 1− ρ2 Kingma  et  al.  Adam:  a  Method  for  Stochas.c  Op.miza.on.  ICLR  2015    δ=e-­‐8,  ρ1=0.9,  ρ2=0.999  
  • 28. Example:  test  func6on   28   Image  credit:  Alec  Radford.   Beale’s  func.on  
  • 29. Example:  saddle  point   29   Image  credit:  Alec  Radford.  
  • 30. Second  order  op6miza6on   30   First  order                Second  order  
  • 31. Second  order  op6miza6on   •  Second  order  Taylor  expansion     •  Solving  for  the  cri.cal  point  we  obtain  the  Newton  parameter  update:   •  Problem:  Hessian  has  O(N2)  elements,  inver.ng  H  O(N3)  (N  parameters  ≈millions)   •  Alterna6ves:   •  Quasi-­‐Newton  methods  (BGFS  Broyden-­‐Fletcher-­‐Goldfarb-­‐Shanno):  instead  of   inver.ng  Hessian,  approximate  inverse  Hessianwith  rank  1  updates  over  .me   O(N2)  each   •  L-­‐BFGS  (Limited  memory  BFGS):  does  not  form/store  the  full  inverse  Hessian   31   J(θ) ≈ J(θ0 ) + (θ −θ0 )T ∇θ J(θ0 ) + 1 2 (θ −θ0 )T H(θ −θ0 ) θ* = θ0 − H−1 ∇θ J(θ0 )
  • 32. Parameter  ini6aliza6on   •  Weights   •  Can’t  ini.alize  weights  to  0    (gradients  would  be  0)   •  Can’t  ini.alize  all  weights  to  the  same  value  (all  hidden  units  in  a  layer  will  always   behave  the  same;  need  to  break  symmetry)   •  Small  random  number,  e.g.,  uniform  or  gaussian  distribu6on     •  if  weights  start  too  small,  the  signal  shrinks  as  it  passes  through  each  layer  un.l  it  is   too  .ny  to  be  useful   •  Calibra6ng  variances  with  1/sqrt(n)    (Xavier  Ini6aliza6on)   •  each  neuron:  w  =  randn(n)  /  sqrt(n)  ,  n  inputs   •  He  ini6aliza6on  (for  ReLu  ac6va6ons)  sqrt(2/n)   •  each  neuron  w  =  randn(n)  *  sqrt(2.0  /n)  ,  n  inputs   •  Biases   •  ini.alize  all  to  0  (except  for  output  unit  for  skewed  distribu.ons,  0.01  to  avoid  satura.ng  RELU)   •  Alterna6ve:  Ini.alize  using  machine  learning;  parameters  learned  by  unsupervised  model   trained  on  the  same  inputs  /  trained  on  unrelated  task   32   N(0,10−2 )
  • 33. Batch  normaliza6on   •  As  learning  progresses,  the  distribu.on  of  the  layer  inputs  changes  due   to  parameter  updates  (  internal  covariate  shi])   •  This  can  result  in  most  inputs  being  in  the  non-­‐linear  regime  of      the  ac.va.on  func.on,  slowing  down  learning     •  Bach  normaliza.on  is  a  technique  to  reduce  this  effect   •  Explicitly  force  the  layer  ac.va.ons  to  have  zero  mean  and  unit   variance  w.r.t  running  batch  es.mates     •  Adds  a  learnable  scale  and  bias  term  to  allow  the  network  to  s.ll   use  the  nonlinearity   33    Ioffe  and  Szegedy,  2015.  “Batch  normaliza.on:  accelera.ng  deep  network  training  by  reducing  internal  covariate  shi]”   FC  /  Conv   Batch  norm   ReLu   FC  /  Conv   Batch  norm   ReLu  
  • 34. Batch  normaliza6on   •  Can  be  applied  to  any  input  or  hidden  layer   •  For  a  mini-­‐batch  of  N  ac.va.ons  of  the  layer   1.  compute  empirical  mean  and  variance  for  each  dimension  D   2.  normalize     •  Note:  normaliza.on  can  reduce  the  expressive  power  of  the  network  (e.g.  normalize   intputs  of  a  sigmoid  would  constrain  them  to  the  linear  regime   •  So  let  the  network  learn  the  iden.ty:  scale  and  shi]   •  To  recover  the  iden.ty  mapping  the  newtork  can  lean   34   ˆx(k ) = x(k ) − E(x(k ) ) var(x(k ) ) N D X β(k ) = E(x(k ) )γ (k ) = var(x(k ) ) y(k ) = γ (k ) ˆx(k ) + β(k )
  • 35. Batch  normaliza6on   35   1.  Improves  gradient  flow   through  the  network   2.  Allows  higher  learning  rates   3.  Reduces  the  strong   dependency  on   ini.aliza.on   4.  Reduces  the  need  of   regulariza.on  
  • 36. Batch  normaliza6on   36   At  test  6me  BN  layers  func6ons   differently:     1.  Mean  and  std  are  not   computed  on  the  batch.   2.  Instead,  a  single  fixed   empirical  mean  and  std  of   ac.va.ons  computed  during   training  is  used                  (can  be  es.mated  with    running  averages)  
  • 37. Summary   37   •  Op.miza.on  for  NN  is  different  from  pure  op.miza.on:   •  GD  with  mini-­‐batches   •  early  stopping   •  non-­‐convex  surface,  local  minima  and  saddle  points   •  Learning  rate  has  a  significant  impact  on  model  performance   •  Several  extensions  to  SGD  can  improve  convergence   •  Ada.ve  learning-­‐rate  methods  are  likely  to  achieve  best  results   •  RMSProp,  Adam     •  Weight  ini.aliza.on:  He        w=  randn(n)  sqrt(2/n)     •  Batch  normaliza.on  to  reduce  the  internal  covariance  shi]  
  • 38. Bibliograpy   •  Goodfellow,  I.,  Bengio,  Y.,  and  A.,  C.  (2016),  Deep  Learning,  MIT  Press.   •  Choromanska,  A.,  Henaff,  M.,  Mathieu,  M.,  Arous,  G.  B.,  and  LeCun,  Y.  (2015),  The  loss  surfaces  of   mul.layer  networks.  In  AISTATS.   •  Dauphin,  Y.  N.,  Pascanu,  R.,  Gulcehre,  C.,  Cho,  K.,  Ganguli,  S.,  and  Bengio,  Y.  (2014).  Iden.fying  and   aaacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  In  Advances  in   Neural  Informa.on  Processing.  Systems,  pages  2933–2941.   •  Duchi,  J.,  Hazan,  E.,  and  Singer,  Y.  (2011).  Adap.ve  subgradient  methods  for  online  learning  and   stochas.c  op.miza.on.  Journal  of  Machine  Learning  Research,  12(Jul):2121–2159.   •  Goodfellow,  I.  J.,  Vinyals,  O.,  and  Saxe,  A.  M.  (2015).  Qualita.vely  characterizing  neural  network   op.miza.on  problems.  In  Interna.onal  Conference  on  Learning  Representa.ons.   •  Hinton,  G.  (2012).  Neural  networks  for  machine  learning.  Coursera,  video  lectures   •  Jacobs,  R.  A.  (1988).  Increased  rates  of  convergence  through  learning  rate  adapta.on.  Neural   networks,  1(4):295–307.   •  Kingma,  D.  and  Ba,  J.  (2014)-­‐  Adam:  A  method  for  stochas.c  op.miza.on.  arXiv  preprint  arXiv: 1412.6980.   •  Saxe,  A.  M.,  McClelland,  J.  L.,  and  Ganguli,  S.  (2013).  Exact  solu.ons  to  the  nonlinear  dynamics  of   learning  in  deep  linear  neural  networks.  In  Interna.onal  Conference  on  Learning  Representa.ons  38