P06 motion and video cvpr2012 deep learning methods for vision
1. LEARNING REPRESENTATIONS OF SEQUENCES
WITH APPLICATIONS TO MOTION CAPTURE AND VIDEO ANALYSIS
GRAHAM TAYLOR
SCHOOL OF ENGINEERING
UNIVERSITY OF GUELPH
Papers and software available at: http://www.uoguelph.ca/~gwtaylor
Saturday, June 16, 2012
2. OVERVIEW: THIS TALK
18 May 2012 / 2
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
3. OVERVIEW: THIS TALK
• Learning representations of temporal data: X (Input)
x
Np
Nx Nw Nz
pk
- existing methods and challenges faced
x
Nw
Np
zk
m,n
- recent methods inspired by “deep learning” Nx Nz
k
P
Pooling
Y (Output) layer
k
Ny
y
Z
Nw
Feature
y
Nw layer
Ny
18 May 2012 / 2
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
4. OVERVIEW: THIS TALK
• Learning representations of temporal data: X (Input)
x
Np
Nx Nw Nz
pk
- existing methods and challenges faced
x
Nw
Np
zk
m,n
- recent methods inspired by “deep learning” Nx Nz
k
P
Pooling
Y (Output) layer
k
Ny
y
Z
Nw
Feature
y
Nw layer
Ny
• Applications: in particular, modeling human pose and activity
- highly structured data: e.g. motion capture
- weakly structured data: e.g. video
w
Component
j
1
2
3
4
5
6
7
8
9
10
18 May 2012 / 2 100 200 300 400 500 600 700
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
5. OUTLINE
Learning representations from sequences
Existing methods, challenges
yt−2 yt−1 yt
Composable, distributed-state models for sequences
Conditional Restricted Boltzmann Machines and their variants
X (Input) Np
Using learned representations to analyze video
x
Nx Nw Nz
x pk
Nw
Np
zk
m,n
A brief and (incomplete survey of deep learning for activity recognition
Nx k
Nz P
Pooling
Y (Output) layer
k
Ny
y
Z
Nw
Feature
y
Nw layer
Ny
18 May 2012 / 3
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
6. TIME SERIES DATA
• Time is an integral part of many human behaviours (motion, reasoning)
• In building statistical models, time is sometimes ignored, often problematic
• Models that do incorporate dynamics fail to account for the fact that data is
often high-dimensional, nonlinear, and contains long-range dependencies
INTENSITY (No of stories)
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
18 May 2012 / 4
Learning Representations of Sequences / G Taylor
Graphic: David McCandless, informationisbeautiful.net
Saturday, June 16, 2012
7. TIME SERIES DATA
• Time is an integral part of many human behaviours (motion, reasoning)
• In building statistical models, time is sometimes ignored, often problematic
• Models that do incorporate dynamics fail to account for the fact that data is
often high-dimensional, nonlinear, and contains long-range dependencies
INTENSITY (No of stories)
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Today we will discuss a number of models that have been developed to
address these challenges
18 May 2012 / 4
Learning Representations of Sequences / G Taylor
Graphic: David McCandless, informationisbeautiful.net
Saturday, June 16, 2012
8. VECTOR AUTOREGRESSIVE MODELS
M
vt = b + Am vt−m + et
m=1
• Have dominated statistical time-series analysis for approx. 50 years
• Can be fit easily by least-squares regression
• Can fail even for simple nonlinearities present in the system
- but many data sets can be modeled well by a linear system
• Well understood; many extensions exist
18 May 2012 / 5
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
9. MARKOV (“N-GRAM”) MODELS
vt−2 vt−1 vt
• Fully observable
• Sequential observations may have nonlinear dependence
• Derived by assuming sequences have Markov property:
t−1 t−1
p(vt |{v1 }) = p(vt |{vt−N })
• This leads to joint: T N
T
t−1
p({v1 }) = p({v1 }) p(vt |{vt−N })
t=N +1
• Number of parameters exponential in N !
18 May 2012 / 6
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
10. HIDDEN MARKOV MODELS (HMM)
ht−2 ht−1 ht
vt−2 vt−1 vt
18 May 2012 / 7
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
11. HIDDEN MARKOV MODELS (HMM)
ht−2 ht−1 ht
Introduces a hidden
state that controls the
dependence of the
current observation
on the past
vt−2 vt−1 vt
18 May 2012 / 7
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
12. HIDDEN MARKOV MODELS (HMM)
ht−2 ht−1 ht
Introduces a hidden
state that controls the
dependence of the
current observation
on the past
vt−2 vt−1 vt
• Successful in speech language modeling, biology
• Defined by 3 sets of parameters:
- Initial state parameters, π
- Transition matrix, A
- Emission distribution, p(vt |ht ) T
• Factored joint distribution: p({ht }, {vt }) = p(h1 )p(v1 |h1 ) p(ht |ht−1 )p(vt |ht )
t=2
18 May 2012 / 7
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
13. HMM INFERENCE AND LEARNING
• Typically three tasks we want to perform in an HMM:
- Likelihood estimation
- Inference
- Learning
• All are exact and tractable due to the simple structure of the model
• Forward-backward algorithm for inference (belief propagation)
• Baum-Welch algorithm for learning (EM)
• Viterbi algorithm for state estimation (max-product)
18 May 2012 / 8
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
14. LIMITATIONS OF HMMS
18 May 2012 / 9
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
15. LIMITATIONS OF HMMS
• Many high-dimensional data sets contain rich componential structure
18 May 2012 / 9
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
16. LIMITATIONS OF HMMS
• Many high-dimensional data sets contain rich componential structure
• Hidden Markov Models cannot model such data efficiently: a single, discrete
K-state multinomial must represent the history of the time series
18 May 2012 / 9
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
17. LIMITATIONS OF HMMS
• Many high-dimensional data sets contain rich componential structure
• Hidden Markov Models cannot model such data efficiently: a single, discrete
K-state multinomial must represent the history of the time series
• To model K bits of information, they need 2K hidden states
18 May 2012 / 9
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
18. LIMITATIONS OF HMMS
• Many high-dimensional data sets contain rich componential structure
• Hidden Markov Models cannot model such data efficiently: a single, discrete
K-state multinomial must represent the history of the time series
• To model K bits of information, they need 2K hidden states
18 May 2012 / 9
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
19. LIMITATIONS OF HMMS
• Many high-dimensional data sets contain rich componential structure
• Hidden Markov Models cannot model such data efficiently: a single, discrete
K-state multinomial must represent the history of the time series
• To model K bits of information, they need 2K hidden states
18 May 2012 / 9
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
20. LIMITATIONS OF HMMS
• Many high-dimensional data sets contain rich componential structure
• Hidden Markov Models cannot model such data efficiently: a single, discrete
K-state multinomial must represent the history of the time series
• To model K bits of information, they need 2K hidden states
• We seek models with distributed hidden state:
18 May 2012 / 9
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
21. LIMITATIONS OF HMMS
• Many high-dimensional data sets contain rich componential structure
• Hidden Markov Models cannot model such data efficiently: a single, discrete
K-state multinomial must represent the history of the time series
• To model K bits of information, they need 2K hidden states
• We seek models with distributed hidden state:
- capacity linear in the number of components
18 May 2012 / 9
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
22. LIMITATIONS OF HMMS
• Many high-dimensional data sets contain rich componential structure
• Hidden Markov Models cannot model such data efficiently: a single, discrete
K-state multinomial must represent the history of the time series
• To model K bits of information, they need 2K hidden states
• We seek models with distributed hidden state:
- capacity linear in the number of components
18 May 2012 / 9
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
23. LINEAR DYNAMICAL SYSTEMS
ht−2 ht−1 ht
vt−2 vt−1 vt
18 May 2012 / 10
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
24. LINEAR DYNAMICAL SYSTEMS
ht−2 ht−1 ht
Graphical model is the
same as HMM but
with real-valued state
vectors
vt−2 vt−1 vt
18 May 2012 / 10
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
25. LINEAR DYNAMICAL SYSTEMS
ht−2 ht−1 ht
Graphical model is the
same as HMM but
with real-valued state
vectors
vt−2 vt−1 vt
• Characterized by linear-Gaussian dynamics and observations:
p(ht |ht − 1) = N (ht ; Aht−1 , Q) p(vt |ht ) = N (vt ; Cht , R)
• Inference is performed using Kalman smoothing (belief propagation)
• Learning can be done by EM
• Dynamics, observations may also depend on an observed input (control)
18 May 2012 / 10
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
26. LATENT REPRESENTATIONS FOR REAL-WORLD DATA
Data for many real-world problems (e.g. motion capture, finance) is high-
dimensional, containing complex non-linear relationships between components
Hidden Markov Models
Pro: complex, nonlinear emission model
Con: single K -state multinomial represents entire history
Linear Dynamical Systems
Pro: state can convey much more information
Con: emission model constrained to be linear
18 May 2012 / 11
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
27. LEARNING DISTRIBUTED REPRESENTATIONS
• Simple networks are capable of discovering useful and interesting internal
representations of static data
• Perhaps the parallel nature of computation in connectionist models may be at
odds with the serial nature of temporal events
• Simple idea: spatial representation of time
- Need a buffer; not biologically plausible
- Cannot process inputs of differing length
- Cannot distinguish between absolute and relative position
• This motivates an implicit representation of time in connectionist models
where time is represented by its effect on processing
18 May 2012 / 12
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
28. RECURRENT NEURAL NETWORKS
ˆ
yt−1 ˆ
yt ˆ
yt+1
ht−1 ht ht+1
vt−1 vt vt+1
18 May 2012 / 13
Learning Representations of Sequences / G Taylor
(Figure from Martens and Sutskever)
Saturday, June 16, 2012
29. RECURRENT NEURAL NETWORKS
ˆ
yt−1 ˆ
yt ˆ
yt+1
ht−1 ht ht+1
vt−1 vt vt+1
• Neural network replicated in time
18 May 2012 / 13
Learning Representations of Sequences / G Taylor
(Figure from Martens and Sutskever)
Saturday, June 16, 2012
30. RECURRENT NEURAL NETWORKS
ˆ
yt−1 ˆ
yt ˆ
yt+1
ht−1 ht ht+1
vt−1 vt vt+1
• Neural network replicated in time
• At each step, receives input vector, updates its internal representation via
nonlinear activation functions, and makes a prediction:
vt = W hv vt−1 + W hh ht−1 + bh
hj,t = e(vj,t )
st = W yh ht + by
yk,t
ˆ = g(yk,t )
18 May 2012 / 13
Learning Representations of Sequences / G Taylor
(Figure from Martens and Sutskever)
Saturday, June 16, 2012
31. TRAINING RECURRENT NEURAL NETWORKS
18 May 2012 / 14
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
32. TRAINING RECURRENT NEURAL NETWORKS
• Possibly high-dimensional, distributed, internal representation and nonlinear
dynamics allow model, in theory, model complex time series
18 May 2012 / 14
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
33. TRAINING RECURRENT NEURAL NETWORKS
• Possibly high-dimensional, distributed, internal representation and nonlinear
dynamics allow model, in theory, model complex time series
• Exact gradients can be computed exactly via Backpropagation Through Time
18 May 2012 / 14
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
34. TRAINING RECURRENT NEURAL NETWORKS
• Possibly high-dimensional, distributed, internal representation and nonlinear
dynamics allow model, in theory, model complex time series
• Exact gradients can be computed exactly via Backpropagation Through Time
• It is an interesting and powerful model. What’s the catch?
- Training RNNs via gradient descent fails on simple problems
- Attributed to “vanishing” or “exploding” gradients
- Much work in the 1990’s focused on identifying and addressing these
issues: none of these methods were widely adopted
18 May 2012 / 14
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
35. TRAINING RECURRENT NEURAL NETWORKS
• Possibly high-dimensional, distributed, internal representation and nonlinear
dynamics allow model, in theory, model complex time series
• Exact gradients can be computed exactly via Backpropagation Through Time
• It is an interesting and powerful model. What’s the catch?
- Training RNNs via gradient descent fails on simple problems
- Attributed to “vanishing” or “exploding” gradients
- Much work in the 1990’s focused on identifying and addressing these
issues: none of these methods were widely adopted
18 May 2012 / 14
Learning Representations of Sequences / G Taylor
(Figure adapted from James Martens)
Saturday, June 16, 2012
36. TRAINING RECURRENT NEURAL NETWORKS
• Possibly high-dimensional, distributed, internal representation and nonlinear
dynamics allow model, in theory, model complex time series
• Exact gradients can be computed exactly via Backpropagation Through Time
• It is an interesting and powerful model. What’s the catch?
- Training RNNs via gradient descent fails on simple problems
- Attributed to “vanishing” or “exploding” gradients
- Much work in the 1990’s focused on identifying and addressing these
issues: none of these methods were widely adopted
• Best-known attempts to resolve the problem of RNN training:
- Long Short-term Memory (LSTM) (Hochreiter and Schmidhuber 1997)
- Echo-State Network (ESN) (Jaeger and Haas 2004)
18 May 2012 / 14
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
37. FAILURE OF GRADIENT DESCENT
Two hypotheses for why gradient descent fails for NN:
18 May 2012 / 15
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
38. FAILURE OF GRADIENT DESCENT
Two hypotheses for why gradient descent fails for NN:
• increased frequency and severity of bad local minima
18 May 2012 / 15
Learning Representations of Sequences / G Taylor
(Figures from James Martens)
Saturday, June 16, 2012
39. FAILURE OF GRADIENT DESCENT
Two hypotheses for why gradient descent fails for NN:
• increased frequency and severity of bad local minima
• pathological curvature, like the type seen in the
Rosenbrock function:
f (x, y) = (1 − x)2 + 100(y − x2 )2
18 May 2012 / 15
Learning Representations of Sequences / G Taylor
(Figures from James Martens)
Saturday, June 16, 2012
40. SECOND ORDER METHODS
• Model the objective function by the local approximation:
T 1 T
f (θ + p) ≈ qθ (p) ≡ f (θ) + ∆f (θ) p + p Bp
2
where p is the search direction and B is a matrix which quantifies curvature
• In Newton’s method, B is the Hessian matrix, H
• By taking the curvature information into account, Newton’s method “rescales”
the gradient so it is a much more sensible direction to follow
• Not feasible for high-dimensional problems!
18 May 2012 / 16
Learning Representations of Sequences / G Taylor
(Figure from James Martens)
Saturday, June 16, 2012
41. HESSIAN-FREE OPTIMIZATION
Based on exploiting two simple ideas (and some additional “tricks”):
18 May 2012 / 17
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
42. HESSIAN-FREE OPTIMIZATION
Based on exploiting two simple ideas (and some additional “tricks”):
• For an n-dimensional vector d , the Hessian-vector product Hd can easily be
computed using finite differences at the cost of a single extra gradient evaluation
- In practice, the R-operator (Perlmutter 1994) is used instead of finite differences
18 May 2012 / 17
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
43. HESSIAN-FREE OPTIMIZATION
Based on exploiting two simple ideas (and some additional “tricks”):
• For an n-dimensional vector d , the Hessian-vector product Hd can easily be
computed using finite differences at the cost of a single extra gradient evaluation
- In practice, the R-operator (Perlmutter 1994) is used instead of finite differences
• There is a very effective algorithm for optimizing quadratic objectives which
requires only Hessian-vector products: linear conjugate-gradient (CG)
18 May 2012 / 17
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
44. HESSIAN-FREE OPTIMIZATION
Based on exploiting two simple ideas (and some additional “tricks”):
• For an n-dimensional vector d , the Hessian-vector product Hd can easily be
computed using finite differences at the cost of a single extra gradient evaluation
- In practice, the R-operator (Perlmutter 1994) is used instead of finite differences
• There is a very effective algorithm for optimizing quadratic objectives which
requires only Hessian-vector products: linear conjugate-gradient (CG)
This method was shown to effectively train RNNs in the pathological
long-term dependency problems they were previously not able to solve
(Martens and Sutskever 2011)
18 May 2012 / 17
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
45. GENERATIVE MODELS WITH DISTRIBUTED STATE
18 May 2012 / 18
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
46. GENERATIVE MODELS WITH DISTRIBUTED STATE
• Many sequences are high-dimensional and have complex structure
- RNNs simply predict the expected value at the next time step
- Cannot capture multi-modality of time series
18 May 2012 / 18
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
47. GENERATIVE MODELS WITH DISTRIBUTED STATE
• Many sequences are high-dimensional and have complex structure
- RNNs simply predict the expected value at the next time step
- Cannot capture multi-modality of time series
• Generative models (like Restricted Boltzmann Machines) can express the
negative log-likelihood of a given configuration of the output, and can capture
complex distributions
18 May 2012 / 18
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
48. GENERATIVE MODELS WITH DISTRIBUTED STATE
• Many sequences are high-dimensional and have complex structure
- RNNs simply predict the expected value at the next time step
- Cannot capture multi-modality of time series
• Generative models (like Restricted Boltzmann Machines) can express the
negative log-likelihood of a given configuration of the output, and can capture
complex distributions
• By using binary latent (hidden) state, we gain the best of both worlds:
- the nonlinear dynamics and observation model of the HMM without the
simple state
- the representationally powerful state of the LDS without the linear-Gaussian
restriction on dynamics and observations
18 May 2012 / 18
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
49. DISTRIBUTED BINARY HIDDEN STATE
• Using distributed binary representations Hidden variables (factors) at time t
for hidden state in directed models of time
series makes inference difficult. But we
can:
- Use a Restricted Boltzmann Machine
(RBM) for the interactions between
hidden and visible variables. A factorial
posterior makes inference and sampling Visible variables (observations) at time t
easy.
One typically uses binary logistic
- Treat the visible variables in the previous units for both visibles and hiddens
time slice as additional fixed inputs
p(hj = 1|v) = σ(bj + vi Wij )
i
18 May 2012 / 19
Learning Representations of Sequences / G Taylor p(vi = 1|h) = σ(bi + hj Wij )
j
Saturday, June 16, 2012
50. MODELING OBSERVATIONS WITH AN RBM
• So the distributed binary latent (hidden) state of an RBM lets us:
- Model complex, nonlinear dynamics
- Easily and exactly infer the latent binary state given the observations
• But RBMs treat data as static (i.i.d.)
Hidden variables (factors) at time t
Visible variables (joint angles) at time t
18 May 2012 / 20
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
51. MODELING OBSERVATIONS WITH AN RBM
• So the distributed binary latent (hidden) state of an RBM lets us:
- Model complex, nonlinear dynamics
- Easily and exactly infer the latent binary state given the observations
• But RBMs treat data as static (i.i.d.)
Hidden variables (factors) at time t
Visible variables (joint angles) at time t
18 May 2012 / 20
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
52. MODELING OBSERVATIONS WITH AN RBM
• So the distributed binary latent (hidden) state of an RBM lets us:
- Model complex, nonlinear dynamics
- Easily and exactly infer the latent binary state given the observations
• But RBMs treat data as static (i.i.d.)
Hidden variables (factors) at time t
Visible variables (joint angles) at time t
18 May 2012 / 20
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
53. CONDITIONAL RESTRICTED BOLTZMANN MACHINES
(Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)
18 May 2012 / 21
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
54. CONDITIONAL RESTRICTED BOLTZMANN MACHINES
(Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)
• Start with a Restricted Boltzmann Machine (RBM)
Hidden layer
Visible layer
18 May 2012 / 21
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
55. CONDITIONAL RESTRICTED BOLTZMANN MACHINES
(Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)
• Start with a Restricted Boltzmann Machine (RBM)
• Add two types of directed connections
Hidden layer
Visible layer
18 May 2012 / 21
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
56. CONDITIONAL RESTRICTED BOLTZMANN MACHINES
(Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)
• Start with a Restricted Boltzmann Machine (RBM)
• Add two types of directed connections
- Autoregressive connections model short-term, linear structure
Hidden layer
Visible layer
Recent history
18 May 2012 / 21
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
57. CONDITIONAL RESTRICTED BOLTZMANN MACHINES
(Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)
• Start with a Restricted Boltzmann Machine (RBM)
• Add two types of directed connections
- Autoregressive connections model short-term, linear structure
- History can also influence dynamics through hidden layer
Hidden layer
Visible layer
Recent history
18 May 2012 / 21
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
58. CONDITIONAL RESTRICTED BOLTZMANN MACHINES
(Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)
• Start with a Restricted Boltzmann Machine (RBM)
• Add two types of directed connections
- Autoregressive connections model short-term, linear structure
- History can also influence dynamics through hidden layer
Hidden layer
• Conditioning does not change inference nor learning
Visible layer
Recent history
18 May 2012 / 21
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
59. CONTRASTIVE DIVERGENCE LEARNING
Fixed j Fixed j
vi hj data vi hj recon
Fixed Fixed
i i
iter = 0 iter =1
(data) (reconstruction)
• When updating visible and hidden units, we implement directed connections
by treating data from previous time steps as a dynamically changing bias
• Inference and learning do not change
18 May 2012 / 22
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
60. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK
18 May 2012 / 23
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
61. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK
• Learn a CRBM
Hidden layer
Visible layer
18 May 2012 / 23
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
62. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK
• Learn a CRBM
• Now, treat the sequence of hidden units as “fully
observed” data and train a second CRBM
Hidden layer
Visible layer
18 May 2012 / 23
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
63. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK
• Learn a CRBM
• Now, treat the sequence of hidden units as “fully
observed” data and train a second CRBM
l
Hidden layer
• The composition of CRBMs is a conditional deep h1
t
belief net
Hidden layer
0
h0
t−2 h0
t−1
Visible layer
18 May 2012 / 23
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
64. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK
• Learn a CRBM
• Now, treat the sequence of hidden units as “fully
observed” data and train a second CRBM
l
Hidden layer
• The composition of CRBMs is a conditional deep h1
t
belief net
• It can be fine-tuned generatively or discriminatively
Hidden layer
0
h0
t−2 h0
t−1
Visible layer
18 May 2012 / 23
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
65. l
h1
t
MOTION SYNTHESIS WITH A 2-LAYER CDBN h0
t−2
h0
t−1
0
• Model is trained on ~8000 frames
of 60fps data (49 dimensions)
• 10 styles of walking: cat, chicken,
dinosaur, drunk, gangly, graceful,
normal, old-man, sexy and strong
• 600 binary hidden units per layer
• 1 hour training on a modern
workstation
18 May 2012 / 24
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
66. l
h1
t
MOTION SYNTHESIS WITH A 2-LAYER CDBN h0
t−2
h0
t−1
0
• Model is trained on ~8000 frames
of 60fps data (49 dimensions)
• 10 styles of walking: cat, chicken,
dinosaur, drunk, gangly, graceful,
normal, old-man, sexy and strong
• 600 binary hidden units per layer
• 1 hour training on a modern
workstation
18 May 2012 / 24
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
67. MODELING CONTEXT
Labels
• A single model was trained on 10 “styled”
m
walks from CMU subject 137
• The model can generate each style based
on initialization
• We cannot prevent nor control
transitioning
• How to blend styles?
• Style or person labels can be provided as
part of the input to the top layer
18 May 2012 / 25
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
68. MODELING CONTEXT
Labels
• A single model was trained on 10 “styled”
m l
walks from CMU subject 137 Hidden layer
h1
• The model can generate each style based t
on initialization
• We cannot prevent nor control
Hidden layer
transitioning 0
h0
t−2 h0
t−1
• How to blend styles?
• Style or person labels can be provided as
part of the input to the top layer Visible layer
18 May 2012 / 25
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
69. MULTIPLICATIVE INTERACTIONS
• Let latent variables act like gates, that hj
dynamically change the connections between
other variables
• This amounts to letting variables multiply
connections between other variables: three-way
multiplicative interactions zk
• Recently used in the context of learning
correspondence between images (Memisevic
Hinton 2007, 2010) but long history before that
vi
18 May 2012 / 26
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
70. GATED RESTRICTED BOLTZMANN MACHINES (GRBM)
Two views: Memisevic Hinton (2007)
hj Latent variables hj
zk vi zk
vi Output
Input Output Input
18 May 2012 / 27
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
71. INFERRING OPTICAL FLOW: IMAGE “ANALOGIES”
• Toy images (Memisevic Hinton 2006)
• No structure in these images, only how
they change
• Can infer optical flow from a pair of
images and apply it to a random image
18 May 2012 / 28
fie d
Figure 2: Columns (left to right): Input images; output images; inferred flowfields;
Learning Representations of Sequences / G Taylor
t
t
ns
ut
ow re
pu
pu
ld
random target images; inferred transformation applied to target images. For the trans-
p
tra
Fl nfer
ut
in
In
formations (last column) gray values represent the probability that a pixel is ’on’ ac-
O
w
y
cording to the model, ranging from black for 0 to white for 1.
I
pl
Ne
Ap
8
Saturday, June 16, 2012
72. BACK TO MOTION STYLE
• Introduce a set of latent “context” variables hj
whose value is known at training time
• In our example, these represent “motion style”
but could also represent height, weight, gender,
etc.
• The contextual variables gate every existing
zk
pairwise connection in our model
vi
18 May 2012 / 29
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
73. LEARNING AND INFERENCE
• Learning and inference remain almost the same hj
as in the standard CRBM
• We can think of the context or style variables as
“blending in” a whole “sub-network”
• This allows us to share parameters across zk
styles but selectively adapt dynamics
vi
18 May 2012 / 30
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
74. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
18 May 2012 / 31
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
75. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer
j
€
Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 31
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
76. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer Hidden layer
j j
€ €
Input layer Output layer Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 31
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
77. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer Hidden layer
Style
j j
l Features
€ €
Input layer Output layer Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 31
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
78. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer Hidden layer
Style
j j
l Features
€ €
Input layer Output layer Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 31
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
79. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer Hidden layer
Style
j j
l Features
€ €
Input layer Output layer Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 31
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
80. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer Hidden layer
Style
j j
l Features
€ €
Input layer Output layer Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 31
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
81. OVERPARAMETERIZATION
Hidden layer
Style
• Note: weight Matrix W v,h has been replaced by
a tensor W v,h,z ! (Likewise for other weights) j
l Features
• The number of parameters is O(N ) - per 3
group of weights
• More, if we want sparse, overcomplete hiddens €
• However, there is a simple yet powerful solution!
Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 32
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
82. Hidden layer
Style features
j
l
FACTORING
€ vh
Wijl
vh
Wijl
z
Wlf h
Wjf
Output layer
(e.g. data at time t)
v
Wif
vh v h z
Wijl = Wif Wjf Wlf
f
18 May 2012 / 33
Learning Representations of Sequences / G Taylor
(Figure adapted from Roland Memisevic)
Saturday, June 16, 2012
83. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
18 May 2012 / 34
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
84. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer
j
€
Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 34
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
85. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer Hidden layer
j j
€ €
Input layer Output layer Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 34
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
86. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer Hidden layer
Style
j j
l Features
€ €
Input layer Output layer Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 34
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
87. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer Hidden layer
Style
j j
l Features
€ €
Input layer Output layer Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 34
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
88. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer Hidden layer
Style
j j
l Features
€ €
Input layer Output layer Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 34
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
89. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer Hidden layer
Style
j j
l Features
€ €
Input layer Output layer Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 34
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
90. SUPERVISED MODELING OF STYLE
(Taylor, Hinton and Roweis ICML 2009, JMLR 2011)
Hidden layer Hidden layer
Style
j j
l Features
€ €
Factors
Input layer Output layer Input layer Output layer
(e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t)
18 May 2012 / 34
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
91. PARAMETER SHARING
18 May 2012 / 35
Learning Representations of Sequences / G Taylor
Saturday, June 16, 2012
92. j
l
MOTION SYNTHESIS: €
FACTORED 3RD-ORDER CRBM
• Same 10-styles dataset
• 600 binary hidden units
• 3×200 deterministic factors
• 100 real-valued style features
• 1 hour training on a modern
workstation
• Synthesis is real-time
18 May 2012 / 36
Learning Representations of Sequences / G Taylor
Summary
Saturday, June 16, 2012
93. j
l
MOTION SYNTHESIS: €
FACTORED 3RD-ORDER CRBM
• Same 10-styles dataset
• 600 binary hidden units
• 3×200 deterministic factors
• 100 real-valued style features
• 1 hour training on a modern
workstation
• Synthesis is real-time
18 May 2012 / 36
Learning Representations of Sequences / G Taylor
Summary
Saturday, June 16, 2012