SlideShare une entreprise Scribd logo
1  sur  122
Télécharger pour lire hors ligne
LEARNING REPRESENTATIONS OF SEQUENCES
                WITH APPLICATIONS TO MOTION CAPTURE AND VIDEO ANALYSIS




                GRAHAM TAYLOR

                SCHOOL OF ENGINEERING
                UNIVERSITY OF GUELPH




                Papers and software available at: http://www.uoguelph.ca/~gwtaylor




Saturday, June 16, 2012
OVERVIEW: THIS TALK




                18 May 2012 / 2
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
OVERVIEW: THIS TALK

                • Learning representations of temporal data:                      X (Input)
                                                                                   x
                                                                                                                          Np

                                                                    Nx            Nw                    Nz
                                                                                                                     pk
                  - existing methods and challenges faced
                                                                          x
                                                                         Nw
                                                                                                                               Np
                                                                                                zk
                                                                                                 m,n


                  - recent methods inspired by “deep learning”               Nx                              Nz
                                                                                                                     k
                                                                                                                     P
                                                                                                                  Pooling
                                                                        Y (Output)                                 layer
                                                                                                  k
                                                                   Ny
                                                                         y
                                                                                                 Z
                                                                        Nw
                                                                                              Feature
                                                                              y
                                                                             Nw                layer
                                                                         Ny




                18 May 2012 / 2
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
OVERVIEW: THIS TALK

                • Learning representations of temporal data:                      X (Input)
                                                                                   x
                                                                                                                                               Np

                                                                    Nx            Nw                             Nz
                                                                                                                                          pk
                  - existing methods and challenges faced
                                                                          x
                                                                         Nw
                                                                                                                                                    Np
                                                                                                          zk
                                                                                                           m,n


                  - recent methods inspired by “deep learning”               Nx                                             Nz
                                                                                                                                          k
                                                                                                                                          P
                                                                                                                                       Pooling
                                                                        Y (Output)                                                      layer
                                                                                                            k
                                                                   Ny
                                                                         y
                                                                                                   Z
                                                                        Nw
                                                                                                Feature
                                                                              y
                                                                             Nw                  layer
                                                                         Ny




                • Applications: in particular, modeling human pose and activity
                  - highly structured data: e.g. motion capture
                  - weakly structured data: e.g. video

                                                                                                          w




                                                                                              Component
                                                                                                           j

                                                                                                           1
                                                                                                           2
                                                                                                           3
                                                                                                           4
                                                                                                           5
                                                                                                           6
                                                                                                           7
                                                                                                           8
                                                                                                           9
                                                                                                          10
                18 May 2012 / 2                                                                                       100        200      300            400   500   600   700

                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
OUTLINE

                 Learning representations from sequences
                 Existing methods, challenges
                                                                                            yt−2                       yt−1   yt




                 Composable, distributed-state models for sequences
                 Conditional Restricted Boltzmann Machines and their variants




                                                                                                                  X (Input)                               Np



                 Using learned representations to analyze video
                                                                                                                   x
                                                                                                    Nx            Nw                    Nz
                                                                                                          x                                          pk
                                                                                                         Nw
                                                                                                                                                               Np
                                                                                                                                zk
                                                                                                                                 m,n




                 A brief and (incomplete survey of deep learning for activity recognition
                                                                                                             Nx                                      k
                                                                                                                                             Nz      P
                                                                                                                                                  Pooling
                                                                                                        Y (Output)                                 layer
                                                                                                                                   k
                                                                                                   Ny
                                                                                                         y
                                                                                                                                 Z
                                                                                                        Nw
                                                                                                                              Feature
                                                                                                              y
                                                                                                             Nw                layer
                                                                                                         Ny




                18 May 2012 / 3
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
TIME SERIES DATA

                • Time is an integral part of many human behaviours (motion, reasoning)
                • In building statistical models, time is sometimes ignored, often problematic
                • Models that do incorporate dynamics fail to account for the fact that data is
                  often high-dimensional, nonlinear, and contains long-range dependencies
                   INTENSITY (No of stories)




                                               2000   2001   2002   2003   2004   2005         2006      2007     2008     2009




                18 May 2012 / 4
                Learning Representations of Sequences / G Taylor

                                                                                         Graphic: David McCandless, informationisbeautiful.net

Saturday, June 16, 2012
TIME SERIES DATA

                • Time is an integral part of many human behaviours (motion, reasoning)
                • In building statistical models, time is sometimes ignored, often problematic
                • Models that do incorporate dynamics fail to account for the fact that data is
                  often high-dimensional, nonlinear, and contains long-range dependencies
                   INTENSITY (No of stories)




                                               2000   2001   2002   2003   2004   2005         2006      2007     2008     2009




                 Today we will discuss a number of models that have been developed to
                 address these challenges
                18 May 2012 / 4
                Learning Representations of Sequences / G Taylor

                                                                                         Graphic: David McCandless, informationisbeautiful.net

Saturday, June 16, 2012
VECTOR AUTOREGRESSIVE MODELS

                                                                   M
                                                                   
                                              vt = b +                   Am vt−m + et
                                                                   m=1

                • Have dominated statistical time-series analysis for approx. 50 years
                • Can be fit easily by least-squares regression
                • Can fail even for simple nonlinearities present in the system
                 - but many data sets can be modeled well by a linear system
                • Well understood; many extensions exist




                18 May 2012 / 5
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
MARKOV (“N-GRAM”) MODELS



                                                    vt−2           vt−1       vt

                • Fully observable
                • Sequential observations may have nonlinear dependence
                • Derived by assuming sequences have Markov property:
                                                                 t−1            t−1
                                                         p(vt |{v1 }) = p(vt |{vt−N })

                • This leads to joint:                  T          N
                                                                           T
                                                                           
                                                                                           t−1
                                                    p({v1 }) = p({v1 })            p(vt |{vt−N })
                                                                          t=N +1
                • Number of parameters exponential in N !


                18 May 2012 / 6
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
HIDDEN MARKOV MODELS (HMM)

                                                                   ht−2   ht−1   ht



                                                                   vt−2   vt−1   vt




                18 May 2012 / 7
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
HIDDEN MARKOV MODELS (HMM)

                                                                   ht−2   ht−1   ht
                          Introduces a hidden
                          state that controls the
                          dependence of the
                          current observation
                          on the past
                                                                   vt−2   vt−1   vt




                18 May 2012 / 7
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
HIDDEN MARKOV MODELS (HMM)

                                                                   ht−2   ht−1          ht
                          Introduces a hidden
                          state that controls the
                          dependence of the
                          current observation
                          on the past
                                                                   vt−2   vt−1          vt



                • Successful in speech  language modeling, biology
                • Defined by 3 sets of parameters:
                 - Initial state parameters, π
                 - Transition matrix, A
                 - Emission distribution, p(vt |ht )                              T
                                                                                  
                • Factored joint distribution: p({ht }, {vt }) = p(h1 )p(v1 |h1 )   p(ht |ht−1 )p(vt |ht )
                                                                                  t=2
                18 May 2012 / 7
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
HMM INFERENCE AND LEARNING

                • Typically three tasks we want to perform in an HMM:
                 - Likelihood estimation
                 - Inference
                 - Learning
                • All are exact and tractable due to the simple structure of the model
                • Forward-backward algorithm for inference (belief propagation)
                • Baum-Welch algorithm for learning (EM)
                • Viterbi algorithm for state estimation (max-product)




                18 May 2012 / 8
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LIMITATIONS OF HMMS




                18 May 2012 / 9
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LIMITATIONS OF HMMS

                • Many high-dimensional data sets contain rich componential structure




                18 May 2012 / 9
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LIMITATIONS OF HMMS

                • Many high-dimensional data sets contain rich componential structure
                • Hidden Markov Models cannot model such data efficiently: a single, discrete
                  K-state multinomial must represent the history of the time series




                18 May 2012 / 9
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LIMITATIONS OF HMMS

                • Many high-dimensional data sets contain rich componential structure
                • Hidden Markov Models cannot model such data efficiently: a single, discrete
                  K-state multinomial must represent the history of the time series
                • To model K bits of information, they need 2K hidden states




                18 May 2012 / 9
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LIMITATIONS OF HMMS

                • Many high-dimensional data sets contain rich componential structure
                • Hidden Markov Models cannot model such data efficiently: a single, discrete
                  K-state multinomial must represent the history of the time series
                • To model K bits of information, they need 2K hidden states




                18 May 2012 / 9
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LIMITATIONS OF HMMS

                • Many high-dimensional data sets contain rich componential structure
                • Hidden Markov Models cannot model such data efficiently: a single, discrete
                  K-state multinomial must represent the history of the time series
                • To model K bits of information, they need 2K hidden states




                18 May 2012 / 9
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LIMITATIONS OF HMMS

                • Many high-dimensional data sets contain rich componential structure
                • Hidden Markov Models cannot model such data efficiently: a single, discrete
                  K-state multinomial must represent the history of the time series
                • To model K bits of information, they need 2K hidden states
                • We seek models with distributed hidden state:




                18 May 2012 / 9
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LIMITATIONS OF HMMS

                • Many high-dimensional data sets contain rich componential structure
                • Hidden Markov Models cannot model such data efficiently: a single, discrete
                  K-state multinomial must represent the history of the time series
                • To model K bits of information, they need 2K hidden states
                • We seek models with distributed hidden state:
                  - capacity linear in the number of components




                18 May 2012 / 9
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LIMITATIONS OF HMMS

                • Many high-dimensional data sets contain rich componential structure
                • Hidden Markov Models cannot model such data efficiently: a single, discrete
                  K-state multinomial must represent the history of the time series
                • To model K bits of information, they need 2K hidden states
                • We seek models with distributed hidden state:
                  - capacity linear in the number of components




                18 May 2012 / 9
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LINEAR DYNAMICAL SYSTEMS

                                                                   ht−2   ht−1   ht



                                                                   vt−2   vt−1   vt




                18 May 2012 / 10
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LINEAR DYNAMICAL SYSTEMS

                                                                   ht−2   ht−1   ht
                          Graphical model is the
                          same as HMM but
                          with real-valued state
                          vectors
                                                                   vt−2   vt−1   vt




                18 May 2012 / 10
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LINEAR DYNAMICAL SYSTEMS

                                                                   ht−2        ht−1          ht
                          Graphical model is the
                          same as HMM but
                          with real-valued state
                          vectors
                                                                   vt−2        vt−1          vt



                • Characterized by linear-Gaussian dynamics and observations:
                          p(ht |ht − 1) = N (ht ; Aht−1 , Q)         p(vt |ht ) = N (vt ; Cht , R)

                • Inference is performed using Kalman smoothing (belief propagation)
                • Learning can be done by EM
                • Dynamics, observations may also depend on an observed input (control)

                18 May 2012 / 10
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LATENT REPRESENTATIONS FOR REAL-WORLD DATA

                Data for many real-world problems (e.g. motion capture, finance) is high-
                dimensional, containing complex non-linear relationships between components


                Hidden Markov Models
                Pro: complex, nonlinear emission model
                Con: single K -state multinomial represents entire history


                Linear Dynamical Systems
                Pro: state can convey much more information
                Con: emission model constrained to be linear




                18 May 2012 / 11
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LEARNING DISTRIBUTED REPRESENTATIONS

                • Simple networks are capable of discovering useful and interesting internal
                  representations of static data
                • Perhaps the parallel nature of computation in connectionist models may be at
                  odds with the serial nature of temporal events
                • Simple idea: spatial representation of time
                 - Need a buffer; not biologically plausible
                 - Cannot process inputs of differing length
                 - Cannot distinguish between absolute and relative position
                • This motivates an implicit representation of time in connectionist models
                  where time is represented by its effect on processing




                18 May 2012 / 12
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
RECURRENT NEURAL NETWORKS

                           ˆ
                           yt−1                               ˆ
                                                              yt    ˆ
                                                                    yt+1


                          ht−1                             ht      ht+1

                           vt−1                              vt     vt+1




                18 May 2012 / 13
                Learning Representations of Sequences / G Taylor


                                                                           (Figure from Martens and Sutskever)

Saturday, June 16, 2012
RECURRENT NEURAL NETWORKS

                           ˆ
                           yt−1                               ˆ
                                                              yt    ˆ
                                                                    yt+1


                          ht−1                             ht      ht+1

                           vt−1                              vt     vt+1

                • Neural network replicated in time




                18 May 2012 / 13
                Learning Representations of Sequences / G Taylor


                                                                           (Figure from Martens and Sutskever)

Saturday, June 16, 2012
RECURRENT NEURAL NETWORKS

                           ˆ
                           yt−1                               ˆ
                                                              yt        ˆ
                                                                        yt+1


                          ht−1                             ht         ht+1

                           vt−1                              vt         vt+1

                • Neural network replicated in time
                • At each step, receives input vector, updates its internal representation via
                  nonlinear activation functions, and makes a prediction:
                                       vt       = W hv vt−1 + W hh ht−1 + bh
                                       hj,t     = e(vj,t )
                                       st       = W yh ht + by
                                       yk,t
                                       ˆ        = g(yk,t )
                18 May 2012 / 13
                Learning Representations of Sequences / G Taylor


                                                                               (Figure from Martens and Sutskever)

Saturday, June 16, 2012
TRAINING RECURRENT NEURAL NETWORKS




                18 May 2012 / 14
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
TRAINING RECURRENT NEURAL NETWORKS

                • Possibly high-dimensional, distributed, internal representation and nonlinear
                  dynamics allow model, in theory, model complex time series




                18 May 2012 / 14
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
TRAINING RECURRENT NEURAL NETWORKS

                • Possibly high-dimensional, distributed, internal representation and nonlinear
                  dynamics allow model, in theory, model complex time series
                • Exact gradients can be computed exactly via Backpropagation Through Time




                18 May 2012 / 14
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
TRAINING RECURRENT NEURAL NETWORKS

                • Possibly high-dimensional, distributed, internal representation and nonlinear
                  dynamics allow model, in theory, model complex time series
                • Exact gradients can be computed exactly via Backpropagation Through Time
                • It is an interesting and powerful model. What’s the catch?
                  - Training RNNs via gradient descent fails on simple problems
                  - Attributed to “vanishing” or “exploding” gradients
                  - Much work in the 1990’s focused on identifying and addressing these
                    issues: none of these methods were widely adopted




                18 May 2012 / 14
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
TRAINING RECURRENT NEURAL NETWORKS

                • Possibly high-dimensional, distributed, internal representation and nonlinear
                  dynamics allow model, in theory, model complex time series
                • Exact gradients can be computed exactly via Backpropagation Through Time
                • It is an interesting and powerful model. What’s the catch?
                  - Training RNNs via gradient descent fails on simple problems
                  - Attributed to “vanishing” or “exploding” gradients
                  - Much work in the 1990’s focused on identifying and addressing these
                    issues: none of these methods were widely adopted




                18 May 2012 / 14
                Learning Representations of Sequences / G Taylor



                                                                   (Figure adapted from James Martens)
Saturday, June 16, 2012
TRAINING RECURRENT NEURAL NETWORKS

                • Possibly high-dimensional, distributed, internal representation and nonlinear
                  dynamics allow model, in theory, model complex time series
                • Exact gradients can be computed exactly via Backpropagation Through Time
                • It is an interesting and powerful model. What’s the catch?
                 - Training RNNs via gradient descent fails on simple problems
                 - Attributed to “vanishing” or “exploding” gradients
                 - Much work in the 1990’s focused on identifying and addressing these
                   issues: none of these methods were widely adopted
                • Best-known attempts to resolve the problem of RNN training:
                  - Long Short-term Memory (LSTM) (Hochreiter and Schmidhuber 1997)
                  - Echo-State Network (ESN) (Jaeger and Haas 2004)




                18 May 2012 / 14
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
FAILURE OF GRADIENT DESCENT

                 Two hypotheses for why gradient descent fails for NN:




                18 May 2012 / 15
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
FAILURE OF GRADIENT DESCENT

                 Two hypotheses for why gradient descent fails for NN:
                • increased frequency and severity of bad local minima




                18 May 2012 / 15
                Learning Representations of Sequences / G Taylor


                                                                         (Figures from James Martens)
Saturday, June 16, 2012
FAILURE OF GRADIENT DESCENT

                 Two hypotheses for why gradient descent fails for NN:
                • increased frequency and severity of bad local minima
                • pathological curvature, like the type seen in the
                  Rosenbrock function:
                                  f (x, y) = (1 − x)2 + 100(y − x2 )2




                18 May 2012 / 15
                Learning Representations of Sequences / G Taylor


                                                                         (Figures from James Martens)
Saturday, June 16, 2012
SECOND ORDER METHODS

                • Model the objective function by the local approximation:
                                                                T    1 T
                             f (θ + p) ≈ qθ (p) ≡ f (θ) + ∆f (θ) p + p Bp
                                                                     2
                  where p is the search direction and B is a matrix which quantifies curvature
                • In Newton’s method, B is the Hessian matrix, H
                • By taking the curvature information into account, Newton’s method “rescales”
                  the gradient so it is a much more sensible direction to follow
                • Not feasible for high-dimensional problems!




                18 May 2012 / 16
                Learning Representations of Sequences / G Taylor


                                                                              (Figure from James Martens)
Saturday, June 16, 2012
HESSIAN-FREE OPTIMIZATION

                 Based on exploiting two simple ideas (and some additional “tricks”):




                18 May 2012 / 17
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
HESSIAN-FREE OPTIMIZATION

                 Based on exploiting two simple ideas (and some additional “tricks”):
                • For an n-dimensional vector d , the Hessian-vector product Hd can easily be
                  computed using finite differences at the cost of a single extra gradient evaluation
                  - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences




                18 May 2012 / 17
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
HESSIAN-FREE OPTIMIZATION

                 Based on exploiting two simple ideas (and some additional “tricks”):
                • For an n-dimensional vector d , the Hessian-vector product Hd can easily be
                  computed using finite differences at the cost of a single extra gradient evaluation
                  - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences

                • There is a very effective algorithm for optimizing quadratic objectives which
                  requires only Hessian-vector products: linear conjugate-gradient (CG)




                18 May 2012 / 17
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
HESSIAN-FREE OPTIMIZATION

                 Based on exploiting two simple ideas (and some additional “tricks”):
                • For an n-dimensional vector d , the Hessian-vector product Hd can easily be
                  computed using finite differences at the cost of a single extra gradient evaluation
                  - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences

                • There is a very effective algorithm for optimizing quadratic objectives which
                  requires only Hessian-vector products: linear conjugate-gradient (CG)


                     This method was shown to effectively train RNNs in the pathological
                     long-term dependency problems they were previously not able to solve
                     (Martens and Sutskever 2011)



                18 May 2012 / 17
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
GENERATIVE MODELS WITH DISTRIBUTED STATE




                18 May 2012 / 18
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
GENERATIVE MODELS WITH DISTRIBUTED STATE

                • Many sequences are high-dimensional and have complex structure
                  - RNNs simply predict the expected value at the next time step
                  - Cannot capture multi-modality of time series




                18 May 2012 / 18
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
GENERATIVE MODELS WITH DISTRIBUTED STATE

                • Many sequences are high-dimensional and have complex structure
                 - RNNs simply predict the expected value at the next time step
                 - Cannot capture multi-modality of time series
                • Generative models (like Restricted Boltzmann Machines) can express the
                  negative log-likelihood of a given configuration of the output, and can capture
                  complex distributions




                18 May 2012 / 18
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
GENERATIVE MODELS WITH DISTRIBUTED STATE

                • Many sequences are high-dimensional and have complex structure
                 - RNNs simply predict the expected value at the next time step
                 - Cannot capture multi-modality of time series
                • Generative models (like Restricted Boltzmann Machines) can express the
                  negative log-likelihood of a given configuration of the output, and can capture
                  complex distributions
                • By using binary latent (hidden) state, we gain the best of both worlds:
                  - the nonlinear dynamics and observation model of the HMM without the
                    simple state
                  - the representationally powerful state of the LDS without the linear-Gaussian
                    restriction on dynamics and observations




                18 May 2012 / 18
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
DISTRIBUTED BINARY HIDDEN STATE

                • Using distributed binary representations                   Hidden variables (factors) at time t
                  for hidden state in directed models of time
                  series makes inference difficult. But we
                  can:
                  - Use a Restricted Boltzmann Machine
                    (RBM) for the interactions between
                    hidden and visible variables. A factorial
                    posterior makes inference and sampling                Visible variables (observations) at time t
                    easy.
                                                                    One typically uses binary logistic
                  - Treat the visible variables in the previous    units for both visibles and hiddens
                    time slice as additional fixed inputs

                                                                                                         
                                                                        p(hj = 1|v) = σ(bj +                     vi Wij )
                                                                                                             i
                                                                                                      
                18 May 2012 / 19
                Learning Representations of Sequences / G Taylor        p(vi = 1|h) = σ(bi +                     hj Wij )
                                                                                                         j


Saturday, June 16, 2012
MODELING OBSERVATIONS WITH AN RBM

                • So the distributed binary latent (hidden) state of an RBM lets us:
                 - Model complex, nonlinear dynamics
                 - Easily and exactly infer the latent binary state given the observations
                • But RBMs treat data as static (i.i.d.)

                          Hidden variables (factors) at time t




                    Visible variables (joint angles) at time t



                18 May 2012 / 20
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
MODELING OBSERVATIONS WITH AN RBM

                • So the distributed binary latent (hidden) state of an RBM lets us:
                 - Model complex, nonlinear dynamics
                 - Easily and exactly infer the latent binary state given the observations
                • But RBMs treat data as static (i.i.d.)

                          Hidden variables (factors) at time t




                    Visible variables (joint angles) at time t



                18 May 2012 / 20
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
MODELING OBSERVATIONS WITH AN RBM

                • So the distributed binary latent (hidden) state of an RBM lets us:
                 - Model complex, nonlinear dynamics
                 - Easily and exactly infer the latent binary state given the observations
                • But RBMs treat data as static (i.i.d.)

                          Hidden variables (factors) at time t




                    Visible variables (joint angles) at time t



                18 May 2012 / 20
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
CONDITIONAL RESTRICTED BOLTZMANN MACHINES
                                                                   (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)




                18 May 2012 / 21
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
CONDITIONAL RESTRICTED BOLTZMANN MACHINES
                                                                   (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)
                • Start with a Restricted Boltzmann Machine (RBM)




                                                                                                               Hidden layer




                                                                                                                Visible layer




                18 May 2012 / 21
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
CONDITIONAL RESTRICTED BOLTZMANN MACHINES
                                                                   (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)
                • Start with a Restricted Boltzmann Machine (RBM)


                • Add two types of directed connections



                                                                                                               Hidden layer




                                                                                                                Visible layer




                18 May 2012 / 21
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
CONDITIONAL RESTRICTED BOLTZMANN MACHINES
                                                                   (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)
                • Start with a Restricted Boltzmann Machine (RBM)


                • Add two types of directed connections
                  - Autoregressive connections model short-term, linear structure

                                                                                                               Hidden layer




                                                                                                                Visible layer




                                                                                           Recent history


                18 May 2012 / 21
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
CONDITIONAL RESTRICTED BOLTZMANN MACHINES
                                                                   (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)
                • Start with a Restricted Boltzmann Machine (RBM)


                • Add two types of directed connections
                  - Autoregressive connections model short-term, linear structure
                  - History can also influence dynamics through hidden layer
                                                                                                               Hidden layer




                                                                                                                Visible layer




                                                                                           Recent history


                18 May 2012 / 21
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
CONDITIONAL RESTRICTED BOLTZMANN MACHINES
                                                                   (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011)
                • Start with a Restricted Boltzmann Machine (RBM)


                • Add two types of directed connections
                  - Autoregressive connections model short-term, linear structure
                  - History can also influence dynamics through hidden layer
                                                                                                               Hidden layer
                • Conditioning does not change inference nor learning



                                                                                                                Visible layer




                                                                                           Recent history


                18 May 2012 / 21
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
CONTRASTIVE DIVERGENCE LEARNING


                            Fixed                      j                                Fixed          j


                                          vi hj data                                           vi hj recon
                  Fixed                                            Fixed
                              i                                                i

                             iter = 0                                           iter =1
                              (data)                                       (reconstruction)


                • When updating visible and hidden units, we implement directed connections
                  by treating data from previous time steps as a dynamically changing bias
                • Inference and learning do not change


                18 May 2012 / 22
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
STACKING: THE CONDITIONAL DEEP BELIEF NETWORK




                18 May 2012 / 23
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
STACKING: THE CONDITIONAL DEEP BELIEF NETWORK

                • Learn a CRBM




                                                                   Hidden layer




                                                                   Visible layer




                18 May 2012 / 23
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
STACKING: THE CONDITIONAL DEEP BELIEF NETWORK

                • Learn a CRBM

                • Now, treat the sequence of hidden units as “fully
                  observed” data and train a second CRBM




                                                                      Hidden layer




                                                                      Visible layer




                18 May 2012 / 23
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
STACKING: THE CONDITIONAL DEEP BELIEF NETWORK

                • Learn a CRBM

                • Now, treat the sequence of hidden units as “fully
                  observed” data and train a second CRBM
                                                                                    l
                                                                                         Hidden layer
                • The composition of CRBMs is a conditional deep                    h1
                                                                                     t
                  belief net


                                                                                         Hidden layer
                                                                                     0
                                                                      h0
                                                                       t−2   h0
                                                                              t−1




                                                                                         Visible layer




                18 May 2012 / 23
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
STACKING: THE CONDITIONAL DEEP BELIEF NETWORK

                • Learn a CRBM

                • Now, treat the sequence of hidden units as “fully
                  observed” data and train a second CRBM
                                                                                       l
                                                                                            Hidden layer
                • The composition of CRBMs is a conditional deep                       h1
                                                                                        t
                  belief net

                • It can be fine-tuned generatively or discriminatively
                                                                                            Hidden layer
                                                                                        0
                                                                         h0
                                                                          t−2   h0
                                                                                 t−1




                                                                                            Visible layer




                18 May 2012 / 23
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
l

                                                                                 h1
                                                                                  t




                MOTION SYNTHESIS WITH A 2-LAYER CDBN               h0
                                                                    t−2
                                                                          h0
                                                                           t−1
                                                                                  0




                • Model is trained on ~8000 frames
                  of 60fps data (49 dimensions)
                • 10 styles of walking: cat, chicken,
                  dinosaur, drunk, gangly, graceful,
                  normal, old-man, sexy and strong
                • 600 binary hidden units per layer
                •  1 hour training on a modern
                  workstation




                18 May 2012 / 24
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
l

                                                                                 h1
                                                                                  t




                MOTION SYNTHESIS WITH A 2-LAYER CDBN               h0
                                                                    t−2
                                                                          h0
                                                                           t−1
                                                                                  0




                • Model is trained on ~8000 frames
                  of 60fps data (49 dimensions)
                • 10 styles of walking: cat, chicken,
                  dinosaur, drunk, gangly, graceful,
                  normal, old-man, sexy and strong
                • 600 binary hidden units per layer
                •  1 hour training on a modern
                  workstation




                18 May 2012 / 24
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
MODELING CONTEXT
                                                                   Labels
                • A single model was trained on 10 “styled”
                                                                   m
                  walks from CMU subject 137
                • The model can generate each style based
                  on initialization
                • We cannot prevent nor control
                  transitioning
                • How to blend styles?
                • Style or person labels can be provided as
                  part of the input to the top layer




                18 May 2012 / 25
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
MODELING CONTEXT
                                                                   Labels
                • A single model was trained on 10 “styled”
                                                                   m               l
                  walks from CMU subject 137                                            Hidden layer
                                                                                   h1
                • The model can generate each style based                           t


                  on initialization
                • We cannot prevent nor control
                                                                                        Hidden layer
                  transitioning                                                     0
                                                                   h0
                                                                    t−2     h0
                                                                             t−1

                • How to blend styles?
                • Style or person labels can be provided as
                  part of the input to the top layer                                    Visible layer




                18 May 2012 / 25
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
MULTIPLICATIVE INTERACTIONS

                • Let latent variables act like gates, that              hj
                  dynamically change the connections between
                  other variables
                • This amounts to letting variables multiply
                  connections between other variables: three-way
                  multiplicative interactions                       zk
                • Recently used in the context of learning
                  correspondence between images (Memisevic 
                  Hinton 2007, 2010) but long history before that
                                                                         vi



                18 May 2012 / 26
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
GATED RESTRICTED BOLTZMANN MACHINES (GRBM)
                Two views: Memisevic  Hinton (2007)


                                      hj                           Latent variables        hj



                   zk                                   vi                  zk

                                                                                      vi        Output



                  Input                               Output                Input


                18 May 2012 / 27
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
INFERRING OPTICAL FLOW: IMAGE “ANALOGIES”

                • Toy images (Memisevic  Hinton 2006)
                • No structure in these images, only how
                  they change
                • Can infer optical flow from a pair of
                  images and apply it to a random image




                18 May 2012 / 28


                                                                                                        fie d
                                                                   Figure 2: Columns (left to right): Input images; output images; inferred flowfields;
                Learning Representations of Sequences / G Taylor
                                                                                         t




                                                                                                              t


                                                                                                            ns
                                                                       ut




                                                                                                   ow re
                                                                                       pu




                                                                                                           pu
                                                                                                           ld
                                                                   random target images; inferred transformation applied to target images. For the trans-
                                                                      p




                                                                                                        tra
                                                                                                 Fl nfer
                                                                                    ut




                                                                                                        in
                                                                   In




                                                                   formations (last column) gray values represent the probability that a pixel is ’on’ ac-
                                                                                 O




                                                                                                    w


                                                                                                      y
                                                                   cording to the model, ranging from black for 0 to white for 1.
                                                                                                      I




                                                                                                   pl
                                                                                                 Ne


                                                                                                Ap
                                                                                                             8
Saturday, June 16, 2012
BACK TO MOTION STYLE

                • Introduce a set of latent “context” variables           hj
                  whose value is known at training time
                • In our example, these represent “motion style”
                  but could also represent height, weight, gender,
                  etc.
                • The contextual variables gate every existing
                                                                     zk
                  pairwise connection in our model



                                                                          vi



                18 May 2012 / 29
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
LEARNING AND INFERENCE

                • Learning and inference remain almost the same            hj
                  as in the standard CRBM
                • We can think of the context or style variables as
                  “blending in” a whole “sub-network”
                • This allows us to share parameters across           zk
                  styles but selectively adapt dynamics




                                                                           vi



                18 May 2012 / 30
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                   (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)




                18 May 2012 / 31
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer


                                                                         j



                                                        €




                          Input layer                   Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)

                18 May 2012 / 31
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer                                                       Hidden layer


                                                                         j                                                                  j



                                                        €                                                                  €




                          Input layer                   Output layer                         Input layer                   Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)                (e.g. data at time t-1:t-N)   (e.g. data at time t)

                18 May 2012 / 31
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer                                                          Hidden layer
                                                                                                                           Style

                                                                         j                                                                    j
                                                                                                            l          Features




                                                        €                                                                  €




                          Input layer                   Output layer                         Input layer                     Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)                (e.g. data at time t-1:t-N)     (e.g. data at time t)

                18 May 2012 / 31
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer                                                          Hidden layer
                                                                                                                           Style

                                                                         j                                                                    j
                                                                                                            l          Features




                                                        €                                                                  €




                          Input layer                   Output layer                         Input layer                     Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)                (e.g. data at time t-1:t-N)     (e.g. data at time t)

                18 May 2012 / 31
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer                                                          Hidden layer
                                                                                                                           Style

                                                                         j                                                                    j
                                                                                                            l          Features




                                                        €                                                                  €




                          Input layer                   Output layer                         Input layer                     Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)                (e.g. data at time t-1:t-N)     (e.g. data at time t)

                18 May 2012 / 31
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer                                                          Hidden layer
                                                                                                                           Style

                                                                         j                                                                    j
                                                                                                            l          Features




                                                        €                                                                  €




                          Input layer                   Output layer                         Input layer                     Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)                (e.g. data at time t-1:t-N)     (e.g. data at time t)

                18 May 2012 / 31
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
OVERPARAMETERIZATION
                                                                                                             Hidden layer
                                                                                                     Style
                • Note: weight Matrix W v,h has been replaced by
                  a tensor W v,h,z ! (Likewise for other weights)                                                       j
                                                                                      l          Features
                • The number of parameters is O(N ) - per          3

                  group of weights
                • More, if we want sparse, overcomplete hiddens                                      €
                • However, there is a simple yet powerful solution!




                                                                       Input layer                     Output layer
                                                                       (e.g. data at time t-1:t-N)     (e.g. data at time t)




                18 May 2012 / 32
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
Hidden layer


                                                                                                      Style features
                                                                                                                                   j
                                                                                                        l
                FACTORING
                                                                                                                € vh
                                                                                                                Wijl


                                                                              vh
                                                                             Wijl
                                                      z
                                                    Wlf                                 h
                                                                                       Wjf
                                                                                                                  Output layer
                                                                                                                  (e.g. data at time t)


                                                                        v
                                                                       Wif

                                                                   
                                                          vh            v   h    z
                                                         Wijl =        Wif Wjf Wlf
                                                                   f


                18 May 2012 / 33
                Learning Representations of Sequences / G Taylor
                                                                               (Figure adapted from Roland Memisevic)



Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                   (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)




                18 May 2012 / 34
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer


                                                                         j



                                                        €




                          Input layer                   Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)

                18 May 2012 / 34
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer                                                       Hidden layer


                                                                         j                                                                  j



                                                        €                                                                  €




                          Input layer                   Output layer                         Input layer                   Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)                (e.g. data at time t-1:t-N)   (e.g. data at time t)

                18 May 2012 / 34
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer                                                          Hidden layer
                                                                                                                           Style

                                                                         j                                                                    j
                                                                                                            l          Features




                                                        €                                                                  €




                          Input layer                   Output layer                         Input layer                     Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)                (e.g. data at time t-1:t-N)     (e.g. data at time t)

                18 May 2012 / 34
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer                                                          Hidden layer
                                                                                                                           Style

                                                                         j                                                                    j
                                                                                                            l          Features




                                                        €                                                                  €




                          Input layer                   Output layer                         Input layer                     Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)                (e.g. data at time t-1:t-N)     (e.g. data at time t)

                18 May 2012 / 34
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer                                                          Hidden layer
                                                                                                                           Style

                                                                         j                                                                    j
                                                                                                            l          Features




                                                        €                                                                  €




                          Input layer                   Output layer                         Input layer                     Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)                (e.g. data at time t-1:t-N)     (e.g. data at time t)

                18 May 2012 / 34
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer                                                          Hidden layer
                                                                                                                           Style

                                                                         j                                                                    j
                                                                                                            l          Features




                                                        €                                                                  €




                          Input layer                   Output layer                         Input layer                     Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)                (e.g. data at time t-1:t-N)     (e.g. data at time t)

                18 May 2012 / 34
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
SUPERVISED MODELING OF STYLE
                                                                                (Taylor, Hinton and Roweis ICML 2009, JMLR 2011)



                                                             Hidden layer                                                          Hidden layer
                                                                                                                           Style

                                                                         j                                                                    j
                                                                                                            l          Features




                                                        €                                                                  €
                                                                                                                                Factors




                          Input layer                   Output layer                         Input layer                     Output layer
                          (e.g. data at time t-1:t-N)   (e.g. data at time t)                (e.g. data at time t-1:t-N)     (e.g. data at time t)

                18 May 2012 / 34
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
PARAMETER SHARING




                                                                              
                                                                          




                                                                            




                18 May 2012 / 35
                Learning Representations of Sequences / G Taylor




Saturday, June 16, 2012
j
                                                                   l

                MOTION SYNTHESIS:                                      €
                FACTORED 3RD-ORDER CRBM



                • Same 10-styles dataset
                • 600 binary hidden units
                • 3×200 deterministic factors
                • 100 real-valued style features
                •  1 hour training on a modern
                  workstation
                • Synthesis is real-time




                18 May 2012 / 36
                Learning Representations of Sequences / G Taylor

                                                                       Summary

Saturday, June 16, 2012
j
                                                                   l

                MOTION SYNTHESIS:                                      €
                FACTORED 3RD-ORDER CRBM



                • Same 10-styles dataset
                • 600 binary hidden units
                • 3×200 deterministic factors
                • 100 real-valued style features
                •  1 hour training on a modern
                  workstation
                • Synthesis is real-time




                18 May 2012 / 36
                Learning Representations of Sequences / G Taylor

                                                                       Summary

Saturday, June 16, 2012
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision
P06 motion and video cvpr2012 deep learning methods for vision

Contenu connexe

Plus de zukun

ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
zukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
zukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
zukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
zukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
zukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
zukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
zukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
zukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
zukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
zukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
zukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
zukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
zukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
zukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
zukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
zukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
zukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
zukun
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
zukun
 

Plus de zukun (20)

ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
 

Dernier

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 

P06 motion and video cvpr2012 deep learning methods for vision

  • 1. LEARNING REPRESENTATIONS OF SEQUENCES WITH APPLICATIONS TO MOTION CAPTURE AND VIDEO ANALYSIS GRAHAM TAYLOR SCHOOL OF ENGINEERING UNIVERSITY OF GUELPH Papers and software available at: http://www.uoguelph.ca/~gwtaylor Saturday, June 16, 2012
  • 2. OVERVIEW: THIS TALK 18 May 2012 / 2 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 3. OVERVIEW: THIS TALK • Learning representations of temporal data: X (Input) x Np Nx Nw Nz pk - existing methods and challenges faced x Nw Np zk m,n - recent methods inspired by “deep learning” Nx Nz k P Pooling Y (Output) layer k Ny y Z Nw Feature y Nw layer Ny 18 May 2012 / 2 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 4. OVERVIEW: THIS TALK • Learning representations of temporal data: X (Input) x Np Nx Nw Nz pk - existing methods and challenges faced x Nw Np zk m,n - recent methods inspired by “deep learning” Nx Nz k P Pooling Y (Output) layer k Ny y Z Nw Feature y Nw layer Ny • Applications: in particular, modeling human pose and activity - highly structured data: e.g. motion capture - weakly structured data: e.g. video w Component j 1 2 3 4 5 6 7 8 9 10 18 May 2012 / 2 100 200 300 400 500 600 700 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 5. OUTLINE Learning representations from sequences Existing methods, challenges yt−2 yt−1 yt Composable, distributed-state models for sequences Conditional Restricted Boltzmann Machines and their variants X (Input) Np Using learned representations to analyze video x Nx Nw Nz x pk Nw Np zk m,n A brief and (incomplete survey of deep learning for activity recognition Nx k Nz P Pooling Y (Output) layer k Ny y Z Nw Feature y Nw layer Ny 18 May 2012 / 3 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 6. TIME SERIES DATA • Time is an integral part of many human behaviours (motion, reasoning) • In building statistical models, time is sometimes ignored, often problematic • Models that do incorporate dynamics fail to account for the fact that data is often high-dimensional, nonlinear, and contains long-range dependencies INTENSITY (No of stories) 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 18 May 2012 / 4 Learning Representations of Sequences / G Taylor Graphic: David McCandless, informationisbeautiful.net Saturday, June 16, 2012
  • 7. TIME SERIES DATA • Time is an integral part of many human behaviours (motion, reasoning) • In building statistical models, time is sometimes ignored, often problematic • Models that do incorporate dynamics fail to account for the fact that data is often high-dimensional, nonlinear, and contains long-range dependencies INTENSITY (No of stories) 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Today we will discuss a number of models that have been developed to address these challenges 18 May 2012 / 4 Learning Representations of Sequences / G Taylor Graphic: David McCandless, informationisbeautiful.net Saturday, June 16, 2012
  • 8. VECTOR AUTOREGRESSIVE MODELS M vt = b + Am vt−m + et m=1 • Have dominated statistical time-series analysis for approx. 50 years • Can be fit easily by least-squares regression • Can fail even for simple nonlinearities present in the system - but many data sets can be modeled well by a linear system • Well understood; many extensions exist 18 May 2012 / 5 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 9. MARKOV (“N-GRAM”) MODELS vt−2 vt−1 vt • Fully observable • Sequential observations may have nonlinear dependence • Derived by assuming sequences have Markov property: t−1 t−1 p(vt |{v1 }) = p(vt |{vt−N }) • This leads to joint: T N T t−1 p({v1 }) = p({v1 }) p(vt |{vt−N }) t=N +1 • Number of parameters exponential in N ! 18 May 2012 / 6 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 10. HIDDEN MARKOV MODELS (HMM) ht−2 ht−1 ht vt−2 vt−1 vt 18 May 2012 / 7 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 11. HIDDEN MARKOV MODELS (HMM) ht−2 ht−1 ht Introduces a hidden state that controls the dependence of the current observation on the past vt−2 vt−1 vt 18 May 2012 / 7 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 12. HIDDEN MARKOV MODELS (HMM) ht−2 ht−1 ht Introduces a hidden state that controls the dependence of the current observation on the past vt−2 vt−1 vt • Successful in speech language modeling, biology • Defined by 3 sets of parameters: - Initial state parameters, π - Transition matrix, A - Emission distribution, p(vt |ht ) T • Factored joint distribution: p({ht }, {vt }) = p(h1 )p(v1 |h1 ) p(ht |ht−1 )p(vt |ht ) t=2 18 May 2012 / 7 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 13. HMM INFERENCE AND LEARNING • Typically three tasks we want to perform in an HMM: - Likelihood estimation - Inference - Learning • All are exact and tractable due to the simple structure of the model • Forward-backward algorithm for inference (belief propagation) • Baum-Welch algorithm for learning (EM) • Viterbi algorithm for state estimation (max-product) 18 May 2012 / 8 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 14. LIMITATIONS OF HMMS 18 May 2012 / 9 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 15. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure 18 May 2012 / 9 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 16. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series 18 May 2012 / 9 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 17. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states 18 May 2012 / 9 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 18. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states 18 May 2012 / 9 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 19. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states 18 May 2012 / 9 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 20. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states • We seek models with distributed hidden state: 18 May 2012 / 9 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 21. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states • We seek models with distributed hidden state: - capacity linear in the number of components 18 May 2012 / 9 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 22. LIMITATIONS OF HMMS • Many high-dimensional data sets contain rich componential structure • Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series • To model K bits of information, they need 2K hidden states • We seek models with distributed hidden state: - capacity linear in the number of components 18 May 2012 / 9 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 23. LINEAR DYNAMICAL SYSTEMS ht−2 ht−1 ht vt−2 vt−1 vt 18 May 2012 / 10 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 24. LINEAR DYNAMICAL SYSTEMS ht−2 ht−1 ht Graphical model is the same as HMM but with real-valued state vectors vt−2 vt−1 vt 18 May 2012 / 10 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 25. LINEAR DYNAMICAL SYSTEMS ht−2 ht−1 ht Graphical model is the same as HMM but with real-valued state vectors vt−2 vt−1 vt • Characterized by linear-Gaussian dynamics and observations: p(ht |ht − 1) = N (ht ; Aht−1 , Q) p(vt |ht ) = N (vt ; Cht , R) • Inference is performed using Kalman smoothing (belief propagation) • Learning can be done by EM • Dynamics, observations may also depend on an observed input (control) 18 May 2012 / 10 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 26. LATENT REPRESENTATIONS FOR REAL-WORLD DATA Data for many real-world problems (e.g. motion capture, finance) is high- dimensional, containing complex non-linear relationships between components Hidden Markov Models Pro: complex, nonlinear emission model Con: single K -state multinomial represents entire history Linear Dynamical Systems Pro: state can convey much more information Con: emission model constrained to be linear 18 May 2012 / 11 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 27. LEARNING DISTRIBUTED REPRESENTATIONS • Simple networks are capable of discovering useful and interesting internal representations of static data • Perhaps the parallel nature of computation in connectionist models may be at odds with the serial nature of temporal events • Simple idea: spatial representation of time - Need a buffer; not biologically plausible - Cannot process inputs of differing length - Cannot distinguish between absolute and relative position • This motivates an implicit representation of time in connectionist models where time is represented by its effect on processing 18 May 2012 / 12 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 28. RECURRENT NEURAL NETWORKS ˆ yt−1 ˆ yt ˆ yt+1 ht−1 ht ht+1 vt−1 vt vt+1 18 May 2012 / 13 Learning Representations of Sequences / G Taylor (Figure from Martens and Sutskever) Saturday, June 16, 2012
  • 29. RECURRENT NEURAL NETWORKS ˆ yt−1 ˆ yt ˆ yt+1 ht−1 ht ht+1 vt−1 vt vt+1 • Neural network replicated in time 18 May 2012 / 13 Learning Representations of Sequences / G Taylor (Figure from Martens and Sutskever) Saturday, June 16, 2012
  • 30. RECURRENT NEURAL NETWORKS ˆ yt−1 ˆ yt ˆ yt+1 ht−1 ht ht+1 vt−1 vt vt+1 • Neural network replicated in time • At each step, receives input vector, updates its internal representation via nonlinear activation functions, and makes a prediction: vt = W hv vt−1 + W hh ht−1 + bh hj,t = e(vj,t ) st = W yh ht + by yk,t ˆ = g(yk,t ) 18 May 2012 / 13 Learning Representations of Sequences / G Taylor (Figure from Martens and Sutskever) Saturday, June 16, 2012
  • 31. TRAINING RECURRENT NEURAL NETWORKS 18 May 2012 / 14 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 32. TRAINING RECURRENT NEURAL NETWORKS • Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series 18 May 2012 / 14 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 33. TRAINING RECURRENT NEURAL NETWORKS • Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series • Exact gradients can be computed exactly via Backpropagation Through Time 18 May 2012 / 14 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 34. TRAINING RECURRENT NEURAL NETWORKS • Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series • Exact gradients can be computed exactly via Backpropagation Through Time • It is an interesting and powerful model. What’s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to “vanishing” or “exploding” gradients - Much work in the 1990’s focused on identifying and addressing these issues: none of these methods were widely adopted 18 May 2012 / 14 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 35. TRAINING RECURRENT NEURAL NETWORKS • Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series • Exact gradients can be computed exactly via Backpropagation Through Time • It is an interesting and powerful model. What’s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to “vanishing” or “exploding” gradients - Much work in the 1990’s focused on identifying and addressing these issues: none of these methods were widely adopted 18 May 2012 / 14 Learning Representations of Sequences / G Taylor (Figure adapted from James Martens) Saturday, June 16, 2012
  • 36. TRAINING RECURRENT NEURAL NETWORKS • Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series • Exact gradients can be computed exactly via Backpropagation Through Time • It is an interesting and powerful model. What’s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to “vanishing” or “exploding” gradients - Much work in the 1990’s focused on identifying and addressing these issues: none of these methods were widely adopted • Best-known attempts to resolve the problem of RNN training: - Long Short-term Memory (LSTM) (Hochreiter and Schmidhuber 1997) - Echo-State Network (ESN) (Jaeger and Haas 2004) 18 May 2012 / 14 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 37. FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: 18 May 2012 / 15 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 38. FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: • increased frequency and severity of bad local minima 18 May 2012 / 15 Learning Representations of Sequences / G Taylor (Figures from James Martens) Saturday, June 16, 2012
  • 39. FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: • increased frequency and severity of bad local minima • pathological curvature, like the type seen in the Rosenbrock function: f (x, y) = (1 − x)2 + 100(y − x2 )2 18 May 2012 / 15 Learning Representations of Sequences / G Taylor (Figures from James Martens) Saturday, June 16, 2012
  • 40. SECOND ORDER METHODS • Model the objective function by the local approximation: T 1 T f (θ + p) ≈ qθ (p) ≡ f (θ) + ∆f (θ) p + p Bp 2 where p is the search direction and B is a matrix which quantifies curvature • In Newton’s method, B is the Hessian matrix, H • By taking the curvature information into account, Newton’s method “rescales” the gradient so it is a much more sensible direction to follow • Not feasible for high-dimensional problems! 18 May 2012 / 16 Learning Representations of Sequences / G Taylor (Figure from James Martens) Saturday, June 16, 2012
  • 41. HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional “tricks”): 18 May 2012 / 17 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 42. HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional “tricks”): • For an n-dimensional vector d , the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences 18 May 2012 / 17 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 43. HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional “tricks”): • For an n-dimensional vector d , the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences • There is a very effective algorithm for optimizing quadratic objectives which requires only Hessian-vector products: linear conjugate-gradient (CG) 18 May 2012 / 17 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 44. HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional “tricks”): • For an n-dimensional vector d , the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences • There is a very effective algorithm for optimizing quadratic objectives which requires only Hessian-vector products: linear conjugate-gradient (CG) This method was shown to effectively train RNNs in the pathological long-term dependency problems they were previously not able to solve (Martens and Sutskever 2011) 18 May 2012 / 17 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 45. GENERATIVE MODELS WITH DISTRIBUTED STATE 18 May 2012 / 18 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 46. GENERATIVE MODELS WITH DISTRIBUTED STATE • Many sequences are high-dimensional and have complex structure - RNNs simply predict the expected value at the next time step - Cannot capture multi-modality of time series 18 May 2012 / 18 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 47. GENERATIVE MODELS WITH DISTRIBUTED STATE • Many sequences are high-dimensional and have complex structure - RNNs simply predict the expected value at the next time step - Cannot capture multi-modality of time series • Generative models (like Restricted Boltzmann Machines) can express the negative log-likelihood of a given configuration of the output, and can capture complex distributions 18 May 2012 / 18 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 48. GENERATIVE MODELS WITH DISTRIBUTED STATE • Many sequences are high-dimensional and have complex structure - RNNs simply predict the expected value at the next time step - Cannot capture multi-modality of time series • Generative models (like Restricted Boltzmann Machines) can express the negative log-likelihood of a given configuration of the output, and can capture complex distributions • By using binary latent (hidden) state, we gain the best of both worlds: - the nonlinear dynamics and observation model of the HMM without the simple state - the representationally powerful state of the LDS without the linear-Gaussian restriction on dynamics and observations 18 May 2012 / 18 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 49. DISTRIBUTED BINARY HIDDEN STATE • Using distributed binary representations Hidden variables (factors) at time t for hidden state in directed models of time series makes inference difficult. But we can: - Use a Restricted Boltzmann Machine (RBM) for the interactions between hidden and visible variables. A factorial posterior makes inference and sampling Visible variables (observations) at time t easy. One typically uses binary logistic - Treat the visible variables in the previous units for both visibles and hiddens time slice as additional fixed inputs p(hj = 1|v) = σ(bj + vi Wij ) i 18 May 2012 / 19 Learning Representations of Sequences / G Taylor p(vi = 1|h) = σ(bi + hj Wij ) j Saturday, June 16, 2012
  • 50. MODELING OBSERVATIONS WITH AN RBM • So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations • But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 18 May 2012 / 20 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 51. MODELING OBSERVATIONS WITH AN RBM • So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations • But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 18 May 2012 / 20 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 52. MODELING OBSERVATIONS WITH AN RBM • So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations • But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 18 May 2012 / 20 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 53. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) 18 May 2012 / 21 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 54. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) • Start with a Restricted Boltzmann Machine (RBM) Hidden layer Visible layer 18 May 2012 / 21 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 55. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) • Start with a Restricted Boltzmann Machine (RBM) • Add two types of directed connections Hidden layer Visible layer 18 May 2012 / 21 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 56. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) • Start with a Restricted Boltzmann Machine (RBM) • Add two types of directed connections - Autoregressive connections model short-term, linear structure Hidden layer Visible layer Recent history 18 May 2012 / 21 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 57. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) • Start with a Restricted Boltzmann Machine (RBM) • Add two types of directed connections - Autoregressive connections model short-term, linear structure - History can also influence dynamics through hidden layer Hidden layer Visible layer Recent history 18 May 2012 / 21 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 58. CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) • Start with a Restricted Boltzmann Machine (RBM) • Add two types of directed connections - Autoregressive connections model short-term, linear structure - History can also influence dynamics through hidden layer Hidden layer • Conditioning does not change inference nor learning Visible layer Recent history 18 May 2012 / 21 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 59. CONTRASTIVE DIVERGENCE LEARNING Fixed j Fixed j vi hj data vi hj recon Fixed Fixed i i iter = 0 iter =1 (data) (reconstruction) • When updating visible and hidden units, we implement directed connections by treating data from previous time steps as a dynamically changing bias • Inference and learning do not change 18 May 2012 / 22 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 60. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK 18 May 2012 / 23 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 61. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK • Learn a CRBM Hidden layer Visible layer 18 May 2012 / 23 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 62. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK • Learn a CRBM • Now, treat the sequence of hidden units as “fully observed” data and train a second CRBM Hidden layer Visible layer 18 May 2012 / 23 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 63. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK • Learn a CRBM • Now, treat the sequence of hidden units as “fully observed” data and train a second CRBM l Hidden layer • The composition of CRBMs is a conditional deep h1 t belief net Hidden layer 0 h0 t−2 h0 t−1 Visible layer 18 May 2012 / 23 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 64. STACKING: THE CONDITIONAL DEEP BELIEF NETWORK • Learn a CRBM • Now, treat the sequence of hidden units as “fully observed” data and train a second CRBM l Hidden layer • The composition of CRBMs is a conditional deep h1 t belief net • It can be fine-tuned generatively or discriminatively Hidden layer 0 h0 t−2 h0 t−1 Visible layer 18 May 2012 / 23 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 65. l h1 t MOTION SYNTHESIS WITH A 2-LAYER CDBN h0 t−2 h0 t−1 0 • Model is trained on ~8000 frames of 60fps data (49 dimensions) • 10 styles of walking: cat, chicken, dinosaur, drunk, gangly, graceful, normal, old-man, sexy and strong • 600 binary hidden units per layer • 1 hour training on a modern workstation 18 May 2012 / 24 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 66. l h1 t MOTION SYNTHESIS WITH A 2-LAYER CDBN h0 t−2 h0 t−1 0 • Model is trained on ~8000 frames of 60fps data (49 dimensions) • 10 styles of walking: cat, chicken, dinosaur, drunk, gangly, graceful, normal, old-man, sexy and strong • 600 binary hidden units per layer • 1 hour training on a modern workstation 18 May 2012 / 24 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 67. MODELING CONTEXT Labels • A single model was trained on 10 “styled” m walks from CMU subject 137 • The model can generate each style based on initialization • We cannot prevent nor control transitioning • How to blend styles? • Style or person labels can be provided as part of the input to the top layer 18 May 2012 / 25 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 68. MODELING CONTEXT Labels • A single model was trained on 10 “styled” m l walks from CMU subject 137 Hidden layer h1 • The model can generate each style based t on initialization • We cannot prevent nor control Hidden layer transitioning 0 h0 t−2 h0 t−1 • How to blend styles? • Style or person labels can be provided as part of the input to the top layer Visible layer 18 May 2012 / 25 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 69. MULTIPLICATIVE INTERACTIONS • Let latent variables act like gates, that hj dynamically change the connections between other variables • This amounts to letting variables multiply connections between other variables: three-way multiplicative interactions zk • Recently used in the context of learning correspondence between images (Memisevic Hinton 2007, 2010) but long history before that vi 18 May 2012 / 26 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 70. GATED RESTRICTED BOLTZMANN MACHINES (GRBM) Two views: Memisevic Hinton (2007) hj Latent variables hj zk vi zk vi Output Input Output Input 18 May 2012 / 27 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 71. INFERRING OPTICAL FLOW: IMAGE “ANALOGIES” • Toy images (Memisevic Hinton 2006) • No structure in these images, only how they change • Can infer optical flow from a pair of images and apply it to a random image 18 May 2012 / 28 fie d Figure 2: Columns (left to right): Input images; output images; inferred flowfields; Learning Representations of Sequences / G Taylor t t ns ut ow re pu pu ld random target images; inferred transformation applied to target images. For the trans- p tra Fl nfer ut in In formations (last column) gray values represent the probability that a pixel is ’on’ ac- O w y cording to the model, ranging from black for 0 to white for 1. I pl Ne Ap 8 Saturday, June 16, 2012
  • 72. BACK TO MOTION STYLE • Introduce a set of latent “context” variables hj whose value is known at training time • In our example, these represent “motion style” but could also represent height, weight, gender, etc. • The contextual variables gate every existing zk pairwise connection in our model vi 18 May 2012 / 29 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 73. LEARNING AND INFERENCE • Learning and inference remain almost the same hj as in the standard CRBM • We can think of the context or style variables as “blending in” a whole “sub-network” • This allows us to share parameters across zk styles but selectively adapt dynamics vi 18 May 2012 / 30 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 74. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) 18 May 2012 / 31 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 75. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer j € Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 76. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer j j € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 77. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 78. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 79. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 80. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 31 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 81. OVERPARAMETERIZATION Hidden layer Style • Note: weight Matrix W v,h has been replaced by a tensor W v,h,z ! (Likewise for other weights) j l Features • The number of parameters is O(N ) - per 3 group of weights • More, if we want sparse, overcomplete hiddens € • However, there is a simple yet powerful solution! Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 32 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 82. Hidden layer Style features j l FACTORING € vh Wijl vh Wijl z Wlf h Wjf Output layer (e.g. data at time t) v Wif vh v h z Wijl = Wif Wjf Wlf f 18 May 2012 / 33 Learning Representations of Sequences / G Taylor (Figure adapted from Roland Memisevic) Saturday, June 16, 2012
  • 83. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) 18 May 2012 / 34 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 84. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer j € Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 85. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer j j € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 86. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 87. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 88. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 89. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 90. SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer Style j j l Features € € Factors Input layer Output layer Input layer Output layer (e.g. data at time t-1:t-N) (e.g. data at time t) (e.g. data at time t-1:t-N) (e.g. data at time t) 18 May 2012 / 34 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 91. PARAMETER SHARING         18 May 2012 / 35 Learning Representations of Sequences / G Taylor Saturday, June 16, 2012
  • 92. j l MOTION SYNTHESIS: € FACTORED 3RD-ORDER CRBM • Same 10-styles dataset • 600 binary hidden units • 3×200 deterministic factors • 100 real-valued style features • 1 hour training on a modern workstation • Synthesis is real-time 18 May 2012 / 36 Learning Representations of Sequences / G Taylor Summary Saturday, June 16, 2012
  • 93. j l MOTION SYNTHESIS: € FACTORED 3RD-ORDER CRBM • Same 10-styles dataset • 600 binary hidden units • 3×200 deterministic factors • 100 real-valued style features • 1 hour training on a modern workstation • Synthesis is real-time 18 May 2012 / 36 Learning Representations of Sequences / G Taylor Summary Saturday, June 16, 2012