SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




                        Part 4: Conditional Random Fields

                       Sebastian Nowozin and Christoph H. Lampert


                                 Colorado Springs, 25th June 2011




                                                                                                                    1 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields



 Problem (Probabilistic Learning)
 Let d(y|x) be the (unknown) true conditional distribution.
 Let D = {(x1 , y 1 ), . . . , (xN , y N )} be i.i.d. samples from d(x, y).
        Find a distribution p(y|x) that we can use as a proxy for d(y|x).
 or
        Given a parametrized family of distributions, p(y|x, w), find the
        parameter w∗ making p(y|x, w) closest to d(y|x).


 Open questions:
        What do we mean by closest?
        What’s a good candidate for p(y|x, w)?
        How to actually find w∗ ?
                conceptually, and
                numerically
                                                                                                                    2 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




 Principle of Parsimony (Parsimoney, aka Occam’s razor)
 “Pluralitas non est ponenda sine neccesitate.”
                                                                                           William of Ockham

 “We are to admit no more causes of natural things than such as are both
 true and sufficient to explain their appearances.”
                                                                                                    Isaac Newton

 “Make everything as simple as possible, but not simpler.”
                                                                                                  Albert Einstein

 “Use the simplest explanation that covers all the facts.”
                                                                                                   what we’ll use




                                                                                                                    3 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields


        1) Define what aspects we consider relevant facts about the data.
        2) Pick the simplest distribution reflecting that.

 Definition (Simplicity ≡ Entropy)
 The simplicity of a distribution p is given by its entropy:

                                      H(p) = −               p(z) log p(z)
                                                       z∈Z


 Definition (Relevant Facts ≡ Feature Functions)
 By φi : Z → R for i = 1, . . . , D we denote a set of feature functions that
 express everything we want to be able to model about our data.

 For example:                   the grayvalue of a pixel,
                                a bag-of-words histogram of an image,
                                the time of day an image was taken,
                                a flag if a pixel is darker than half of its neighbors.
                                                                                                                    4 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields



 Principle (Maximum Entropy Principle)
 Let z 1 , . . . , z N be samples from a distribution d(z). Let φ1 , . . . , φD be
                                               1
 feature functions, and denote by µi := N n φi (z n ) their means over the
 sample set.
 The maximum entropy distribution, p, is the solution to

                      max            H(p)        subject to           Ez∼p(z) {φi (z)} = µi .
                p is a prob.distr.
                                                                      be faithful to what we know
                be as simple as possible

 Theorem (Exponential Family Distribution)
 Under some very reasonable conditions, the maximum entropy distribution
 has the form
                              1
                      p(z) = exp            wi φi (z)
                              Z           i

 for some parameter vector w = (w1 , . . . , wD ) and constant Z.

                                                                                                                    5 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




 Example:
        Let Z = R, φ1 (z) = z, φ2 (z) = z 2 .
        The exponential family distribution is
                             1
                   p(z) =         exp( w1 z + w2 z 2 )
                            Z(w)
                             b2 a                 2                  w1
                          =         exp( a z − b ) for a = w2 , b = − .
                            Z(a, b)                                  w2


         It’s a Gaussian!

        Given examples z 1 , . . . , z N , we can compute a and b, and derive w.



                                                                                                                    6 / 39
Sebastian Nowozin and Christoph Lampert    –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields



 Example:
        Let Z = {1, . . . , K}, φk (z) = z = k , for k = 1, . . . , K.
        The exponential family distribution is
                                              1
                               p(z) =              exp(   wk φk (z) )
                                            Z(w)
                                                        k
                                            
                                            exp(w1 )/Z for z = 1,
                                            
                                            
                                            
                                            exp(w )/Z for z = 2,
                                                     2
                                          =
                                            . . .
                                            
                                            
                                            
                                            exp(w )/Z for z = K.
                                                     K

                                      with Z = exp(w1 ) + · · · + exp(wK ).


         It’s a Multinomial!

                                                                                                                     7 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision    –    Part 4. Conditional Random Fields




 Example:
        Let Z = {0, 1}N ×M image grid,
        φi (y) := yi for each pixel i,
        φN M (y) = i∼j yi = yj (summing over all 4-neighbor pairs)
        The exponential family distribution is
                                       1
                         p(z) =            exp( w, φ(y) + w
                                                          ˜                               yi = yj )
                                      Z(w)
                                                                                i,j



         It’s a (binary) Markov Random Field!




                                                                                                                      8 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields


Conditional Random Field Learning

 Assume:
     a set of i.i.d. samples D = {(xn , y n )}n=1,...,N , (xn , y n ) ∼ d(x, y)
     feature functions ( φ1 (x, y), . . . , φD (x, y) ) ≡: φ(x, y)
                                               1
     parametrized family p(y|x, w) = Z(x,w) exp( w, φ(x, y) )

 Task:
     adjust w of p(y|x, w) based on D.

 Many possible technique to do so:
    Expectation Matching
    Maximum Likelihood
    Best Approximation
    MAP estimation of w

 Punchline: they all turn out to be (almost) the same!
                                                                                                                    9 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




 Maximum Likelihood Parameter Estimation Idea: maximize conditional
 likelihood of observing outputs y 1 , . . . , y N for inputs x1 , . . . , xN

                        w∗ = argmax p(y 1 , . . . , y N |x1 , . . . , xN , w)
                                    w∈RD
                                                N
                          i.i.d.
                            = argmax                 p(y n |xn , w)
                                    w∈RD       n=1
                                                            N
                      − log(·)
                          =        argmin              −         log p(y n |xn , w)
                                   w∈RD                    n=1

                                              negative conditional log-likelihood (of D)




                                                                                                                   10 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields



 Best Approximation
 Idea: find p(y|x, w) that is closest to d(y|x)

 Definition (Similarity between conditional distributions)
 For fixed x ∈ X : KL-divergence measure similarity

                                                                                 d(y|x)
                         KLcond (p||d)(x) :=                   d(y|x) log
                                                                                p(y|x, w)
                                                        y∈Y

 For x ∼ d(x), compute expectation:

                        KLtot (p||d) : = Ex∼d(x) KLcond (p||d)(x)
                                                                                 d(y|x)
                                              =                d(x, y) log
                                                                                p(y|x, w)
                                                  x∈X y∈Y




                                                                                                                   11 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




 Best Approximation
 Idea: find p(y|x, w) of minimal KLtot -distance to d(y|x)


                                                                                  d(y|x)
                              w∗ = argmin                       d(x, y) log
                                          w∈RD x∈X y∈Y                           p(y|x, w)
                        drop const.
                              =       argmin −                        d(x, y) log p(y|x, w)
                                          w∈RD        (x,y)∈X ×Y
                                                                N
                  (xn ,y n )∼d(x,y)
                          ≈           argmin               −         log p(y n |xn , w)
                                          w∈RD                 n=1

                                                  negative conditional log-likelihood (of D)




                                                                                                                   12 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields

 MAP Estimation of w Idea: Treat w as random variable; maximize
 posterior probability p(w|D)

                                                                                         N
              p(x1 , y 1 , . . . , xn , y n |w)p(w) i.i.d.
                 Bayes                                                                         p(y n |xn , w)
     p(w|D) =                                         = p(w)
                               p(D)                                                             p(y n |xn )
                                                                                         n=1

 p(w): prior belief on w (cannot be estimated from data).


            w∗ = argmax p(w|D) = argmin − log p(w|D)
                       w∈RD                           w∈RD
                                                           N
                 = argmin − log p(w) −                          log p(y n |xn , w) + log p(y n |xn )
                       w∈RD                               n=1
                                                                                                indep. of w
                                                           N
                 = argmin − log p(w) −                          log p(y n |xn , w)
                       w∈RD                               n=1

                                                                                                                   13 / 39
Sebastian Nowozin and Christoph Lampert       –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields



                                                                            N
                     w∗ = argmin − log p(w) −                                   log p(y n |xn , w)
                                w∈RD                                     n=1


 Choices for p(w):
     p(w) :≡ const.                 (uniform; in RD not really a distribution)
                                                               N
                       w∗ = argmin                        −         log p(y n |xn , w)       + const.
                                  w∈RD                        n=1
                                                   negative conditional log-likelihood


                                          1           2
        p(w) := const. · e− 2σ2                   w
                                                              (Gaussian)
                                                                        N
                                                   1
                w∗ = argmin                −           w        2
                                                                    +         log p(y n |xn , w) + const.
                           w∈RD                   2σ 2
                                                                        n=1
                                          regularized negative conditional log-likelihood

                                                                                                                       14 / 39
Sebastian Nowozin and Christoph Lampert   –    Structured Models in Computer Vision   –   Part 4. Conditional Random Fields


 Probabilistic Models for Structured Prediction - Summary
 Negative (Regularized) Conditional Log-Likelihood (of D)


                                              N
                  1                  2                                                          w,φ(xn ,y)
           L(w) = 2 w                     −           w, φ(xn , y n ) − log                 e
                 2σ
                                              n=1                                     y∈Y

 (σ 2 → ∞ makes it unregularized)

 Probabilistic parameter estimation or training means solving

                                              w∗ = argmin L(w).
                                                        w∈RD



  Same optimization problem as for multi-class logistic regression.

                                                                                                                    15 / 39
Sebastian Nowozin and Christoph Lampert                                                  –    Structured Models in Computer Vision                                              –        Part 4. Conditional Random Fields


                            Negative Conditional Log-Likelihood (Toy Example)
                                       negative log likelihood σ2 =0.01                                                                                  negative log likelihood σ2 =0.10
             3                                                                                                          3




                                                                                                                                                                                                                128.0
             2                                                                                                          2




                                                                                                                                                                                                                 00
                      .000




                                                                                             512.0
                     512




                                                                                               00




                                                                                                                                     0
                                  0




                                                                                                                                    .00
                                 .00




                                                                                                                                128
             1                                                                                                          1
                             256




                                                                                                                                                   00
                                                                                                                                                  64.0
                                                                               128.000




                                                                                                                                                                                               32.000
                                                  .0 00
                                                16                                                                                                                        0 0
             0                                                                                                          0                                              2.0

                                                               32.000




                                                                                                                                                                                8.000
                                       64.000




                                                                                                         1024.000




                                                                                                                                                                 4.0
                                                                                                                                                                  00
             1                                                                                                          1




                                                                                                                                                                                    00
                                                                                                                                                                                16.0
             2                                                                                                          2
                 3           2          1            0                  1                2       3   4              5       3             2              1             0        1          2            3   4           5
                                       negative log likelihood σ2 =1.00                                                                                   negative log likelihood σ2 → ∞
             3                                                                                                          2.5
                       0
                     .00




                                                                                                                        2.0
                 128




             2
                                                                                                                        1.5
                                                          00

                                                                  00
                                                        .0




                                                                                                                                                   00
                                                                .0




                                                                                                                        1.0
                                                     32

                                                             16




                                                                                                                                              64.0
             1
                                                                0   0
                                                           4.00
                                                              0




                                                                                                                        0.5
                                                          8.0




                                                                                                                                                             00




                                                                                                                                                                     31 2 25 50
                                                                                                                                                         32.0




                                                                                                                                                                 0.0.00.1 0.2
                                                                                                                                                                      0000 00




                                                                                                                                                                 0.0 0000 00
                                                                                                                                                                      0001 02
                                                                                                                                                                       00 00




                                                                                                                                                                     04 8 16
                                                                                                                                                                  0.5 .0 2.0




                                                                                                                                                                 0.0.0 0.0
                                                                                                                                                                          00




                                                                                                                                                                 0.0.0 0.0
                                                                                                                                                                  4.0 8.0




                                                                                                                                                                 0.0.0 0.0
                                                                                                                        0.0



                                                                                                                                                             16.0




                                                                                                                                                                        0
                                                                                                                                                                       6




                                                                                                                                                                     00
                       00




                                                                                                                                                                     1
             0




                                                                                                                                                                    0
                                                                                                                                                                    0
                                                                                                                                                                   0
                                                                                                                                                                   0
                     64.0




                                                     00




                                                                                                                        0.5
                                                                            1.000
                                                 0.5




             1                                                                                                          1.0

                                                                             00
                                                                                                                        1.5
                                                                        2.0
             2                                                                                                          2.0
                 3           2          1            0                  1                2       3   4              5           3             2              1          0       1          2            3   4           5

                                                                                                                                                                                                                            16 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




 Steepest Descent Minimization – minimize L(w)
 input tolerance > 0
   1: wcur ← 0
   2: repeat
   3:   v ← w L(wcur )
   4:   η ← argminη∈R L(wcur − ηv)
   5:   wcur ← wcur − ηv
   6: until v <
 output wcur


 Alternatives:
        L-BFGS (second-order descent without explicit Hessian)
        Conjugate Gradient
 We always need (at least) the gradient of L.

                                                                                                                   17 / 39
Sebastian Nowozin and Christoph Lampert   –    Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




                                          N
                      1           2                                                           w,φ(xn ,y)
        L(w) =            w           −             w, φ(xn , y n ) + log                 e
                     2σ 2
                                          n=1                                       y∈Y


                                   N
                                                                         e   w,φ(xn ,y)   φ(xn , y)
              1                                 n    n            y∈Y
     w L(w) = 2 w −                       φ(x , y ) −                               w,φ(xn ,¯)
                                                                                            y
             σ                                                           y ∈Y
                                                                         ¯      e
                                 n=1
                                  N
                     1
                 =      w−                φ(xn , y n ) −            p(y|xn , w)φ(xn , y)
                     σ2
                                 n=1                          y∈Y
                                   N
                     1
                 =      w−                φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y)
                     σ2
                                 n=1


                                           N
                     1
     ∆L(w) =            IdD×D +                 [Ey∼p(y|xn ,w) φ(xn , y)][Ey∼p(y|xn ,w) φ(xn , y)]
                     σ2
                                          n=1
                                                                                                                    18 / 39
Sebastian Nowozin and Christoph Lampert   –    Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




                                              N
                         1           2                                                          w,φ(xn ,y)
           L(w) =            w            −           w, φ(xn , y n ) + log                 e
                        2σ 2
                                              n=1                                     y∈Y




        C ∞ -differentiable on all RD .




                                                                                                                    19 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




                                              N
                                 1
                w L(w) =            w−              φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y)
                                 σ2
                                              n=1



        For σ → ∞:

                  Ey∼p(y|xn ,w) φ(xn , y) = φ(xn , y n )                       ⇒          w L(w)      = 0,

        criticial point of L (local minimum/maximum/saddle point).


 Interpretation:
        We aim for expectation matching: Ey∼p φ(x, y) = φ(x, y obs )
        but discriminatively: only for x ∈ {x1 , . . . , xn }.


                                                                                                                   20 / 39
Sebastian Nowozin and Christoph Lampert    –    Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




                                          N
                    1
   ∆L(w) =             IdD×D +                  [Ey∼p(y|xn ,w) φ(xn , y)][Ey∼p(y|xn ,w) φ(xn , y)]
                    σ2
                                          n=1




        positive definite Hessian matrix → L(w) is convex
        → w L(w) = 0 implies global minimum.




                                                                                                                     21 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




 Milestone I: Probabilistic Training (Conditional Random Fields)
        p(y|x, w) log-linear in w ∈ RD .
        Training: many probabilistic derivations lead to same optimization
        problem → minimize negative conditional log-likelihood, L(w)
        L(w) is differentiable and convex,
        → gradient descent will find global optimum with                                       w L(w)      =0
        Same structure as multi-class logistic regression.


  For logistic regression: this is where the textbook ends. we’re done.

  For conditional random fields: we’re not in safe waters, yet!




                                                                                                                   22 / 39
Sebastian Nowozin and Christoph Lampert       –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




 Task: Compute v =                  w L(wcur ),          evaluate L(wcur + ηv):

                                                    N
                            1             2                                                            w,φ(xn ,y)
              L(w) =            w             −            w, φ(xn , y n ) + log                   e
                           2σ 2
                                                  n=1                                        y∈Y
                                          N
                           1
          w   L(w) =          w−                   φ(xn , y n ) −            p(y|xn , w)φ(xn , y)
                           σ2
                                          n=1                         y∈Y

 Problem: Y typically is very (exponentially) large:
        binary image segmentation: |Y| = 2640×480 ≈ 1092475
        ranking N images: |Y| = N !, e.g. N = 1000: |Y| ≈ 102568 .

  We must use the structure in Y, or we’re lost.



                                                                                                                       23 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields

 Solving the Training Optimization Problem Numerically
                                              N
                                 1
                w   L(w) =          w−                φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y)
                                 σ2
                                              n=1

 Computing the Gradient (naive): O(K M N D)

                                                  N
                      1                   2
               L(w) = 2 w                     −          w, φ(xn , y n ) + log Z(xn , w)
                     2σ
                                                  n=1

 Line Search (naive): O(K M N D) per evaluation of L

        N : number of samples
        D: dimension of feature space
        M : number of output nodes ≈ 100s to 1,000,000s
        K: number of possible labels of each output nodes ≈ 2 to 100s
                                                                                                                   24 / 39
Sebastian Nowozin and Christoph Lampert    –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields

 Probabilistic Inference to the Rescue Remember: in a graphical model with
 factors F, the features decompose:


                                          φ(x, y) = φF (x, yF )
                                                                             F ∈F



                      Ey∼p(y|x,w) φ(x, y) = Ey∼p(y|x,w) φF (x, yF )
                                                                                              F ∈F
                                                    = EyF ∼p(yF |x,w) φF (x, yF )                F ∈F



             EyF ∼p(yF |x,w) φF (x, yF ) =                               p(yF |x, w) φF (x, yF )
                                                           yF ∈YF
                                                                       factor marginals
                                                        K |F | terms

 Factor marginals µF = p(yF |x, w)
     are much smaller than complete joint distribution p(y|x, w),
     can be computed/approximated, e.g., with (loopy) belief propagation.
                                                                                                                    25 / 39
Sebastian Nowozin and Christoph Lampert   –    Structured Models in Computer Vision   –   Part 4. Conditional Random Fields

 Solving the Training Optimization Problem Numerically

                                               N
                                 1
                w L(w) =            w−               φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y)
                                 σ2
                                              n=1

 Computing the Gradient: X O(M K |Fmax | N D):
                         O(K MX X
                         XXX   
                              nd),

                                              N
                         1           2                                                          w,φ(xn ,y)
           L(w) =            w            −           w, φ(xn , y n ) + log                 e
                        2σ 2
                                              n=1                                     y∈Y

 Line Search: XX O(M K |Fmax | N D) per evaluation of L
              O(K MX X
              X X   
                nd),


        N : number of samples ≈ 10s to 1,000,000s
        D: dimension of feature space
        M : number of output nodes
        K: number of possible labels of each output nodes
                                                                                                                    26 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields

 What, if the training set D is too large (e.g. millions of examples)?
 Simplify Model

        Train simpler model (smaller factors)
 Less expressive power ⇒ results might get worse rather than better

 Subsampling

        Create random subset D ⊂ D. Train model using D
 Ignores all information in D  D

 Parallelize
        Train multiple models in parallel. Merge the models.
 Follows the multi-core/GPU trend
 How to actually merge?     (or ?)
 Doesn’t really save computation.
                                                                                                                   27 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields

 What, if the training set D is too large (e.g. millions of examples)?

 Stochastic Gradient Descent (SGD)

        Keep maximizing p(w|D).
        In each gradient descent step:
                Create random subset D ⊂ D,                      ← often just 1–3 elements!
                Follow approximate gradient

                    ˜w L(w) = 1 w −                              φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y)
                              σ2
                                                (xn ,y n )∈D




        Line search no longer possible. Extra parameter: stepsize η
        SGD converges to argminw L(w)! (if η chosen right)
        SGD needs more iterations, but each one is much faster

 more: see L. Bottou, O. Bousquet: ”The Tradeoffs of Large Scale Learning”, NIPS 2008.
 also: http://leon.bottou.org/research/largescale
                                                                                                                   28 / 39
Sebastian Nowozin and Christoph Lampert   –    Structured Models in Computer Vision   –   Part 4. Conditional Random Fields

 Solving the Training Optimization Problem Numerically

                                               N
                                 1
                w L(w) =            w−               φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y)
                                 σ2
                                              n=1

 Computing the Gradient: X O(M K 2 N D) (if BP is possible):
                         O(K MX X
                         XXX   
                              nd),

                                              N
                         1           2                                                          w,φ(xn ,y)
           L(w) =            w            −           w, φ(xn , y n ) + log                 e
                        2σ 2
                                              n=1                                     y∈Y

 Line Search: XX O(M K 2 N D) per evaluation of L
              O(K MX X
              X X   
                nd),


        N : number of samples
        D: dimension of feature space: ≈ φi,j 1–10s, φi : 100s to 10000s
        M : number of output nodes
        K: number of possible labels of each output nodes
                                                                                                                    29 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields



 Typical feature functions in image segmentation:


        φi (yi , x) ∈ R≈1000 : local image features, e.g. bag-of-words
           → wi , φi (yi , x) : local classifier (like logistic-regression)

        φi,j (yi , yj ) = yi = yj ∈ R1 : test for same label
           → wij , φij (yi , yj ) : penalizer for label changes (if wij  0)

        combined: argmaxy p(y|x) is smoothed version of local cues




                   original                   local classification             local + smoothness

                                                                                                                   30 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




 Typical feature functions in handwriting recognition:


        φi (yi , x) ∈ R≈1000 : image representation (pixels, gradients)
           → wi , φi (yi , x) : local classifier if xi is letter yi

        φi,j (yi , yj ) = eyi ⊗ eyj ∈ R26·26 : letter/letter indicator
           → wij , φij (yi , yj ) : encourage/suppress letter combinations

        combined: argmaxy p(y|x) is ”corrected” version of local cues




                    local classification                              local + ”correction”


                                                                                                                   31 / 39
Sebastian Nowozin and Christoph Lampert    –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields


 Typical feature functions in pose estimation:


        φi (yi , x) ∈ R≈1000 : local image representation, e.g. HoG
           → wi , φi (yi , x) : local confidence map

        φi,j (yi , yj ) = good fit(yi , yj ) ∈ R1 : test for geometric fit
           → wij , φij (yi , yj ) : penalizer for unrealistic poses

        together: argmaxy p(y|x) is sanitized version of local cues




                    original                   local classification              local + geometry
 [V. Ferrari, M. Marin-Jimenez, A. Zisserman: ”Progressive Search Space Reduction for Human Pose Estimation”, CVPR 2008.]

                                                                                                                     32 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields




 Typical feature functions for CRFs in Computer Vision:


        φi (yi , x): local representation, high-dimensional
           → wi , φi (yi , x) : local classifier

        φi,j (yi , yj ): prior knowledge, low-dimensional
           → wij , φij (yi , yj ) : penalize outliers

        learning adjusts parameters:
                unary wi : learn local classifiers and their importance
                binary wij : learn importance of smoothing/penalization

        argmaxy p(y|x) is cleaned up version of local prediction



                                                                                                                   33 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields

 Solving the Training Optimization Problem Numerically Idea: split
 learning of unary potentials into two parts:
      local classifiers,
      their importance.

 Two-Stage Training

        pre-train fiy (x) = log p(yi |x)
                            ˆ
        use φ˜i (yi , x) := f y (x) ∈ RK (low-dimensional)
                                   i
        keep φij (yi , yj ) are before
                                  ˜
        perform CRF learning with φi and φij

 Advantage:
     lower dimensional feature space during inference → faster
     fiy (x) can be stronger classifiers, e.g. non-linear SVMs
 Disadvantage:
     if local classifiers are bad, CRF training cannot fix that.
                                                                                                                   34 / 39
Sebastian Nowozin and Christoph Lampert    –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields



 Solving the Training Optimization Problem Numerically CRF training
 methods is based on gradient-descent optimization.
 The faster we can do it, the better (more realistic) models we can use:


                                N
  ˜w L(w) = w −                           φ(xn , y n ) −             p(y|xn , w) φ(xn , y)                   ∈ RD
            σ2                n=1                            y∈Y


 A lot of research on accelerating CRF training:

      problem                    ”solution”                               method(s)
    |Y| too large             exploit structure                  (loopy) belief propagation
                               smart sampling                      contrastive divergence
                             use approximate L                      e.g. pseudo-likelihood
    N too large                  mini-batches                   stochastic gradient descent
    D too large                 trained φunary                        two-stage training

                                                                                                                    35 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –    Part 4. Conditional Random Fields




 Summary – CRF Learning Given:
                                                                                                     i.i.d.
        training set {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y,                         (xn , y n ) ∼ d(x, y)
        feature function φ : X × RD .
                                                                      1
 Task: find parameter vector w such that                               Z   exp( w, φ(x, y) ) ≈ d(y|x).

 CRF solution derived by minimizing negative conditional log-likelihood:
                                                  N
         ∗        1                       2                                                         w,φ(xn ,y)
      w = argmin      w                       −          w, φ(xn , y n ) − log                  e
             w   2σ 2
                                                  n=1                                     y∈Y


        convex optimization problem → gradient descent works
        training needs repeated runs of probabilistic inference



                                                                                                                    36 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields



 Extra I: Beyond Fully Supervised Learning So far, training was fully
 supervised, all variables were observed.
 In real life, some variables can be unobserved even during training.




       missing labels in training data                        latent variables, e.g. part location




   latent variables, e.g. part occlusion                        latent variables, e.g. viewpoint

                                                                                                                   37 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields



 Graphical model: three types of variables
        x ∈ X always observed,
        y ∈ Y observed only in training,
        z ∈ Z never observed (latent).

 Marginalization over Latent Variables
 Construct conditional likelihood as usual:
                                                   1
                          p(y, z|x, w) =                 exp( w, φ(x, y, z) )
                                                 Z(x, w)

 Derive p(y|x, w) by marginalizing over z:

                                                   1
                             p(y|x, w) =                             p(y, z|x, w)
                                                 Z(x, w)
                                                               z∈Z



                                                                                                                   38 / 39
Sebastian Nowozin and Christoph Lampert   –   Structured Models in Computer Vision   –   Part 4. Conditional Random Fields


 Negative regularized conditional log-likelihood:
                                                  N
                                          2
                   L(w) = λ w                 −         log p(y n |xn , w)
                                                  n=1
                                                   N
                                          2
                            =λ w              −         log         p(y n , z|xn , w)
                                                  n=1         z∈Z
                                                  N
                                          2
                            =λ w              −         log         exp( w, φ(xn , y n , z) )
                                                  n=1         z∈Z
                                                  N
                                              +         log         exp( w, φ(xn , y, z) )
                                                  n=1         z∈Z
                                                              y∈Y


        L is not convex in w → can have local minima
        no agreed on best way for treating latent variables
                                                                                                                   39 / 39

Contenu connexe

Tendances

Reconstruction (of micro-objects) based on focus-sets using blind deconvoluti...
Reconstruction (of micro-objects) based on focus-sets using blind deconvoluti...Reconstruction (of micro-objects) based on focus-sets using blind deconvoluti...
Reconstruction (of micro-objects) based on focus-sets using blind deconvoluti...Jan Wedekind
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...zukun
 
Lesson 8: Basic Differentiation Rules (Section 41 slides)
Lesson 8: Basic Differentiation Rules (Section 41 slides)Lesson 8: Basic Differentiation Rules (Section 41 slides)
Lesson 8: Basic Differentiation Rules (Section 41 slides)Matthew Leingang
 
Team meeting 100325
Team meeting 100325Team meeting 100325
Team meeting 100325Yi-Hsin Liu
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Polynomial functions modelllings
Polynomial functions modelllingsPolynomial functions modelllings
Polynomial functions modelllingsTarun Gehlot
 
CS 354 Global Illumination
CS 354 Global IlluminationCS 354 Global Illumination
CS 354 Global IlluminationMark Kilgard
 
Lesson 27: Evaluating Definite Integrals
Lesson 27: Evaluating Definite IntegralsLesson 27: Evaluating Definite Integrals
Lesson 27: Evaluating Definite IntegralsMatthew Leingang
 
Lesson 26: The Fundamental Theorem of Calculus (Section 041 slides)
Lesson 26: The Fundamental Theorem of Calculus (Section 041 slides)Lesson 26: The Fundamental Theorem of Calculus (Section 041 slides)
Lesson 26: The Fundamental Theorem of Calculus (Section 041 slides)Matthew Leingang
 
Bayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal modelsBayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal modelskhbrodersen
 
Master Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural NetworksMaster Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural NetworksAlina Leidinger
 
Lesson19 Maximum And Minimum Values 034 Slides
Lesson19   Maximum And Minimum Values 034 SlidesLesson19   Maximum And Minimum Values 034 Slides
Lesson19 Maximum And Minimum Values 034 SlidesMatthew Leingang
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
 

Tendances (16)

Dft
DftDft
Dft
 
NC time seminar
NC time seminarNC time seminar
NC time seminar
 
Reconstruction (of micro-objects) based on focus-sets using blind deconvoluti...
Reconstruction (of micro-objects) based on focus-sets using blind deconvoluti...Reconstruction (of micro-objects) based on focus-sets using blind deconvoluti...
Reconstruction (of micro-objects) based on focus-sets using blind deconvoluti...
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
 
Lesson 8: Basic Differentiation Rules (Section 41 slides)
Lesson 8: Basic Differentiation Rules (Section 41 slides)Lesson 8: Basic Differentiation Rules (Section 41 slides)
Lesson 8: Basic Differentiation Rules (Section 41 slides)
 
Team meeting 100325
Team meeting 100325Team meeting 100325
Team meeting 100325
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Polynomial functions modelllings
Polynomial functions modelllingsPolynomial functions modelllings
Polynomial functions modelllings
 
CS 354 Global Illumination
CS 354 Global IlluminationCS 354 Global Illumination
CS 354 Global Illumination
 
Lesson 27: Evaluating Definite Integrals
Lesson 27: Evaluating Definite IntegralsLesson 27: Evaluating Definite Integrals
Lesson 27: Evaluating Definite Integrals
 
Lesson 26: The Fundamental Theorem of Calculus (Section 041 slides)
Lesson 26: The Fundamental Theorem of Calculus (Section 041 slides)Lesson 26: The Fundamental Theorem of Calculus (Section 041 slides)
Lesson 26: The Fundamental Theorem of Calculus (Section 041 slides)
 
Bayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal modelsBayesian inversion of deterministic dynamic causal models
Bayesian inversion of deterministic dynamic causal models
 
Master Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural NetworksMaster Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural Networks
 
Lesson19 Maximum And Minimum Values 034 Slides
Lesson19   Maximum And Minimum Values 034 SlidesLesson19   Maximum And Minimum Values 034 Slides
Lesson19 Maximum And Minimum Values 034 Slides
 
FDTD Presentation
FDTD PresentationFDTD Presentation
FDTD Presentation
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functions
 

Similaire à 03 conditional random field

Complex numbers org.ppt
Complex numbers org.pptComplex numbers org.ppt
Complex numbers org.pptOsama Tahir
 
computational stochastic phase-field
computational stochastic phase-fieldcomputational stochastic phase-field
computational stochastic phase-fieldcerniagigante
 
Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingSSA KPI
 
UMAP - Mathematics and implementational details
UMAP - Mathematics and implementational detailsUMAP - Mathematics and implementational details
UMAP - Mathematics and implementational detailsUmberto Lupo
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
Topological Inference via Meshing
Topological Inference via MeshingTopological Inference via Meshing
Topological Inference via MeshingDon Sheehy
 
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...Avichai Cohen
 
Polya recurrence
Polya recurrencePolya recurrence
Polya recurrenceBrian Burns
 
simplex.pdf
simplex.pdfsimplex.pdf
simplex.pdfgrssieee
 
simplex.pdf
simplex.pdfsimplex.pdf
simplex.pdfgrssieee
 
simplex.pdf
simplex.pdfsimplex.pdf
simplex.pdfgrssieee
 
SIMPLEX VOLUME ANALYSIS BASED ON TRIANGULAR FACTORIZATION: A FRAMEWORK FOR HY...
SIMPLEX VOLUME ANALYSIS BASED ON TRIANGULAR FACTORIZATION: A FRAMEWORK FOR HY...SIMPLEX VOLUME ANALYSIS BASED ON TRIANGULAR FACTORIZATION: A FRAMEWORK FOR HY...
SIMPLEX VOLUME ANALYSIS BASED ON TRIANGULAR FACTORIZATION: A FRAMEWORK FOR HY...grssieee
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論岳華 杜
 

Similaire à 03 conditional random field (20)

Complex numbers org.ppt
Complex numbers org.pptComplex numbers org.ppt
Complex numbers org.ppt
 
computational stochastic phase-field
computational stochastic phase-fieldcomputational stochastic phase-field
computational stochastic phase-field
 
Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
 
Cd Simon
Cd SimonCd Simon
Cd Simon
 
UMAP - Mathematics and implementational details
UMAP - Mathematics and implementational detailsUMAP - Mathematics and implementational details
UMAP - Mathematics and implementational details
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
RSA Cryptosystem
RSA CryptosystemRSA Cryptosystem
RSA Cryptosystem
 
RSA Cryptosystem
RSA CryptosystemRSA Cryptosystem
RSA Cryptosystem
 
RSA Cryptosystem
RSA CryptosystemRSA Cryptosystem
RSA Cryptosystem
 
Topological Inference via Meshing
Topological Inference via MeshingTopological Inference via Meshing
Topological Inference via Meshing
 
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
 
Mgm
MgmMgm
Mgm
 
Polya recurrence
Polya recurrencePolya recurrence
Polya recurrence
 
simplex.pdf
simplex.pdfsimplex.pdf
simplex.pdf
 
simplex.pdf
simplex.pdfsimplex.pdf
simplex.pdf
 
simplex.pdf
simplex.pdfsimplex.pdf
simplex.pdf
 
SIMPLEX VOLUME ANALYSIS BASED ON TRIANGULAR FACTORIZATION: A FRAMEWORK FOR HY...
SIMPLEX VOLUME ANALYSIS BASED ON TRIANGULAR FACTORIZATION: A FRAMEWORK FOR HY...SIMPLEX VOLUME ANALYSIS BASED ON TRIANGULAR FACTORIZATION: A FRAMEWORK FOR HY...
SIMPLEX VOLUME ANALYSIS BASED ON TRIANGULAR FACTORIZATION: A FRAMEWORK FOR HY...
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論
 
Lagrange
LagrangeLagrange
Lagrange
 
clock_theorems
clock_theoremsclock_theorems
clock_theorems
 

Plus de zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVzukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Informationzukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statisticszukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionzukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 

Plus de zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 

Dernier

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Dernier (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

03 conditional random field

  • 1. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39
  • 2. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Problem (Probabilistic Learning) Let d(y|x) be the (unknown) true conditional distribution. Let D = {(x1 , y 1 ), . . . , (xN , y N )} be i.i.d. samples from d(x, y). Find a distribution p(y|x) that we can use as a proxy for d(y|x). or Given a parametrized family of distributions, p(y|x, w), find the parameter w∗ making p(y|x, w) closest to d(y|x). Open questions: What do we mean by closest? What’s a good candidate for p(y|x, w)? How to actually find w∗ ? conceptually, and numerically 2 / 39
  • 3. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Principle of Parsimony (Parsimoney, aka Occam’s razor) “Pluralitas non est ponenda sine neccesitate.” William of Ockham “We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances.” Isaac Newton “Make everything as simple as possible, but not simpler.” Albert Einstein “Use the simplest explanation that covers all the facts.” what we’ll use 3 / 39
  • 4. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields 1) Define what aspects we consider relevant facts about the data. 2) Pick the simplest distribution reflecting that. Definition (Simplicity ≡ Entropy) The simplicity of a distribution p is given by its entropy: H(p) = − p(z) log p(z) z∈Z Definition (Relevant Facts ≡ Feature Functions) By φi : Z → R for i = 1, . . . , D we denote a set of feature functions that express everything we want to be able to model about our data. For example: the grayvalue of a pixel, a bag-of-words histogram of an image, the time of day an image was taken, a flag if a pixel is darker than half of its neighbors. 4 / 39
  • 5. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Principle (Maximum Entropy Principle) Let z 1 , . . . , z N be samples from a distribution d(z). Let φ1 , . . . , φD be 1 feature functions, and denote by µi := N n φi (z n ) their means over the sample set. The maximum entropy distribution, p, is the solution to max H(p) subject to Ez∼p(z) {φi (z)} = µi . p is a prob.distr. be faithful to what we know be as simple as possible Theorem (Exponential Family Distribution) Under some very reasonable conditions, the maximum entropy distribution has the form 1 p(z) = exp wi φi (z) Z i for some parameter vector w = (w1 , . . . , wD ) and constant Z. 5 / 39
  • 6. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: Let Z = R, φ1 (z) = z, φ2 (z) = z 2 . The exponential family distribution is 1 p(z) = exp( w1 z + w2 z 2 ) Z(w) b2 a 2 w1 = exp( a z − b ) for a = w2 , b = − . Z(a, b) w2 It’s a Gaussian! Given examples z 1 , . . . , z N , we can compute a and b, and derive w. 6 / 39
  • 7. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: Let Z = {1, . . . , K}, φk (z) = z = k , for k = 1, . . . , K. The exponential family distribution is 1 p(z) = exp( wk φk (z) ) Z(w) k  exp(w1 )/Z for z = 1,    exp(w )/Z for z = 2, 2 = . . .    exp(w )/Z for z = K. K with Z = exp(w1 ) + · · · + exp(wK ). It’s a Multinomial! 7 / 39
  • 8. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Example: Let Z = {0, 1}N ×M image grid, φi (y) := yi for each pixel i, φN M (y) = i∼j yi = yj (summing over all 4-neighbor pairs) The exponential family distribution is 1 p(z) = exp( w, φ(y) + w ˜ yi = yj ) Z(w) i,j It’s a (binary) Markov Random Field! 8 / 39
  • 9. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Conditional Random Field Learning Assume: a set of i.i.d. samples D = {(xn , y n )}n=1,...,N , (xn , y n ) ∼ d(x, y) feature functions ( φ1 (x, y), . . . , φD (x, y) ) ≡: φ(x, y) 1 parametrized family p(y|x, w) = Z(x,w) exp( w, φ(x, y) ) Task: adjust w of p(y|x, w) based on D. Many possible technique to do so: Expectation Matching Maximum Likelihood Best Approximation MAP estimation of w Punchline: they all turn out to be (almost) the same! 9 / 39
  • 10. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Maximum Likelihood Parameter Estimation Idea: maximize conditional likelihood of observing outputs y 1 , . . . , y N for inputs x1 , . . . , xN w∗ = argmax p(y 1 , . . . , y N |x1 , . . . , xN , w) w∈RD N i.i.d. = argmax p(y n |xn , w) w∈RD n=1 N − log(·) = argmin − log p(y n |xn , w) w∈RD n=1 negative conditional log-likelihood (of D) 10 / 39
  • 11. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Best Approximation Idea: find p(y|x, w) that is closest to d(y|x) Definition (Similarity between conditional distributions) For fixed x ∈ X : KL-divergence measure similarity d(y|x) KLcond (p||d)(x) := d(y|x) log p(y|x, w) y∈Y For x ∼ d(x), compute expectation: KLtot (p||d) : = Ex∼d(x) KLcond (p||d)(x) d(y|x) = d(x, y) log p(y|x, w) x∈X y∈Y 11 / 39
  • 12. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Best Approximation Idea: find p(y|x, w) of minimal KLtot -distance to d(y|x) d(y|x) w∗ = argmin d(x, y) log w∈RD x∈X y∈Y p(y|x, w) drop const. = argmin − d(x, y) log p(y|x, w) w∈RD (x,y)∈X ×Y N (xn ,y n )∼d(x,y) ≈ argmin − log p(y n |xn , w) w∈RD n=1 negative conditional log-likelihood (of D) 12 / 39
  • 13. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields MAP Estimation of w Idea: Treat w as random variable; maximize posterior probability p(w|D) N p(x1 , y 1 , . . . , xn , y n |w)p(w) i.i.d. Bayes p(y n |xn , w) p(w|D) = = p(w) p(D) p(y n |xn ) n=1 p(w): prior belief on w (cannot be estimated from data). w∗ = argmax p(w|D) = argmin − log p(w|D) w∈RD w∈RD N = argmin − log p(w) − log p(y n |xn , w) + log p(y n |xn ) w∈RD n=1 indep. of w N = argmin − log p(w) − log p(y n |xn , w) w∈RD n=1 13 / 39
  • 14. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N w∗ = argmin − log p(w) − log p(y n |xn , w) w∈RD n=1 Choices for p(w): p(w) :≡ const. (uniform; in RD not really a distribution) N w∗ = argmin − log p(y n |xn , w) + const. w∈RD n=1 negative conditional log-likelihood 1 2 p(w) := const. · e− 2σ2 w (Gaussian) N 1 w∗ = argmin − w 2 + log p(y n |xn , w) + const. w∈RD 2σ 2 n=1 regularized negative conditional log-likelihood 14 / 39
  • 15. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D) N 1 2 w,φ(xn ,y) L(w) = 2 w − w, φ(xn , y n ) − log e 2σ n=1 y∈Y (σ 2 → ∞ makes it unregularized) Probabilistic parameter estimation or training means solving w∗ = argmin L(w). w∈RD Same optimization problem as for multi-class logistic regression. 15 / 39
  • 16. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Negative Conditional Log-Likelihood (Toy Example) negative log likelihood σ2 =0.01 negative log likelihood σ2 =0.10 3 3 128.0 2 2 00 .000 512.0 512 00 0 0 .00 .00 128 1 1 256 00 64.0 128.000 32.000 .0 00 16 0 0 0 0 2.0 32.000 8.000 64.000 1024.000 4.0 00 1 1 00 16.0 2 2 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 negative log likelihood σ2 =1.00 negative log likelihood σ2 → ∞ 3 2.5 0 .00 2.0 128 2 1.5 00 00 .0 00 .0 1.0 32 16 64.0 1 0 0 4.00 0 0.5 8.0 00 31 2 25 50 32.0 0.0.00.1 0.2 0000 00 0.0 0000 00 0001 02 00 00 04 8 16 0.5 .0 2.0 0.0.0 0.0 00 0.0.0 0.0 4.0 8.0 0.0.0 0.0 0.0 16.0 0 6 00 00 1 0 0 0 0 0 64.0 00 0.5 1.000 0.5 1 1.0 00 1.5 2.0 2 2.0 3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5 16 / 39
  • 17. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Steepest Descent Minimization – minimize L(w) input tolerance > 0 1: wcur ← 0 2: repeat 3: v ← w L(wcur ) 4: η ← argminη∈R L(wcur − ηv) 5: wcur ← wcur − ηv 6: until v < output wcur Alternatives: L-BFGS (second-order descent without explicit Hessian) Conjugate Gradient We always need (at least) the gradient of L. 17 / 39
  • 18. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N 1 2 w,φ(xn ,y) L(w) = w − w, φ(xn , y n ) + log e 2σ 2 n=1 y∈Y N e w,φ(xn ,y) φ(xn , y) 1 n n y∈Y w L(w) = 2 w − φ(x , y ) − w,φ(xn ,¯) y σ y ∈Y ¯ e n=1 N 1 = w− φ(xn , y n ) − p(y|xn , w)φ(xn , y) σ2 n=1 y∈Y N 1 = w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 n=1 N 1 ∆L(w) = IdD×D + [Ey∼p(y|xn ,w) φ(xn , y)][Ey∼p(y|xn ,w) φ(xn , y)] σ2 n=1 18 / 39
  • 19. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N 1 2 w,φ(xn ,y) L(w) = w − w, φ(xn , y n ) + log e 2σ 2 n=1 y∈Y C ∞ -differentiable on all RD . 19 / 39
  • 20. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N 1 w L(w) = w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 n=1 For σ → ∞: Ey∼p(y|xn ,w) φ(xn , y) = φ(xn , y n ) ⇒ w L(w) = 0, criticial point of L (local minimum/maximum/saddle point). Interpretation: We aim for expectation matching: Ey∼p φ(x, y) = φ(x, y obs ) but discriminatively: only for x ∈ {x1 , . . . , xn }. 20 / 39
  • 21. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields N 1 ∆L(w) = IdD×D + [Ey∼p(y|xn ,w) φ(xn , y)][Ey∼p(y|xn ,w) φ(xn , y)] σ2 n=1 positive definite Hessian matrix → L(w) is convex → w L(w) = 0 implies global minimum. 21 / 39
  • 22. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Milestone I: Probabilistic Training (Conditional Random Fields) p(y|x, w) log-linear in w ∈ RD . Training: many probabilistic derivations lead to same optimization problem → minimize negative conditional log-likelihood, L(w) L(w) is differentiable and convex, → gradient descent will find global optimum with w L(w) =0 Same structure as multi-class logistic regression. For logistic regression: this is where the textbook ends. we’re done. For conditional random fields: we’re not in safe waters, yet! 22 / 39
  • 23. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Task: Compute v = w L(wcur ), evaluate L(wcur + ηv): N 1 2 w,φ(xn ,y) L(w) = w − w, φ(xn , y n ) + log e 2σ 2 n=1 y∈Y N 1 w L(w) = w− φ(xn , y n ) − p(y|xn , w)φ(xn , y) σ2 n=1 y∈Y Problem: Y typically is very (exponentially) large: binary image segmentation: |Y| = 2640×480 ≈ 1092475 ranking N images: |Y| = N !, e.g. N = 1000: |Y| ≈ 102568 . We must use the structure in Y, or we’re lost. 23 / 39
  • 24. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Solving the Training Optimization Problem Numerically N 1 w L(w) = w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 n=1 Computing the Gradient (naive): O(K M N D) N 1 2 L(w) = 2 w − w, φ(xn , y n ) + log Z(xn , w) 2σ n=1 Line Search (naive): O(K M N D) per evaluation of L N : number of samples D: dimension of feature space M : number of output nodes ≈ 100s to 1,000,000s K: number of possible labels of each output nodes ≈ 2 to 100s 24 / 39
  • 25. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Probabilistic Inference to the Rescue Remember: in a graphical model with factors F, the features decompose: φ(x, y) = φF (x, yF ) F ∈F Ey∼p(y|x,w) φ(x, y) = Ey∼p(y|x,w) φF (x, yF ) F ∈F = EyF ∼p(yF |x,w) φF (x, yF ) F ∈F EyF ∼p(yF |x,w) φF (x, yF ) = p(yF |x, w) φF (x, yF ) yF ∈YF factor marginals K |F | terms Factor marginals µF = p(yF |x, w) are much smaller than complete joint distribution p(y|x, w), can be computed/approximated, e.g., with (loopy) belief propagation. 25 / 39
  • 26. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Solving the Training Optimization Problem Numerically N 1 w L(w) = w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 n=1 Computing the Gradient: X O(M K |Fmax | N D): O(K MX X XXX nd), N 1 2 w,φ(xn ,y) L(w) = w − w, φ(xn , y n ) + log e 2σ 2 n=1 y∈Y Line Search: XX O(M K |Fmax | N D) per evaluation of L O(K MX X X X nd), N : number of samples ≈ 10s to 1,000,000s D: dimension of feature space M : number of output nodes K: number of possible labels of each output nodes 26 / 39
  • 27. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields What, if the training set D is too large (e.g. millions of examples)? Simplify Model Train simpler model (smaller factors) Less expressive power ⇒ results might get worse rather than better Subsampling Create random subset D ⊂ D. Train model using D Ignores all information in D D Parallelize Train multiple models in parallel. Merge the models. Follows the multi-core/GPU trend How to actually merge? (or ?) Doesn’t really save computation. 27 / 39
  • 28. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Keep maximizing p(w|D). In each gradient descent step: Create random subset D ⊂ D, ← often just 1–3 elements! Follow approximate gradient ˜w L(w) = 1 w − φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 (xn ,y n )∈D Line search no longer possible. Extra parameter: stepsize η SGD converges to argminw L(w)! (if η chosen right) SGD needs more iterations, but each one is much faster more: see L. Bottou, O. Bousquet: ”The Tradeoffs of Large Scale Learning”, NIPS 2008. also: http://leon.bottou.org/research/largescale 28 / 39
  • 29. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Solving the Training Optimization Problem Numerically N 1 w L(w) = w− φ(xn , y n ) − Ey∼p(y|xn ,w) φ(xn , y) σ2 n=1 Computing the Gradient: X O(M K 2 N D) (if BP is possible): O(K MX X XXX nd), N 1 2 w,φ(xn ,y) L(w) = w − w, φ(xn , y n ) + log e 2σ 2 n=1 y∈Y Line Search: XX O(M K 2 N D) per evaluation of L O(K MX X X X nd), N : number of samples D: dimension of feature space: ≈ φi,j 1–10s, φi : 100s to 10000s M : number of output nodes K: number of possible labels of each output nodes 29 / 39
  • 30. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Typical feature functions in image segmentation: φi (yi , x) ∈ R≈1000 : local image features, e.g. bag-of-words → wi , φi (yi , x) : local classifier (like logistic-regression) φi,j (yi , yj ) = yi = yj ∈ R1 : test for same label → wij , φij (yi , yj ) : penalizer for label changes (if wij 0) combined: argmaxy p(y|x) is smoothed version of local cues original local classification local + smoothness 30 / 39
  • 31. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Typical feature functions in handwriting recognition: φi (yi , x) ∈ R≈1000 : image representation (pixels, gradients) → wi , φi (yi , x) : local classifier if xi is letter yi φi,j (yi , yj ) = eyi ⊗ eyj ∈ R26·26 : letter/letter indicator → wij , φij (yi , yj ) : encourage/suppress letter combinations combined: argmaxy p(y|x) is ”corrected” version of local cues local classification local + ”correction” 31 / 39
  • 32. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Typical feature functions in pose estimation: φi (yi , x) ∈ R≈1000 : local image representation, e.g. HoG → wi , φi (yi , x) : local confidence map φi,j (yi , yj ) = good fit(yi , yj ) ∈ R1 : test for geometric fit → wij , φij (yi , yj ) : penalizer for unrealistic poses together: argmaxy p(y|x) is sanitized version of local cues original local classification local + geometry [V. Ferrari, M. Marin-Jimenez, A. Zisserman: ”Progressive Search Space Reduction for Human Pose Estimation”, CVPR 2008.] 32 / 39
  • 33. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Typical feature functions for CRFs in Computer Vision: φi (yi , x): local representation, high-dimensional → wi , φi (yi , x) : local classifier φi,j (yi , yj ): prior knowledge, low-dimensional → wij , φij (yi , yj ) : penalize outliers learning adjusts parameters: unary wi : learn local classifiers and their importance binary wij : learn importance of smoothing/penalization argmaxy p(y|x) is cleaned up version of local prediction 33 / 39
  • 34. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train fiy (x) = log p(yi |x) ˆ use φ˜i (yi , x) := f y (x) ∈ RK (low-dimensional) i keep φij (yi , yj ) are before ˜ perform CRF learning with φi and φij Advantage: lower dimensional feature space during inference → faster fiy (x) can be stronger classifiers, e.g. non-linear SVMs Disadvantage: if local classifiers are bad, CRF training cannot fix that. 34 / 39
  • 35. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Solving the Training Optimization Problem Numerically CRF training methods is based on gradient-descent optimization. The faster we can do it, the better (more realistic) models we can use: N ˜w L(w) = w − φ(xn , y n ) − p(y|xn , w) φ(xn , y) ∈ RD σ2 n=1 y∈Y A lot of research on accelerating CRF training: problem ”solution” method(s) |Y| too large exploit structure (loopy) belief propagation smart sampling contrastive divergence use approximate L e.g. pseudo-likelihood N too large mini-batches stochastic gradient descent D too large trained φunary two-stage training 35 / 39
  • 36. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Summary – CRF Learning Given: i.i.d. training set {(x1 , y 1 ), . . . , (xn , y n )} ⊂ X × Y, (xn , y n ) ∼ d(x, y) feature function φ : X × RD . 1 Task: find parameter vector w such that Z exp( w, φ(x, y) ) ≈ d(y|x). CRF solution derived by minimizing negative conditional log-likelihood: N ∗ 1 2 w,φ(xn ,y) w = argmin w − w, φ(xn , y n ) − log e w 2σ 2 n=1 y∈Y convex optimization problem → gradient descent works training needs repeated runs of probabilistic inference 36 / 39
  • 37. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Extra I: Beyond Fully Supervised Learning So far, training was fully supervised, all variables were observed. In real life, some variables can be unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint 37 / 39
  • 38. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Graphical model: three types of variables x ∈ X always observed, y ∈ Y observed only in training, z ∈ Z never observed (latent). Marginalization over Latent Variables Construct conditional likelihood as usual: 1 p(y, z|x, w) = exp( w, φ(x, y, z) ) Z(x, w) Derive p(y|x, w) by marginalizing over z: 1 p(y|x, w) = p(y, z|x, w) Z(x, w) z∈Z 38 / 39
  • 39. Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 4. Conditional Random Fields Negative regularized conditional log-likelihood: N 2 L(w) = λ w − log p(y n |xn , w) n=1 N 2 =λ w − log p(y n , z|xn , w) n=1 z∈Z N 2 =λ w − log exp( w, φ(xn , y n , z) ) n=1 z∈Z N + log exp( w, φ(xn , y, z) ) n=1 z∈Z y∈Y L is not convex in w → can have local minima no agreed on best way for treating latent variables 39 / 39