SlideShare une entreprise Scribd logo
1  sur  80
Télécharger pour lire hors ligne
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End




                      Part 7: Structured Prediction and Energy
                                 Minimization (2/2)

                                 Sebastian Nowozin and Christoph H. Lampert



                                             Colorado Springs, 25th June 2011




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Worst-case Complexity




                                                              Hard problem


                     Generality Optimality Worst-case Integrality Determinism
                                           complexity




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Worst-case Complexity



Giving up Worst-case Complexity


                 Worst-case complexity is an asymptotic behaviour
                 Worst-case complexity quantifies worst case
                 Practical case might be very different
                           Issue: what is the distribution over inputs?


        Popular methods with bad or unknown worst-case complexity
                 Simplex Method for Linear Programming
                 Hash tables
                 Branch-and-bound search



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Worst-case Complexity



Branch-and-bound Search


                 Implicit enumeration: globally optimal
                 Choose
                    1. Partitioning of solution space
                    2. Branching scheme
                    3. Upper and lower bounds over partitions


        Branch and bound
                 is very flexible, many tuning possibilities in partitioning, branching
                 schemes and bounding functions,
                 can be very efficient in practise,
                 typically has worst-case complexity equal to exhaustive enumeration


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                      Determinism   End

G: Worst-case Complexity



Branch-and-Bound (cont)

                                                                                  A4
                                                                   A5
                                                                                            C2

                                                              C1              A2

                                                                                       A3
                                                      A1
                                                                     C3                          Y
        Work with partitioning of solution space Y
                 Active nodes (white)
                 Closed nodes (gray)


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism       End

G: Worst-case Complexity



Branch-and-Bound (cont)




                                     A                                                C




                                                              Y                                 Y
                 Initially: everything active
                 Goal: everything closed


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                      Determinism   End

G: Worst-case Complexity



Branch-and-Bound (cont)


                                                                                  A4
                                                                   A5
                                                                                            C2

                                                              C1              A2

                                                                                       A3
                                                      A1
                                                                     C3                          Y



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations        Determinism   End

G: Worst-case Complexity



Branch-and-Bound (cont)




                                                                                  A2




                 Take an active element (A2 )



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations        Determinism   End

G: Worst-case Complexity



Branch-and-Bound (cont)




                                                                                  A2




                 Partition into two or more subsets of Y



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations        Determinism   End

G: Worst-case Complexity



Branch-and-Bound (cont)


                                                                                  C4

                                                                                  A6
                                                                          C5




                 Evaluate bounds and set node active or closed
                 Closing possible if we can prove that no solution in a partition can
                 be better than a known solution of value L
                 g (A) ≤ L → close A

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Worst-case Complexity



Example 1: Efficient Subwindow Search
        Efficient Subwindow Search (Lampert and Blaschko, 2008)




        Find the bounding box that maximizes a linear scoring function
                                                    y ∗ = argmax w , φ(y , x)
                                                                  y ∈Y

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                    Determinism   End

G: Worst-case Complexity



Example 1: Efficient Subwindow Search
        Efficient Subwindow Search (Lampert and Blaschko, 2008)




                                                g (x, y ) = β +                          w (xi )
                                                                           xi within y


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Worst-case Complexity



Example 1: Efficient Subwindow Search
        Efficient Subwindow Search (Lampert and Blaschko, 2008)




        Subsets B of bounding boxes specified by interval coordinates, B ⊂ 2Y

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations             Determinism      End

G: Worst-case Complexity



Example 1: Efficient Subwindow Search


        Efficient Subwindow Search (Lampert and Blaschko, 2008)
        Upper bound: g (x, B) ≥ g (x, y ) for all y ∈ B

             g (x, B)         = β+                            max{w (xi ), 0} +            min{0, w (xi )}
                                                 xi                                 xi
                                               within                             within
                                               Bmax                               Bmin
                              ≥ max g (x, y )
                                       y ∈B




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Worst-case Complexity



Example 1: Efficient Subwindow Search


        Efficient Subwindow Search (Lampert and Blaschko, 2008)




        PASCAL VOC 2007 detection challenge bounding boxes found using ESS




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Worst-case Complexity



Example 2: Branch-and-Mincut
        Branch-and-Mincut (Lempitsky et al., 2008)
        Binary image segmentation with non-local interaction




                                                                  y1 , y2 ∈ Y



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                Determinism                   End

G: Worst-case Complexity



Example 2: Branch-and-Mincut
        Branch-and-Mincut (Lempitsky et al., 2008)
        Binary image segmentation with non-local interaction

        E (z, y ) = C (y )+                    F p (y )zp +               B p (y )(1−zp )+             P pq (y )|zp −zq |,
                                        p∈V                       p∈V                        {i,j}∈E


                                                     g (x, y ) = max −E (z, y )
                                                                        z∈2V

             Here: z ∈ {0, 1}V is a binary pixel
             mask
             F p (y ), B p (y ) are
             foreground/background unary energies
             P pq (y ) is a standard pairwise energy
             Global dependencies on y


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                    Determinism   End

G: Worst-case Complexity



Example 2: Branch-and-Mincut
        Branch-and-Mincut (Lempitsky et al., 2008)
        Upper bound for any subset A ⊆ Y:
                       max g (x, y ) = max max −E (z, y )
                       y ∈A            y ∈A z∈2V
                                    2
                                              X p           X p
                =      max max − 4C (y ) +       F (y )zp +  B (y )(1 − zp )
                        y ∈A z∈2V
                                                          p∈V                       p∈V
                                                                         3
                                     X
                               +              P pq (y )|zp − zq |5
                                   {i,j}∈E
                               2
                            „           « X„             «
                ≤      max 4 max −C (y ) +  max −F p (y ) zp
                       z∈2V          y ∈A                                  y ∈A
                                                                p∈V
                                                                            3
                         X„      p
                                      «            X „      pq
                                                                 «
                       +   max −B (y ) (1 − zp ) +    max −P (y ) |zp − zq |5 .
                                     y ∈A                                                   y ∈A
                           p∈V                                                    {i,j}∈E



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Integrality/Relaxations




                                                              Hard problem


                       Generality Optimality Worst-case Integrality Determinism
                                             complexity




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations         Determinism   End

G: Integrality/Relaxations



Problem Relaxations

                  Optimization problems (minimizing g : G → R over Y ⊆ G) can
                  become easier if
                             feasible set is enlarged, and/or
                             objective function is replaced with a bound.

                 g(x),
                 h(x)




                                                               h(z)
                                                                                  Y

                               g(y)
                                                                                  Z⊇Y
                                                                      y, z


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Integrality/Relaxations



Problem Relaxations

                  Optimization problems (minimizing g : G → R over Y ⊆ G) can
                  become easier if
                             feasible set is enlarged, and/or
                             objective function is replaced with a bound.

         Definition (Relaxation (Geoffrion, 1974))
         Given two optimization problems (g , Y, G) and (h, Z, G), the problem
         (h, Z, G) is said to be a relaxation of (g , Y, G) if,
            1. Z ⊇ Y, i.e. the feasible set of the relaxation contains the feasible
               set of the original problem, and
            2. ∀y ∈ Y : h(y ) ≥ g (y ), i.e. over the original feasible set the objective
               function h achieves no smaller values than the objective function g .


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Integrality/Relaxations



Relaxations

                  Relaxed solution z ∗ provides a bound:

                                        h(z ∗ ) ≥ g (y ∗ ), (maximization: upper bound)
                                        h(z ∗ ) ≤ g (y ∗ ), (minimization: lower bound)

                  Relaxation is typically tractable
                  Evidence that relaxations are “better” for learning (Kulesza and
                  Pereira, 2007), (Finley and Joachims, 2008), (Martins et al., 2009)
                  Drawback: z ∗ ∈ Z  Y possible

         Are there principled methods to construct relaxations?
                  Linear Programming Relaxations
                  Lagrangian relaxation
                  Lagrangian/Dual decomposition
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Integrality/Relaxations



Relaxations

                  Relaxed solution z ∗ provides a bound:

                                        h(z ∗ ) ≥ g (y ∗ ), (maximization: upper bound)
                                        h(z ∗ ) ≤ g (y ∗ ), (minimization: lower bound)

                  Relaxation is typically tractable
                  Evidence that relaxations are “better” for learning (Kulesza and
                  Pereira, 2007), (Finley and Joachims, 2008), (Martins et al., 2009)
                  Drawback: z ∗ ∈ Z  Y possible

         Are there principled methods to construct relaxations?
                  Linear Programming Relaxations
                  Lagrangian relaxation
                  Lagrangian/Dual decomposition
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations            Determinism   End

G: Integrality/Relaxations



Linear Programming Relaxation

                                  y2


                                   1

                                             Ay ≤ b


                                                              1
                                                                      y1


                                                              max          c y
                                                                  y
                                                              sb.t. Ay ≤ b,
                                                                           y is integer.



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations            Determinism   End

G: Integrality/Relaxations



Linear Programming Relaxation

                                  y2


                                   1

                                             Ay ≤ b


                                                              1
                                                                      y1


                                                              max          c y
                                                                  y
                                                              sb.t. Ay ≤ b,
                                                                           y is integer.



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                 Determinism   End

G: Integrality/Relaxations



Linear Programming Relaxation

                                  y2                                              y2


                                   1                                              1

                                             Ay ≤ b                                    Ay ≤ b


                                                              1
                                                                  y1                             1
                                                                                                     y1


                                                                  max       c y
                                                                   y
                                                                  sb.t. Ay ≤ b,




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations         Determinism        End

G: Integrality/Relaxations



MAP-MRF Linear Programming Relaxation



                                                     Y1                            Y2

                                  Y1                                     Y1 × Y2                      Y2



                             θ1                               θ1,2                            θ2




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                         Determinism        End

G: Integrality/Relaxations



MAP-MRF Linear Programming Relaxation

                                                      Y1                                   Y2

                                      Y1                               Y1 × Y2                               Y2


                             y1 = 2                                               (y1 , y2 ) = (2, 3)
                                                                                                                  y2 = 3




                  Energy E (y ; x) = θ1 (y1 ; x) + θ1,2 (y1 , y2 ; x) + θ2 (y2 ; x)
                                                          1
                  Probability p(y |x) =                 Z (x)   exp{−E (y ; x)}
                  MAP prediction: argmax p(y |x) = argmin E (y ; x)
                                                    y ∈Y                           y ∈Y




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                      Determinism                  End

G: Integrality/Relaxations



MAP-MRF Linear Programming Relaxation

                                               Y1                                             Y2

                        Y1                                          Y1 × Y2                                         Y2




              µ1 ∈ {0, 1}Y1                               µ1,2 ∈ {0, 1}Y1 ×Y2                              µ2 ∈ {0, 1}Y2
                      µ1 (y1 ) =         y2 ∈Y2   µ1,2 (y1 , y2 )                 y1 ∈Y1   µ1,2 (y1 , y2 ) = µ2 (y2 )
                y1 ∈Y1 µ1 (y1 ) = 1                  (y1 ,y2 )∈Y1 ×Y2 µ1,2 (y1 , y2 ) = 1                y2 ∈Y2    µ2 (y2 ) = 1

                  Energy is now linear: E (y ; x) = θ, µ =: −g (y , x)


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                       Determinism                  End

G: Integrality/Relaxations



MAP-MRF Linear Programming Relaxation (cont)



                                XX                                        X             X
                      max                      θi (yi )µi (yi ) +                                 θi,j (yi , yj )µi,j (yi , yj )
                        µ
                                i∈V yi ∈Yi                             {i,j}∈E (yi ,yj )∈Yi ×Yj
                                 X
                      sb.t.              µi (yi ) = 1,          ∀i ∈ V ,
                                yi ∈Yi
                                 X
                                         µi,j (yi , yj ) = µi (yi ),              ∀{i, j} ∈ E , ∀yi ∈ Yi
                                yj ∈Yj

                                µi (yi ) ∈ {0, 1},             ∀i ∈ V , ∀yi ∈ Yi ,
                                µi,j (yi , yj ) ∈ {0, 1},             ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj .




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                       Determinism                  End

G: Integrality/Relaxations



MAP-MRF Linear Programming Relaxation (cont)


                                XX                                        X             X
                      max                      θi (yi )µi (yi ) +                                 θi,j (yi , yj )µi,j (yi , yj )
                        µ
                                i∈V yi ∈Yi                             {i,j}∈E (yi ,yj )∈Yi ×Yj
                                 X
                      sb.t.              µi (yi ) = 1,          ∀i ∈ V ,
                                yi ∈Yi
                                 X
                                         µi,j (yi , yj ) = µi (yi ),              ∀{i, j} ∈ E , ∀yi ∈ Yi
                                yj ∈Yj

                                µi (yi ) ∈ {0, 1},             ∀i ∈ V , ∀yi ∈ Yi ,
                                µi,j (yi , yj ) ∈ {0, 1},             ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj .


                  → NP-hard integer linear program


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                       Determinism                  End

G: Integrality/Relaxations



MAP-MRF Linear Programming Relaxation (cont)


                                XX                                        X             X
                      max                      θi (yi )µi (yi ) +                                 θi,j (yi , yj )µi,j (yi , yj )
                        µ
                                i∈V yi ∈Yi                             {i,j}∈E (yi ,yj )∈Yi ×Yj
                                 X
                      sb.t.              µi (yi ) = 1,          ∀i ∈ V ,
                                yi ∈Yi
                                 X
                                         µi,j (yi , yj ) = µi (yi ),              ∀{i, j} ∈ E , ∀yi ∈ Yi
                                yj ∈Yj

                                µi (yi ) ∈ [0, 1],            ∀i ∈ V , ∀yi ∈ Yi ,
                                µi,j (yi , yj ) ∈ [0, 1],           ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj .


                  → linear program


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Integrality/Relaxations



MAP-MRF LP Relaxation, Works

         MAP-MRF LP Analysis
            (Wainwright et al., Trans. Inf. Tech., 2005), (Weiss et al., UAI
            2007), (Werner, PAMI 2005)
                  (Kolmogorov, PAMI 2006)

         Improving tightness
                  (Komodakis and Paragios, ECCV 2008), (Kumar and Torr, ICML
                  2008), (Sontag and Jaakkola, NIPS 2007), (Sontag et al., UAI
                  2008), (Werner, CVPR 2008), (Kumar et al., NIPS 2007)

         Algorithms based on the LP
                  (Globerson and Jaakkola, NIPS 2007), (Kumar and Torr, ICML
                  2008), (Sontag et al., UAI 2008)

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Integrality/Relaxations



MAP-MRF LP Relaxation, Known results



            1. LOCAL is tight iff the factor graph is a forest (acyclic)
            2. All labelings are vertices of LOCAL (Wainwright and Jordan, 2008)
            3. For cyclic graphs there are additional fractional vertices.
            4. If all factors have regular energies, the fractional solutions are never
               optimal (Wainwright and Jordan, 2008)
            5. For models with only binary states: half-integrality, integral variables
               are certain




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations        Determinism   End

G: Integrality/Relaxations



Lagrangian Relaxation

                                                                 max          g (y )
                                                                    y
                                                                sb.t. y ∈ D,
                                                                              y ∈ C.

         Assumption
             Optimizing g (y ) over y ∈ D is “easy”.
                  Optimizing over y ∈ D ∩ C is hard.

         High-level idea
             Incorporate y ∈ C constraint into objective function
         For an excellent introduction, see (Guignard, “Lagrangean Relaxation”,
         TOP 2003) and (Lemar´chal, “Lagrangian Relaxation”, 2001).
                                 e

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations               Determinism   End

G: Integrality/Relaxations



Lagrangian Relaxation (cont)

         Recipe
            1. Express C in terms of equalities and inequalities

                      C = {y ∈ G : ui (y ) = 0, ∀i = 1, . . . , I , vj (y ) ≤ 0, ∀j = 1, . . . , J},

                             ui : G → R differentiable, typically affine,
                             vj : G → R differentiable, typically convex.
            2. Introduce Lagrange multipliers, yielding

                                                 max          g (y )
                                                    y
                                                 sb.t. y ∈ D,
                                                              ui (y ) = 0 : λ,    i = 1, . . . , I ,
                                                              vj (y ) ≤ 0 : µ,    j = 1, . . . , J.

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations               Determinism         End

G: Integrality/Relaxations



Lagrangian Relaxation (cont)


            2. Introduce Lagrange multipliers, yielding

                                                 max          g (y )
                                                    y
                                                 sb.t. y ∈ D,
                                                              ui (y ) = 0 : λ,    i = 1, . . . , I ,
                                                              vj (y ) ≤ 0 : µ,    j = 1, . . . , J.

            3. Build partial Lagrangian

                                                    max          g (y ) + λ u(y ) + µ v (y )                (1)
                                                        y
                                                   sb.t. y ∈ D.



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations          Determinism         End

G: Integrality/Relaxations



Lagrangian Relaxation (cont)


            3. Build partial Lagrangian

                                                    max          g (y ) + λ u(y ) + µ v (y )           (1)
                                                       y
                                                   sb.t. y ∈ D.

         Theorem (Weak Duality of Lagrangean Relaxation)
         For differentiable functions ui : G → R and vj : G → R, and for any
         λ ∈ RI and any non-negative µ ∈ RJ , µ ≥ 0, problem (1) is a relaxation
         of the original problem: its value is larger than or equal to the optimal
         value of the original problem.



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations             Determinism         End

G: Integrality/Relaxations



Lagrangian Relaxation (cont)


            3. Build partial Lagrangian

                                      q(λ, µ) :=              max         g (y ) + λ u(y ) + µ v (y )     (1)
                                                                 y
                                                              sb.t. y ∈ D.

            4. Minimize upper bound wrt λ, µ

                                                                      min         q(λ, µ)
                                                                       λ,µ
                                                                     sb.t. µ ≥ 0


                             → efficiently solvable if q(λ, µ) can be evaluated



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations             Determinism         End

G: Integrality/Relaxations



Optimizing Lagrangian Dual Functions
            4. Minimize upper bound wrt λ, µ
                                                                      min         q(λ, µ)                 (2)
                                                                       λ,µ
                                                                     sb.t. µ ≥ 0

         Theorem (Lagrangean Dual Function)
            1. q is convex in λ and µ, Problem (2) is a convex minimization
               problem.
            2. If q is unbounded below, then the original problem is infeasible.
            3. For any λ, µ ≥ 0, let
                                   y (λ, µ) = argmaxy ∈D g (y ) + λ u(y ) + µ v (u)

                  Then, a subgradient of q can be constructed by evaluating the
                  constraint functions at y (λ, µ) as
                                     ∂                                  ∂
                      u(y (λ, µ)) ∈     q(λ, µ),    and v (y (λ, µ)) ∈    q(λ, µ).
                                    ∂λ                                 ∂µ
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                         Determinism                  End

G: Integrality/Relaxations



Example: MAP-MRF Message Passing

                                XX                                        X             X
                      max                      θi (yi )µi (yi ) +                                   θi,j (yi , yj )µi,j (yi , yj )
                        µ
                                i∈V yi ∈Yi                             {i,j}∈E (yi ,yj )∈Yi ×Yj
                                 X
                      sb.t.              µi (yi ) = 1,          ∀i ∈ V ,
                                yi ∈Yi
                                       X
                                                    µi,j (yi , yj ) = 1,            ∀{i, j} ∈ E ,
                                (yi ,yj )∈Yi ×Yj
                                 X
                                         µi,j (yi , yj ) = µi (yi ),              ∀{i, j} ∈ E , ∀yi ∈ Yi
                                yj ∈Yj

                                µi (yi ) ∈ {0, 1},             ∀i ∈ V , ∀yi ∈ Yi ,
                                µi,j (yi , yj ) ∈ {0, 1},             ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj .




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                       Determinism                  End

G: Integrality/Relaxations



Example: MAP-MRF Message Passing

                                XX                                        X           X
                      max                      θi (yi )µi (yi ) +                                 θi,j (yi , yj )µi,j (yi , yj )
                        µ
                                i∈V yi ∈Yi                             {i,j}∈E (yi ,yj )∈Yi ×Yj
                                 X
                      sb.t.              µi (yi ) = 1,          ∀i ∈ V ,
                                yi ∈Yi
                                       X
                                                    µi,j (yi , yj ) = 1,          ∀{i, j} ∈ E ,
                                (yi ,yj )∈Yi ×Yj
                                 X
                                         µi,j (yi , yj ) = µi (yi ),          ∀{i, j} ∈ E , ∀yi ∈ Yi
                                yj ∈Yj

                                µi (yi ) ∈ {0, 1},             ∀i ∈ V , ∀yi ∈ Yi ,
                                µi,j (yi , yj ) ∈ {0, 1},             ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj .

                  If the constraint would not be there, problem is trivial!
                  (Wainwright and Jordan, 2008, section 4.1.3)

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                       Determinism                  End

G: Integrality/Relaxations



Example: MAP-MRF Message Passing

                                XX                                        X           X
                      max                      θi (yi )µi (yi ) +                                 θi,j (yi , yj )µi,j (yi , yj )
                        µ
                                i∈V yi ∈Yi                             {i,j}∈E (yi ,yj )∈Yi ×Yj
                                 X
                      sb.t.              µi (yi ) = 1,          ∀i ∈ V ,
                                yi ∈Yi
                                       X
                                                    µi,j (yi , yj ) = 1,          ∀{i, j} ∈ E ,
                                (yi ,yj )∈Yi ×Yj
                                 X
                                         µi,j (yi , yj ) = µi (yi ) : φi,j (yi ),        ∀{i, j} ∈ E , ∀yi ∈ Yi
                                yj ∈Yj

                                µi (yi ) ∈ {0, 1},             ∀i ∈ V , ∀yi ∈ Yi ,
                                µi,j (yi , yj ) ∈ {0, 1},             ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj .

                  If the constraint would not be there, problem is trivial!
                  (Wainwright and Jordan, 2008, section 4.1.3)

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                           Determinism                         End

G: Integrality/Relaxations



Example: MAP-MRF Message Passing

                                       XX                                          X             X
          q(φ) :=            max                      θi (yi )µi (yi ) +                                     θi,j (yi , yj )µi,j (yi , yj )
                              µ
                                       i∈V yi ∈Yi                                 {i,j}∈E (yi ,yj )∈Yi ×Yj
                                                                           0                                         1
                                             X X                                   X
                                       +                      φi,j (yi ) @                 µi,j (yi , yj ) − µi (yi )A
                                           {i,j}∈E yi ∈Yi                         yj ∈Yj
                                        X
                             sb.t.              µi (yi ) = 1,          ∀i ∈ V ,
                                       yi ∈Yi
                                             X
                                                           µi,j (yi , yj ) = 1,             ∀{i, j} ∈ E ,
                                       (yi ,yj )∈Yi ×Yj

                                       µi (yi ) ∈ {0, 1},             ∀i ∈ V , ∀yi ∈ Yi ,
                                       µi,j (yi , yj ) ∈ {0, 1},             ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj .




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                              G: Integrality/Relaxations                          Determinism               End

G: Integrality/Relaxations



Example: MAP-MRF Message Passing

                                                              0                                           1
                                             XX                                       X
                q(φ) :=          max                          @θi (yi ) −                       φi,j (yi )A µi (yi )
                                    µ
                                             i∈V yi ∈Yi                        j∈V :{i,j}∈E
                                                   X                X
                                             +                                     (θi,j (yi , yj ) + φi,j (yi )) µi,j (yi , yj )
                                                 {i,j}∈E (yi ,yj )∈Yi ×Yj
                                              X
                                 sb.t.                µi (yi ) = 1,           ∀i ∈ V ,
                                             yi ∈Yi
                                                   X
                                                                  µi,j (yi , yj ) = 1,         ∀{i, j} ∈ E ,
                                             (yi ,yj )∈Yi ×Yj

                                             µi (yi ) ∈ {0, 1},              ∀i ∈ V , ∀yi ∈ Yi ,
                                             µi,j (yi , yj ) ∈ {0, 1},               ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj .




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations               Determinism             End

G: Integrality/Relaxations



Example: MAP-MRF Message Passing

                                              φ1,2 ∈ RY1                               φ2,1 ∈ RY2


                                                      Y1                                      Y2

                             Y1                                             Y1 × Y2                              Y2


                   θ1                                                                                       θ2
                                                              θ1,2                    +φ1,2
                −φ1,2                                                                                 −φ2,1


                                                                             +φ2,1

                  Parent-to-child region-graph messages (Meltzer et al., UAI 2009)
                  Max-sum diffusion reparametrization (Werner, PAMI 2007)


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations             Determinism              End

G: Integrality/Relaxations



Example behaviour of objectives

                      10
                                                                                                 Dual objective
                       0
                                                                                                 Primal objective


                     −10
         Objective




                     −20


                     −30


                     −40


                     −50
                        0                50                      100                  150        200           250
                                                                          Iteration


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations               Determinism   End

G: Integrality/Relaxations



Primal Solution Recovery

         Assume we solved the dual problem for (λ∗ , µ∗ )
            1. Can we obtain a primal solution y ∗ ?
            2. Can we say something about the relaxation quality?

         Theorem (Sufficient Optimality Conditions)
         If for a given λ, µ ≥ 0, we have u(y (λ, µ)) = 0 and v (y (λ, µ)) ≤ 0
         (primal feasibility) and further we have

                                λ u(y (λ, µ)) = 0,                      and       µ v (y (λ, µ)) = 0,

         (complementary slackness), then
                  y (λ, µ) is an optimal primal solution to the original problem, and
                  (λ, µ) is an optimal solution to the dual problem.

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations               Determinism   End

G: Integrality/Relaxations



Primal Solution Recovery

         Assume we solved the dual problem for (λ∗ , µ∗ )
            1. Can we obtain a primal solution y ∗ ?
            2. Can we say something about the relaxation quality?

         Theorem (Sufficient Optimality Conditions)
         If for a given λ, µ ≥ 0, we have u(y (λ, µ)) = 0 and v (y (λ, µ)) ≤ 0
         (primal feasibility) and further we have

                                λ u(y (λ, µ)) = 0,                      and       µ v (y (λ, µ)) = 0,

         (complementary slackness), then
                  y (λ, µ) is an optimal primal solution to the original problem, and
                  (λ, µ) is an optimal solution to the dual problem.

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations              Determinism         End

G: Integrality/Relaxations



Primal Solution Recovery (cont)
         But,
                  we might never see a solution satisfying the optimality condition
                  is the case only if there is no duality gap, i.e. q(λ∗ , µ∗ ) = g (x, y ∗ )

         Special case: integer linear programs
             we can always reconstruct a primal solution to
                                                               min          g (y )                         (3)
                                                                  y
                                                               sb.t. y ∈ conv(D),
                                                                            y ∈ C.
                  For example for subgradient method updates it is known that
                  (Anstreicher and Wolsey, MathProg 2009)
                                                                 T
                                                    1
                                                lim                    y (λt , µt ) → y ∗ of (3).
                                               T →∞ T
                                                                t=1

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations               Determinism   End

G: Integrality/Relaxations



Dual/Lagrangian Decomposition
         Additive structure
                                                                              K
                                                               max                  gk (y )
                                                                  y
                                                                            k=1
                                                              sb.t. y ∈ Y,
                               K
         such that             k=1    gk (y ) is hard, but for any k,

                                                                  max             gk (y )
                                                                      y
                                                                  sb.t. y ∈ Y

         is tractable.
               (Guignard, “Lagrangean Decomposition”, MathProg 1987),
               the original paper for “dual decomposition”
               (Conejo et al., “Decomposition Techniques in Mathematical
               Programming”, 2006), continuous variable problems
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations               Determinism   End

G: Integrality/Relaxations



Dual/Lagrangian Decomposition
         Additive structure
                                                                              K
                                                               max                  gk (y )
                                                                  y
                                                                            k=1
                                                              sb.t. y ∈ Y,
                               K
         such that             k=1    gk (y ) is hard, but for any k,

                                                                  max             gk (y )
                                                                      y
                                                                  sb.t. y ∈ Y

         is tractable.
               (Guignard, “Lagrangean Decomposition”, MathProg 1987),
               the original paper for “dual decomposition”
               (Conejo et al., “Decomposition Techniques in Mathematical
               Programming”, 2006), continuous variable problems
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                  Determinism         End

G: Integrality/Relaxations



Dual/Lagrangian Decomposition

         Idea
            1. Selectively duplicate variables (“variable splitting”)
                                                                K
                                              max                    gk (yk )
                                           y1 ,...,yK ,y
                                                              k=1
                                              sb.t.           y ∈ Y,
                                                              yk ∈ Y,             k = 1, . . . , K ,
                                                              y = yk : λ k ,             k = 1, . . . , K .    (4)

            2. Add coupling equality constraints between duplicated variables
            3. Apply Lagrangian relaxation to the coupling constraint

         Also known as dual decomposition and variable splitting.

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                        Determinism         End

G: Integrality/Relaxations



Dual/Lagrangian Decomposition (cont)




                                                              K
                                          max                     gk (yk )
                                      y1 ,...,yK ,y
                                                          k=1
                                          sb.t.          y ∈ Y,
                                                         yk ∈ Y,                  k = 1, . . . , K ,
                                                         y = yk : λ k ,                  k = 1, . . . , K .          (5)




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                    Determinism   End

G: Integrality/Relaxations



Dual/Lagrangian Decomposition (cont)




                                                              K                    K
                                           max                      gk (yk ) +          λk (y − yk )
                                        y1 ,...,yK ,y
                                                              k=1                 k=1
                                           sb.t.              y ∈ Y,
                                                              yk ∈ Y,             k = 1, . . . , K .




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                      Determinism       End

G: Integrality/Relaxations



Dual/Lagrangian Decomposition (cont)




                                                     K                                           K
                                    max                       gk (yk ) − λk yk +                       λk          y
                                 y1 ,...,yK ,y
                                                    k=1                                          k=1
                                    sb.t.          y ∈ Y,
                                                   yk ∈ Y,                  k = 1, . . . , K .


                  Problem is decomposed




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                        Determinism            End

G: Integrality/Relaxations



Dual/Lagrangian Decomposition (cont)
                                                              K                                            K
                       q(λ) :=              max                      gk (yk ) − λk yk +                          λk       y
                                        y1 ,...,yK ,y
                                                              k=1                                          k=1
                                            sb.t.          y ∈ Y,
                                                           yk ∈ Y,                    k = 1, . . . , K .
                  Dual optimization problem is
                                                                    min        q(λ)
                                                                     λ
                                                                                  K
                                                                  sb.t.                λk = 0,
                                                                               k=1
                                        K
                  where {λ| k=1 λk = 0} is the domain where q(λ) > −∞
                  Projected subgradient method, using
                                                                           K                           K
                                  ∂                                 1                             1
                                              (y − yk ) −                         (y − y ) =                   y − yk .
                                 ∂λk                                K                             K
                                                                           =1                          =1

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                 Determinism   End

G: Integrality/Relaxations



Primal Interpretation


                  Primal interpretation for linear case due to (Guignard, 1987)

         Theorem (Lagrangian Decomposition Primal)
         Let g (x, y ) = c(x) y be linear, then the solution of the dual obtains the
         value of the following relaxed primal optimization problem.
                                                        K
                                          min                 gk (x, y )
                                             y
                                                      k=1
                                          sb.t. y ∈ conv(Yk ),                    ∀k = 1, . . . , K .




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                 Determinism   End

G: Integrality/Relaxations



Primal Interpretation (cont)

                                                                        yD
                                                                                     y∗
                                                               conv(Y1 )∩
                                                                conv(Y2 )             conv(Y2)
                                           conv(Y1)

                                                                                               c y


                                                        K
                                          min                 gk (x, y )
                                             y
                                                      k=1
                                         sb.t. y ∈ conv(Yk ),                     ∀k = 1, . . . , K .

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

G: Integrality/Relaxations



Applications in Computer Vision

         Very broadly applicable
                  (Komodakis et al., ICCV 2007)
                  (Woodford et al., ICCV 2009)
                  (Strandmark and Kahl, CVPR 2010)
                  (Vicente et al., ICCV 2009)
                  (Torresani et al., ECCV 2008)
                  (Werner, TechReport 2009)

         Further reading
             (Sontag et al., “Introduction to Dual Decomposition for Inference”,
             OfML Vol, 2011)


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

Determinism




                                                              Hard problem


                     Generality Optimality Worst-case Integrality Determinism
                                           complexity




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

Determinism



Giving up Determinism

        Algorithms that use randomness, and
                 are non-deterministic, result does not exclusively depend on the
                 input, and
                 are allowed to return a wrong result (with low probability), or/and
                 have a random runtime.

        Is it a good idea?
                 for some problems randomized algorithms are the only known
                 efficient algorithms,
                 using randomness for hard problems has a long tradition in
                 sampling-based algorithms, physics, computational chemistry, etc.
                 such algorithms can be simple and effective, but proving theorems
                 about them can be much harder

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

Determinism



Simulated Annealing


        Basic idea (Kirkpatrick et al., 1983)
           1. Define a distribution p(y ; T ) that concentrates mass on states y
              with high values g (x, y )
           2. Simulate p(y ; T )
           3. Increase concentration and repeat

        Defining p(y ; T )
                 Boltzmann distribution

        Simulating p(y ; T )
            MCMC: usually done using a Metropolis chain or a Gibbs sampler



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations                   Determinism         End

Determinism



Simulated Annealing (cont)



        Definition (Boltzmann Distribution)
        For a finite set Y, a function g : X × Y → R and a temperature
        parameter T > 0, let

                                                                   1              g (x, y )
                                             p(y ; T ) =                exp                   ,                 (5)
                                                                 Z (T )              T

        with Z (T ) = y ∈Y exp( g (x,y ) ) be the Boltzmann distribution for g over
                                   T
        Y at temperature T .




Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations               Determinism        End

Determinism



Simulated Annealing (cont)
                                                                      Function g
                                            40

                                            35

                                            30
                           Function value




                                            25

                                            20

                                            15

                                            10

                                            5

                                            0
                                             0   5    10          15          20    25   30      35         40
                                                                          State y
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                                G: Integrality/Relaxations                Determinism        End

Determinism



Simulated Annealing (cont)
                                                            Distribution P(T=100.0)
                                               1

                                              0.9

                                              0.8

                                              0.7
                           Probability mass




                                              0.6

                                              0.5

                                              0.4

                                              0.3

                                              0.2

                                              0.1

                                               0
                                                0   5     10          15         20     25   30      35         40
                                                                              State y
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                                G: Integrality/Relaxations                Determinism        End

Determinism



Simulated Annealing (cont)
                                                              Distribution P(T=10.0)
                                               1

                                              0.9

                                              0.8

                                              0.7
                           Probability mass




                                              0.6

                                              0.5

                                              0.4

                                              0.3

                                              0.2

                                              0.1

                                               0
                                                0   5     10          15         20     25   30      35         40
                                                                              State y
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                                G: Integrality/Relaxations                Determinism        End

Determinism



Simulated Annealing (cont)
                                                               Distribution P(T=4.0)
                                               1

                                              0.9

                                              0.8

                                              0.7
                           Probability mass




                                              0.6

                                              0.5

                                              0.4

                                              0.3

                                              0.2

                                              0.1

                                               0
                                                0   5     10          15         20     25   30      35         40
                                                                              State y
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                                G: Integrality/Relaxations                Determinism        End

Determinism



Simulated Annealing (cont)
                                                               Distribution P(T=1.0)
                                               1

                                              0.9

                                              0.8

                                              0.7
                           Probability mass




                                              0.6

                                              0.5

                                              0.4

                                              0.3

                                              0.2

                                              0.1

                                               0
                                                0   5     10          15         20     25   30      35         40
                                                                              State y
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                                G: Integrality/Relaxations                Determinism        End

Determinism



Simulated Annealing (cont)
                                                               Distribution P(T=0.1)
                                               1

                                              0.9

                                              0.8

                                              0.7
                           Probability mass




                                              0.6

                                              0.5

                                              0.4

                                              0.3

                                              0.2

                                              0.1

                                               0
                                                0   5     10          15         20     25   30      35         40
                                                                              State y
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations               Determinism        End

Determinism



Simulated Annealing (cont)
                                                                      Function g
                                            40

                                            35

                                            30
                           Function value




                                            25

                                            20

                                            15

                                            10

                                            5

                                            0
                                             0   5    10          15          20    25   30      35         40
                                                                          State y
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

Determinism



Simulated Annealing (cont)


              1:   y ∗ = SimulatedAnnealing(Y, g , T , N)
              2:   Input:
              3:        Y finite feasible set
              4:        g : X × Y → R objective function
              5:        T ∈ RK sequence of K decreasing temperatures
              6:        N ∈ NK sequence of K step lengths
              7:   (y , y ∗ ) ← (y0 , y0 )
              8:   for k = 1, . . . , K do
              9:      y ← simulate Markov chain p(y ; T (k)) starting from y for N(k)
                      steps
          10:      end for
          11:      y∗ ← y



Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations      Determinism   End

Determinism



Guarantees

        Theorem (Guaranteed Optimality (Geman and Geman, 1984))
        If there exist a k0 ≥ 2 such that for all k ≥ k0 the temperature T (k)
        satisfies
                                             |Y| · (maxy ∈Y g (x, y ) − miny ∈Y g (x, y ))
                              T (k) ≥                                                      ,
                                                                log k
        then the probability of seeing the maximizer y ∗ of g tends to one as
        k → ∞.

                 too slow in practise
                 faster schedules are used in practise, such as

                                                                 T (k) = T0 · αk .


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations      Determinism   End

Determinism



Guarantees

        Theorem (Guaranteed Optimality (Geman and Geman, 1984))
        If there exist a k0 ≥ 2 such that for all k ≥ k0 the temperature T (k)
        satisfies
                                             |Y| · (maxy ∈Y g (x, y ) − miny ∈Y g (x, y ))
                              T (k) ≥                                                      ,
                                                                log k
        then the probability of seeing the maximizer y ∗ of g tends to one as
        k → ∞.

                 too slow in practise
                 faster schedules are used in practise, such as

                                                                 T (k) = T0 · αk .


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations              Determinism                    End

Determinism



Example: Simulated Annealing
        (Geman and Geman, 1984)


                                                                                   20



                                                                                   40



                                                                                   60



                                                                                   80



                                                                                  100



                                                                                  120

                                                                                        20    40    60     80   100   120



                           Figure: Factor graph model                             Figure: Approximate sample
                                                                                  (200 sweeps)
                 2D 8-neighbor 128-by-128 grid, 5 possible labels
                 Pairwise Potts potentials
                 Task: restoration
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations              Determinism                    End

Determinism



Example: Simulated Annealing
        (Geman and Geman, 1984)

                           20                                                      20



                           40                                                      40



                           60                                                      60



                           80                                                      80



                        100                                                       100



                        120                                                       120

                                 20     40     60      80     100    120                20    40    60     80   100   120



                        Figure: Approximate sample                                Figure: Corrupted with
                        (200 sweeps)                                              Gaussian noise
                 2D 8-neighbor 128-by-128 grid, 5 possible labels
                 Pairwise Potts potentials
                 Task: restoration
Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations              Determinism                    End

Determinism



Example: Simulated Annealing
        (Geman and Geman, 1984)


                           20                                                      20



                           40                                                      40



                           60                                                      60



                           80                                                      80



                        100                                                       100



                        120                                                       120

                                 20     40     60      80     100    120                20    40    60     80   100   120



                        Figure: Corrupted input                                   Figure: Restoration, 25
                        image                                                     sweeps

                 Derive unary energies from corrupted input image (optimal)
                 Simulated annealing, 25 Gibbs sampling sweeps

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations              Determinism                    End

Determinism



Example: Simulated Annealing
        (Geman and Geman, 1984)


                           20                                                      20



                           40                                                      40



                           60                                                      60



                           80                                                      80



                        100                                                       100



                        120                                                       120

                                 20     40     60      80     100    120                20    40    60     80   100   120



                        Figure: Corrupted input                                   Figure: Restoration, 300
                        image                                                     sweeps

                 Simulated annealing, 300 Gibbs sampling sweeps
                 Schedule T (k) = 4.0/ log(1 + k)

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations              Determinism                    End

Determinism



Example: Simulated Annealing
        (Geman and Geman, 1984)


                           20                                                      20



                           40                                                      40



                           60                                                      60



                           80                                                      80



                        100                                                       100



                        120                                                       120

                                   20   40     60      80     100    120                20    40    60     80   100   120



                                Figure: Noise-free sample                         Figure: Restoration, 300
                                                                                  sweeps

                 Simulated annealing, 300 Gibbs sampling sweeps
                 Schedule T (k) = 4.0/ log(1 + k)

Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)
G: Worst-case Complexity                             G: Integrality/Relaxations   Determinism   End

End



The End...


             Tutorial in written form
             now publisher’s FnT Computer
             Graphics and Vision series
             http://www.nowpublishers.com/
             PDF available on authors’ homepages




                                                              Thank you!


Sebastian Nowozin and Christoph H. Lampert
Part 7: Structured Prediction and Energy Minimization (2/2)

Contenu connexe

Plus de zukun

Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featureszukun
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...zukun
 
Quoc le tera-scale deep learning
Quoc le   tera-scale deep learningQuoc le   tera-scale deep learning
Quoc le tera-scale deep learningzukun
 
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...zukun
 
Lecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learningLecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learningzukun
 
Lecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognitionLecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognitionzukun
 

Plus de zukun (20)

Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
 
Quoc le tera-scale deep learning
Quoc le   tera-scale deep learningQuoc le   tera-scale deep learning
Quoc le tera-scale deep learning
 
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...
Deep Learning workshop 2010: Deep Learning of Invariant Spatiotemporal Featur...
 
Lecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learningLecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learning
 
Lecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognitionLecun 20060816-ciar-02-deep learning for generic object recognition
Lecun 20060816-ciar-02-deep learning for generic object recognition
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

05 structured prediction and energy minimization part 2

  • 1. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Part 7: Structured Prediction and Energy Minimization (2/2) Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 2. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Hard problem Generality Optimality Worst-case Integrality Determinism complexity Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 3. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Giving up Worst-case Complexity Worst-case complexity is an asymptotic behaviour Worst-case complexity quantifies worst case Practical case might be very different Issue: what is the distribution over inputs? Popular methods with bad or unknown worst-case complexity Simplex Method for Linear Programming Hash tables Branch-and-bound search Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 4. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Branch-and-bound Search Implicit enumeration: globally optimal Choose 1. Partitioning of solution space 2. Branching scheme 3. Upper and lower bounds over partitions Branch and bound is very flexible, many tuning possibilities in partitioning, branching schemes and bounding functions, can be very efficient in practise, typically has worst-case complexity equal to exhaustive enumeration Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 5. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Branch-and-Bound (cont) A4 A5 C2 C1 A2 A3 A1 C3 Y Work with partitioning of solution space Y Active nodes (white) Closed nodes (gray) Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 6. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Branch-and-Bound (cont) A C Y Y Initially: everything active Goal: everything closed Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 7. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Branch-and-Bound (cont) A4 A5 C2 C1 A2 A3 A1 C3 Y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 8. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Branch-and-Bound (cont) A2 Take an active element (A2 ) Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 9. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Branch-and-Bound (cont) A2 Partition into two or more subsets of Y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 10. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Branch-and-Bound (cont) C4 A6 C5 Evaluate bounds and set node active or closed Closing possible if we can prove that no solution in a partition can be better than a known solution of value L g (A) ≤ L → close A Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 11. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Example 1: Efficient Subwindow Search Efficient Subwindow Search (Lampert and Blaschko, 2008) Find the bounding box that maximizes a linear scoring function y ∗ = argmax w , φ(y , x) y ∈Y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 12. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Example 1: Efficient Subwindow Search Efficient Subwindow Search (Lampert and Blaschko, 2008) g (x, y ) = β + w (xi ) xi within y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 13. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Example 1: Efficient Subwindow Search Efficient Subwindow Search (Lampert and Blaschko, 2008) Subsets B of bounding boxes specified by interval coordinates, B ⊂ 2Y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 14. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Example 1: Efficient Subwindow Search Efficient Subwindow Search (Lampert and Blaschko, 2008) Upper bound: g (x, B) ≥ g (x, y ) for all y ∈ B g (x, B) = β+ max{w (xi ), 0} + min{0, w (xi )} xi xi within within Bmax Bmin ≥ max g (x, y ) y ∈B Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 15. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Example 1: Efficient Subwindow Search Efficient Subwindow Search (Lampert and Blaschko, 2008) PASCAL VOC 2007 detection challenge bounding boxes found using ESS Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 16. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Example 2: Branch-and-Mincut Branch-and-Mincut (Lempitsky et al., 2008) Binary image segmentation with non-local interaction y1 , y2 ∈ Y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 17. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Example 2: Branch-and-Mincut Branch-and-Mincut (Lempitsky et al., 2008) Binary image segmentation with non-local interaction E (z, y ) = C (y )+ F p (y )zp + B p (y )(1−zp )+ P pq (y )|zp −zq |, p∈V p∈V {i,j}∈E g (x, y ) = max −E (z, y ) z∈2V Here: z ∈ {0, 1}V is a binary pixel mask F p (y ), B p (y ) are foreground/background unary energies P pq (y ) is a standard pairwise energy Global dependencies on y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 18. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Worst-case Complexity Example 2: Branch-and-Mincut Branch-and-Mincut (Lempitsky et al., 2008) Upper bound for any subset A ⊆ Y: max g (x, y ) = max max −E (z, y ) y ∈A y ∈A z∈2V 2 X p X p = max max − 4C (y ) + F (y )zp + B (y )(1 − zp ) y ∈A z∈2V p∈V p∈V 3 X + P pq (y )|zp − zq |5 {i,j}∈E 2 „ « X„ « ≤ max 4 max −C (y ) + max −F p (y ) zp z∈2V y ∈A y ∈A p∈V 3 X„ p « X „ pq « + max −B (y ) (1 − zp ) + max −P (y ) |zp − zq |5 . y ∈A y ∈A p∈V {i,j}∈E Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 19. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Hard problem Generality Optimality Worst-case Integrality Determinism complexity Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 20. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Problem Relaxations Optimization problems (minimizing g : G → R over Y ⊆ G) can become easier if feasible set is enlarged, and/or objective function is replaced with a bound. g(x), h(x) h(z) Y g(y) Z⊇Y y, z Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 21. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Problem Relaxations Optimization problems (minimizing g : G → R over Y ⊆ G) can become easier if feasible set is enlarged, and/or objective function is replaced with a bound. Definition (Relaxation (Geoffrion, 1974)) Given two optimization problems (g , Y, G) and (h, Z, G), the problem (h, Z, G) is said to be a relaxation of (g , Y, G) if, 1. Z ⊇ Y, i.e. the feasible set of the relaxation contains the feasible set of the original problem, and 2. ∀y ∈ Y : h(y ) ≥ g (y ), i.e. over the original feasible set the objective function h achieves no smaller values than the objective function g . Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 22. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Relaxations Relaxed solution z ∗ provides a bound: h(z ∗ ) ≥ g (y ∗ ), (maximization: upper bound) h(z ∗ ) ≤ g (y ∗ ), (minimization: lower bound) Relaxation is typically tractable Evidence that relaxations are “better” for learning (Kulesza and Pereira, 2007), (Finley and Joachims, 2008), (Martins et al., 2009) Drawback: z ∗ ∈ Z Y possible Are there principled methods to construct relaxations? Linear Programming Relaxations Lagrangian relaxation Lagrangian/Dual decomposition Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 23. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Relaxations Relaxed solution z ∗ provides a bound: h(z ∗ ) ≥ g (y ∗ ), (maximization: upper bound) h(z ∗ ) ≤ g (y ∗ ), (minimization: lower bound) Relaxation is typically tractable Evidence that relaxations are “better” for learning (Kulesza and Pereira, 2007), (Finley and Joachims, 2008), (Martins et al., 2009) Drawback: z ∗ ∈ Z Y possible Are there principled methods to construct relaxations? Linear Programming Relaxations Lagrangian relaxation Lagrangian/Dual decomposition Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 24. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Linear Programming Relaxation y2 1 Ay ≤ b 1 y1 max c y y sb.t. Ay ≤ b, y is integer. Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 25. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Linear Programming Relaxation y2 1 Ay ≤ b 1 y1 max c y y sb.t. Ay ≤ b, y is integer. Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 26. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Linear Programming Relaxation y2 y2 1 1 Ay ≤ b Ay ≤ b 1 y1 1 y1 max c y y sb.t. Ay ≤ b, Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 27. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations MAP-MRF Linear Programming Relaxation Y1 Y2 Y1 Y1 × Y2 Y2 θ1 θ1,2 θ2 Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 28. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations MAP-MRF Linear Programming Relaxation Y1 Y2 Y1 Y1 × Y2 Y2 y1 = 2 (y1 , y2 ) = (2, 3) y2 = 3 Energy E (y ; x) = θ1 (y1 ; x) + θ1,2 (y1 , y2 ; x) + θ2 (y2 ; x) 1 Probability p(y |x) = Z (x) exp{−E (y ; x)} MAP prediction: argmax p(y |x) = argmin E (y ; x) y ∈Y y ∈Y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 29. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations MAP-MRF Linear Programming Relaxation Y1 Y2 Y1 Y1 × Y2 Y2 µ1 ∈ {0, 1}Y1 µ1,2 ∈ {0, 1}Y1 ×Y2 µ2 ∈ {0, 1}Y2 µ1 (y1 ) = y2 ∈Y2 µ1,2 (y1 , y2 ) y1 ∈Y1 µ1,2 (y1 , y2 ) = µ2 (y2 ) y1 ∈Y1 µ1 (y1 ) = 1 (y1 ,y2 )∈Y1 ×Y2 µ1,2 (y1 , y2 ) = 1 y2 ∈Y2 µ2 (y2 ) = 1 Energy is now linear: E (y ; x) = θ, µ =: −g (y , x) Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 30. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations MAP-MRF Linear Programming Relaxation (cont) XX X X max θi (yi )µi (yi ) + θi,j (yi , yj )µi,j (yi , yj ) µ i∈V yi ∈Yi {i,j}∈E (yi ,yj )∈Yi ×Yj X sb.t. µi (yi ) = 1, ∀i ∈ V , yi ∈Yi X µi,j (yi , yj ) = µi (yi ), ∀{i, j} ∈ E , ∀yi ∈ Yi yj ∈Yj µi (yi ) ∈ {0, 1}, ∀i ∈ V , ∀yi ∈ Yi , µi,j (yi , yj ) ∈ {0, 1}, ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj . Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 31. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations MAP-MRF Linear Programming Relaxation (cont) XX X X max θi (yi )µi (yi ) + θi,j (yi , yj )µi,j (yi , yj ) µ i∈V yi ∈Yi {i,j}∈E (yi ,yj )∈Yi ×Yj X sb.t. µi (yi ) = 1, ∀i ∈ V , yi ∈Yi X µi,j (yi , yj ) = µi (yi ), ∀{i, j} ∈ E , ∀yi ∈ Yi yj ∈Yj µi (yi ) ∈ {0, 1}, ∀i ∈ V , ∀yi ∈ Yi , µi,j (yi , yj ) ∈ {0, 1}, ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj . → NP-hard integer linear program Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 32. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations MAP-MRF Linear Programming Relaxation (cont) XX X X max θi (yi )µi (yi ) + θi,j (yi , yj )µi,j (yi , yj ) µ i∈V yi ∈Yi {i,j}∈E (yi ,yj )∈Yi ×Yj X sb.t. µi (yi ) = 1, ∀i ∈ V , yi ∈Yi X µi,j (yi , yj ) = µi (yi ), ∀{i, j} ∈ E , ∀yi ∈ Yi yj ∈Yj µi (yi ) ∈ [0, 1], ∀i ∈ V , ∀yi ∈ Yi , µi,j (yi , yj ) ∈ [0, 1], ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj . → linear program Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 33. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations MAP-MRF LP Relaxation, Works MAP-MRF LP Analysis (Wainwright et al., Trans. Inf. Tech., 2005), (Weiss et al., UAI 2007), (Werner, PAMI 2005) (Kolmogorov, PAMI 2006) Improving tightness (Komodakis and Paragios, ECCV 2008), (Kumar and Torr, ICML 2008), (Sontag and Jaakkola, NIPS 2007), (Sontag et al., UAI 2008), (Werner, CVPR 2008), (Kumar et al., NIPS 2007) Algorithms based on the LP (Globerson and Jaakkola, NIPS 2007), (Kumar and Torr, ICML 2008), (Sontag et al., UAI 2008) Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 34. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations MAP-MRF LP Relaxation, Known results 1. LOCAL is tight iff the factor graph is a forest (acyclic) 2. All labelings are vertices of LOCAL (Wainwright and Jordan, 2008) 3. For cyclic graphs there are additional fractional vertices. 4. If all factors have regular energies, the fractional solutions are never optimal (Wainwright and Jordan, 2008) 5. For models with only binary states: half-integrality, integral variables are certain Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 35. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Lagrangian Relaxation max g (y ) y sb.t. y ∈ D, y ∈ C. Assumption Optimizing g (y ) over y ∈ D is “easy”. Optimizing over y ∈ D ∩ C is hard. High-level idea Incorporate y ∈ C constraint into objective function For an excellent introduction, see (Guignard, “Lagrangean Relaxation”, TOP 2003) and (Lemar´chal, “Lagrangian Relaxation”, 2001). e Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 36. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Lagrangian Relaxation (cont) Recipe 1. Express C in terms of equalities and inequalities C = {y ∈ G : ui (y ) = 0, ∀i = 1, . . . , I , vj (y ) ≤ 0, ∀j = 1, . . . , J}, ui : G → R differentiable, typically affine, vj : G → R differentiable, typically convex. 2. Introduce Lagrange multipliers, yielding max g (y ) y sb.t. y ∈ D, ui (y ) = 0 : λ, i = 1, . . . , I , vj (y ) ≤ 0 : µ, j = 1, . . . , J. Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 37. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Lagrangian Relaxation (cont) 2. Introduce Lagrange multipliers, yielding max g (y ) y sb.t. y ∈ D, ui (y ) = 0 : λ, i = 1, . . . , I , vj (y ) ≤ 0 : µ, j = 1, . . . , J. 3. Build partial Lagrangian max g (y ) + λ u(y ) + µ v (y ) (1) y sb.t. y ∈ D. Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 38. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Lagrangian Relaxation (cont) 3. Build partial Lagrangian max g (y ) + λ u(y ) + µ v (y ) (1) y sb.t. y ∈ D. Theorem (Weak Duality of Lagrangean Relaxation) For differentiable functions ui : G → R and vj : G → R, and for any λ ∈ RI and any non-negative µ ∈ RJ , µ ≥ 0, problem (1) is a relaxation of the original problem: its value is larger than or equal to the optimal value of the original problem. Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 39. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Lagrangian Relaxation (cont) 3. Build partial Lagrangian q(λ, µ) := max g (y ) + λ u(y ) + µ v (y ) (1) y sb.t. y ∈ D. 4. Minimize upper bound wrt λ, µ min q(λ, µ) λ,µ sb.t. µ ≥ 0 → efficiently solvable if q(λ, µ) can be evaluated Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 40. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Optimizing Lagrangian Dual Functions 4. Minimize upper bound wrt λ, µ min q(λ, µ) (2) λ,µ sb.t. µ ≥ 0 Theorem (Lagrangean Dual Function) 1. q is convex in λ and µ, Problem (2) is a convex minimization problem. 2. If q is unbounded below, then the original problem is infeasible. 3. For any λ, µ ≥ 0, let y (λ, µ) = argmaxy ∈D g (y ) + λ u(y ) + µ v (u) Then, a subgradient of q can be constructed by evaluating the constraint functions at y (λ, µ) as ∂ ∂ u(y (λ, µ)) ∈ q(λ, µ), and v (y (λ, µ)) ∈ q(λ, µ). ∂λ ∂µ Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 41. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Example: MAP-MRF Message Passing XX X X max θi (yi )µi (yi ) + θi,j (yi , yj )µi,j (yi , yj ) µ i∈V yi ∈Yi {i,j}∈E (yi ,yj )∈Yi ×Yj X sb.t. µi (yi ) = 1, ∀i ∈ V , yi ∈Yi X µi,j (yi , yj ) = 1, ∀{i, j} ∈ E , (yi ,yj )∈Yi ×Yj X µi,j (yi , yj ) = µi (yi ), ∀{i, j} ∈ E , ∀yi ∈ Yi yj ∈Yj µi (yi ) ∈ {0, 1}, ∀i ∈ V , ∀yi ∈ Yi , µi,j (yi , yj ) ∈ {0, 1}, ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj . Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 42. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Example: MAP-MRF Message Passing XX X X max θi (yi )µi (yi ) + θi,j (yi , yj )µi,j (yi , yj ) µ i∈V yi ∈Yi {i,j}∈E (yi ,yj )∈Yi ×Yj X sb.t. µi (yi ) = 1, ∀i ∈ V , yi ∈Yi X µi,j (yi , yj ) = 1, ∀{i, j} ∈ E , (yi ,yj )∈Yi ×Yj X µi,j (yi , yj ) = µi (yi ), ∀{i, j} ∈ E , ∀yi ∈ Yi yj ∈Yj µi (yi ) ∈ {0, 1}, ∀i ∈ V , ∀yi ∈ Yi , µi,j (yi , yj ) ∈ {0, 1}, ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj . If the constraint would not be there, problem is trivial! (Wainwright and Jordan, 2008, section 4.1.3) Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 43. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Example: MAP-MRF Message Passing XX X X max θi (yi )µi (yi ) + θi,j (yi , yj )µi,j (yi , yj ) µ i∈V yi ∈Yi {i,j}∈E (yi ,yj )∈Yi ×Yj X sb.t. µi (yi ) = 1, ∀i ∈ V , yi ∈Yi X µi,j (yi , yj ) = 1, ∀{i, j} ∈ E , (yi ,yj )∈Yi ×Yj X µi,j (yi , yj ) = µi (yi ) : φi,j (yi ), ∀{i, j} ∈ E , ∀yi ∈ Yi yj ∈Yj µi (yi ) ∈ {0, 1}, ∀i ∈ V , ∀yi ∈ Yi , µi,j (yi , yj ) ∈ {0, 1}, ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj . If the constraint would not be there, problem is trivial! (Wainwright and Jordan, 2008, section 4.1.3) Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 44. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Example: MAP-MRF Message Passing XX X X q(φ) := max θi (yi )µi (yi ) + θi,j (yi , yj )µi,j (yi , yj ) µ i∈V yi ∈Yi {i,j}∈E (yi ,yj )∈Yi ×Yj 0 1 X X X + φi,j (yi ) @ µi,j (yi , yj ) − µi (yi )A {i,j}∈E yi ∈Yi yj ∈Yj X sb.t. µi (yi ) = 1, ∀i ∈ V , yi ∈Yi X µi,j (yi , yj ) = 1, ∀{i, j} ∈ E , (yi ,yj )∈Yi ×Yj µi (yi ) ∈ {0, 1}, ∀i ∈ V , ∀yi ∈ Yi , µi,j (yi , yj ) ∈ {0, 1}, ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj . Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 45. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Example: MAP-MRF Message Passing 0 1 XX X q(φ) := max @θi (yi ) − φi,j (yi )A µi (yi ) µ i∈V yi ∈Yi j∈V :{i,j}∈E X X + (θi,j (yi , yj ) + φi,j (yi )) µi,j (yi , yj ) {i,j}∈E (yi ,yj )∈Yi ×Yj X sb.t. µi (yi ) = 1, ∀i ∈ V , yi ∈Yi X µi,j (yi , yj ) = 1, ∀{i, j} ∈ E , (yi ,yj )∈Yi ×Yj µi (yi ) ∈ {0, 1}, ∀i ∈ V , ∀yi ∈ Yi , µi,j (yi , yj ) ∈ {0, 1}, ∀{i, j} ∈ E , ∀(yi , yj ) ∈ Yi × Yj . Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 46. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Example: MAP-MRF Message Passing φ1,2 ∈ RY1 φ2,1 ∈ RY2 Y1 Y2 Y1 Y1 × Y2 Y2 θ1 θ2 θ1,2 +φ1,2 −φ1,2 −φ2,1 +φ2,1 Parent-to-child region-graph messages (Meltzer et al., UAI 2009) Max-sum diffusion reparametrization (Werner, PAMI 2007) Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 47. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Example behaviour of objectives 10 Dual objective 0 Primal objective −10 Objective −20 −30 −40 −50 0 50 100 150 200 250 Iteration Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 48. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Primal Solution Recovery Assume we solved the dual problem for (λ∗ , µ∗ ) 1. Can we obtain a primal solution y ∗ ? 2. Can we say something about the relaxation quality? Theorem (Sufficient Optimality Conditions) If for a given λ, µ ≥ 0, we have u(y (λ, µ)) = 0 and v (y (λ, µ)) ≤ 0 (primal feasibility) and further we have λ u(y (λ, µ)) = 0, and µ v (y (λ, µ)) = 0, (complementary slackness), then y (λ, µ) is an optimal primal solution to the original problem, and (λ, µ) is an optimal solution to the dual problem. Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 49. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Primal Solution Recovery Assume we solved the dual problem for (λ∗ , µ∗ ) 1. Can we obtain a primal solution y ∗ ? 2. Can we say something about the relaxation quality? Theorem (Sufficient Optimality Conditions) If for a given λ, µ ≥ 0, we have u(y (λ, µ)) = 0 and v (y (λ, µ)) ≤ 0 (primal feasibility) and further we have λ u(y (λ, µ)) = 0, and µ v (y (λ, µ)) = 0, (complementary slackness), then y (λ, µ) is an optimal primal solution to the original problem, and (λ, µ) is an optimal solution to the dual problem. Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 50. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Primal Solution Recovery (cont) But, we might never see a solution satisfying the optimality condition is the case only if there is no duality gap, i.e. q(λ∗ , µ∗ ) = g (x, y ∗ ) Special case: integer linear programs we can always reconstruct a primal solution to min g (y ) (3) y sb.t. y ∈ conv(D), y ∈ C. For example for subgradient method updates it is known that (Anstreicher and Wolsey, MathProg 2009) T 1 lim y (λt , µt ) → y ∗ of (3). T →∞ T t=1 Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 51. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Dual/Lagrangian Decomposition Additive structure K max gk (y ) y k=1 sb.t. y ∈ Y, K such that k=1 gk (y ) is hard, but for any k, max gk (y ) y sb.t. y ∈ Y is tractable. (Guignard, “Lagrangean Decomposition”, MathProg 1987), the original paper for “dual decomposition” (Conejo et al., “Decomposition Techniques in Mathematical Programming”, 2006), continuous variable problems Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 52. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Dual/Lagrangian Decomposition Additive structure K max gk (y ) y k=1 sb.t. y ∈ Y, K such that k=1 gk (y ) is hard, but for any k, max gk (y ) y sb.t. y ∈ Y is tractable. (Guignard, “Lagrangean Decomposition”, MathProg 1987), the original paper for “dual decomposition” (Conejo et al., “Decomposition Techniques in Mathematical Programming”, 2006), continuous variable problems Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 53. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Dual/Lagrangian Decomposition Idea 1. Selectively duplicate variables (“variable splitting”) K max gk (yk ) y1 ,...,yK ,y k=1 sb.t. y ∈ Y, yk ∈ Y, k = 1, . . . , K , y = yk : λ k , k = 1, . . . , K . (4) 2. Add coupling equality constraints between duplicated variables 3. Apply Lagrangian relaxation to the coupling constraint Also known as dual decomposition and variable splitting. Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 54. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Dual/Lagrangian Decomposition (cont) K max gk (yk ) y1 ,...,yK ,y k=1 sb.t. y ∈ Y, yk ∈ Y, k = 1, . . . , K , y = yk : λ k , k = 1, . . . , K . (5) Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 55. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Dual/Lagrangian Decomposition (cont) K K max gk (yk ) + λk (y − yk ) y1 ,...,yK ,y k=1 k=1 sb.t. y ∈ Y, yk ∈ Y, k = 1, . . . , K . Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 56. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Dual/Lagrangian Decomposition (cont) K K max gk (yk ) − λk yk + λk y y1 ,...,yK ,y k=1 k=1 sb.t. y ∈ Y, yk ∈ Y, k = 1, . . . , K . Problem is decomposed Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 57. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Dual/Lagrangian Decomposition (cont) K K q(λ) := max gk (yk ) − λk yk + λk y y1 ,...,yK ,y k=1 k=1 sb.t. y ∈ Y, yk ∈ Y, k = 1, . . . , K . Dual optimization problem is min q(λ) λ K sb.t. λk = 0, k=1 K where {λ| k=1 λk = 0} is the domain where q(λ) > −∞ Projected subgradient method, using K K ∂ 1 1 (y − yk ) − (y − y ) = y − yk . ∂λk K K =1 =1 Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 58. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Primal Interpretation Primal interpretation for linear case due to (Guignard, 1987) Theorem (Lagrangian Decomposition Primal) Let g (x, y ) = c(x) y be linear, then the solution of the dual obtains the value of the following relaxed primal optimization problem. K min gk (x, y ) y k=1 sb.t. y ∈ conv(Yk ), ∀k = 1, . . . , K . Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 59. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Primal Interpretation (cont) yD y∗ conv(Y1 )∩ conv(Y2 ) conv(Y2) conv(Y1) c y K min gk (x, y ) y k=1 sb.t. y ∈ conv(Yk ), ∀k = 1, . . . , K . Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 60. G: Worst-case Complexity G: Integrality/Relaxations Determinism End G: Integrality/Relaxations Applications in Computer Vision Very broadly applicable (Komodakis et al., ICCV 2007) (Woodford et al., ICCV 2009) (Strandmark and Kahl, CVPR 2010) (Vicente et al., ICCV 2009) (Torresani et al., ECCV 2008) (Werner, TechReport 2009) Further reading (Sontag et al., “Introduction to Dual Decomposition for Inference”, OfML Vol, 2011) Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 61. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Hard problem Generality Optimality Worst-case Integrality Determinism complexity Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 62. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Giving up Determinism Algorithms that use randomness, and are non-deterministic, result does not exclusively depend on the input, and are allowed to return a wrong result (with low probability), or/and have a random runtime. Is it a good idea? for some problems randomized algorithms are the only known efficient algorithms, using randomness for hard problems has a long tradition in sampling-based algorithms, physics, computational chemistry, etc. such algorithms can be simple and effective, but proving theorems about them can be much harder Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 63. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Simulated Annealing Basic idea (Kirkpatrick et al., 1983) 1. Define a distribution p(y ; T ) that concentrates mass on states y with high values g (x, y ) 2. Simulate p(y ; T ) 3. Increase concentration and repeat Defining p(y ; T ) Boltzmann distribution Simulating p(y ; T ) MCMC: usually done using a Metropolis chain or a Gibbs sampler Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 64. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Simulated Annealing (cont) Definition (Boltzmann Distribution) For a finite set Y, a function g : X × Y → R and a temperature parameter T > 0, let 1 g (x, y ) p(y ; T ) = exp , (5) Z (T ) T with Z (T ) = y ∈Y exp( g (x,y ) ) be the Boltzmann distribution for g over T Y at temperature T . Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 65. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Simulated Annealing (cont) Function g 40 35 30 Function value 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 State y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 66. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Simulated Annealing (cont) Distribution P(T=100.0) 1 0.9 0.8 0.7 Probability mass 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 40 State y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 67. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Simulated Annealing (cont) Distribution P(T=10.0) 1 0.9 0.8 0.7 Probability mass 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 40 State y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 68. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Simulated Annealing (cont) Distribution P(T=4.0) 1 0.9 0.8 0.7 Probability mass 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 40 State y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 69. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Simulated Annealing (cont) Distribution P(T=1.0) 1 0.9 0.8 0.7 Probability mass 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 40 State y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 70. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Simulated Annealing (cont) Distribution P(T=0.1) 1 0.9 0.8 0.7 Probability mass 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 40 State y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 71. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Simulated Annealing (cont) Function g 40 35 30 Function value 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 State y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 72. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Simulated Annealing (cont) 1: y ∗ = SimulatedAnnealing(Y, g , T , N) 2: Input: 3: Y finite feasible set 4: g : X × Y → R objective function 5: T ∈ RK sequence of K decreasing temperatures 6: N ∈ NK sequence of K step lengths 7: (y , y ∗ ) ← (y0 , y0 ) 8: for k = 1, . . . , K do 9: y ← simulate Markov chain p(y ; T (k)) starting from y for N(k) steps 10: end for 11: y∗ ← y Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 73. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Guarantees Theorem (Guaranteed Optimality (Geman and Geman, 1984)) If there exist a k0 ≥ 2 such that for all k ≥ k0 the temperature T (k) satisfies |Y| · (maxy ∈Y g (x, y ) − miny ∈Y g (x, y )) T (k) ≥ , log k then the probability of seeing the maximizer y ∗ of g tends to one as k → ∞. too slow in practise faster schedules are used in practise, such as T (k) = T0 · αk . Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 74. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Guarantees Theorem (Guaranteed Optimality (Geman and Geman, 1984)) If there exist a k0 ≥ 2 such that for all k ≥ k0 the temperature T (k) satisfies |Y| · (maxy ∈Y g (x, y ) − miny ∈Y g (x, y )) T (k) ≥ , log k then the probability of seeing the maximizer y ∗ of g tends to one as k → ∞. too slow in practise faster schedules are used in practise, such as T (k) = T0 · αk . Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 75. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Example: Simulated Annealing (Geman and Geman, 1984) 20 40 60 80 100 120 20 40 60 80 100 120 Figure: Factor graph model Figure: Approximate sample (200 sweeps) 2D 8-neighbor 128-by-128 grid, 5 possible labels Pairwise Potts potentials Task: restoration Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 76. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Example: Simulated Annealing (Geman and Geman, 1984) 20 20 40 40 60 60 80 80 100 100 120 120 20 40 60 80 100 120 20 40 60 80 100 120 Figure: Approximate sample Figure: Corrupted with (200 sweeps) Gaussian noise 2D 8-neighbor 128-by-128 grid, 5 possible labels Pairwise Potts potentials Task: restoration Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 77. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Example: Simulated Annealing (Geman and Geman, 1984) 20 20 40 40 60 60 80 80 100 100 120 120 20 40 60 80 100 120 20 40 60 80 100 120 Figure: Corrupted input Figure: Restoration, 25 image sweeps Derive unary energies from corrupted input image (optimal) Simulated annealing, 25 Gibbs sampling sweeps Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 78. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Example: Simulated Annealing (Geman and Geman, 1984) 20 20 40 40 60 60 80 80 100 100 120 120 20 40 60 80 100 120 20 40 60 80 100 120 Figure: Corrupted input Figure: Restoration, 300 image sweeps Simulated annealing, 300 Gibbs sampling sweeps Schedule T (k) = 4.0/ log(1 + k) Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 79. G: Worst-case Complexity G: Integrality/Relaxations Determinism End Determinism Example: Simulated Annealing (Geman and Geman, 1984) 20 20 40 40 60 60 80 80 100 100 120 120 20 40 60 80 100 120 20 40 60 80 100 120 Figure: Noise-free sample Figure: Restoration, 300 sweeps Simulated annealing, 300 Gibbs sampling sweeps Schedule T (k) = 4.0/ log(1 + k) Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)
  • 80. G: Worst-case Complexity G: Integrality/Relaxations Determinism End End The End... Tutorial in written form now publisher’s FnT Computer Graphics and Vision series http://www.nowpublishers.com/ PDF available on authors’ homepages Thank you! Sebastian Nowozin and Christoph H. Lampert Part 7: Structured Prediction and Energy Minimization (2/2)