SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
Introduction
        Harmony potential 2.0: fusing across scale
                                Action recognition
                                        Discussion



                              PASCAL VOC 2010
Semantic object segmentation and action recognition in still images


                                Andrew D. Bagdanov
                              bagdanov@cvc.uab.es

                                                               ´
                       Departamento de Ciencias de la Computacion
                                          ´
                           Universidad Autnoma de Barcelona




          Xavier              Pep            Nataliya      Wenjuan          Fahad

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
           Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                   Action recognition    Action recognition
                                           Discussion    Our main ideas


Overview
     On 03/05/2010 the PASCAL VOC competition was announced
     and the training and validation sets published.
     20 semantic categories for the competition remain the same:
aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, diningtable,
dog, horse, motorbike, person, potted plant, sheep, sofa, train, and tv/monitor.




                       The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas


Old competitions, new competitions

   There are two (+ 1/2) main challenges in PASCAL.
   Image classification is the prediction of the presence/absence of
   an instance of class in a test image.
   Object detection is the prediction of the bounding box and label
   of each object from the twenty target classes in a test image.
   Semantic image segmentation is the assignment of one of the
   twenty class labels to every pixel in a test image.
   Image segmentation is becoming a mainstream competition.
   Action recognition in still images was included as a new “taster
   challenge” this year.
   Taster competitions are used to measure interest in new problems.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
       Harmony potential 2.0: fusing across scale    Semantic image segmentation
                               Action recognition    Action recognition
                                       Discussion    Our main ideas


Our contributions to PASCAL VOC 2010


  Last year we participated in the Detection, Classification and
  Segmentation challenges.
  This year we decided to concentrate on Classification and
  Segmentation. Our segmentation technique relies heavily on
  classification.
  We also fielded a team in Action Recognition this year to see
  what that’s all about.
  As always, success in PASCAL VOC challenges is approximately
  85% engineering, 10% inspiration and 5% luck (if you’re lucky).



                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
          Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                  Action recognition    Action recognition
                                          Discussion    Our main ideas


Outline


 1   Introduction
         Overview of the challenges
         Our contribution and main ideas
 2   The harmony potential 2.0: fusing across scale
         Building on last year’s submission
         Fusing across scales and learning
 3   Action recognition
         A torrent of features
         Exploiting the size of the problem
 4   Discussion



                      The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas


Giving semantics to pixels




       Image                                    Object                              Class
   Semantic image segmentation is not object segmentation
   Only for simple cases are they the same.
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas


Turning a hard problem into a harder one




       Image                                    Object                              Class
   The object is to assign semantic labels to every pixel
   Fine distinctions must be made
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas


Make that a very hard one




       Image                                    Object                              Class
   The objective is to assign semantic labels to every pixel
   Fine distinctions must be made
   Occlusions, varying viewpoint and size complicate things



                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas


Action recognition in still images


   New competition this year: human action recognition in still
   images.
   Individual images sampled from the Flikr dataset.
   Bounding boxes of the human in each image is provided.
   Very important: we don’t have to solve the detection problem.
   Action recognition is offered as a “taster challenge” in order to
   gauge interest in the general problem.
   It was difficult to hypothesize about what would succeed and what
   would not in this challenge.



                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
       Harmony potential 2.0: fusing across scale    Semantic image segmentation
                               Action recognition    Action recognition
                                       Discussion    Our main ideas


Action classes




                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas


Segmentation: the role of context




   Context provides very important cues for make fine
   discriminations at the (super-) pixel scale.
   We can exploit three levels of scale: local, mid-level and global
   [Zhu, NIPS2008].
   Existing techniques apply overly-simplified models of context that
   do not generalize upward from local to global scales.
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas


Segmentation: global constraints on label
combinations
   Our principal idea is to use global Classification to enhance
   segmentation results.
   Global image classification results tend to be less noisy than ones.
   We will use them to constrain the combinations of semantic labels
   we are likely to encounter during segmentation.
   We showed last year how a tractable inference technique can be
   devised for this labeling problem (our PASCAL 2009 entry).
   This year we also show how mid-level context can be incorporated
   in the form of object detections.
   We also show how position priors cam be similarly incorporated
   into the framework to provide class specific location information.
   Finally, we devised a stochastic steepest ascent technique for
   optimizing the many parameters in a class-specific way.
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   PASCAL VOC 2010
        Harmony potential 2.0: fusing across scale    Semantic image segmentation
                                Action recognition    Action recognition
                                        Discussion    Our main ideas


Action recognition: driven by data limitations

   Initial experiments confirmed our intuition about the limitations of
   the data.
       Structural learning: sampling of pose space not dense enough.
       Latent SVM: object interactions under-sampled as well.
       Multiple kernel learning: converges to simple selection.
   From a very early stage, we decided to treat action recognition as
   an image classification problem.
   We exploit the small size dataset by performing extensive cross
   validation.
   Features are one of our string points, and we had to get the
   feature pipeline running for Classification in any case.


                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results
                                       Discussion


HCRFs for labeling problem

  We represent our segmentation problem as a graph: G = (V, E)
  V is used for indexing random variables, and E is the set of
  undirected edges representing compatibility relationships between
  random variables.
  X = {Xi } denotes the set of random variables or nodes, for i ∈ V.
  An energy function will be defined over graphical configurations of
  random variables.
  By the Hammersley-Clifford theorem, the energy of a configuration
  of x = {xi } can be written as the negative exponential of an
  energy function E(x) = c∈C ϕc (xc ), where ϕc is the potential
  function of clique c ∈ C.


                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                          Our point of departure
        Harmony potential 2.0: fusing across scale
                                                          Datasets and implementation
                                Action recognition
                                                          Experimental results
                                        Discussion


Consistency potentials for labeling problems
   The energy function of G can be written as:
       E(x) =             φ(xi ) +                    ψL (xi , xj ) +              ψG (xi , xg ).
                   i∈V                  (i,j)∈EL                        (i,g)∈EG

   The unary term φ(xi ) depends on a single probability
   P(Xi = xi |Øi ), where Øi is the observation that affects Xi in the
   model.
   The smoothness potential ψL (xi , xj ) determines the pairwise
   relationship between two local nodes.
   The consistency potential ψG (xi , xg ) expresses the dependency
   between local nodes and a global node.
   And the Maximum a Posteriori (MAP) estimate of the optimal
   labeling is:
                                     x∗ = arg min E(x).
                                                         x
                    The CVC PASCAL VOC Team               CVC PASCAL VOC 2010
Introduction
                                                           Our point of departure
           Harmony potential 2.0: fusing across scale
                                                           Datasets and implementation
                                   Action recognition
                                                           Experimental results
                                           Discussion


HCRF models of image segmentation

   Smoothness                                       Potts                                Robust P N
                                                                                                Free




  (Shotten et al, CVPR2008)                  (Plath et al, ICML2009)                (Ladicky et al, ICCV2009)


  Colored nodes represent (hidden) semantic labels.
  Dark nodes represent image measurements.
  Red edges represent penalties imposed by potential.


                       The CVC PASCAL VOC Team             CVC PASCAL VOC 2010
Introduction
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results
                                        Discussion


Different features for discriminations


   The previously mentioned approaches all try to make global
   distinctions using local information.
   Either by voting of local observations (Potts).
   Or, by penalizing rampantly discordant local label assignments
   PN .
   None of these techniques try to exploit truly global information to
   constrain local labels.
   And none incorporate the notion of encoding combinations of
   primitive node labels at the global level.



                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results
                                        Discussion


The harmony potential: selective subsets




   Only labels that do not agree with subset are penalized.
   Can represent more diverse combinations.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results
                                       Discussion


The harmony potential: overview




                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results
                                        Discussion


Ranked subsampling of P(L)

  We can do this using the following posterior:
                          ∗             ∗          ∗
                    P( ⊆ xg |Ø) ∝ P( ⊆ xg )P(O| ⊆ xg ).

  This allows us to effectively rank possible global node labels, and
                                                                     ∗
  thus to prioritize candidates in the search for the optimal label xg .
          ∗
  P( ⊆ xg |O) establishes an order on subsets of the (unknown)
                                       ∗
  optimal labeling of the global node xg that guides the
  consideration of global labels.
  We may not be able to exhaustively consider all labels in P(L), but
                                                       ∗
  at least we consider the most likely candidates for xg .
  And image classification can give us an estimate of this posterior.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results
                                       Discussion


PASCAL 2010: pushing the limit

  The previous slides describe our approach used for the PASCAL
  2009 submission.
  The discriminative model was based on only SVMs trained to
  discriminate object classes from their own backgrounds.
  Starting with the harmony potential approach, this year we
  concentrated on adding cues derived from different levels of
  mid-level context.
  We found the HCRF model with harmony potential to be very
  useful for performing this fusion.
  Our hypothesis at the end of the 2009 competition was that
  detection would be essential for pushing forward the
  state-of-the-art.

                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                        Our point of departure
          Harmony potential 2.0: fusing across scale
                                                        Datasets and implementation
                                  Action recognition
                                                        Experimental results
                                          Discussion


PASCAL 2010: fusing across scales
 1   FG/BG: 20 SVMs trained to discriminate classes from their own
     background. The same discriminative model used last year,
     essential for localizing object boundaries.
 2   CLASS: 20 SVMs trained to discriminate each object class from
     the other object. Essential for distinguishing objects with similar
     backgrounds (e.g. cows from sheep, birds from planes).
     Incorporated directly into unary potential.
 3   LOC: 20 class-specific location priors. Computed from ground
     truth segmentations by simple, spatial averaging. A form of
     top-down mid-level context.
 4   OBJ: 20 class-specific object detectors [Felzenszwalb 2010] are
     converted to superpixel scores by selecting the highest scoring
     detection intersecting each pixel of the superpixel. A type of
     bottom-up mid-level context.
                      The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                       Our point of departure
       Harmony potential 2.0: fusing across scale
                                                       Datasets and implementation
                               Action recognition
                                                       Experimental results
                                       Discussion


PASCAL 2010: learning unary potentials

  We compute the unary potential by weighting the classification
  scores {si (k , xi )}k∈F through a sigmoid function. The unary
  potential becomes:

                                                                  1
                 φL (xi ) = −µL Ki log
                  i
                                                           1 + exp(fi (k, xi ))
                                                     k∈F
                          fi (k , xi ) = a(k, xi )si (k , xi ) + b(k, xi )

  µL is the weighting factor of the local unary potential, and
  Ki normalizes over the number of pixels inside the superpixel.
  We have two sigmoid parameters for each class/cue pair: a(k , xi )
  and b(k , xi ).

                   The CVC PASCAL VOC Team             CVC PASCAL VOC 2010
Introduction
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results
                                       Discussion


Datasets


  We have evaluated the harmony potential approach on two
  standard, publicly available datasets.
  The Pascal VOC 2010 Segmentation Challenge dataset contains
  2250 color images of 20 different semantic classes.
  This set is split into 750 images for training, 750 images for
  testing, and 750 for validation.
  The Microsoft MSRC-21 dataset contains 591 color images of 21
  object classes.
  We do our own splits for cross-validation on MSRC-21.



                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results
                                       Discussion


Unsupervised segmentation
  Images are first over-segmented to with quick-shift to derive
  super-pixels [Fulkerson, ICCV 2009].
  This preserves object boundaries while simplifying the
  representation.
  Working at the super-pixel level reduces the number of nodes in
  the CRF by 102 to 105 per image.




                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results
                                        Discussion


Local classification scores: P(Xi = xi |Oi )

   We extract patches with 50% overlap on a regular grid at several
   resolutions (12, 24, 36 and 48 pixels in diameter).
   Patches are described with SIFT, color and for MSCR-21 location
   features.
   A vocabulary is constructed using k-means to quantize to 1000
   SIFT words and 400 color words.
   An SVM classifier using an intersection kernel is built for each
   semantic category.
   A similar number of positive and negative examples are used:
   around a total of 8.000 superpixel samples for MSCR-21, and
   20.000 for VOC 2010 for each class.


                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results
                                        Discussion


Global potential and general approach
   For the PASCAL 2010 dataset we use our entry to the 2010 VOC
   Classification Challenge:
   [Khan, IJCV2010 (submitted)].
   It uses a bag-of-words representation based on SIFT and color
   SIFT, plus spatial pyramids and color attention
   [Khan, ICCV 2009].
   An SVM classifier with a χ2 kernel is trained for each semantic
   category in the dataset.
   The FG/BG and CLASS cues are computed by training a
   discriminative model using an SVM with histogram intersection
   kernel.
   Except for the additional cues and optimization strategy,
   architecture the same as our approach described at CVPR.
   [Gonfaus, CVPR2010]
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results
                                       Discussion


Learning the HCRF parameters
  We found it to be essential to train the per-class sigmoid
  parameters through cross validation.
  Classification scores are learned independently, are unbalanced
  and are effectively incomparable in many cases.
  The sigmoid functions weight the importance of each cue for each
  class.
  In addition to these (180) sigmoid parameters, we also must learn
  the weighting factors for each potential.
  We use a stochastic, steepest ascent technique to optimize these
  parameters on a validation set.
  In each step we randomly generate new instances of parameters.
  New parameter instances are generated using a Gibbs-like
  sampling strategy.
                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                                                    Our point of departure
           Harmony potential 2.0: fusing across scale
                                                                                    Datasets and implementation
                                   Action recognition
                                                                                    Experimental results
                                           Discussion


History: PASCAL VOC 2009


                         Background

                                      Aeroplane

                                                      Bicycle




                                                                                      Bottle




                                                                                                                             Chair
                                                                        Boat
                                                                Bird




                                                                                               Bus

                                                                                                              Car

                                                                                                                      Cat
          BONN         83.9 64.3 21.8 21.7 32.0 40.2 57.3 49.4 38.8 5.2
     BROOKES           79.6 48.3 6.7 19.1 10.0 16.6 32.7 38.1 25.3 5.5
Harmony potential      80.5 62.3 24.1 28.3 30.5 32.7 42.2 48.1 22.8 9.1
                                      Dinning Table




                                                                                               Potted Plant




                                                                                                                                     TV/Monitor
                                                                        Motorbike




                                                                                                                                                   Average
                                                                                      Person



                                                                                                              Sheep
                                                                Horse




                                                                                                                             Train
                                                                                                                      Sofa
                          Cow



                                                      Dog




          BONN         28.5 22.0 19.6 33.6 45.5 33.6 27.3 40.4 18.1 33.6 46.1                                                                     36.3
     BROOKES            9.4 25.1 13.3 12.3 35.5 20.7 13.4 17.1 18.4 37.5 36.4                                                                     24.8
Harmony potential      30.1 7.9 21.5 41.9 49.6 31.5 26.1 37.0 20.1 39.4 31.1                                                                      34.1



                       The CVC PASCAL VOC Team                                      CVC PASCAL VOC 2010
Introduction
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results
                                       Discussion


Qualitative results: MSRC-21




                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results
                                        Discussion


Quantitative results: MSRC-21




   MSRC-21 contains more multi-class images than PASCAL.
   Our performance demonstrates the benefits of incorporating
   global scale when making local decisions.




                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                     Our point of departure
       Harmony potential 2.0: fusing across scale
                                                     Datasets and implementation
                               Action recognition
                                                     Experimental results
                                       Discussion


Qualitative results: PASCAL 2010




                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results
                                        Discussion


Quantitative results: PASCAL 2010




   FG/BG shows the performance of our baseline (PASCAL 2009)
   approach.
   At the top, performance on the validation set (i.e. how well we
   thought we were doing).
   Image tags indicated how well the technique can perform with
   perfect global information.
                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                                   Our point of departure
        Harmony potential 2.0: fusing across scale
                                                                   Datasets and implementation
                                Action recognition
                                                                   Experimental results
                                        Discussion


The cost of segmentation
   The optimal MAP label configuration x∗ is inferred using
   α-expansion graph cuts [Kolmogorov, PAMI2004].
   The global node uses the 100 most probable label subsets
                                 Sheet1
   obtained from ranked subsampling.
                                                  MSRC-21       PASCAL 2010

                             85                                                             50
                                                                                            48
                             80




                                                                                                 mAP on PASCAL VOC 2010
                                                                                            46
                             75                                                             44
            mAP on MSRC-21




                             70                                                             42
                                                                                            40
                             65                                                             38
                             60                                                             36
                                                                                            34
                             55
                                                                                            32
                             50                                                       30
                                  1   2   3   5 10 15 20 25 30 35 40 50 75 100 150 200
                                                      # labels selected
                                  The CVC PASCAL VOC Team          CVC PASCAL VOC 2010
Introduction
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results
                                        Discussion


Qualitative results: PASCAL 2010 failures




   Context is sometimes weighted too much.
   When the global classifier fails, little can be done.



                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                      Our point of departure
        Harmony potential 2.0: fusing across scale
                                                      Datasets and implementation
                                Action recognition
                                                      Experimental results
                                        Discussion


Every little bit helps




                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
                                                                                                          Our point of departure
                      Harmony potential 2.0: fusing across scale
                                                                                                          Datasets and implementation
                                              Action recognition
                                                                                                          Experimental results
                                                      Discussion


A photo finish
                                                                                                                                Sheet1
                                        Sheet1
                                                                                                         42
                 15      20       25             30   35            40
                                                                                                         40




                                                                                mAP on PASCAL VOC 2010
        FG-BG                                         33.9


        CLASS                    23.4                                                                    38

          LOC             20.1                                                                           36

          OBJ                           26.2
                                                                                                         34
 FG-BG + CLASS                                               36.6
                                                                                                         32
           All                                                           40.4
                                                                                                         30
                                                                                                              0   500   1000       1500      2000   2500   3000
                                                                                                                               #iterations



      The final results are tough to call between BONN and CVC.
      In the end, fusion over many scales and per-class, per-feature
      parameter optimization won.



                                        The CVC PASCAL VOC Team                                           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results


The action recognition taster


   Images collected from Flikr using action queries. A set of nine
   actions was chosen in the end.
   They are disjoint from the main challenge dataset.
   Only subset of people are annotated (bounding box + action).
   This subset labelled with exactly one action class.
   Important point: we don’t have to solve the detection problem.
   Most action classes in the challenge contain either large variation
   in scale or large variations in pose (or both).




                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction     The data
        Harmony potential 2.0: fusing across scale      State-of-the-art
                                Action recognition      Our approach
                                        Discussion      Results


Dataset breakdown

                                train                     val                trainval       test
                             img obj                  img obj              img obj      img obj
     Phoning                  25     25                25     26            50     51     -      -
Playinginstrument             27     38                27     38            54     76     -      -
     Reading                  25     26                26     27            51     53     -      -
    Ridingbike                25     33                25     33            50     66     -      -
   Ridinghorse                27     35                26     36            53     71     -      -
     Running                  26     47                25     47            51     94     -      -
   Takingphoto                25     27                26     28            51     55     -      -
 Usingcomputer                26     29                26     30            52     59     -      -
     Walking                  25     41                26     42            51     83     -      -
       Total                 226 301                  228 307              454 608        -      -


                    The CVC PASCAL VOC Team             CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results


Grouplets and poselets
   Two state-of-the art techniques to action recognition in still
   images. The grouplets of Fei Fei Li [Yao et al, CVPR2010]:




   And the latent poses of Greg Mori [Yang et al, CVPR2010]:




                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results


Treat it like image classification

   Initial experiments confirmed our intuition about the limitations of
   the data.
       Structural learning: sampling of pose space not dense enough.
       Latent SVM: complexity of object interactions problematic.
       Multiple kernel learning: converges to simple selection.
   State-of-the-art techniques rely on learning complex structural
   models of pose-variations over many
   From a very early stage, we decided to treat action recognition as
   an image classification problem.
   We exploit the small size dataset by performing extensive cross
   validation.


                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
       Harmony potential 2.0: fusing across scale    State-of-the-art
                               Action recognition    Our approach
                                       Discussion    Results


The classification pipeline




                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results


Action recognition: features


   SIFT, color SIFT (normalize R/G and opponent), self-similarity,
   SURF, PHOG (good for capturing pose), and color attention
   (focuses on interesting color features).
   Sparse and dense variations of most of these.
   Plus a range of pyramid configurations (1, 2 × 2, 3 × 3, 4 × 4).
   Object detectors also incorporated using a simple occurrence
   histogram [Felzenszwalb 2010].
   The goal was to incorporate all of this into a BoVW classifier and
   push the limits of what is possible using classical BoW on actions.



                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results


Action recognition: contextual pyramids


   Context was also important for most object classes.
   We used a type of foreground/background pyramid decomposition
   that split features into object or background.
   The was done using a type of spatial soft-assign based on the
   distance to the boundary of the object.
   For some classes, we also assigned contextual object regions that
   model the appearance of objects associated with them (the “horsy
   box”).




                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results


Action recognition: learning in the design space

   In the end, after all of the combinatorics introduced by pyramids
   and other variations, we had about 100 feature configurations in a
   big pool.
   Most attempts to automatically learn the parameters of these
   features were total failures.
   Except one. Initial experiments with multiple kernel learning
   showed that MKL starts converging quickly towards class-specific
   feature selection rather than mixing.
   With such a small dataset, and a little heuristic trimming, we were
   able to exhaustively explore a part of the design space.
   This resulted in the best per-class feature combinations.


                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results


Action recognition: classification



   We experimented with a number of kernels (histogram
   intersection, χ2 , bin-ratio distance).
   There wasn’t a huge difference among these kernels.
   In the end, we chose histogram intersection for our submission as
   it appeared to generalize better.
   In addition to over-fitting less, there are no parameters to tune and
   it is very fast.




                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
       Harmony potential 2.0: fusing across scale    State-of-the-art
                               Action recognition    Our approach
                                       Discussion    Results


Overall results: average precision




                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
      Harmony potential 2.0: fusing across scale    State-of-the-art
                              Action recognition    Our approach
                                      Discussion    Results


Per-class AP




                  The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
       Harmony potential 2.0: fusing across scale    State-of-the-art
                               Action recognition    Our approach
                                       Discussion    Results


Per technique median average precision




                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
         Harmony potential 2.0: fusing across scale    State-of-the-art
                                 Action recognition    Our approach
                                         Discussion    Results


Qualitative results




   When the horsey box and detectors fail, context dominates.
   Classifier still surprisingly robust.



                     The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results


Qualitative results




   Some fine discriminations very difficult to make.
   Probably difficult even for humans.



                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction   The data
        Harmony potential 2.0: fusing across scale    State-of-the-art
                                Action recognition    Our approach
                                        Discussion    Results


Qualitative results




   People taking photos should be banned.
   Classes with large pose variations were the most difficult.



                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
       Harmony potential 2.0: fusing across scale
                               Action recognition
                                       Discussion


Discussion: semantic image segmentation

  The harmony potential works well for fusing global information into
  local segmentations.
  This year we also showed that the harmony potential framework is
  also appropriate for incorporating different types of mid-level cues
  as well.
  Ranked sub-sampling, driven by the same posterior as used to
  define the global potential function, renders the optimization
  problem tractable.
  Most useful when multiple semantic classes co-occur frequently.
  Per-class learning of parameters essential (about +5% in final
  results).


                   The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
        Harmony potential 2.0: fusing across scale
                                Action recognition
                                        Discussion


Discussion: action recognition


   This year’s taster challenge on action recognition was little more
   than a toy.
   However, we have demonstrated what is possible using proven
   techniques from image classification.
   We feel that object context, in particular object interaction context,
   is the way forward.
   The PASCAL data set is the right direction to go (more general),
   but we need more samples.




                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
        Harmony potential 2.0: fusing across scale
                                Action recognition
                                        Discussion


The future: segmentation

   Semantic image segmentation has come a long way, but still has a
   long way to go.
   It is becoming a mainstream event in PASCAL.
   This year we arrived as a sort of three-way detente between the
   CVC (winner 2010), BONN (winner 2009) and OXFORD (best
   paper award ECCV 2010) in segmentation.
   Each have their own approach, and each has its advantages and
   disadvantages.
   Engineering can probably maximize results.
   It is becoming mature, and we can begin thinking about what new
   applications are enabled by such technologies.

                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010
Introduction
        Harmony potential 2.0: fusing across scale
                                Action recognition
                                        Discussion


The future: action recognition


   It seems that action recognition in still images is a popular
   challenge.
   The PASCAL organizers are keen to promote it for the future.
   The concentration will remain on still images, but perhaps more
   concentration on incorporating user interaction as well.
   It seems that the community is becoming more interested in the
   “alternative” PASCAL challenges.
   The multimedia community probably has an important role to play
   here.



                    The CVC PASCAL VOC Team           CVC PASCAL VOC 2010

Contenu connexe

Plus de Media Integration and Communication Center

Plus de Media Integration and Communication Center (13)

Interactive Video Search and Browsing Systems
Interactive Video Search and Browsing SystemsInteractive Video Search and Browsing Systems
Interactive Video Search and Browsing Systems
 
Danthe. Digital and Tuscan heritage
Danthe. Digital and Tuscan heritageDanthe. Digital and Tuscan heritage
Danthe. Digital and Tuscan heritage
 
IM3I Presentation
IM3I PresentationIM3I Presentation
IM3I Presentation
 
IM3I flyer
IM3I flyerIM3I flyer
IM3I flyer
 
IM3I brochure
IM3I brochureIM3I brochure
IM3I brochure
 
IM3I flyer
IM3I flyerIM3I flyer
IM3I flyer
 
The harmony potential: fusing local and global information for semantic image...
The harmony potential: fusing local and global information for semantic image...The harmony potential: fusing local and global information for semantic image...
The harmony potential: fusing local and global information for semantic image...
 
MediaPick
MediaPickMediaPick
MediaPick
 
Andromeda
AndromedaAndromeda
Andromeda
 
Sirio, Orione and Pan
Sirio, Orione and PanSirio, Orione and Pan
Sirio, Orione and Pan
 
Vidivideo and IM3I
Vidivideo and IM3IVidivideo and IM3I
Vidivideo and IM3I
 
Ircdl damico del-bimbo-meoni
Ircdl damico del-bimbo-meoniIrcdl damico del-bimbo-meoni
Ircdl damico del-bimbo-meoni
 
Accurate Evaluation of HER-2 Ampli cation in FISH Images Poster at Internatio...
Accurate Evaluation of HER-2 Amplication in FISH Images Poster at Internatio...Accurate Evaluation of HER-2 Amplication in FISH Images Poster at Internatio...
Accurate Evaluation of HER-2 Ampli cation in FISH Images Poster at Internatio...
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

PASCAL VOC 2010: semantic object segmentation and action recognition in still images

  • 1. Introduction Harmony potential 2.0: fusing across scale Action recognition Discussion PASCAL VOC 2010 Semantic object segmentation and action recognition in still images Andrew D. Bagdanov bagdanov@cvc.uab.es ´ Departamento de Ciencias de la Computacion ´ Universidad Autnoma de Barcelona Xavier Pep Nataliya Wenjuan Fahad The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 2. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Overview On 03/05/2010 the PASCAL VOC competition was announced and the training and validation sets published. 20 semantic categories for the competition remain the same: aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, diningtable, dog, horse, motorbike, person, potted plant, sheep, sofa, train, and tv/monitor. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 3. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Old competitions, new competitions There are two (+ 1/2) main challenges in PASCAL. Image classification is the prediction of the presence/absence of an instance of class in a test image. Object detection is the prediction of the bounding box and label of each object from the twenty target classes in a test image. Semantic image segmentation is the assignment of one of the twenty class labels to every pixel in a test image. Image segmentation is becoming a mainstream competition. Action recognition in still images was included as a new “taster challenge” this year. Taster competitions are used to measure interest in new problems. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 4. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Our contributions to PASCAL VOC 2010 Last year we participated in the Detection, Classification and Segmentation challenges. This year we decided to concentrate on Classification and Segmentation. Our segmentation technique relies heavily on classification. We also fielded a team in Action Recognition this year to see what that’s all about. As always, success in PASCAL VOC challenges is approximately 85% engineering, 10% inspiration and 5% luck (if you’re lucky). The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 5. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Outline 1 Introduction Overview of the challenges Our contribution and main ideas 2 The harmony potential 2.0: fusing across scale Building on last year’s submission Fusing across scales and learning 3 Action recognition A torrent of features Exploiting the size of the problem 4 Discussion The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 6. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Giving semantics to pixels Image Object Class Semantic image segmentation is not object segmentation Only for simple cases are they the same. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 7. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Turning a hard problem into a harder one Image Object Class The object is to assign semantic labels to every pixel Fine distinctions must be made The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 8. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Make that a very hard one Image Object Class The objective is to assign semantic labels to every pixel Fine distinctions must be made Occlusions, varying viewpoint and size complicate things The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 9. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Action recognition in still images New competition this year: human action recognition in still images. Individual images sampled from the Flikr dataset. Bounding boxes of the human in each image is provided. Very important: we don’t have to solve the detection problem. Action recognition is offered as a “taster challenge” in order to gauge interest in the general problem. It was difficult to hypothesize about what would succeed and what would not in this challenge. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 10. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Action classes The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 11. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Segmentation: the role of context Context provides very important cues for make fine discriminations at the (super-) pixel scale. We can exploit three levels of scale: local, mid-level and global [Zhu, NIPS2008]. Existing techniques apply overly-simplified models of context that do not generalize upward from local to global scales. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 12. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Segmentation: global constraints on label combinations Our principal idea is to use global Classification to enhance segmentation results. Global image classification results tend to be less noisy than ones. We will use them to constrain the combinations of semantic labels we are likely to encounter during segmentation. We showed last year how a tractable inference technique can be devised for this labeling problem (our PASCAL 2009 entry). This year we also show how mid-level context can be incorporated in the form of object detections. We also show how position priors cam be similarly incorporated into the framework to provide class specific location information. Finally, we devised a stochastic steepest ascent technique for optimizing the many parameters in a class-specific way. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 13. Introduction PASCAL VOC 2010 Harmony potential 2.0: fusing across scale Semantic image segmentation Action recognition Action recognition Discussion Our main ideas Action recognition: driven by data limitations Initial experiments confirmed our intuition about the limitations of the data. Structural learning: sampling of pose space not dense enough. Latent SVM: object interactions under-sampled as well. Multiple kernel learning: converges to simple selection. From a very early stage, we decided to treat action recognition as an image classification problem. We exploit the small size dataset by performing extensive cross validation. Features are one of our string points, and we had to get the feature pipeline running for Classification in any case. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 14. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion HCRFs for labeling problem We represent our segmentation problem as a graph: G = (V, E) V is used for indexing random variables, and E is the set of undirected edges representing compatibility relationships between random variables. X = {Xi } denotes the set of random variables or nodes, for i ∈ V. An energy function will be defined over graphical configurations of random variables. By the Hammersley-Clifford theorem, the energy of a configuration of x = {xi } can be written as the negative exponential of an energy function E(x) = c∈C ϕc (xc ), where ϕc is the potential function of clique c ∈ C. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 15. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Consistency potentials for labeling problems The energy function of G can be written as: E(x) = φ(xi ) + ψL (xi , xj ) + ψG (xi , xg ). i∈V (i,j)∈EL (i,g)∈EG The unary term φ(xi ) depends on a single probability P(Xi = xi |Øi ), where Øi is the observation that affects Xi in the model. The smoothness potential ψL (xi , xj ) determines the pairwise relationship between two local nodes. The consistency potential ψG (xi , xg ) expresses the dependency between local nodes and a global node. And the Maximum a Posteriori (MAP) estimate of the optimal labeling is: x∗ = arg min E(x). x The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 16. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion HCRF models of image segmentation Smoothness Potts Robust P N Free (Shotten et al, CVPR2008) (Plath et al, ICML2009) (Ladicky et al, ICCV2009) Colored nodes represent (hidden) semantic labels. Dark nodes represent image measurements. Red edges represent penalties imposed by potential. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 17. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Different features for discriminations The previously mentioned approaches all try to make global distinctions using local information. Either by voting of local observations (Potts). Or, by penalizing rampantly discordant local label assignments PN . None of these techniques try to exploit truly global information to constrain local labels. And none incorporate the notion of encoding combinations of primitive node labels at the global level. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 18. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion The harmony potential: selective subsets Only labels that do not agree with subset are penalized. Can represent more diverse combinations. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 19. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion The harmony potential: overview The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 20. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Ranked subsampling of P(L) We can do this using the following posterior: ∗ ∗ ∗ P( ⊆ xg |Ø) ∝ P( ⊆ xg )P(O| ⊆ xg ). This allows us to effectively rank possible global node labels, and ∗ thus to prioritize candidates in the search for the optimal label xg . ∗ P( ⊆ xg |O) establishes an order on subsets of the (unknown) ∗ optimal labeling of the global node xg that guides the consideration of global labels. We may not be able to exhaustively consider all labels in P(L), but ∗ at least we consider the most likely candidates for xg . And image classification can give us an estimate of this posterior. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 21. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion PASCAL 2010: pushing the limit The previous slides describe our approach used for the PASCAL 2009 submission. The discriminative model was based on only SVMs trained to discriminate object classes from their own backgrounds. Starting with the harmony potential approach, this year we concentrated on adding cues derived from different levels of mid-level context. We found the HCRF model with harmony potential to be very useful for performing this fusion. Our hypothesis at the end of the 2009 competition was that detection would be essential for pushing forward the state-of-the-art. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 22. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion PASCAL 2010: fusing across scales 1 FG/BG: 20 SVMs trained to discriminate classes from their own background. The same discriminative model used last year, essential for localizing object boundaries. 2 CLASS: 20 SVMs trained to discriminate each object class from the other object. Essential for distinguishing objects with similar backgrounds (e.g. cows from sheep, birds from planes). Incorporated directly into unary potential. 3 LOC: 20 class-specific location priors. Computed from ground truth segmentations by simple, spatial averaging. A form of top-down mid-level context. 4 OBJ: 20 class-specific object detectors [Felzenszwalb 2010] are converted to superpixel scores by selecting the highest scoring detection intersecting each pixel of the superpixel. A type of bottom-up mid-level context. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 23. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion PASCAL 2010: learning unary potentials We compute the unary potential by weighting the classification scores {si (k , xi )}k∈F through a sigmoid function. The unary potential becomes: 1 φL (xi ) = −µL Ki log i 1 + exp(fi (k, xi )) k∈F fi (k , xi ) = a(k, xi )si (k , xi ) + b(k, xi ) µL is the weighting factor of the local unary potential, and Ki normalizes over the number of pixels inside the superpixel. We have two sigmoid parameters for each class/cue pair: a(k , xi ) and b(k , xi ). The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 24. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Datasets We have evaluated the harmony potential approach on two standard, publicly available datasets. The Pascal VOC 2010 Segmentation Challenge dataset contains 2250 color images of 20 different semantic classes. This set is split into 750 images for training, 750 images for testing, and 750 for validation. The Microsoft MSRC-21 dataset contains 591 color images of 21 object classes. We do our own splits for cross-validation on MSRC-21. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 25. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Unsupervised segmentation Images are first over-segmented to with quick-shift to derive super-pixels [Fulkerson, ICCV 2009]. This preserves object boundaries while simplifying the representation. Working at the super-pixel level reduces the number of nodes in the CRF by 102 to 105 per image. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 26. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Local classification scores: P(Xi = xi |Oi ) We extract patches with 50% overlap on a regular grid at several resolutions (12, 24, 36 and 48 pixels in diameter). Patches are described with SIFT, color and for MSCR-21 location features. A vocabulary is constructed using k-means to quantize to 1000 SIFT words and 400 color words. An SVM classifier using an intersection kernel is built for each semantic category. A similar number of positive and negative examples are used: around a total of 8.000 superpixel samples for MSCR-21, and 20.000 for VOC 2010 for each class. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 27. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Global potential and general approach For the PASCAL 2010 dataset we use our entry to the 2010 VOC Classification Challenge: [Khan, IJCV2010 (submitted)]. It uses a bag-of-words representation based on SIFT and color SIFT, plus spatial pyramids and color attention [Khan, ICCV 2009]. An SVM classifier with a χ2 kernel is trained for each semantic category in the dataset. The FG/BG and CLASS cues are computed by training a discriminative model using an SVM with histogram intersection kernel. Except for the additional cues and optimization strategy, architecture the same as our approach described at CVPR. [Gonfaus, CVPR2010] The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 28. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Learning the HCRF parameters We found it to be essential to train the per-class sigmoid parameters through cross validation. Classification scores are learned independently, are unbalanced and are effectively incomparable in many cases. The sigmoid functions weight the importance of each cue for each class. In addition to these (180) sigmoid parameters, we also must learn the weighting factors for each potential. We use a stochastic, steepest ascent technique to optimize these parameters on a validation set. In each step we randomly generate new instances of parameters. New parameter instances are generated using a Gibbs-like sampling strategy. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 29. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion History: PASCAL VOC 2009 Background Aeroplane Bicycle Bottle Chair Boat Bird Bus Car Cat BONN 83.9 64.3 21.8 21.7 32.0 40.2 57.3 49.4 38.8 5.2 BROOKES 79.6 48.3 6.7 19.1 10.0 16.6 32.7 38.1 25.3 5.5 Harmony potential 80.5 62.3 24.1 28.3 30.5 32.7 42.2 48.1 22.8 9.1 Dinning Table Potted Plant TV/Monitor Motorbike Average Person Sheep Horse Train Sofa Cow Dog BONN 28.5 22.0 19.6 33.6 45.5 33.6 27.3 40.4 18.1 33.6 46.1 36.3 BROOKES 9.4 25.1 13.3 12.3 35.5 20.7 13.4 17.1 18.4 37.5 36.4 24.8 Harmony potential 30.1 7.9 21.5 41.9 49.6 31.5 26.1 37.0 20.1 39.4 31.1 34.1 The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 30. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Qualitative results: MSRC-21 The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 31. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Quantitative results: MSRC-21 MSRC-21 contains more multi-class images than PASCAL. Our performance demonstrates the benefits of incorporating global scale when making local decisions. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 32. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Qualitative results: PASCAL 2010 The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 33. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Quantitative results: PASCAL 2010 FG/BG shows the performance of our baseline (PASCAL 2009) approach. At the top, performance on the validation set (i.e. how well we thought we were doing). Image tags indicated how well the technique can perform with perfect global information. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 34. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion The cost of segmentation The optimal MAP label configuration x∗ is inferred using α-expansion graph cuts [Kolmogorov, PAMI2004]. The global node uses the 100 most probable label subsets Sheet1 obtained from ranked subsampling. MSRC-21 PASCAL 2010 85 50 48 80 mAP on PASCAL VOC 2010 46 75 44 mAP on MSRC-21 70 42 40 65 38 60 36 34 55 32 50 30 1 2 3 5 10 15 20 25 30 35 40 50 75 100 150 200 # labels selected The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 35. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Qualitative results: PASCAL 2010 failures Context is sometimes weighted too much. When the global classifier fails, little can be done. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 36. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion Every little bit helps The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 37. Introduction Our point of departure Harmony potential 2.0: fusing across scale Datasets and implementation Action recognition Experimental results Discussion A photo finish Sheet1 Sheet1 42 15 20 25 30 35 40 40 mAP on PASCAL VOC 2010 FG-BG 33.9 CLASS 23.4 38 LOC 20.1 36 OBJ 26.2 34 FG-BG + CLASS 36.6 32 All 40.4 30 0 500 1000 1500 2000 2500 3000 #iterations The final results are tough to call between BONN and CVC. In the end, fusion over many scales and per-class, per-feature parameter optimization won. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 38. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results The action recognition taster Images collected from Flikr using action queries. A set of nine actions was chosen in the end. They are disjoint from the main challenge dataset. Only subset of people are annotated (bounding box + action). This subset labelled with exactly one action class. Important point: we don’t have to solve the detection problem. Most action classes in the challenge contain either large variation in scale or large variations in pose (or both). The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 39. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Dataset breakdown train val trainval test img obj img obj img obj img obj Phoning 25 25 25 26 50 51 - - Playinginstrument 27 38 27 38 54 76 - - Reading 25 26 26 27 51 53 - - Ridingbike 25 33 25 33 50 66 - - Ridinghorse 27 35 26 36 53 71 - - Running 26 47 25 47 51 94 - - Takingphoto 25 27 26 28 51 55 - - Usingcomputer 26 29 26 30 52 59 - - Walking 25 41 26 42 51 83 - - Total 226 301 228 307 454 608 - - The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 40. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Grouplets and poselets Two state-of-the art techniques to action recognition in still images. The grouplets of Fei Fei Li [Yao et al, CVPR2010]: And the latent poses of Greg Mori [Yang et al, CVPR2010]: The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 41. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Treat it like image classification Initial experiments confirmed our intuition about the limitations of the data. Structural learning: sampling of pose space not dense enough. Latent SVM: complexity of object interactions problematic. Multiple kernel learning: converges to simple selection. State-of-the-art techniques rely on learning complex structural models of pose-variations over many From a very early stage, we decided to treat action recognition as an image classification problem. We exploit the small size dataset by performing extensive cross validation. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 42. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results The classification pipeline The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 43. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Action recognition: features SIFT, color SIFT (normalize R/G and opponent), self-similarity, SURF, PHOG (good for capturing pose), and color attention (focuses on interesting color features). Sparse and dense variations of most of these. Plus a range of pyramid configurations (1, 2 × 2, 3 × 3, 4 × 4). Object detectors also incorporated using a simple occurrence histogram [Felzenszwalb 2010]. The goal was to incorporate all of this into a BoVW classifier and push the limits of what is possible using classical BoW on actions. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 44. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Action recognition: contextual pyramids Context was also important for most object classes. We used a type of foreground/background pyramid decomposition that split features into object or background. The was done using a type of spatial soft-assign based on the distance to the boundary of the object. For some classes, we also assigned contextual object regions that model the appearance of objects associated with them (the “horsy box”). The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 45. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Action recognition: learning in the design space In the end, after all of the combinatorics introduced by pyramids and other variations, we had about 100 feature configurations in a big pool. Most attempts to automatically learn the parameters of these features were total failures. Except one. Initial experiments with multiple kernel learning showed that MKL starts converging quickly towards class-specific feature selection rather than mixing. With such a small dataset, and a little heuristic trimming, we were able to exhaustively explore a part of the design space. This resulted in the best per-class feature combinations. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 46. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Action recognition: classification We experimented with a number of kernels (histogram intersection, χ2 , bin-ratio distance). There wasn’t a huge difference among these kernels. In the end, we chose histogram intersection for our submission as it appeared to generalize better. In addition to over-fitting less, there are no parameters to tune and it is very fast. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 47. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Overall results: average precision The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 48. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Per-class AP The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 49. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Per technique median average precision The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 50. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Qualitative results When the horsey box and detectors fail, context dominates. Classifier still surprisingly robust. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 51. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Qualitative results Some fine discriminations very difficult to make. Probably difficult even for humans. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 52. Introduction The data Harmony potential 2.0: fusing across scale State-of-the-art Action recognition Our approach Discussion Results Qualitative results People taking photos should be banned. Classes with large pose variations were the most difficult. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 53. Introduction Harmony potential 2.0: fusing across scale Action recognition Discussion Discussion: semantic image segmentation The harmony potential works well for fusing global information into local segmentations. This year we also showed that the harmony potential framework is also appropriate for incorporating different types of mid-level cues as well. Ranked sub-sampling, driven by the same posterior as used to define the global potential function, renders the optimization problem tractable. Most useful when multiple semantic classes co-occur frequently. Per-class learning of parameters essential (about +5% in final results). The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 54. Introduction Harmony potential 2.0: fusing across scale Action recognition Discussion Discussion: action recognition This year’s taster challenge on action recognition was little more than a toy. However, we have demonstrated what is possible using proven techniques from image classification. We feel that object context, in particular object interaction context, is the way forward. The PASCAL data set is the right direction to go (more general), but we need more samples. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 55. Introduction Harmony potential 2.0: fusing across scale Action recognition Discussion The future: segmentation Semantic image segmentation has come a long way, but still has a long way to go. It is becoming a mainstream event in PASCAL. This year we arrived as a sort of three-way detente between the CVC (winner 2010), BONN (winner 2009) and OXFORD (best paper award ECCV 2010) in segmentation. Each have their own approach, and each has its advantages and disadvantages. Engineering can probably maximize results. It is becoming mature, and we can begin thinking about what new applications are enabled by such technologies. The CVC PASCAL VOC Team CVC PASCAL VOC 2010
  • 56. Introduction Harmony potential 2.0: fusing across scale Action recognition Discussion The future: action recognition It seems that action recognition in still images is a popular challenge. The PASCAL organizers are keen to promote it for the future. The concentration will remain on still images, but perhaps more concentration on incorporating user interaction as well. It seems that the community is becoming more interested in the “alternative” PASCAL challenges. The multimedia community probably has an important role to play here. The CVC PASCAL VOC Team CVC PASCAL VOC 2010