SlideShare une entreprise Scribd logo
1  sur  140
Télécharger pour lire hors ligne
Object Recognition and
MIT      Scene Understanding
                  student presentation
6.870
6.870

Template matching
   and histograms
           Nicolas Pinto
Introduction
Hosts
Hosts
   a guy...




(who has big arms)
Hosts
   a guy...               Antonio T...




(who has big arms)   (who knows a lot about vision)
Hosts
   a guy...               Antonio T...                  a frog...




(who has big arms)   (who knows a lot about vision)   (who has big eyes)
Hosts
   a guy...               Antonio T...                  a frog...




(who has big arms)   (who knows a lot about vision)   (who has big eyes)
                                                       and thus should know
                                                       a lot about vision...
rs
       p    e
    pa
3




    yey!!
Object Recognition from Local Scale-Invariant Features
                                                                                David G. Lowe


                    Lowe
                                                                         Computer Science Department
                                                                         University of British Columbia



               s
                                                                       Vancouver, B.C., V6T 1Z4, Canada



              r    (1999)
                                                                               lowe@cs.ubc.ca




       p    e                                         Abstract                                   translation, scaling, and rotation, and partially invariant to
                                                                                                 illumination changes and affine or 3D projection. Previous




     a
                            An object recognition system has been developed that uses a          approaches to local feature generation lacked invariance to
                            new class of local image features. The features are invariant        scale and were more sensitive to projective distortion and




    p
                            to image scaling, translation, and rotation, and partially in-       illumination change. The SIFT features share a number of
                            variant to illumination changes and affine or 3D projection.          properties in common with the responses of neurons in infe-
                            These features share similar properties with neurons in in-          rior temporal (IT) cortex in primate vision. This paper also




3
                            ferior temporal cortex that are used for object recognition          describes improved approaches to indexing and model ver-
                            in primate vision. Features are efficiently detected through          ification.
                            a staged filtering approach that identifies stable points in               The scale-invariant features are efficiently identified by
                            scale space. Image keys are created that allow for local ge-         using a staged filtering approach. The first stage identifies
                            ometric deformations by representing blurred image gradi-            key locations in scale space by looking for locations that
                            ents in multiple orientation planes and at multiple scales.          are maxima or minima of a difference-of-Gaussian function.
                            The keys are used as input to a nearest-neighbor indexing            Each point is used to generate a feature vector that describes
                            method that identifies candidate object matches. Final veri-          the local image region sampled relative to its scale-space co-
                            fication of each match is achieved by finding a low-residual           ordinate frame. The features achieve partial invariance to
                            least-squares solution for the unknown model parameters.             local variations, such as affine or 3D projections, by blur-
                            Experimental results show that robust object recognition             ring image gradient locations. This approach is based on a
                            can be achieved in cluttered partially-occluded images with          model of the behavior of complex cells in the cerebral cor-
                            a computation time of under 2 seconds.                               tex of mammalian vision. The resulting feature vectors are
                                                                                                 called SIFT keys. In the current implementation, each im-
                            1. Introduction                                                      age generates on the order of 1000 SIFT keys, a process that
                                                                                                 requires less than 1 second of computation time.
                            Object recognition in cluttered real-world scenes requires               The SIFT keys derived from an image are used in a
                            local image features that are unaffected by nearby clutter or        nearest-neighbour approach to indexing to identify candi-
                            partial occlusion. The features must be at least partially in-       date object models. Collections of keys that agree on a po-
                            variant to illumination, 3D projective transforms, and com-          tential model pose are first identified through a Hough trans-
                            mon object variations. On the other hand, the features must          form hash table, and then through a least-squares fit to a final
                            also be sufficiently distinctive to identify specific objects          estimate of model parameters. When at least 3 keys agree
                            among many alternatives. The difficulty of the object recog-          on the model parameters with low residual, there is strong
                            nition problem is due in large part to the lack of success in        evidence for the presence of the object. Since there may be
                            finding such image features. However, recent research on              dozens of SIFT keys in the image of a typical object, it is
                            the use of dense local features (e.g., Schmid & Mohr [19])           possible to have substantial levels of occlusion in the image
                            has shown that efficient recognition can often be achieved            and yet retain high levels of reliability.
                            by using local image descriptors sampled at a large number               The current object models are represented as 2D loca-
                            of repeatable locations.                                             tions of SIFT keys that can undergo affine projection. Suf-
                               This paper presents a new method for image feature gen-           ficient variation in feature location is allowed to recognize
                            eration called the Scale Invariant Feature Transform (SIFT).         perspective projection of planar shapes at up to a 60 degree
                            This approach transforms an image into a large collection            rotation away from the camera or to allow up to a 20 degree
                            of local feature vectors, each of which is invariant to image        rotation of a 3D object.


                            Proc. of the International Conference on                         1
                            Computer Vision, Corfu (Sept. 1999)

    yey!!
Object Recognition from Local Scale-Invariant Features
                                                                                   David G. Lowe


                      Lowe
                                                                            Computer Science Department
                                                                            University of British Columbia



                s
                                                                          Vancouver, B.C., V6T 1Z4, Canada



               r     (1999)
                                                                                  lowe@cs.ubc.ca




       p     e                                           Abstract                                    translation, scaling, and rotation, and partially invariant to
                                                                                                     illumination changes and affine or 3D projection. Previous




     a
                                An object recognition system has been developed that uses a          approaches to local feature generation lacked invariance to
                                new class of local image features. The features are invariant        scale and were more sensitive to projective distortion and




    p
                                to image scaling, translation, and rotation, and partially in-       illumination change. The SIFT features share a number of
                                variant to illumination changes and affine or 3D projection.          properties in common with the responses of neurons in infe-
                                These features share similar properties with neurons in in-          rior temporal (IT) cortex in primate vision. This paper also




3
                                ferior temporal cortex that are used for object recognition          describes improved approaches to indexing and model ver-
                                         Histograms of Oriented Gradients for Human Detection
                                in primate vision. Features are efficiently detected through          ification.
                                a staged filtering approach that identifies stable points in               The scale-invariant features are efficiently identified by
                                scale space. Image keys are created that allow for local ge-         using a staged filtering approach. The first stage identifies
                                ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that
                                                                              Navneet gradi-         key Triggs
                                                    INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
                                                                 o
                                ents in multiple orientation planes and at multiple scales.          are maxima or minima of a difference-of-Gaussian function.
                                The keys are used as input to a nearest-neighbor indexing            Each http://lear.inrialpes.fr
                                                         {Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes
                                method that identifies candidate object matches. Final veri-          the local image region sampled relative to its scale-space co-

            Nalal and Triggs
                                fication of each match is achieved by finding a low-residual           ordinate frame. The features achieve partial invariance to
                                least-squares solution for the unknown model parameters.
                                                           Abstract
                                Experimental results show that robust object recognition            We briefly discusssuch as affine or 3D projections, by blur-
                                                                                                     local variations, previous work on human detection in
                                   We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data
                                                                                                     ring image gradient locations. This approach is based on a
                                can be achieved in cluttered partially-occluded imagesob-

                      (2005)   ject recognition,time of under 2 seconds.
                                a computation adopting linear SVM based human detec-
                               tion as a test case. After reviewing existing edge and gra-
                               dient based descriptors, we show experimentally that grids
                                                                                                 setsmodel and give a detailedcomplex cells in experimental cor-
                                                                                                      in §4 of the behavior of description and the cerebral
                                                                                                 evaluation of each stage of the process in §5–6. The main
                                                                                                     tex of mammalian vision. The resulting feature vectors are
                                                                                                 conclusions are summarized in §7. implementation, each im-
                                                                                                     called SIFT keys. In the current
                                1. Introduction
                               of Histograms of Oriented Gradient (HOG) descriptors sig-
                                                                                                     age generates on the order of 1000 SIFT keys, a process that
                                                                                                 2 requires lessWork second of computation time.
                                                                                                      Previous than 1
                               nificantly outperform existing feature sets for human detec-
                               tion. We recognition in cluttered real-world scenes requires
                                Object study the influence of each stage of the computation          There is SIFT keys derived from an image are used in a
                                                                                                         The an extensive literature on object detection, but
                                local image features that are that fine-scale nearby clutter or here we mention just a approach to papers on to identify candi-
                               on performance, concluding       unaffected by gradients, fine         nearest-neighbour few relevant indexing human detec-
                               orientation binning, relatively coarse be at least partially in- tiondate object models.See [6] for a survey. Papageorgiou et po-
                                partial occlusion. The features must spatial binning, and             [18,17,22,16,20]. Collections of keys that agree on a
                                variant to illumination, 3D projective transforms, and com- al [18] describe apose are first identified throughpolynomial
                               high-quality local contrast normalization in overlapping de-          tential model pedestrian detector based on a a Hough trans-
                                mon object variations. On the other hand, the features must SVM using rectified and then through input descriptors, withfinal
                               scriptor blocks are all important for good results. The new           form hash table, Haar wavelets as a least-squares fit to a
                                also be sufficiently distinctive to identify specific objects a parts (subwindow) based variant in [17]. at least 3 keysal
                               approach gives near-perfect separation on the original MIT            estimate of model parameters. When Depoortere et agree
                                among many alternatives. The difficulty of the challenging give anthe model parameters with[2]. Gavrila &there is strong
                               pedestrian database, so we introduce a more         object recog-     on optimized version of this low residual, Philomen
                                nition problem is overin large part to the lack images within [8] evidence fordirect approach, extracting edge images and be
                               dataset containing    due 1800 annotated human of success             take a more the presence of the object. Since there may
                                finding such of pose features. However, recent research on matching them to a set of in the image of a typicalchamfer it is
                               a large range    image variations and backgrounds.                    dozens of SIFT keys learned exemplars using object,
                                the use of dense local features (e.g., Schmid & Mohr [19]) distance. This has been used in levels of occlusion inpedes-
                                                                                                     possible to have substantial a practical real-time the image
                               1 Introduction                                                        and yet retain high levels of et al [22]
                                has shown that efficient recognition can often be achieved trian detection system [7]. Viola reliability.build an efficient
                                byDetecting humans in images is sampled at a large owing moving person detector, using AdaBoost to train a chain of
                                    using local image descriptors a challenging task number              The current object models are represented as 2D loca-
                               to their variable appearance and the wide range of poses that
                                of repeatable locations.                                         progressively more complexcan undergo affine projection. Suf-
                                                                                                     tions of SIFT keys that region rejection rules based on
                                                                                                     ficient variation in space-time differences. Ronfard et
                                     can adopt. The first need is a robust feature set gen- Haar-like wavelets andfeature location is allowed to recognize
                               theyThis paper presents a new method for image featurethat
                               allows the human form to be discriminated cleanly, even in
                                eration called the Scale Invariant Feature Transform (SIFT).     al [19] build anprojection of planar shapesby incorporating
                                                                                                     perspective articulated body detectornd at up to a 60 degree
                                                                                                                                     st
                                                               difficult illumination. collection SVM based away classifierscamera or to allow up to a 20 degree
                               cluttered backgrounds underan image into a largeWe study
                                This approach transforms                                             rotation limb from the over 1 and 2 order Gaussian
                               the issue of feature sets foreach of which is invariant to image filters in a dynamic programming framework similar to those
                                of local feature vectors, human detection, showing that lo-          rotation of a 3D object.
                               cally normalized Histogram of Oriented Gradient (HOG) de-         of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
                               scriptors provide excellent performance relative to other ex-     [9]. Mikolajczyk et al [16] use combinations of orientation-
                               isting feature sets including wavelets [17,22]. The proposed
                                Proc. of the International Conference on                         position histograms with binary-thresholded gradient magni-
                                                                                                 1
                               descriptorsVision, Corfu (Sept. 1999) orientation histograms tudes to build a parts based method containing detectors for
                                Computer are reminiscent of edge
                               [4,5], SIFT descriptors [12] and shape contexts [1], but they     faces, heads, and front and side profiles of upper and lower
    yey!!                      are computed on a dense grid of uniformly spaced cells and        body parts. In contrast, our detector uses a simpler archi-
                                                                                                 tecture with a single detection window, but appears to give
                               they use overlapping local contrast normalizations for im-
                               proved performance. We make a detailed study of the effects       significantly higher performance on pedestrian images.
                               of various implementation choices on detector performance,
                               taking “pedestrian detection” (the detection of mostly visible
                                                                                                 3 Overview of the Method
Object Recognition from Local Scale-Invariant Features
                                                                                   David G. Lowe


                      Lowe
                                                                            Computer Science Department
                                                                            University of British Columbia



                s
                                                                          Vancouver, B.C., V6T 1Z4, Canada



               r     (1999)
                                                                                  lowe@cs.ubc.ca




       p     e                                           Abstract                                    translation, scaling, and rotation, and partially invariant to
                                                                                                     illumination changes and affine or 3D projection. Previous




     a
                                An object recognition system has been developed that uses a          approaches to local feature generation lacked invariance to
                                new class of local image features. The features are invariant        scale and were more sensitive to projective distortion and




    p
                                to image scaling, translation, and rotation, and partially in-       illumination change. The SIFT features share a number of
                                variant to illumination changes and affine or 3D projection.          properties in common with the responses of neurons in infe-
                                These features share similar properties with neurons in in-          rior temporal (IT) cortex in primate vision. This paper also




3
                                ferior temporal cortex that are used for object recognition          describes improved approaches to indexing and model ver-
                                         Histograms of Oriented Gradients for Human Detection
                                in primate vision. Features are efficiently detected through          ification.
                                a staged filtering approach that identifies stable points in               The scale-invariant features are efficiently identified by
                                scale space. Image keys are created that allow for local ge-         using a staged filtering approach. The first stage identifies
                                ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that
                                                                              Navneet gradi-         key Triggs
                                                    INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
                                                                 o
                                ents in multiple orientation planes and at multiple scales.          are maxima or minima of a difference-of-Gaussian function.
                                The keys are used as input to a nearest-neighbor indexing            Each http://lear.inrialpes.fr
                                                         {Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes
                                method that identifies candidate object matches. Final veri-          the local image region sampled relative to its scale-space co-

            Nalal and Triggs
                                fication of each match is achieved by finding a low-residual           ordinate frame. The features achieve partial invariance to
                                least-squares solution for the unknown model parameters.
                                                           Abstract
                                Experimental results show that robust object recognition            We briefly discusssuch as affine or 3D projections, by blur-
                                                                                                     local variations, previous work on human detection in
                                   We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data
                                                                                                     ring image gradient locations. This approach is based on a
                                can be achieved in cluttered partially-occluded imagesob-

                      (2005)   ject recognition,time of under 2 seconds.
                                a computation adopting linear SVM based human detec-
                               tion as a test case. After reviewing existing edge and gra-
                               dient based descriptors, we show experimentally that grids
                                                                                                 setsmodel and give a detailedcomplex cells in experimental cor-
                                                                                                      in §4 of the behavior of description and the cerebral
                                                                                                 evaluation of each stage of the process in §5–6. The main
                                                                                                     tex of mammalian vision. The resulting feature vectors are
                                                                                                 conclusions are summarized in §7. implementation, each im-
                                                                                                     called SIFT keys. In the current
                                1. Introduction
                               of Histograms of Oriented Gradient (HOG) descriptors sig-
                                                                                                     age generates on the order of 1000 SIFT keys, a process that
                                                                                                 2 requires lessWork second of computation time.
                                                                                                      Previous than 1
                               nificantly outperform existing feature sets for human detec-
                               tion. We recognition in cluttered real-world scenes requires
                                Object study the influence of each stage of the computation          There is SIFT keys derived from an image are used in a
                                                                                                         The an extensive literature on object detection, but
                                local image features that are that fine-scale nearby clutter or here we mention just a approach to papers on to identify candi-
                               on performance, concluding       unaffected by gradients, fine         nearest-neighbour few relevant indexing human detec-
                               orientation binning, relatively coarse be at least partially in- tiondate object models.See [6] for a survey. Papageorgiou et po-
                                partial occlusion. The features must spatial binning, and             [18,17,22,16,20]. Collections of keys that agree on a
                                variant to illumination, 3D projective transforms, and com- al [18] describe apose are first identified throughpolynomial
                               high-quality local contrast normalization in overlapping de-          tential model pedestrian detector based on a a Hough trans-
                                               A Discriminatively Trained, Multiscale, Deformable Part Model fit to a
                                mon object variations. On the other hand, the features must SVM using rectified and then through input descriptors, withfinal
                               scriptor blocks are all important for good results. The new           form hash table, Haar wavelets as a least-squares
                                also be sufficiently distinctive to identify specific objects a parts (subwindow) based variant in [17]. at least 3 keysal
                               approach gives near-perfect separation on the original MIT            estimate of model parameters. When Depoortere et agree
                                among many alternatives. The difficulty of the challenging give anthe model parameters with[2]. Gavrila &there is strong
                               pedestrian database, so we introduce a more         object recog-     on optimized version of this low residual, Philomen
                                nition problem is overin large part to the lack images within [8] evidence fordirect approach, extracting edge images and be
                               dataset containing    due 1800 annotated human of success             take a more the presence of the object. Since there may
                                      Pedro Felzenszwalb                               David McAllester                                   Deva Ramanan
                                finding such of pose features. However, recent research on matching them to a set of in the image of a typicalchamfer it is
                               a large range    image variations and backgrounds.                    dozens of SIFT keys learned exemplars using object,
                                the University of Chicago (e.g.,Toyota Technological Institute to has been used in levels ofUC Irvine pedes-
                                     use of dense local features                                     possible at Chicago
                                                                          Schmid & Mohr [19]) distance. This have substantial a practical real-time the image
                                                                                                                                             occlusion in
                               1 Introduction
      Felzenszwalb et al.       has pff@cs.uchicago.edu
                                     shown that efficient recognition can often be achieved trian detection system [7]. Viola dramanan@ics.uci.edu
                                                                                      mcallester@tti-c.org                          et al [22]
                                                                                                     and yet retain high levels of reliability.build an efficient
                                byDetecting humans in images is sampled at a large owing moving person detector, using AdaBoost to train a chain of
                                    using local image descriptors a challenging task number              The current object models are represented as 2D loca-
                               to their variable appearance and the wide range of poses that
                                of repeatable locations.                                         progressively more complexcan undergo affine projection. Suf-
                                                                                                     tions of SIFT keys that region rejection rules based on
                                                                                                     ficient variation in space-time differences. Ronfard et
                                     can adopt. The firstnew method for image feature gen- Haar-like wavelets andfeature location is allowed to recognize

                 (2008)
                               theyThis paper presents aAbstract robust feature set that
                                                             need is a
                               allows the human form to be discriminated cleanly, even in
                                eration called the Scale Invariant Feature Transform (SIFT).     al [19] build anprojection of planar shapesby incorporating
                                                                                                     perspective articulated body detectornd at up to a 60 degree
                                                                                                                                      st
                                                               difficult illumination. collection SVM based away classifierscamera or to allow up to a 20 degree
                               cluttered backgrounds underan image into a largeWe study
                                This approach describes a discriminatively trained, multi-           rotation limb from the over 1 and 2 order Gaussian
                                   This paper transforms
                               the issue of feature sets foreach of which is invariant to image filters in a dynamic programming framework similar to those
                                of local feature vectors, human detection, showing that lo-
                                scale, deformable part model for object detection. Our sys- of Felzenszwalb3D object.
                               cally normalized Histogram of Oriented Gradient (HOG) de-
                                                                                                     rotation of a
                                                                                                                    & Huttenlocher [3] and Ioffe & Forsyth
                               scriptors providetwo-fold improvement relative to other ex-
                                tem achieves a excellent performance in average precision [9]. Mikolajczyk et al [16] use combinations of orientation-
                               isting thethe International Conference 2006 PASCAL person de- position histograms with binary-thresholded gradient magni-
                                over feature sets including wavelets [17,22]. The proposed 1
                                Proc. of best performance in the on
                                tection challenge. It also outperforms the best results in the tudes to build a parts based method containing detectors for
                               descriptorsVision, Corfu (Sept. 1999) orientation histograms
                                Computer are reminiscent of edge
                               [4,5], SIFT descriptors [12]of twenty categories. The system faces, heads, and front and side profiles of upper and lower
                                2007 challenge in ten out and shape contexts [1], but they
    yey!!                      are computed on adeformableof uniformly spaced cells and
                                relies heavily on dense grid parts. While deformable part body parts. In contrast, our detector uses a simpler archi-
                                models overlapping quite contrast their value had not been tecture with a single detection obtained with the person model. The
                                                                                                     Figure 1. Example detection window, but appears to give
                               they usehave become local popular, normalizations for im-
                                                                                                     model is defined by a coarse template, several higher resolution
                               proved performance. We make a detailedsuch as the PASCAL significantly higher performance on for the location of each part.
                                demonstrated on difficult benchmarks study of the effects
                                                                                                     part templates and a spatial model
                                                                                                                                         pedestrian images.
                                challenge. Our system also relies heavily on new methods
                               of various implementation choices on detector performance,
                                for discriminative training. We detection of mostly visible 3 Overview of the Method
                               taking “pedestrian detection” (thecombine a margin-sensitive
Object Recognition from Local Scale-Invariant Features
                                                                          David G. Lowe


            Lowe
                                                                   Computer Science Department
                                                                   University of British Columbia
                                                                 Vancouver, B.C., V6T 1Z4, Canada

           (1999)
                                                                         lowe@cs.ubc.ca


                                               Abstract                                   translation, scaling, and rotation, and partially invariant to
                                                                                          illumination changes and affine or 3D projection. Previous
                       An object recognition system has been developed that uses a        approaches to local feature generation lacked invariance to
                       new class of local image features. The features are invariant      scale and were more sensitive to projective distortion and
                       to image scaling, translation, and rotation, and partially in-     illumination change. The SIFT features share a number of
                       variant to illumination changes and affine or 3D projection.        properties in common with the responses of neurons in infe-
                       These features share similar properties with neurons in in-        rior temporal (IT) cortex in primate vision. This paper also
                       ferior temporal cortex that are used for object recognition        describes improved approaches to indexing and model ver-
                                Histograms of Oriented Gradients for Human Detection
                       in primate vision. Features are efficiently detected through        ification.
                       a staged filtering approach that identifies stable points in             The scale-invariant features are efficiently identified by
                       scale space. Image keys are created that allow for local ge-       using a staged filtering approach. The first stage identifies
                       ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that
                                                                   Navneet gradi-         key Triggs
                                          INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
                                                      o
                       ents in multiple orientation planes and at multiple scales.        are maxima or minima of a difference-of-Gaussian function.
                       The keys are used as input to a nearest-neighbor indexing          Each http://lear.inrialpes.fr
                                              {Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes
                       method that identifies candidate object matches. Final veri-        the local image region sampled relative to its scale-space co-

  Nalal and Triggs
                       fication of each match is achieved by finding a low-residual         ordinate frame. The features achieve partial invariance to
                       least-squares solution for the unknown model parameters.
                                                Abstract
                       Experimental results show that robust object recognition          We briefly discusssuch as affine or 3D projections, by blur-
                                                                                          local variations, previous work on human detection in
                         We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data
                                                                                          ring image gradient locations. This approach is based on a
                       can be achieved in cluttered partially-occluded imagesob-

            (2005)    ject recognition,time of under 2 seconds.
                       a computation adopting linear SVM based human detec-
                      tion as a test case. After reviewing existing edge and gra-
                      dient based descriptors, we show experimentally that grids
                                                                                      setsmodel and give a detailedcomplex cells in experimental cor-
                                                                                           in §4 of the behavior of description and the cerebral
                                                                                      evaluation of each stage of the process in §5–6. The main
                                                                                          tex of mammalian vision. The resulting feature vectors are
                                                                                      conclusions are summarized in §7. implementation, each im-
                                                                                          called SIFT keys. In the current
                       1. Introduction
                      of Histograms of Oriented Gradient (HOG) descriptors sig-
                                                                                          age generates on the order of 1000 SIFT keys, a process that
                                                                                      2 requires lessWork second of computation time.
                                                                                           Previous than 1
                      nificantly outperform existing feature sets for human detec-
                      tion. We recognition in cluttered real-world scenes requires
                       Object study the influence of each stage of the computation        There is SIFT keys derived from an image are used in a
                                                                                              The an extensive literature on object detection, but
                       local image features that are unaffected by nearby clutter or      nearest-neighbour approach to indexing to identify candi-
                       partial occlusion. The features must be at least partially in-     date object models. Collections of keys that agree on a po-
                       variant to illumination, 3D projective transforms, and com-        tential model pose are first identified through a Hough trans-
                                      A Discriminatively Trained, Multiscale, Deformable Part Model fit to a final
                       mon object variations. On the other hand, the features must        form hash table, and then through a least-squares
                       also be sufficiently distinctive to identify specific objects        estimate of model parameters. When at least 3 keys agree
                       among many alternatives. The difficulty of the object recog-        on the model parameters with low residual, there is strong
                       nition problem is due in large part to the lack of success in
                            Pedro Felzenszwalb                               David McAllester for the presence of the object. Since there may be
                                                                                          evidence
                                                                                                                             Deva Ramanan
                       finding such image features. However, recent research on            dozens of SIFT keys in the image of a typical object, it is
                       the University of Chicago (e.g.,Toyota Technological Institute to have substantial levels ofUC Irvine the image
                           use of dense local features        Schmid & Mohr [19])         possible at Chicago                     occlusion in

Felzenszwalb et al.    has pff@cs.uchicago.edu                             mcallester@tti-c.org
                            shown that efficient recognition can often be achieved         and yet retain high levels of dramanan@ics.uci.edu
                                                                                                                        reliability.
                       by using local image descriptors sampled at a large number             The current object models are represented as 2D loca-
                       of repeatable locations.                                           tions of SIFT keys that can undergo affine projection. Suf-


           (2008)
                           This paper presents aAbstract for image feature gen-
                                                 new method                               ficient variation in feature location is allowed to recognize
                       eration called the Scale Invariant Feature Transform (SIFT).       perspective projection of planar shapes at up to a 60 degree
                       This approach describes a an image into a large collection
                          This paper    transforms discriminatively trained, multi-       rotation away from the camera or to allow up to a 20 degree
                       of local feature vectors, each of which isdetection. to image
                       scale, deformable part model for object     invariant Our sys-     rotation of a 3D object.
                      tem achieves a two-fold improvement in average precision
                      over thethe International Conference 2006 PASCAL person de- 1
                      Proc. of best performance in the on
                      tection challenge. It also outperforms the best results in the
                      Computer Vision, Corfu (Sept. 1999)
                      2007 challenge in ten out of twenty categories. The system
                      relies heavily on deformable parts. While deformable part
                      models have become quite popular, their value had not been     Figure 1. Example detection obtained with the person model. The
                      demonstrated on difficult benchmarks such as the PASCAL         model is defined by a coarse template, several higher resolution
Scale-Invariant Feature Transform
              (SIFT)




                      adapted from Kucuktunc
Scale-Invariant Feature Transform
              (SIFT)




                          adapted from Brown, ICCV 2003
SIFT local features are
invariant...




                  adapted from David Lee
like me they are robust...



      Text
like me they are robust...



                Text


... to changes in illumination,
noise, viewpoint, occlusion, etc.
I am sure you want to know
how to build them


      Text
I am sure you want to know
        how to build them

1. find interest points or “keypoints”
                Text
I am sure you want to know
        how to build them

1. find interest points or “keypoints”
                Text

2. find their dominant orientation
I am sure you want to know
        how to build them

1. find interest points or “keypoints”
                Text

2. find their dominant orientation

3. compute their descriptor
I am sure you want to know
        how to build them

1. find interest points or “keypoints”
                Text

2. find their dominant orientation

3. compute their descriptor

4. match them on other images
1. find interest points or “keypoints”
                Text
keypoints are taken as maxima/minima
of a DoG pyramid




                Text




                  in this settings, extremas are invariant to scale...
a DoG (Difference of Gaussians) pyramid
is simple to compute...   even him can do it!




    before            after




              adapted from Pallus and Fleishman
then we just have to find
neighborhood extremas
in this 3D DoG space
then we just have to find
neighborhood extremas
in this 3D DoG space



                           if a pixel is an extrema
                           in its neighboring region
                           he becomes a candidate
                           keypoint
too many
keypoints?




             adapted from wikipedia
too many
keypoints?




1. remove
low contrast




               adapted from wikipedia
too many
keypoints?




1. remove
low contrast




               adapted from wikipedia
too many
keypoints?




1. remove
low contrast

2. remove
edges




               adapted from wikipedia
too many
keypoints?




1. remove
low contrast

2. remove
edges




               adapted from wikipedia
Text

2. find their dominant orientation
each selected keypoint is
assigned to one or more
“dominant” orientations...
each selected keypoint is
assigned to one or more
“dominant” orientations...



... this step is important to
achieve rotation invariance
How?
How?
using the DoG pyramid to achieve
scale invariance:
How?
using the DoG pyramid to achieve
scale invariance:

a. compute image gradient
magnitude and orientation
How?
using the DoG pyramid to achieve
scale invariance:

a. compute image gradient
magnitude and orientation

b. build an orientation histogram
How?
using the DoG pyramid to achieve
scale invariance:

a. compute image gradient
magnitude and orientation

b. build an orientation histogram

c. keypoint’s orientation(s) = peak(s)
a. compute image gradient
magnitude and orientation
a. compute image gradient
magnitude and orientation
b. build an orientation histogram




                                    adapted from Ofir Pele
c. keypoint’s orientation(s) = peak(s)


                            *




                                   * the peak ;-)
Text



3. compute their descriptor
SIFT descriptor
= a set of orientation histograms




   16x16 neighborhood   4x4 array x 8 bins
   of pixel gradients   = 128 dimensions (normalized)
Text




4. match them on other images
How to   atch?
How to           atch?


nearest neighbor
How to           atch?


nearest neighbor
hough transform voting
How to           atch?


nearest neighbor
hough transform voting
least-squares fit
How to           atch?


nearest neighbor
hough transform voting
least-squares fit
etc.
SIFT is great!




       Text
SIFT is great!




                  Text
 invariant to affine transformations
SIFT is great!




                  Text
 invariant to affine transformations

 easy to understand
SIFT is great!




                  Text
 invariant to affine transformations

 easy to understand

 fast to compute
Extension example:
Spatial Pyramid Matching using SIFT
            Beyond Bags of Features: Spatial Pyramid Matching
                for Recognizing Natural Scene Categories

    Svetlana Lazebnik1            Cordelia Schmid2                   Jean Ponce1,3
     slazebni@uiuc.edu        Cordelia.Schmid@inrialpes.fr         ponce@cs.uiuc.edu
     1
      Beckman Institute           2        Text
                                  INRIA Rhˆ ne-Alpes
                                          o                  3
                                                                 Ecole Normale Sup´ rieure
                                                                                   e
     University of Illinois       Montbonnot, France                  Paris, France




                                                                                CVPR 2006
Object Recognition from Local Scale-Invariant Features
                                                                          David G. Lowe


            Lowe
                                                                   Computer Science Department
                                                                   University of British Columbia
                                                                 Vancouver, B.C., V6T 1Z4, Canada

           (1999)
                                                                         lowe@cs.ubc.ca


                                                Abstract                                   translation, scaling, and rotation, and partially invariant to
                                                                                           illumination changes and affine or 3D projection. Previous
                      An object recognition system has been developed that uses a          approaches to local feature generation lacked invariance to
                      new class of local image features. The features are invariant        scale and were more sensitive to projective distortion and
                      to image scaling, translation, and rotation, and partially in-       illumination change. The SIFT features share a number of
                      variant to illumination changes and affine or 3D projection.          properties in common with the responses of neurons in infe-



                                Histograms of Oriented Gradients for Human Detection
                                                           Navneet Dalal and Bill Triggs
                                        INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
                                                o
                                           {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr


  Nalal and Triggs                             Abstract
                          We study the question of feature sets for robust visual ob-
                                                                                          We briefly discuss previous work on human detection in
                                                                                       §2, give an overview of our method §3, describe our data

            (2005)    ject recognition, adopting linear SVM based human detec-
                      tion as a test case. After reviewing existing edge and gra-
                      dient based descriptors, we show experimentally that grids
                                                                                       sets in §4 and give a detailed description and experimental
                                                                                       evaluation of each stage of the process in §5–6. The main
                                                                                       conclusions are summarized in §7.
                      of Histograms of Oriented Gradient (HOG) descriptors sig-       2 Previous Work
                      nificantly outperform existing feature sets for human detec-
                      tion. We study the influence of each stage of the computation        There is an extensive literature on object detection, but
                      on performance, concluding that fine-scale gradients, fine        here we mention just a few relevant papers on human detec-
                      orientation binning, relatively coarse spatial binning, and     tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et
                      high-quality local contrast normalization in overlapping de-    al [18] describe a pedestrian detector based on a polynomial
                                      A Discriminatively Trained, Multiscale, Deformable Part Model with
                      scriptor blocks are all important for good results. The new     SVM using rectified Haar wavelets as input descriptors,
                      approach gives near-perfect separation on the original MIT      a parts (subwindow) based variant in [17]. Depoortere et al
                      pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen
                                                                                      [8] take a more direct approach, extracting edge images and
                      dataset containing over 1800 annotated human images with McAllester
                            Pedro Felzenszwalb                                David matching them to a set of learned exemplarsRamanan
                                                                                                                             Deva using chamfer
                      a large range of pose variations and backgrounds.
                           University of Chicago              Toyota Technological Institute athas been used in a practical real-time pedes-
                                                                                      distance. This Chicago                    UC Irvine
                      1 Introduction
Felzenszwalb et al.
                            pff@cs.uchicago.edu                             mcallester@tti-c.org system [7]. Viola dramanan@ics.uci.edu
                                                                                      trian detection                   et al [22] build an efficient
                          Detecting humans in images is a challenging task owing      moving person detector, using AdaBoost to train a chain of
                      to their variable appearance and the wide range of poses that   progressively more complex region rejection rules based on
                                                                                      Haar-like wavelets and space-time differences. Ronfard et

           (2008)
                      they can adopt. The first need is a robust feature set that
                                                  Abstract
                      allows the human form to be discriminated cleanly, even in      al [19] build an articulated body detector by incorporating
                      cluttered backgrounds under difficult illumination. We study     SVM based limb classifiers over 1st and 2nd order Gaussian
                          This paper describes a discriminatively trained, multi- filters in a dynamic programming framework similar to those
                      the issue of feature sets for human detection, showing that lo-
                       scale, deformable part model for object detection. Our sys- of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
                      cally normalized Histogram of Oriented Gradient (HOG) de-
                      scriptors providetwo-fold improvement relative to other ex-
                       tem achieves a excellent performance in average precision [9]. Mikolajczyk et al [16] use combinations of orientation-
                      isting the bestsets including wavelets [17,22]. The person de- position histograms with binary-thresholded gradient magni-
                       over feature performance in the 2006 PASCAL proposed
                      descriptors are reminiscent of edge orientation results in the tudes to build a parts based method containing detectors for
                       tection challenge. It also outperforms the best histograms
                      [4,5], SIFT descriptors [12]of twenty categories. The system faces, heads, and front and side profiles of upper and lower
                       2007 challenge in ten out and shape contexts [1], but they
                      are computed on adeformableof uniformly spaced cells and
                       relies heavily on dense grid parts. While deformable part body parts. In contrast, our detector uses a simpler archi-
                       models overlapping quite contrast their value had not been tecture with a single detection obtained with the person model. The
                      they usehave become local popular, normalizations for im-
                                                                                           Figure 1. Example detection window, but appears to give
                                                                                           model is defined by a coarse template, several higher resolution
                      proved performance. We make a detailedsuch as the PASCAL significantly higher performance on pedestrian images.
                       demonstrated on difficult benchmarks study of the effects
                      of various implementation choices on detector performance,
                      taking “pedestrian detection” (the detection of mostly visible
                                                                                      3 Overview of the Method
Histograms of Oriented Gradients for Human Detection
                                      Navneet Dalal and Bill Triggs
                   INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
                           o
                      {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr


                         Abstract                                    We briefly discuss previous work on human detection in
   We study the question of feature sets for robust visual ob-    §2, give an overview of our method §3, describe our data
ject recognition, adopting linear SVM based human detec-          sets in §4 and give a detailed description and experimental
tion as a test case. After reviewing existing edge and gra-       evaluation of each stage of the process in §5–6. The main
dient based descriptors, we show experimentally that grids        conclusions are summarized in §7.
of Histograms of Oriented Gradient (HOG) descriptors sig-         2   Previous Work
nificantly outperform existing feature sets for human detec-
tion. We study the influence of each stage of the computation          There is an extensive literature on object detection, but
on performance, concluding that fine-scale gradients, fine          here we mention just a few relevant papers on human detec-
orientation binning, relatively coarse spatial binning, and       tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et
high-quality local contrast normalization in overlapping de-      al [18] describe a pedestrian detector based on a polynomial
scriptor blocks are all important for good results. The new       SVM using rectified Haar wavelets as input descriptors, with
approach gives near-perfect separation on the original MIT        a parts (subwindow) based variant in [17]. Depoortere et al
pedestrian database, so we introduce a more challenging           give an optimized version of this [2]. Gavrila & Philomen
dataset containing over 1800 annotated human images with          [8] take a more direct approach, extracting edge images and
a large range of pose variations and backgrounds.                 matching them to a set of learned exemplars using chamfer
                                                                  distance. This has been used in a practical real-time pedes-
1 Introduction                                                    trian detection system [7]. Viola et al [22] build an efficient
    Detecting humans in images is a challenging task owing        moving person detector, using AdaBoost to train a chain of
to their variable appearance and the wide range of poses that     progressively more complex region rejection rules based on
they can adopt. The first need is a robust feature set that        Haar-like wavelets and space-time differences. Ronfard et
                                 first of all, let me put this paper in
allows the human form to be discriminated cleanly, even in
cluttered backgrounds under difficult illumination. We study
                                                                  al [19] build an articulated body detector by incorporating
                                                                  SVM based limb classifiers over 1st and 2nd order Gaussian
                                 context
the issue of feature sets for human detection, showing that lo-
cally normalized Histogram of Oriented Gradient (HOG) de-
                                                                  filters in a dynamic programming framework similar to those
                                                                  of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
Histograms of Oriented Gradients for Human Detection
                                    Navneet Dalal and Bill Triggs
                 INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
                         o
                    {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
            λ
            λ
            λ




                        Abstract                                  We briefly discuss previous work on human detection in
                       Swain & Ballard 1991 - Color an overview of our method §3, describe our data
                                                               §2, give Histograms
   We study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec-       sets in §4 and give a detailed description and experimental
tion as a test case. After reviewing& Crowley 1996 evaluation of each stage of the process in §5–6. The main
                       Schiele existing edge and gra- conclusions are summarized in §7.
                                                               - Receptive Fields Histograms
dient based descriptors, we show experimentally that grids
of Histograms of Oriented Gradient (HOG) descriptors sig-       2 Previous Work
nificantly outperform existing feature sets - SIFT detec-
                         Lowe 1999 for human
tion. We study the influence of each stage of the computation        There is an extensive literature on object detection, but
on performance, concluding that fine-scale gradients, fine        here we mention just a few relevant papers on human detec-
                         Schneiderman & Kanade 2000 - Localized for a survey. PapageorgiouWavelets
                                                                tion [18,17,22,16,20]. See [6] Histograms of et
orientation binning, relatively coarse spatial binning, and
high-quality local contrast normalization in overlapping de-    al [18] describe a pedestrian detector based on a polynomial
                                                                SVM using rectified Haar wavelets as input descriptors, with
scriptor blocks are all Leung for good results. The new Texton Histograms
                          important & Malik 2001 -
approach gives near-perfect separation on the original MIT      a parts (subwindow) based variant in [17]. Depoortere et al
pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen
dataset containing over 1800 annotated human images with Shape Context approach, extracting edge images and
                         Belongie et al. 2002 - [8] take a more direct
a large range of pose variations and backgrounds.               matching them to a set of learned exemplars using chamfer
                                                                distance. This has been used in a practical real-time pedes-
1 Introduction           Dalal & Triggs 2005 - Dense Orientation Histogramsan efficient
                                                                trian detection system [7]. Viola et al [22] build
    Detecting humans in images is a challenging task owing      moving person detector, using AdaBoost to train a chain of
to their variable appearance and the wide range of poses that
                         ...                                    progressively more complex region rejection rules based on
they can adopt. The first need is a robust feature set that      Haar-like wavelets and space-time differences. Ronfard et
                               histograms of local image measurement
allows the human form to be discriminated cleanly, even in
cluttered backgrounds under difficult illumination. We study
                                                                al [19] build an articulated body detector by incorporating
                                                                SVM based limb classifiers over 1st and 2nd order Gaussian
                               have been quite successful
the issue of feature sets for human detection, showing that lo-
cally normalized Histogram of Oriented Gradient (HOG) de-
                                                                filters in a dynamic programming framework similar to those
                                                                of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
Histograms of Oriented Gradients for Human Detection
                                    Navneet Dalal and Bill Triggs
                                                                                                                  features
                 INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
                         o
                    {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr


                       Abstract                                   We briefly discuss previous work on human detection in
   We study the question of feature sets for robust visual ob- §2, give an overview of our method §3, describe our data
                       Gravrila & Philomen 1999 - Edgegive a detailed description and experimental
ject recognition, adopting linear SVM based human detec-       sets in §4 and Templates + Nearest Neighbor
tion as a test case. After reviewing existing edge and gra-    evaluation of each stage of the process in §5–6. The main
dient based descriptors, we show experimentally that grids     conclusions are summarized in §7.
                      Papageorgiou & Poggio 2000, Mohan et al. 2001, DePoortere et al.
of Histograms of Oriented Gradient (HOG) descriptors sig-
                         2002 - Haar Wavelets 2 Previous Work
nificantly outperform existing feature sets for human detec- + SVM
tion. We study the influence of each stage of the computation        There is an extensive literature on object detection, but
on performance, concluding that fine-scale gradients, - Rectangular Differentialpapers on human +
                                                                here we mention just a few relevant
                         Viola & Jones 2001 fine tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et
                                                                                                       Features
                                                                                                                        detec-
orientation binning, relatively coarse spatial binning, and
                         AdaBoost
high-quality local contrast normalization in overlapping de-    al [18] describe a pedestrian detector based on a polynomial
scriptor blocks are all important for good results. The new     SVM using rectified Haar wavelets as input descriptors, with
approach gives near-perfect separation on the original MIT - parts (subwindow) based variant in [17]. Depoortere et al
                                                                a
                         Mikolajczyk et al. 2004 give an optimized version of this [2]. Gavrila & Philomen
                                                                   Parts Based Histograms + AdaBoost
pedestrian database, so we introduce a more challenging
dataset containing over 1800 annotated human images with        [8] take a more direct approach, extracting edge images and
a large range of pose variations Sukthankar 2004 - PCA-SIFT set of learned exemplars using chamfer
                         Ke & and backgrounds.                  matching them to a
                                                                distance. This has been used in a practical real-time pedes-
1 Introduction                                                  trian detection system [7]. Viola et al [22] build an efficient
                         ...
    Detecting humans in images is a challenging task owing      moving person detector, using AdaBoost to train a chain of
to their variable appearance and the wide range of poses that   progressively more complex region rejection rules based on
they can adopt. The first need is a robust feature set that      Haar-like wavelets and space-time differences. Ronfard et
allows the human form to be discriminated cleanly, even in      al [19] build an articulated body detector by incorporating
                    tons of “feature sets” have been proposed
cluttered backgrounds under difficult illumination. We study
the issue of feature sets for human detection, showing that lo-
                                                                SVM based limb classifiers over 1st and 2nd order Gaussian
                                                                filters in a dynamic programming framework similar to those
cally normalized Histogram of Oriented Gradient (HOG) de-       of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
Histograms of Oriented Gradients for Human Detection
                                     Navneet Dalal and Bill Triggs
                                                                                                                       difficult!
                  INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
                          o
                     {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr


                        Abstract                                  We briefly discuss previous work on human detection in
   We study the question of feature sets for robust visual ob- §2, give an overview of our method §3, describe our data
ject recognition, adopting linearvariety human detec-
                       Wide SVM based of articulated poses a detailed description and experimental
                                                               sets in §4 and give
tion as a test case. After reviewing existing edge and gra-    evaluation of each stage of the process in §5–6. The main
dient based descriptors, we show experimentally that grids     conclusions are summarized in §7.
                       Variable appearance/clothing
of Histograms of Oriented Gradient (HOG) descriptors sig-       2 Previous Work
nificantly outperform existing feature sets for human detec-
                         Complex backgrounds
tion. We study the influence of each stage of the computation        There is an extensive literature on object detection, but
on performance, concluding that fine-scale gradients, fine        here we mention just a few relevant papers on human detec-
orientation binning, relatively coarse spatial binning, and     tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et
                         Unconstrained illuminations
high-quality local contrast normalization in overlapping de-    al [18] describe a pedestrian detector based on a polynomial
scriptor blocks are all important for good results. The new     SVM using rectified Haar wavelets as input descriptors, with
approach gives near-perfect separation on the original MIT      a parts (subwindow) based variant in [17]. Depoortere et al
                         Occlusions
pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen
dataset containing over 1800 annotated human images with        [8] take a more direct approach, extracting edge images and
                         Different scales
a large range of pose variations and backgrounds.               matching them to a set of learned exemplars using chamfer
                                                                distance. This has been used in a practical real-time pedes-
1 Introduction                                                  trian detection system [7]. Viola et al [22] build an efficient
                         ...
    Detecting humans in images is a challenging task owing      moving person detector, using AdaBoost to train a chain of
to their variable appearance and the wide range of poses that   progressively more complex region rejection rules based on
they can adopt. The first need is a robust feature set that      Haar-like wavelets and space-time differences. Ronfard et
                     localizing humans in images is a
allows the human form to be discriminated cleanly, even in
cluttered backgrounds under difficult illumination. We study
                                                                al [19] build an articulated body detector by incorporating
                                                                SVM based limb classifiers over 1st and 2nd order Gaussian
                     challenging task...
the issue of feature sets for human detection, showing that lo-
cally normalized Histogram of Oriented Gradient (HOG) de-
                                                                filters in a dynamic programming framework similar to those
                                                                of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
Approach
Approach


• robust feature set   (HOG)
Approach


• robust feature set   (HOG)
Approach


• robust feature set   (HOG)


• simple classifier(linear SVM)
Approach


• robust feature set   (HOG)


• simple classifier(linear SVM)


• fast detection(sliding window)
adapted from Bill Triggs
• Gamma normalization
• Space: RGB, LAB or Gray
• Method: SQRT or LOG
• Filtering with simple
                    masks

  centered            centered *
                                                  diagonal


  uncentered          uncentered




cubic-corrected     cubic-corrected                Sobel

                                      * centered performs the best
remember SIFT ?




• Filtering with simple
  masks
            centered




            uncentered




          cubic-corrected
...after filtering, each “pixel” represents
an oriented gradient...
...pixels are regrouped in “cells”,
they cast a weighted vote for an
orientation histogram...




           HOG (Histogram of Oriented Gradients)
a window can be
represented like
that
then, cells are locally normalized
using overlapping “blocks”
they used two types of blocks
they used two types of blocks




•   rectangular

•   similar to SIFT (but dense)
they used two types of blocks




•   rectangular                   •   circular

•   similar to SIFT (but dense)   •   similar to Shape Context
and four different types of block
normalization
and four different types of block
normalization
like SIFT, they gain invariance...



...to illuminations, small
deformations, etc.
finally, a sliding window is
classified by a simple linear SVM
during the learning phase, the
algorithm “looked” for hard examples

      Training




                 adapted from Martial Hebert
average gradients




positive weights                       negative weights
Example
Example




          adapted from Bill Triggs
Example




          adapted from Martial Hebert
Further
Development
Further
Development

 • Detection on Pascal VOC (2006)
Further
Development

 • Detection on Pascal VOC (2006)
 • Human Detection in Movies (ECCV 2006)
Further
Development

 • Detection on Pascal VOC (2006)
 • Human Detection in Movies (ECCV 2006)
 • US Patent by MERL (2006)
Further
Development

 • Detection on Pascal VOC (2006)
 • Human Detection in Movies (ECCV 2006)
 • US Patent by MERL (2006)
 • Stereo Vision HoG (ICVES 2008)
Extension example:
Pyramid HoG++
Extension example:
Pyramid HoG++
Extension example:
Pyramid HoG++
A simple demo...
A simple demo...
A simple demo...




               VIDEO HERE
A simple demo...




               VIDEO HERE
so, it doesn’t work ?!?
so, it doesn’t work ?!?



          no no, it works...
so, it doesn’t work ?!?



          no no, it works...



    ...it just doesn’t work well...
Object Recognition from Local Scale-Invariant Features
                                                                          David G. Lowe


            Lowe
                                                                   Computer Science Department
                                                                   University of British Columbia
                                                                 Vancouver, B.C., V6T 1Z4, Canada

           (1999)
                                                                         lowe@cs.ubc.ca


                                                Abstract                                   translation, scaling, and rotation, and partially invariant to
                                                                                           illumination changes and affine or 3D projection. Previous
                      An object recognition system has been developed that uses a          approaches to local feature generation lacked invariance to
                      new class of local image features. The features are invariant        scale and were more sensitive to projective distortion and
                      to image scaling, translation, and rotation, and partially in-       illumination change. The SIFT features share a number of
                      variant to illumination changes and affine or 3D projection.          properties in common with the responses of neurons in infe-



                                Histograms of Oriented Gradients for Human Detection
                                                           Navneet Dalal and Bill Triggs
                                        INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
                                                o
                                           {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr


  Nalal and Triggs                             Abstract
                         We study the question of feature sets for robust visual ob-
                                                                                          We briefly discuss previous work on human detection in
                                                                                       §2, give an overview of our method §3, describe our data

            (2005)    ject recognition, adopting linear SVM based human detec-
                      tion as a test case. After reviewing existing edge and gra-
                      dient based descriptors, we show experimentally that grids
                                                                                       sets in §4 and give a detailed description and experimental
                                                                                       evaluation of each stage of the process in §5–6. The main
                                                                                       conclusions are summarized in §7.
                      of Histograms of Oriented Gradient (HOG) descriptors sig-        2    Previous Work
                      nificantly outperform existing feature sets for human detec-
                      tion. We study the influence of each stage of the computation         There is an extensive literature on object detection, but




                                    A Discriminatively Trained, Multiscale, Deformable Part Model

                           Pedro Felzenszwalb                           David McAllester                                    Deva Ramanan
                          University of Chicago              Toyota Technological Institute at Chicago                        UC Irvine

Felzenszwalb et al.
                          pff@cs.uchicago.edu                          mcallester@tti-c.org                              dramanan@ics.uci.edu



           (2008)                               Abstract

                         This paper describes a discriminatively trained, multi-
                      scale, deformable part model for object detection. Our sys-
                      tem achieves a two-fold improvement in average precision
                      over the best performance in the 2006 PASCAL person de-
                      tection challenge. It also outperforms the best results in the
                      2007 challenge in ten out of twenty categories. The system
                      relies heavily on deformable parts. While deformable part
                      models have become quite popular, their value had not been           Figure 1. Example detection obtained with the person model. The
                      demonstrated on difficult benchmarks such as the PASCAL               model is defined by a coarse template, several higher resolution
                                                                                           part templates and a spatial model for the location of each part.
                      challenge. Our system also relies heavily on new methods
                      for discriminative training. We combine a margin-sensitive
This paper describes one
of the best algorithm in
object detection...
They used the following methods:




                  s                       Mo del
              e
          atur                     Part                  t SV
                                                              M
HO   G Fe                     able                 Laten
                      De form
They used the following methods:




                            Introduced by
                            Dalal & Triggs (2005)

              e   s
          atur
HO   G Fe
They used the following methods:




                    Mo del
             Part
        able
De form

            Introduced by
            Fischler & Elschlager (1973)
They used the following methods:




Introduced by the authors


                                              M
                                     ten t SV
                                   La
e   s
          atur
HO   G Fe
Model Overview
                                             deformation
      detection   root filter   part filters
                                               models
t ures
  G Fea
HO




      // 8x8 pixel blocks window

      // features computed at different
      resolutions (pyramid)
id
      Py ram
HOG
Mo del
             Part
        able
De form
l
                   M ode
              Part
      mable
D efor




                            // each part is a local
                            property

                            // springs capture
                            spatial relationships

                            // here, the springs
                            can be “negative”
l
                  M ode
              art
Defor
     mable
             P
                    detection score =
                    sum of filter responses - deformation cost
l
                  M ode
              art
Defor
     mable
             P
                    detection score =
                    sum of filter responses - deformation cost




                                           root filter
l
                  M ode
              art
Defor
     mable
             P
                    detection score =
                    sum of filter responses - deformation cost




                                           root filter




                                           part filters
l
                  M ode
              art
Defor
     mable
             P
                    detection score =
                    sum of filter responses - deformation cost




                                           root filter




                                                         deformable
                                           part filters
                                                           model
l
                  M ode
             Part
     mable
 efor
D
                    score of a placement



 filters        feature vector      coefficients of a
                                                           position relative
                (at position p   quadratic function on
                                                         to the root location
             in the pyramid H)      the placement
M
  ten t SV
La
VM
 ate nt S
L


               filters and deformation
                                        features   part displacements
                     parameters
VM
 ate nt S
L
s
   B onu



// Data Mining Hard Negatives

// Model Initialization
s
 Result




Pascal VOC 2006
s
Result




Models learned
m ents
Experi




                ~ Dalal’s model
                ~ Dalal’s + LSVM
am ples
Ex




            errors
em o...
           d
  im ple
As
em o...
           d
  im ple
As
em o...
           d
  im ple
As
em o...
           d
  im ple
As
ns
   cl usio
Con
ns
   cl usio
Con




      so, it doesn’t work ?!?
ns
   cl usio
Con




      so, it doesn’t work ?!?



                no no, it works...
ns
   cl usio
Con




      so, it doesn’t work ?!?



                    no no, it works...

                ...it just doesn’t work well...
ns
   cl usio
Con




      so, it doesn’t work ?!?



                      no no, it works...

                  ...it just doesn’t work well...

               ...or there is a problem with the
               seat-computer interface...
Conclusion
Mit6870 template matching and histograms

Contenu connexe

Tendances

A Textural Approach to Palmprint Identification
A Textural Approach to Palmprint IdentificationA Textural Approach to Palmprint Identification
A Textural Approach to Palmprint IdentificationIJASCSE
 
A Simple Robust Digital Image Watermarking against Salt and Pepper Noise usin...
A Simple Robust Digital Image Watermarking against Salt and Pepper Noise usin...A Simple Robust Digital Image Watermarking against Salt and Pepper Noise usin...
A Simple Robust Digital Image Watermarking against Salt and Pepper Noise usin...IDES Editor
 
object recognition for robots
object recognition for robotsobject recognition for robots
object recognition for robotss1240148
 
A NOVAL ARTECHTURE FOR 3D MODEL IN VIRTUAL COMMUNITIES FROM FACE DETECTION
A NOVAL ARTECHTURE FOR 3D MODEL IN VIRTUAL COMMUNITIES FROM FACE DETECTIONA NOVAL ARTECHTURE FOR 3D MODEL IN VIRTUAL COMMUNITIES FROM FACE DETECTION
A NOVAL ARTECHTURE FOR 3D MODEL IN VIRTUAL COMMUNITIES FROM FACE DETECTIONIJASCSE
 
Retrieving Informations from Satellite Images by Detecting and Removing Shadow
Retrieving Informations from Satellite Images by Detecting and Removing ShadowRetrieving Informations from Satellite Images by Detecting and Removing Shadow
Retrieving Informations from Satellite Images by Detecting and Removing ShadowIJTET Journal
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionzukun
 
Extreme Spatio-Temporal Data Analysis
Extreme Spatio-Temporal Data AnalysisExtreme Spatio-Temporal Data Analysis
Extreme Spatio-Temporal Data AnalysisJoel Saltz
 
Keynote taiwan
Keynote taiwanKeynote taiwan
Keynote taiwanArif Altun
 
ICCV2009: MAP Inference in Discrete Models: Part 1: Introduction
ICCV2009: MAP Inference in Discrete Models: Part 1: IntroductionICCV2009: MAP Inference in Discrete Models: Part 1: Introduction
ICCV2009: MAP Inference in Discrete Models: Part 1: Introductionzukun
 
Design of Shadow Detection and Removal System
Design of Shadow Detection and Removal SystemDesign of Shadow Detection and Removal System
Design of Shadow Detection and Removal Systemijsrd.com
 
Image semantic coding using OTB
Image semantic coding using OTBImage semantic coding using OTB
Image semantic coding using OTBmelaneum
 
ACM ICMI Workshop 2012
ACM ICMI Workshop 2012ACM ICMI Workshop 2012
ACM ICMI Workshop 2012Lê Anh
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 

Tendances (18)

A Textural Approach to Palmprint Identification
A Textural Approach to Palmprint IdentificationA Textural Approach to Palmprint Identification
A Textural Approach to Palmprint Identification
 
A Simple Robust Digital Image Watermarking against Salt and Pepper Noise usin...
A Simple Robust Digital Image Watermarking against Salt and Pepper Noise usin...A Simple Robust Digital Image Watermarking against Salt and Pepper Noise usin...
A Simple Robust Digital Image Watermarking against Salt and Pepper Noise usin...
 
object recognition for robots
object recognition for robotsobject recognition for robots
object recognition for robots
 
A NOVAL ARTECHTURE FOR 3D MODEL IN VIRTUAL COMMUNITIES FROM FACE DETECTION
A NOVAL ARTECHTURE FOR 3D MODEL IN VIRTUAL COMMUNITIES FROM FACE DETECTIONA NOVAL ARTECHTURE FOR 3D MODEL IN VIRTUAL COMMUNITIES FROM FACE DETECTION
A NOVAL ARTECHTURE FOR 3D MODEL IN VIRTUAL COMMUNITIES FROM FACE DETECTION
 
Retrieving Informations from Satellite Images by Detecting and Removing Shadow
Retrieving Informations from Satellite Images by Detecting and Removing ShadowRetrieving Informations from Satellite Images by Detecting and Removing Shadow
Retrieving Informations from Satellite Images by Detecting and Removing Shadow
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Extreme Spatio-Temporal Data Analysis
Extreme Spatio-Temporal Data AnalysisExtreme Spatio-Temporal Data Analysis
Extreme Spatio-Temporal Data Analysis
 
Keynote taiwan
Keynote taiwanKeynote taiwan
Keynote taiwan
 
ICCV2009: MAP Inference in Discrete Models: Part 1: Introduction
ICCV2009: MAP Inference in Discrete Models: Part 1: IntroductionICCV2009: MAP Inference in Discrete Models: Part 1: Introduction
ICCV2009: MAP Inference in Discrete Models: Part 1: Introduction
 
D25014017
D25014017D25014017
D25014017
 
Design of Shadow Detection and Removal System
Design of Shadow Detection and Removal SystemDesign of Shadow Detection and Removal System
Design of Shadow Detection and Removal System
 
Image semantic coding using OTB
Image semantic coding using OTBImage semantic coding using OTB
Image semantic coding using OTB
 
Ai4
Ai4Ai4
Ai4
 
Cl4301502506
Cl4301502506Cl4301502506
Cl4301502506
 
Fb3110231028
Fb3110231028Fb3110231028
Fb3110231028
 
ACM ICMI Workshop 2012
ACM ICMI Workshop 2012ACM ICMI Workshop 2012
ACM ICMI Workshop 2012
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
W5 structuring two-dimensional-space
W5 structuring two-dimensional-spaceW5 structuring two-dimensional-space
W5 structuring two-dimensional-space
 

Similaire à Mit6870 template matching and histograms

OBJECT DETECTION AND RECOGNITION: A SURVEY
OBJECT DETECTION AND RECOGNITION: A SURVEYOBJECT DETECTION AND RECOGNITION: A SURVEY
OBJECT DETECTION AND RECOGNITION: A SURVEYJournal For Research
 
Remote Sensing Image Scene Classification
Remote Sensing Image Scene ClassificationRemote Sensing Image Scene Classification
Remote Sensing Image Scene ClassificationGaurav Singh
 
Object recognition
Object recognitionObject recognition
Object recognitionsaniacorreya
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
 
426 Lecture 9: Research Directions in AR
426 Lecture 9: Research Directions in AR426 Lecture 9: Research Directions in AR
426 Lecture 9: Research Directions in ARMark Billinghurst
 
Feature Tracking of Objects in Underwater Video Sequences
Feature Tracking of Objects in Underwater Video SequencesFeature Tracking of Objects in Underwater Video Sequences
Feature Tracking of Objects in Underwater Video SequencesIDES Editor
 
Semantic Segmentation.pdf
Semantic Segmentation.pdfSemantic Segmentation.pdf
Semantic Segmentation.pdfnagwaAboElenein
 
Object detection with deep learning
Object detection with deep learningObject detection with deep learning
Object detection with deep learningSushant Shrivastava
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision Chen Sagiv
 
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...CSCJournals
 
Currency recognition on mobile phones
Currency recognition on mobile phonesCurrency recognition on mobile phones
Currency recognition on mobile phoneshabeebsab
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Trackingijsrd.com
 
Computer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an ObjectComputer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an ObjectIOSR Journals
 
IRJET- Real-Time Object Detection using Deep Learning: A Survey
IRJET- Real-Time Object Detection using Deep Learning: A SurveyIRJET- Real-Time Object Detection using Deep Learning: A Survey
IRJET- Real-Time Object Detection using Deep Learning: A SurveyIRJET Journal
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionPetteriTeikariPhD
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Universitat Politècnica de Catalunya
 

Similaire à Mit6870 template matching and histograms (20)

ICPRAM 2012
ICPRAM 2012ICPRAM 2012
ICPRAM 2012
 
OBJECT DETECTION AND RECOGNITION: A SURVEY
OBJECT DETECTION AND RECOGNITION: A SURVEYOBJECT DETECTION AND RECOGNITION: A SURVEY
OBJECT DETECTION AND RECOGNITION: A SURVEY
 
Remote Sensing Image Scene Classification
Remote Sensing Image Scene ClassificationRemote Sensing Image Scene Classification
Remote Sensing Image Scene Classification
 
Object recognition
Object recognitionObject recognition
Object recognition
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
426 Lecture 9: Research Directions in AR
426 Lecture 9: Research Directions in AR426 Lecture 9: Research Directions in AR
426 Lecture 9: Research Directions in AR
 
Feature Tracking of Objects in Underwater Video Sequences
Feature Tracking of Objects in Underwater Video SequencesFeature Tracking of Objects in Underwater Video Sequences
Feature Tracking of Objects in Underwater Video Sequences
 
F045073136
F045073136F045073136
F045073136
 
PPT s11-machine vision-s2
PPT s11-machine vision-s2PPT s11-machine vision-s2
PPT s11-machine vision-s2
 
Semantic Segmentation.pdf
Semantic Segmentation.pdfSemantic Segmentation.pdf
Semantic Segmentation.pdf
 
Object detection with deep learning
Object detection with deep learningObject detection with deep learning
Object detection with deep learning
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision
 
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...
 
Currency recognition on mobile phones
Currency recognition on mobile phonesCurrency recognition on mobile phones
Currency recognition on mobile phones
 
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object TrackingIntegrated Hidden Markov Model and Kalman Filter for Online Object Tracking
Integrated Hidden Markov Model and Kalman Filter for Online Object Tracking
 
1.pdf
1.pdf1.pdf
1.pdf
 
Computer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an ObjectComputer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an Object
 
IRJET- Real-Time Object Detection using Deep Learning: A Survey
IRJET- Real-Time Object Detection using Deep Learning: A SurveyIRJET- Real-Time Object Detection using Deep Learning: A Survey
IRJET- Real-Time Object Detection using Deep Learning: A Survey
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer Vision
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
 

Plus de zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVzukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Informationzukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statisticszukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featureszukun
 

Plus de zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
 

Dernier

ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 

Dernier (20)

ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 

Mit6870 template matching and histograms

  • 1. Object Recognition and MIT Scene Understanding student presentation 6.870
  • 2. 6.870 Template matching and histograms Nicolas Pinto
  • 5. Hosts a guy... (who has big arms)
  • 6. Hosts a guy... Antonio T... (who has big arms) (who knows a lot about vision)
  • 7. Hosts a guy... Antonio T... a frog... (who has big arms) (who knows a lot about vision) (who has big eyes)
  • 8. Hosts a guy... Antonio T... a frog... (who has big arms) (who knows a lot about vision) (who has big eyes) and thus should know a lot about vision...
  • 9. rs p e pa 3 yey!!
  • 10. Object Recognition from Local Scale-Invariant Features David G. Lowe Lowe Computer Science Department University of British Columbia s Vancouver, B.C., V6T 1Z4, Canada r (1999) lowe@cs.ubc.ca p e Abstract translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Previous a An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and p to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe- These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also 3 ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver- in primate vision. Features are efficiently detected through ification. a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies ometric deformations by representing blurred image gradi- key locations in scale space by looking for locations that ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function. The keys are used as input to a nearest-neighbor indexing Each point is used to generate a feature vector that describes method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co- fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to least-squares solution for the unknown model parameters. local variations, such as affine or 3D projections, by blur- Experimental results show that robust object recognition ring image gradient locations. This approach is based on a can be achieved in cluttered partially-occluded images with model of the behavior of complex cells in the cerebral cor- a computation time of under 2 seconds. tex of mammalian vision. The resulting feature vectors are called SIFT keys. In the current implementation, each im- 1. Introduction age generates on the order of 1000 SIFT keys, a process that requires less than 1 second of computation time. Object recognition in cluttered real-world scenes requires The SIFT keys derived from an image are used in a local image features that are unaffected by nearby clutter or nearest-neighbour approach to indexing to identify candi- partial occlusion. The features must be at least partially in- date object models. Collections of keys that agree on a po- variant to illumination, 3D projective transforms, and com- tential model pose are first identified through a Hough trans- mon object variations. On the other hand, the features must form hash table, and then through a least-squares fit to a final also be sufficiently distinctive to identify specific objects estimate of model parameters. When at least 3 keys agree among many alternatives. The difficulty of the object recog- on the model parameters with low residual, there is strong nition problem is due in large part to the lack of success in evidence for the presence of the object. Since there may be finding such image features. However, recent research on dozens of SIFT keys in the image of a typical object, it is the use of dense local features (e.g., Schmid & Mohr [19]) possible to have substantial levels of occlusion in the image has shown that efficient recognition can often be achieved and yet retain high levels of reliability. by using local image descriptors sampled at a large number The current object models are represented as 2D loca- of repeatable locations. tions of SIFT keys that can undergo affine projection. Suf- This paper presents a new method for image feature gen- ficient variation in feature location is allowed to recognize eration called the Scale Invariant Feature Transform (SIFT). perspective projection of planar shapes at up to a 60 degree This approach transforms an image into a large collection rotation away from the camera or to allow up to a 20 degree of local feature vectors, each of which is invariant to image rotation of a 3D object. Proc. of the International Conference on 1 Computer Vision, Corfu (Sept. 1999) yey!!
  • 11. Object Recognition from Local Scale-Invariant Features David G. Lowe Lowe Computer Science Department University of British Columbia s Vancouver, B.C., V6T 1Z4, Canada r (1999) lowe@cs.ubc.ca p e Abstract translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Previous a An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and p to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe- These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also 3 ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver- Histograms of Oriented Gradients for Human Detection in primate vision. Features are efficiently detected through ification. a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that Navneet gradi- key Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function. The keys are used as input to a nearest-neighbor indexing Each http://lear.inrialpes.fr {Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co- Nalal and Triggs fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to least-squares solution for the unknown model parameters. Abstract Experimental results show that robust object recognition We briefly discusssuch as affine or 3D projections, by blur- local variations, previous work on human detection in We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data ring image gradient locations. This approach is based on a can be achieved in cluttered partially-occluded imagesob- (2005) ject recognition,time of under 2 seconds. a computation adopting linear SVM based human detec- tion as a test case. After reviewing existing edge and gra- dient based descriptors, we show experimentally that grids setsmodel and give a detailedcomplex cells in experimental cor- in §4 of the behavior of description and the cerebral evaluation of each stage of the process in §5–6. The main tex of mammalian vision. The resulting feature vectors are conclusions are summarized in §7. implementation, each im- called SIFT keys. In the current 1. Introduction of Histograms of Oriented Gradient (HOG) descriptors sig- age generates on the order of 1000 SIFT keys, a process that 2 requires lessWork second of computation time. Previous than 1 nificantly outperform existing feature sets for human detec- tion. We recognition in cluttered real-world scenes requires Object study the influence of each stage of the computation There is SIFT keys derived from an image are used in a The an extensive literature on object detection, but local image features that are that fine-scale nearby clutter or here we mention just a approach to papers on to identify candi- on performance, concluding unaffected by gradients, fine nearest-neighbour few relevant indexing human detec- orientation binning, relatively coarse be at least partially in- tiondate object models.See [6] for a survey. Papageorgiou et po- partial occlusion. The features must spatial binning, and [18,17,22,16,20]. Collections of keys that agree on a variant to illumination, 3D projective transforms, and com- al [18] describe apose are first identified throughpolynomial high-quality local contrast normalization in overlapping de- tential model pedestrian detector based on a a Hough trans- mon object variations. On the other hand, the features must SVM using rectified and then through input descriptors, withfinal scriptor blocks are all important for good results. The new form hash table, Haar wavelets as a least-squares fit to a also be sufficiently distinctive to identify specific objects a parts (subwindow) based variant in [17]. at least 3 keysal approach gives near-perfect separation on the original MIT estimate of model parameters. When Depoortere et agree among many alternatives. The difficulty of the challenging give anthe model parameters with[2]. Gavrila &there is strong pedestrian database, so we introduce a more object recog- on optimized version of this low residual, Philomen nition problem is overin large part to the lack images within [8] evidence fordirect approach, extracting edge images and be dataset containing due 1800 annotated human of success take a more the presence of the object. Since there may finding such of pose features. However, recent research on matching them to a set of in the image of a typicalchamfer it is a large range image variations and backgrounds. dozens of SIFT keys learned exemplars using object, the use of dense local features (e.g., Schmid & Mohr [19]) distance. This has been used in levels of occlusion inpedes- possible to have substantial a practical real-time the image 1 Introduction and yet retain high levels of et al [22] has shown that efficient recognition can often be achieved trian detection system [7]. Viola reliability.build an efficient byDetecting humans in images is sampled at a large owing moving person detector, using AdaBoost to train a chain of using local image descriptors a challenging task number The current object models are represented as 2D loca- to their variable appearance and the wide range of poses that of repeatable locations. progressively more complexcan undergo affine projection. Suf- tions of SIFT keys that region rejection rules based on ficient variation in space-time differences. Ronfard et can adopt. The first need is a robust feature set gen- Haar-like wavelets andfeature location is allowed to recognize theyThis paper presents a new method for image featurethat allows the human form to be discriminated cleanly, even in eration called the Scale Invariant Feature Transform (SIFT). al [19] build anprojection of planar shapesby incorporating perspective articulated body detectornd at up to a 60 degree st difficult illumination. collection SVM based away classifierscamera or to allow up to a 20 degree cluttered backgrounds underan image into a largeWe study This approach transforms rotation limb from the over 1 and 2 order Gaussian the issue of feature sets foreach of which is invariant to image filters in a dynamic programming framework similar to those of local feature vectors, human detection, showing that lo- rotation of a 3D object. cally normalized Histogram of Oriented Gradient (HOG) de- of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth scriptors provide excellent performance relative to other ex- [9]. Mikolajczyk et al [16] use combinations of orientation- isting feature sets including wavelets [17,22]. The proposed Proc. of the International Conference on position histograms with binary-thresholded gradient magni- 1 descriptorsVision, Corfu (Sept. 1999) orientation histograms tudes to build a parts based method containing detectors for Computer are reminiscent of edge [4,5], SIFT descriptors [12] and shape contexts [1], but they faces, heads, and front and side profiles of upper and lower yey!! are computed on a dense grid of uniformly spaced cells and body parts. In contrast, our detector uses a simpler archi- tecture with a single detection window, but appears to give they use overlapping local contrast normalizations for im- proved performance. We make a detailed study of the effects significantly higher performance on pedestrian images. of various implementation choices on detector performance, taking “pedestrian detection” (the detection of mostly visible 3 Overview of the Method
  • 12. Object Recognition from Local Scale-Invariant Features David G. Lowe Lowe Computer Science Department University of British Columbia s Vancouver, B.C., V6T 1Z4, Canada r (1999) lowe@cs.ubc.ca p e Abstract translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Previous a An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and p to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe- These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also 3 ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver- Histograms of Oriented Gradients for Human Detection in primate vision. Features are efficiently detected through ification. a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that Navneet gradi- key Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function. The keys are used as input to a nearest-neighbor indexing Each http://lear.inrialpes.fr {Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co- Nalal and Triggs fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to least-squares solution for the unknown model parameters. Abstract Experimental results show that robust object recognition We briefly discusssuch as affine or 3D projections, by blur- local variations, previous work on human detection in We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data ring image gradient locations. This approach is based on a can be achieved in cluttered partially-occluded imagesob- (2005) ject recognition,time of under 2 seconds. a computation adopting linear SVM based human detec- tion as a test case. After reviewing existing edge and gra- dient based descriptors, we show experimentally that grids setsmodel and give a detailedcomplex cells in experimental cor- in §4 of the behavior of description and the cerebral evaluation of each stage of the process in §5–6. The main tex of mammalian vision. The resulting feature vectors are conclusions are summarized in §7. implementation, each im- called SIFT keys. In the current 1. Introduction of Histograms of Oriented Gradient (HOG) descriptors sig- age generates on the order of 1000 SIFT keys, a process that 2 requires lessWork second of computation time. Previous than 1 nificantly outperform existing feature sets for human detec- tion. We recognition in cluttered real-world scenes requires Object study the influence of each stage of the computation There is SIFT keys derived from an image are used in a The an extensive literature on object detection, but local image features that are that fine-scale nearby clutter or here we mention just a approach to papers on to identify candi- on performance, concluding unaffected by gradients, fine nearest-neighbour few relevant indexing human detec- orientation binning, relatively coarse be at least partially in- tiondate object models.See [6] for a survey. Papageorgiou et po- partial occlusion. The features must spatial binning, and [18,17,22,16,20]. Collections of keys that agree on a variant to illumination, 3D projective transforms, and com- al [18] describe apose are first identified throughpolynomial high-quality local contrast normalization in overlapping de- tential model pedestrian detector based on a a Hough trans- A Discriminatively Trained, Multiscale, Deformable Part Model fit to a mon object variations. On the other hand, the features must SVM using rectified and then through input descriptors, withfinal scriptor blocks are all important for good results. The new form hash table, Haar wavelets as a least-squares also be sufficiently distinctive to identify specific objects a parts (subwindow) based variant in [17]. at least 3 keysal approach gives near-perfect separation on the original MIT estimate of model parameters. When Depoortere et agree among many alternatives. The difficulty of the challenging give anthe model parameters with[2]. Gavrila &there is strong pedestrian database, so we introduce a more object recog- on optimized version of this low residual, Philomen nition problem is overin large part to the lack images within [8] evidence fordirect approach, extracting edge images and be dataset containing due 1800 annotated human of success take a more the presence of the object. Since there may Pedro Felzenszwalb David McAllester Deva Ramanan finding such of pose features. However, recent research on matching them to a set of in the image of a typicalchamfer it is a large range image variations and backgrounds. dozens of SIFT keys learned exemplars using object, the University of Chicago (e.g.,Toyota Technological Institute to has been used in levels ofUC Irvine pedes- use of dense local features possible at Chicago Schmid & Mohr [19]) distance. This have substantial a practical real-time the image occlusion in 1 Introduction Felzenszwalb et al. has pff@cs.uchicago.edu shown that efficient recognition can often be achieved trian detection system [7]. Viola dramanan@ics.uci.edu mcallester@tti-c.org et al [22] and yet retain high levels of reliability.build an efficient byDetecting humans in images is sampled at a large owing moving person detector, using AdaBoost to train a chain of using local image descriptors a challenging task number The current object models are represented as 2D loca- to their variable appearance and the wide range of poses that of repeatable locations. progressively more complexcan undergo affine projection. Suf- tions of SIFT keys that region rejection rules based on ficient variation in space-time differences. Ronfard et can adopt. The firstnew method for image feature gen- Haar-like wavelets andfeature location is allowed to recognize (2008) theyThis paper presents aAbstract robust feature set that need is a allows the human form to be discriminated cleanly, even in eration called the Scale Invariant Feature Transform (SIFT). al [19] build anprojection of planar shapesby incorporating perspective articulated body detectornd at up to a 60 degree st difficult illumination. collection SVM based away classifierscamera or to allow up to a 20 degree cluttered backgrounds underan image into a largeWe study This approach describes a discriminatively trained, multi- rotation limb from the over 1 and 2 order Gaussian This paper transforms the issue of feature sets foreach of which is invariant to image filters in a dynamic programming framework similar to those of local feature vectors, human detection, showing that lo- scale, deformable part model for object detection. Our sys- of Felzenszwalb3D object. cally normalized Histogram of Oriented Gradient (HOG) de- rotation of a & Huttenlocher [3] and Ioffe & Forsyth scriptors providetwo-fold improvement relative to other ex- tem achieves a excellent performance in average precision [9]. Mikolajczyk et al [16] use combinations of orientation- isting thethe International Conference 2006 PASCAL person de- position histograms with binary-thresholded gradient magni- over feature sets including wavelets [17,22]. The proposed 1 Proc. of best performance in the on tection challenge. It also outperforms the best results in the tudes to build a parts based method containing detectors for descriptorsVision, Corfu (Sept. 1999) orientation histograms Computer are reminiscent of edge [4,5], SIFT descriptors [12]of twenty categories. The system faces, heads, and front and side profiles of upper and lower 2007 challenge in ten out and shape contexts [1], but they yey!! are computed on adeformableof uniformly spaced cells and relies heavily on dense grid parts. While deformable part body parts. In contrast, our detector uses a simpler archi- models overlapping quite contrast their value had not been tecture with a single detection obtained with the person model. The Figure 1. Example detection window, but appears to give they usehave become local popular, normalizations for im- model is defined by a coarse template, several higher resolution proved performance. We make a detailedsuch as the PASCAL significantly higher performance on for the location of each part. demonstrated on difficult benchmarks study of the effects part templates and a spatial model pedestrian images. challenge. Our system also relies heavily on new methods of various implementation choices on detector performance, for discriminative training. We detection of mostly visible 3 Overview of the Method taking “pedestrian detection” (thecombine a margin-sensitive
  • 13. Object Recognition from Local Scale-Invariant Features David G. Lowe Lowe Computer Science Department University of British Columbia Vancouver, B.C., V6T 1Z4, Canada (1999) lowe@cs.ubc.ca Abstract translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Previous An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe- These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver- Histograms of Oriented Gradients for Human Detection in primate vision. Features are efficiently detected through ification. a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that Navneet gradi- key Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function. The keys are used as input to a nearest-neighbor indexing Each http://lear.inrialpes.fr {Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co- Nalal and Triggs fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to least-squares solution for the unknown model parameters. Abstract Experimental results show that robust object recognition We briefly discusssuch as affine or 3D projections, by blur- local variations, previous work on human detection in We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data ring image gradient locations. This approach is based on a can be achieved in cluttered partially-occluded imagesob- (2005) ject recognition,time of under 2 seconds. a computation adopting linear SVM based human detec- tion as a test case. After reviewing existing edge and gra- dient based descriptors, we show experimentally that grids setsmodel and give a detailedcomplex cells in experimental cor- in §4 of the behavior of description and the cerebral evaluation of each stage of the process in §5–6. The main tex of mammalian vision. The resulting feature vectors are conclusions are summarized in §7. implementation, each im- called SIFT keys. In the current 1. Introduction of Histograms of Oriented Gradient (HOG) descriptors sig- age generates on the order of 1000 SIFT keys, a process that 2 requires lessWork second of computation time. Previous than 1 nificantly outperform existing feature sets for human detec- tion. We recognition in cluttered real-world scenes requires Object study the influence of each stage of the computation There is SIFT keys derived from an image are used in a The an extensive literature on object detection, but local image features that are unaffected by nearby clutter or nearest-neighbour approach to indexing to identify candi- partial occlusion. The features must be at least partially in- date object models. Collections of keys that agree on a po- variant to illumination, 3D projective transforms, and com- tential model pose are first identified through a Hough trans- A Discriminatively Trained, Multiscale, Deformable Part Model fit to a final mon object variations. On the other hand, the features must form hash table, and then through a least-squares also be sufficiently distinctive to identify specific objects estimate of model parameters. When at least 3 keys agree among many alternatives. The difficulty of the object recog- on the model parameters with low residual, there is strong nition problem is due in large part to the lack of success in Pedro Felzenszwalb David McAllester for the presence of the object. Since there may be evidence Deva Ramanan finding such image features. However, recent research on dozens of SIFT keys in the image of a typical object, it is the University of Chicago (e.g.,Toyota Technological Institute to have substantial levels ofUC Irvine the image use of dense local features Schmid & Mohr [19]) possible at Chicago occlusion in Felzenszwalb et al. has pff@cs.uchicago.edu mcallester@tti-c.org shown that efficient recognition can often be achieved and yet retain high levels of dramanan@ics.uci.edu reliability. by using local image descriptors sampled at a large number The current object models are represented as 2D loca- of repeatable locations. tions of SIFT keys that can undergo affine projection. Suf- (2008) This paper presents aAbstract for image feature gen- new method ficient variation in feature location is allowed to recognize eration called the Scale Invariant Feature Transform (SIFT). perspective projection of planar shapes at up to a 60 degree This approach describes a an image into a large collection This paper transforms discriminatively trained, multi- rotation away from the camera or to allow up to a 20 degree of local feature vectors, each of which isdetection. to image scale, deformable part model for object invariant Our sys- rotation of a 3D object. tem achieves a two-fold improvement in average precision over thethe International Conference 2006 PASCAL person de- 1 Proc. of best performance in the on tection challenge. It also outperforms the best results in the Computer Vision, Corfu (Sept. 1999) 2007 challenge in ten out of twenty categories. The system relies heavily on deformable parts. While deformable part models have become quite popular, their value had not been Figure 1. Example detection obtained with the person model. The demonstrated on difficult benchmarks such as the PASCAL model is defined by a coarse template, several higher resolution
  • 14. Scale-Invariant Feature Transform (SIFT) adapted from Kucuktunc
  • 15. Scale-Invariant Feature Transform (SIFT) adapted from Brown, ICCV 2003
  • 16. SIFT local features are invariant... adapted from David Lee
  • 17. like me they are robust... Text
  • 18. like me they are robust... Text ... to changes in illumination, noise, viewpoint, occlusion, etc.
  • 19. I am sure you want to know how to build them Text
  • 20. I am sure you want to know how to build them 1. find interest points or “keypoints” Text
  • 21. I am sure you want to know how to build them 1. find interest points or “keypoints” Text 2. find their dominant orientation
  • 22. I am sure you want to know how to build them 1. find interest points or “keypoints” Text 2. find their dominant orientation 3. compute their descriptor
  • 23. I am sure you want to know how to build them 1. find interest points or “keypoints” Text 2. find their dominant orientation 3. compute their descriptor 4. match them on other images
  • 24. 1. find interest points or “keypoints” Text
  • 25. keypoints are taken as maxima/minima of a DoG pyramid Text in this settings, extremas are invariant to scale...
  • 26. a DoG (Difference of Gaussians) pyramid is simple to compute... even him can do it! before after adapted from Pallus and Fleishman
  • 27. then we just have to find neighborhood extremas in this 3D DoG space
  • 28. then we just have to find neighborhood extremas in this 3D DoG space if a pixel is an extrema in its neighboring region he becomes a candidate keypoint
  • 29. too many keypoints? adapted from wikipedia
  • 30. too many keypoints? 1. remove low contrast adapted from wikipedia
  • 31. too many keypoints? 1. remove low contrast adapted from wikipedia
  • 32. too many keypoints? 1. remove low contrast 2. remove edges adapted from wikipedia
  • 33. too many keypoints? 1. remove low contrast 2. remove edges adapted from wikipedia
  • 34. Text 2. find their dominant orientation
  • 35. each selected keypoint is assigned to one or more “dominant” orientations...
  • 36. each selected keypoint is assigned to one or more “dominant” orientations... ... this step is important to achieve rotation invariance
  • 37. How?
  • 38. How? using the DoG pyramid to achieve scale invariance:
  • 39. How? using the DoG pyramid to achieve scale invariance: a. compute image gradient magnitude and orientation
  • 40. How? using the DoG pyramid to achieve scale invariance: a. compute image gradient magnitude and orientation b. build an orientation histogram
  • 41. How? using the DoG pyramid to achieve scale invariance: a. compute image gradient magnitude and orientation b. build an orientation histogram c. keypoint’s orientation(s) = peak(s)
  • 42. a. compute image gradient magnitude and orientation
  • 43. a. compute image gradient magnitude and orientation
  • 44. b. build an orientation histogram adapted from Ofir Pele
  • 45. c. keypoint’s orientation(s) = peak(s) * * the peak ;-)
  • 46. Text 3. compute their descriptor
  • 47. SIFT descriptor = a set of orientation histograms 16x16 neighborhood 4x4 array x 8 bins of pixel gradients = 128 dimensions (normalized)
  • 48. Text 4. match them on other images
  • 49. How to atch?
  • 50. How to atch? nearest neighbor
  • 51. How to atch? nearest neighbor hough transform voting
  • 52. How to atch? nearest neighbor hough transform voting least-squares fit
  • 53. How to atch? nearest neighbor hough transform voting least-squares fit etc.
  • 55. SIFT is great! Text invariant to affine transformations
  • 56. SIFT is great! Text invariant to affine transformations easy to understand
  • 57. SIFT is great! Text invariant to affine transformations easy to understand fast to compute
  • 58. Extension example: Spatial Pyramid Matching using SIFT Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories Svetlana Lazebnik1 Cordelia Schmid2 Jean Ponce1,3 slazebni@uiuc.edu Cordelia.Schmid@inrialpes.fr ponce@cs.uiuc.edu 1 Beckman Institute 2 Text INRIA Rhˆ ne-Alpes o 3 Ecole Normale Sup´ rieure e University of Illinois Montbonnot, France Paris, France CVPR 2006
  • 59. Object Recognition from Local Scale-Invariant Features David G. Lowe Lowe Computer Science Department University of British Columbia Vancouver, B.C., V6T 1Z4, Canada (1999) lowe@cs.ubc.ca Abstract translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Previous An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe- Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr Nalal and Triggs Abstract We study the question of feature sets for robust visual ob- We briefly discuss previous work on human detection in §2, give an overview of our method §3, describe our data (2005) ject recognition, adopting linear SVM based human detec- tion as a test case. After reviewing existing edge and gra- dient based descriptors, we show experimentally that grids sets in §4 and give a detailed description and experimental evaluation of each stage of the process in §5–6. The main conclusions are summarized in §7. of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work nificantly outperform existing feature sets for human detec- tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec- orientation binning, relatively coarse spatial binning, and tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial A Discriminatively Trained, Multiscale, Deformable Part Model with scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen [8] take a more direct approach, extracting edge images and dataset containing over 1800 annotated human images with McAllester Pedro Felzenszwalb David matching them to a set of learned exemplarsRamanan Deva using chamfer a large range of pose variations and backgrounds. University of Chicago Toyota Technological Institute athas been used in a practical real-time pedes- distance. This Chicago UC Irvine 1 Introduction Felzenszwalb et al. pff@cs.uchicago.edu mcallester@tti-c.org system [7]. Viola dramanan@ics.uci.edu trian detection et al [22] build an efficient Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on Haar-like wavelets and space-time differences. Ronfard et (2008) they can adopt. The first need is a robust feature set that Abstract allows the human form to be discriminated cleanly, even in al [19] build an articulated body detector by incorporating cluttered backgrounds under difficult illumination. We study SVM based limb classifiers over 1st and 2nd order Gaussian This paper describes a discriminatively trained, multi- filters in a dynamic programming framework similar to those the issue of feature sets for human detection, showing that lo- scale, deformable part model for object detection. Our sys- of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth cally normalized Histogram of Oriented Gradient (HOG) de- scriptors providetwo-fold improvement relative to other ex- tem achieves a excellent performance in average precision [9]. Mikolajczyk et al [16] use combinations of orientation- isting the bestsets including wavelets [17,22]. The person de- position histograms with binary-thresholded gradient magni- over feature performance in the 2006 PASCAL proposed descriptors are reminiscent of edge orientation results in the tudes to build a parts based method containing detectors for tection challenge. It also outperforms the best histograms [4,5], SIFT descriptors [12]of twenty categories. The system faces, heads, and front and side profiles of upper and lower 2007 challenge in ten out and shape contexts [1], but they are computed on adeformableof uniformly spaced cells and relies heavily on dense grid parts. While deformable part body parts. In contrast, our detector uses a simpler archi- models overlapping quite contrast their value had not been tecture with a single detection obtained with the person model. The they usehave become local popular, normalizations for im- Figure 1. Example detection window, but appears to give model is defined by a coarse template, several higher resolution proved performance. We make a detailedsuch as the PASCAL significantly higher performance on pedestrian images. demonstrated on difficult benchmarks study of the effects of various implementation choices on detector performance, taking “pedestrian detection” (the detection of mostly visible 3 Overview of the Method
  • 60. Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr Abstract We briefly discuss previous work on human detection in We study the question of feature sets for robust visual ob- §2, give an overview of our method §3, describe our data ject recognition, adopting linear SVM based human detec- sets in §4 and give a detailed description and experimental tion as a test case. After reviewing existing edge and gra- evaluation of each stage of the process in §5–6. The main dient based descriptors, we show experimentally that grids conclusions are summarized in §7. of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work nificantly outperform existing feature sets for human detec- tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec- orientation binning, relatively coarse spatial binning, and tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, with approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen dataset containing over 1800 annotated human images with [8] take a more direct approach, extracting edge images and a large range of pose variations and backgrounds. matching them to a set of learned exemplars using chamfer distance. This has been used in a practical real-time pedes- 1 Introduction trian detection system [7]. Viola et al [22] build an efficient Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et first of all, let me put this paper in allows the human form to be discriminated cleanly, even in cluttered backgrounds under difficult illumination. We study al [19] build an articulated body detector by incorporating SVM based limb classifiers over 1st and 2nd order Gaussian context the issue of feature sets for human detection, showing that lo- cally normalized Histogram of Oriented Gradient (HOG) de- filters in a dynamic programming framework similar to those of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
  • 61. Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr λ λ λ Abstract We briefly discuss previous work on human detection in Swain & Ballard 1991 - Color an overview of our method §3, describe our data §2, give Histograms We study the question of feature sets for robust visual ob- ject recognition, adopting linear SVM based human detec- sets in §4 and give a detailed description and experimental tion as a test case. After reviewing& Crowley 1996 evaluation of each stage of the process in §5–6. The main Schiele existing edge and gra- conclusions are summarized in §7. - Receptive Fields Histograms dient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work nificantly outperform existing feature sets - SIFT detec- Lowe 1999 for human tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec- Schneiderman & Kanade 2000 - Localized for a survey. PapageorgiouWavelets tion [18,17,22,16,20]. See [6] Histograms of et orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial SVM using rectified Haar wavelets as input descriptors, with scriptor blocks are all Leung for good results. The new Texton Histograms important & Malik 2001 - approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen dataset containing over 1800 annotated human images with Shape Context approach, extracting edge images and Belongie et al. 2002 - [8] take a more direct a large range of pose variations and backgrounds. matching them to a set of learned exemplars using chamfer distance. This has been used in a practical real-time pedes- 1 Introduction Dalal & Triggs 2005 - Dense Orientation Histogramsan efficient trian detection system [7]. Viola et al [22] build Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of to their variable appearance and the wide range of poses that ... progressively more complex region rejection rules based on they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et histograms of local image measurement allows the human form to be discriminated cleanly, even in cluttered backgrounds under difficult illumination. We study al [19] build an articulated body detector by incorporating SVM based limb classifiers over 1st and 2nd order Gaussian have been quite successful the issue of feature sets for human detection, showing that lo- cally normalized Histogram of Oriented Gradient (HOG) de- filters in a dynamic programming framework similar to those of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
  • 62. Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs features INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr Abstract We briefly discuss previous work on human detection in We study the question of feature sets for robust visual ob- §2, give an overview of our method §3, describe our data Gravrila & Philomen 1999 - Edgegive a detailed description and experimental ject recognition, adopting linear SVM based human detec- sets in §4 and Templates + Nearest Neighbor tion as a test case. After reviewing existing edge and gra- evaluation of each stage of the process in §5–6. The main dient based descriptors, we show experimentally that grids conclusions are summarized in §7. Papageorgiou & Poggio 2000, Mohan et al. 2001, DePoortere et al. of Histograms of Oriented Gradient (HOG) descriptors sig- 2002 - Haar Wavelets 2 Previous Work nificantly outperform existing feature sets for human detec- + SVM tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but on performance, concluding that fine-scale gradients, - Rectangular Differentialpapers on human + here we mention just a few relevant Viola & Jones 2001 fine tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et Features detec- orientation binning, relatively coarse spatial binning, and AdaBoost high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, with approach gives near-perfect separation on the original MIT - parts (subwindow) based variant in [17]. Depoortere et al a Mikolajczyk et al. 2004 give an optimized version of this [2]. Gavrila & Philomen Parts Based Histograms + AdaBoost pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with [8] take a more direct approach, extracting edge images and a large range of pose variations Sukthankar 2004 - PCA-SIFT set of learned exemplars using chamfer Ke & and backgrounds. matching them to a distance. This has been used in a practical real-time pedes- 1 Introduction trian detection system [7]. Viola et al [22] build an efficient ... Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et allows the human form to be discriminated cleanly, even in al [19] build an articulated body detector by incorporating tons of “feature sets” have been proposed cluttered backgrounds under difficult illumination. We study the issue of feature sets for human detection, showing that lo- SVM based limb classifiers over 1st and 2nd order Gaussian filters in a dynamic programming framework similar to those cally normalized Histogram of Oriented Gradient (HOG) de- of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
  • 63. Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs difficult! INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr Abstract We briefly discuss previous work on human detection in We study the question of feature sets for robust visual ob- §2, give an overview of our method §3, describe our data ject recognition, adopting linearvariety human detec- Wide SVM based of articulated poses a detailed description and experimental sets in §4 and give tion as a test case. After reviewing existing edge and gra- evaluation of each stage of the process in §5–6. The main dient based descriptors, we show experimentally that grids conclusions are summarized in §7. Variable appearance/clothing of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work nificantly outperform existing feature sets for human detec- Complex backgrounds tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec- orientation binning, relatively coarse spatial binning, and tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et Unconstrained illuminations high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, with approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al Occlusions pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen dataset containing over 1800 annotated human images with [8] take a more direct approach, extracting edge images and Different scales a large range of pose variations and backgrounds. matching them to a set of learned exemplars using chamfer distance. This has been used in a practical real-time pedes- 1 Introduction trian detection system [7]. Viola et al [22] build an efficient ... Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et localizing humans in images is a allows the human form to be discriminated cleanly, even in cluttered backgrounds under difficult illumination. We study al [19] build an articulated body detector by incorporating SVM based limb classifiers over 1st and 2nd order Gaussian challenging task... the issue of feature sets for human detection, showing that lo- cally normalized Histogram of Oriented Gradient (HOG) de- filters in a dynamic programming framework similar to those of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
  • 67. Approach • robust feature set (HOG) • simple classifier(linear SVM)
  • 68. Approach • robust feature set (HOG) • simple classifier(linear SVM) • fast detection(sliding window)
  • 70. • Gamma normalization • Space: RGB, LAB or Gray • Method: SQRT or LOG
  • 71. • Filtering with simple masks centered centered * diagonal uncentered uncentered cubic-corrected cubic-corrected Sobel * centered performs the best
  • 72. remember SIFT ? • Filtering with simple masks centered uncentered cubic-corrected
  • 73. ...after filtering, each “pixel” represents an oriented gradient...
  • 74. ...pixels are regrouped in “cells”, they cast a weighted vote for an orientation histogram... HOG (Histogram of Oriented Gradients)
  • 75. a window can be represented like that
  • 76. then, cells are locally normalized using overlapping “blocks”
  • 77. they used two types of blocks
  • 78. they used two types of blocks • rectangular • similar to SIFT (but dense)
  • 79. they used two types of blocks • rectangular • circular • similar to SIFT (but dense) • similar to Shape Context
  • 80. and four different types of block normalization
  • 81. and four different types of block normalization
  • 82. like SIFT, they gain invariance... ...to illuminations, small deformations, etc.
  • 83. finally, a sliding window is classified by a simple linear SVM
  • 84. during the learning phase, the algorithm “looked” for hard examples Training adapted from Martial Hebert
  • 87. Example adapted from Bill Triggs
  • 88. Example adapted from Martial Hebert
  • 90. Further Development • Detection on Pascal VOC (2006)
  • 91. Further Development • Detection on Pascal VOC (2006) • Human Detection in Movies (ECCV 2006)
  • 92. Further Development • Detection on Pascal VOC (2006) • Human Detection in Movies (ECCV 2006) • US Patent by MERL (2006)
  • 93. Further Development • Detection on Pascal VOC (2006) • Human Detection in Movies (ECCV 2006) • US Patent by MERL (2006) • Stereo Vision HoG (ICVES 2008)
  • 99. A simple demo... VIDEO HERE
  • 100. A simple demo... VIDEO HERE
  • 101.
  • 102. so, it doesn’t work ?!?
  • 103. so, it doesn’t work ?!? no no, it works...
  • 104. so, it doesn’t work ?!? no no, it works... ...it just doesn’t work well...
  • 105. Object Recognition from Local Scale-Invariant Features David G. Lowe Lowe Computer Science Department University of British Columbia Vancouver, B.C., V6T 1Z4, Canada (1999) lowe@cs.ubc.ca Abstract translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection. Previous An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe- Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France o {Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr Nalal and Triggs Abstract We study the question of feature sets for robust visual ob- We briefly discuss previous work on human detection in §2, give an overview of our method §3, describe our data (2005) ject recognition, adopting linear SVM based human detec- tion as a test case. After reviewing existing edge and gra- dient based descriptors, we show experimentally that grids sets in §4 and give a detailed description and experimental evaluation of each stage of the process in §5–6. The main conclusions are summarized in §7. of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work nificantly outperform existing feature sets for human detec- tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but A Discriminatively Trained, Multiscale, Deformable Part Model Pedro Felzenszwalb David McAllester Deva Ramanan University of Chicago Toyota Technological Institute at Chicago UC Irvine Felzenszwalb et al. pff@cs.uchicago.edu mcallester@tti-c.org dramanan@ics.uci.edu (2008) Abstract This paper describes a discriminatively trained, multi- scale, deformable part model for object detection. Our sys- tem achieves a two-fold improvement in average precision over the best performance in the 2006 PASCAL person de- tection challenge. It also outperforms the best results in the 2007 challenge in ten out of twenty categories. The system relies heavily on deformable parts. While deformable part models have become quite popular, their value had not been Figure 1. Example detection obtained with the person model. The demonstrated on difficult benchmarks such as the PASCAL model is defined by a coarse template, several higher resolution part templates and a spatial model for the location of each part. challenge. Our system also relies heavily on new methods for discriminative training. We combine a margin-sensitive
  • 106. This paper describes one of the best algorithm in object detection...
  • 107. They used the following methods: s Mo del e atur Part t SV M HO G Fe able Laten De form
  • 108. They used the following methods: Introduced by Dalal & Triggs (2005) e s atur HO G Fe
  • 109. They used the following methods: Mo del Part able De form Introduced by Fischler & Elschlager (1973)
  • 110. They used the following methods: Introduced by the authors M ten t SV La
  • 111. e s atur HO G Fe
  • 112. Model Overview deformation detection root filter part filters models
  • 113. t ures G Fea HO // 8x8 pixel blocks window // features computed at different resolutions (pyramid)
  • 114. id Py ram HOG
  • 115. Mo del Part able De form
  • 116. l M ode Part mable D efor // each part is a local property // springs capture spatial relationships // here, the springs can be “negative”
  • 117. l M ode art Defor mable P detection score = sum of filter responses - deformation cost
  • 118. l M ode art Defor mable P detection score = sum of filter responses - deformation cost root filter
  • 119. l M ode art Defor mable P detection score = sum of filter responses - deformation cost root filter part filters
  • 120. l M ode art Defor mable P detection score = sum of filter responses - deformation cost root filter deformable part filters model
  • 121. l M ode Part mable efor D score of a placement filters feature vector coefficients of a position relative (at position p quadratic function on to the root location in the pyramid H) the placement
  • 122. M ten t SV La
  • 123. VM ate nt S L filters and deformation features part displacements parameters
  • 124. VM ate nt S L
  • 125. s B onu // Data Mining Hard Negatives // Model Initialization
  • 128. m ents Experi ~ Dalal’s model ~ Dalal’s + LSVM
  • 129. am ples Ex errors
  • 130. em o... d im ple As
  • 131. em o... d im ple As
  • 132. em o... d im ple As
  • 133. em o... d im ple As
  • 134. ns cl usio Con
  • 135. ns cl usio Con so, it doesn’t work ?!?
  • 136. ns cl usio Con so, it doesn’t work ?!? no no, it works...
  • 137. ns cl usio Con so, it doesn’t work ?!? no no, it works... ...it just doesn’t work well...
  • 138. ns cl usio Con so, it doesn’t work ?!? no no, it works... ...it just doesn’t work well... ...or there is a problem with the seat-computer interface...