6. Hosts
a guy... Antonio T...
(who has big arms) (who knows a lot about vision)
7. Hosts
a guy... Antonio T... a frog...
(who has big arms) (who knows a lot about vision) (who has big eyes)
8. Hosts
a guy... Antonio T... a frog...
(who has big arms) (who knows a lot about vision) (who has big eyes)
and thus should know
a lot about vision...
10. Object Recognition from Local Scale-Invariant Features
David G. Lowe
Lowe
Computer Science Department
University of British Columbia
s
Vancouver, B.C., V6T 1Z4, Canada
r (1999)
lowe@cs.ubc.ca
p e Abstract translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
a
An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to
new class of local image features. The features are invariant scale and were more sensitive to projective distortion and
p
to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of
variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe-
These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also
3
ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver-
in primate vision. Features are efficiently detected through ification.
a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by
scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies
ometric deformations by representing blurred image gradi- key locations in scale space by looking for locations that
ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function.
The keys are used as input to a nearest-neighbor indexing Each point is used to generate a feature vector that describes
method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co-
fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to
least-squares solution for the unknown model parameters. local variations, such as affine or 3D projections, by blur-
Experimental results show that robust object recognition ring image gradient locations. This approach is based on a
can be achieved in cluttered partially-occluded images with model of the behavior of complex cells in the cerebral cor-
a computation time of under 2 seconds. tex of mammalian vision. The resulting feature vectors are
called SIFT keys. In the current implementation, each im-
1. Introduction age generates on the order of 1000 SIFT keys, a process that
requires less than 1 second of computation time.
Object recognition in cluttered real-world scenes requires The SIFT keys derived from an image are used in a
local image features that are unaffected by nearby clutter or nearest-neighbour approach to indexing to identify candi-
partial occlusion. The features must be at least partially in- date object models. Collections of keys that agree on a po-
variant to illumination, 3D projective transforms, and com- tential model pose are first identified through a Hough trans-
mon object variations. On the other hand, the features must form hash table, and then through a least-squares fit to a final
also be sufficiently distinctive to identify specific objects estimate of model parameters. When at least 3 keys agree
among many alternatives. The difficulty of the object recog- on the model parameters with low residual, there is strong
nition problem is due in large part to the lack of success in evidence for the presence of the object. Since there may be
finding such image features. However, recent research on dozens of SIFT keys in the image of a typical object, it is
the use of dense local features (e.g., Schmid & Mohr [19]) possible to have substantial levels of occlusion in the image
has shown that efficient recognition can often be achieved and yet retain high levels of reliability.
by using local image descriptors sampled at a large number The current object models are represented as 2D loca-
of repeatable locations. tions of SIFT keys that can undergo affine projection. Suf-
This paper presents a new method for image feature gen- ficient variation in feature location is allowed to recognize
eration called the Scale Invariant Feature Transform (SIFT). perspective projection of planar shapes at up to a 60 degree
This approach transforms an image into a large collection rotation away from the camera or to allow up to a 20 degree
of local feature vectors, each of which is invariant to image rotation of a 3D object.
Proc. of the International Conference on 1
Computer Vision, Corfu (Sept. 1999)
yey!!
11. Object Recognition from Local Scale-Invariant Features
David G. Lowe
Lowe
Computer Science Department
University of British Columbia
s
Vancouver, B.C., V6T 1Z4, Canada
r (1999)
lowe@cs.ubc.ca
p e Abstract translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
a
An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to
new class of local image features. The features are invariant scale and were more sensitive to projective distortion and
p
to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of
variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe-
These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also
3
ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver-
Histograms of Oriented Gradients for Human Detection
in primate vision. Features are efficiently detected through ification.
a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by
scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies
ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that
Navneet gradi- key Triggs
INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
o
ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function.
The keys are used as input to a nearest-neighbor indexing Each http://lear.inrialpes.fr
{Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes
method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co-
Nalal and Triggs
fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to
least-squares solution for the unknown model parameters.
Abstract
Experimental results show that robust object recognition We briefly discusssuch as affine or 3D projections, by blur-
local variations, previous work on human detection in
We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data
ring image gradient locations. This approach is based on a
can be achieved in cluttered partially-occluded imagesob-
(2005) ject recognition,time of under 2 seconds.
a computation adopting linear SVM based human detec-
tion as a test case. After reviewing existing edge and gra-
dient based descriptors, we show experimentally that grids
setsmodel and give a detailedcomplex cells in experimental cor-
in §4 of the behavior of description and the cerebral
evaluation of each stage of the process in §5–6. The main
tex of mammalian vision. The resulting feature vectors are
conclusions are summarized in §7. implementation, each im-
called SIFT keys. In the current
1. Introduction
of Histograms of Oriented Gradient (HOG) descriptors sig-
age generates on the order of 1000 SIFT keys, a process that
2 requires lessWork second of computation time.
Previous than 1
nificantly outperform existing feature sets for human detec-
tion. We recognition in cluttered real-world scenes requires
Object study the influence of each stage of the computation There is SIFT keys derived from an image are used in a
The an extensive literature on object detection, but
local image features that are that fine-scale nearby clutter or here we mention just a approach to papers on to identify candi-
on performance, concluding unaffected by gradients, fine nearest-neighbour few relevant indexing human detec-
orientation binning, relatively coarse be at least partially in- tiondate object models.See [6] for a survey. Papageorgiou et po-
partial occlusion. The features must spatial binning, and [18,17,22,16,20]. Collections of keys that agree on a
variant to illumination, 3D projective transforms, and com- al [18] describe apose are first identified throughpolynomial
high-quality local contrast normalization in overlapping de- tential model pedestrian detector based on a a Hough trans-
mon object variations. On the other hand, the features must SVM using rectified and then through input descriptors, withfinal
scriptor blocks are all important for good results. The new form hash table, Haar wavelets as a least-squares fit to a
also be sufficiently distinctive to identify specific objects a parts (subwindow) based variant in [17]. at least 3 keysal
approach gives near-perfect separation on the original MIT estimate of model parameters. When Depoortere et agree
among many alternatives. The difficulty of the challenging give anthe model parameters with[2]. Gavrila &there is strong
pedestrian database, so we introduce a more object recog- on optimized version of this low residual, Philomen
nition problem is overin large part to the lack images within [8] evidence fordirect approach, extracting edge images and be
dataset containing due 1800 annotated human of success take a more the presence of the object. Since there may
finding such of pose features. However, recent research on matching them to a set of in the image of a typicalchamfer it is
a large range image variations and backgrounds. dozens of SIFT keys learned exemplars using object,
the use of dense local features (e.g., Schmid & Mohr [19]) distance. This has been used in levels of occlusion inpedes-
possible to have substantial a practical real-time the image
1 Introduction and yet retain high levels of et al [22]
has shown that efficient recognition can often be achieved trian detection system [7]. Viola reliability.build an efficient
byDetecting humans in images is sampled at a large owing moving person detector, using AdaBoost to train a chain of
using local image descriptors a challenging task number The current object models are represented as 2D loca-
to their variable appearance and the wide range of poses that
of repeatable locations. progressively more complexcan undergo affine projection. Suf-
tions of SIFT keys that region rejection rules based on
ficient variation in space-time differences. Ronfard et
can adopt. The first need is a robust feature set gen- Haar-like wavelets andfeature location is allowed to recognize
theyThis paper presents a new method for image featurethat
allows the human form to be discriminated cleanly, even in
eration called the Scale Invariant Feature Transform (SIFT). al [19] build anprojection of planar shapesby incorporating
perspective articulated body detectornd at up to a 60 degree
st
difficult illumination. collection SVM based away classifierscamera or to allow up to a 20 degree
cluttered backgrounds underan image into a largeWe study
This approach transforms rotation limb from the over 1 and 2 order Gaussian
the issue of feature sets foreach of which is invariant to image filters in a dynamic programming framework similar to those
of local feature vectors, human detection, showing that lo- rotation of a 3D object.
cally normalized Histogram of Oriented Gradient (HOG) de- of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
scriptors provide excellent performance relative to other ex- [9]. Mikolajczyk et al [16] use combinations of orientation-
isting feature sets including wavelets [17,22]. The proposed
Proc. of the International Conference on position histograms with binary-thresholded gradient magni-
1
descriptorsVision, Corfu (Sept. 1999) orientation histograms tudes to build a parts based method containing detectors for
Computer are reminiscent of edge
[4,5], SIFT descriptors [12] and shape contexts [1], but they faces, heads, and front and side profiles of upper and lower
yey!! are computed on a dense grid of uniformly spaced cells and body parts. In contrast, our detector uses a simpler archi-
tecture with a single detection window, but appears to give
they use overlapping local contrast normalizations for im-
proved performance. We make a detailed study of the effects significantly higher performance on pedestrian images.
of various implementation choices on detector performance,
taking “pedestrian detection” (the detection of mostly visible
3 Overview of the Method
12. Object Recognition from Local Scale-Invariant Features
David G. Lowe
Lowe
Computer Science Department
University of British Columbia
s
Vancouver, B.C., V6T 1Z4, Canada
r (1999)
lowe@cs.ubc.ca
p e Abstract translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
a
An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to
new class of local image features. The features are invariant scale and were more sensitive to projective distortion and
p
to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of
variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe-
These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also
3
ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver-
Histograms of Oriented Gradients for Human Detection
in primate vision. Features are efficiently detected through ification.
a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by
scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies
ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that
Navneet gradi- key Triggs
INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
o
ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function.
The keys are used as input to a nearest-neighbor indexing Each http://lear.inrialpes.fr
{Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes
method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co-
Nalal and Triggs
fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to
least-squares solution for the unknown model parameters.
Abstract
Experimental results show that robust object recognition We briefly discusssuch as affine or 3D projections, by blur-
local variations, previous work on human detection in
We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data
ring image gradient locations. This approach is based on a
can be achieved in cluttered partially-occluded imagesob-
(2005) ject recognition,time of under 2 seconds.
a computation adopting linear SVM based human detec-
tion as a test case. After reviewing existing edge and gra-
dient based descriptors, we show experimentally that grids
setsmodel and give a detailedcomplex cells in experimental cor-
in §4 of the behavior of description and the cerebral
evaluation of each stage of the process in §5–6. The main
tex of mammalian vision. The resulting feature vectors are
conclusions are summarized in §7. implementation, each im-
called SIFT keys. In the current
1. Introduction
of Histograms of Oriented Gradient (HOG) descriptors sig-
age generates on the order of 1000 SIFT keys, a process that
2 requires lessWork second of computation time.
Previous than 1
nificantly outperform existing feature sets for human detec-
tion. We recognition in cluttered real-world scenes requires
Object study the influence of each stage of the computation There is SIFT keys derived from an image are used in a
The an extensive literature on object detection, but
local image features that are that fine-scale nearby clutter or here we mention just a approach to papers on to identify candi-
on performance, concluding unaffected by gradients, fine nearest-neighbour few relevant indexing human detec-
orientation binning, relatively coarse be at least partially in- tiondate object models.See [6] for a survey. Papageorgiou et po-
partial occlusion. The features must spatial binning, and [18,17,22,16,20]. Collections of keys that agree on a
variant to illumination, 3D projective transforms, and com- al [18] describe apose are first identified throughpolynomial
high-quality local contrast normalization in overlapping de- tential model pedestrian detector based on a a Hough trans-
A Discriminatively Trained, Multiscale, Deformable Part Model fit to a
mon object variations. On the other hand, the features must SVM using rectified and then through input descriptors, withfinal
scriptor blocks are all important for good results. The new form hash table, Haar wavelets as a least-squares
also be sufficiently distinctive to identify specific objects a parts (subwindow) based variant in [17]. at least 3 keysal
approach gives near-perfect separation on the original MIT estimate of model parameters. When Depoortere et agree
among many alternatives. The difficulty of the challenging give anthe model parameters with[2]. Gavrila &there is strong
pedestrian database, so we introduce a more object recog- on optimized version of this low residual, Philomen
nition problem is overin large part to the lack images within [8] evidence fordirect approach, extracting edge images and be
dataset containing due 1800 annotated human of success take a more the presence of the object. Since there may
Pedro Felzenszwalb David McAllester Deva Ramanan
finding such of pose features. However, recent research on matching them to a set of in the image of a typicalchamfer it is
a large range image variations and backgrounds. dozens of SIFT keys learned exemplars using object,
the University of Chicago (e.g.,Toyota Technological Institute to has been used in levels ofUC Irvine pedes-
use of dense local features possible at Chicago
Schmid & Mohr [19]) distance. This have substantial a practical real-time the image
occlusion in
1 Introduction
Felzenszwalb et al. has pff@cs.uchicago.edu
shown that efficient recognition can often be achieved trian detection system [7]. Viola dramanan@ics.uci.edu
mcallester@tti-c.org et al [22]
and yet retain high levels of reliability.build an efficient
byDetecting humans in images is sampled at a large owing moving person detector, using AdaBoost to train a chain of
using local image descriptors a challenging task number The current object models are represented as 2D loca-
to their variable appearance and the wide range of poses that
of repeatable locations. progressively more complexcan undergo affine projection. Suf-
tions of SIFT keys that region rejection rules based on
ficient variation in space-time differences. Ronfard et
can adopt. The firstnew method for image feature gen- Haar-like wavelets andfeature location is allowed to recognize
(2008)
theyThis paper presents aAbstract robust feature set that
need is a
allows the human form to be discriminated cleanly, even in
eration called the Scale Invariant Feature Transform (SIFT). al [19] build anprojection of planar shapesby incorporating
perspective articulated body detectornd at up to a 60 degree
st
difficult illumination. collection SVM based away classifierscamera or to allow up to a 20 degree
cluttered backgrounds underan image into a largeWe study
This approach describes a discriminatively trained, multi- rotation limb from the over 1 and 2 order Gaussian
This paper transforms
the issue of feature sets foreach of which is invariant to image filters in a dynamic programming framework similar to those
of local feature vectors, human detection, showing that lo-
scale, deformable part model for object detection. Our sys- of Felzenszwalb3D object.
cally normalized Histogram of Oriented Gradient (HOG) de-
rotation of a
& Huttenlocher [3] and Ioffe & Forsyth
scriptors providetwo-fold improvement relative to other ex-
tem achieves a excellent performance in average precision [9]. Mikolajczyk et al [16] use combinations of orientation-
isting thethe International Conference 2006 PASCAL person de- position histograms with binary-thresholded gradient magni-
over feature sets including wavelets [17,22]. The proposed 1
Proc. of best performance in the on
tection challenge. It also outperforms the best results in the tudes to build a parts based method containing detectors for
descriptorsVision, Corfu (Sept. 1999) orientation histograms
Computer are reminiscent of edge
[4,5], SIFT descriptors [12]of twenty categories. The system faces, heads, and front and side profiles of upper and lower
2007 challenge in ten out and shape contexts [1], but they
yey!! are computed on adeformableof uniformly spaced cells and
relies heavily on dense grid parts. While deformable part body parts. In contrast, our detector uses a simpler archi-
models overlapping quite contrast their value had not been tecture with a single detection obtained with the person model. The
Figure 1. Example detection window, but appears to give
they usehave become local popular, normalizations for im-
model is defined by a coarse template, several higher resolution
proved performance. We make a detailedsuch as the PASCAL significantly higher performance on for the location of each part.
demonstrated on difficult benchmarks study of the effects
part templates and a spatial model
pedestrian images.
challenge. Our system also relies heavily on new methods
of various implementation choices on detector performance,
for discriminative training. We detection of mostly visible 3 Overview of the Method
taking “pedestrian detection” (thecombine a margin-sensitive
13. Object Recognition from Local Scale-Invariant Features
David G. Lowe
Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
(1999)
lowe@cs.ubc.ca
Abstract translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to
new class of local image features. The features are invariant scale and were more sensitive to projective distortion and
to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of
variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe-
These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also
ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver-
Histograms of Oriented Gradients for Human Detection
in primate vision. Features are efficiently detected through ification.
a staged filtering approach that identifies stable points in The scale-invariant features are efficiently identified by
scale space. Image keys are created that allow for local ge- using a staged filtering approach. The first stage identifies
ometric deformations by representing blurred imageDalal and Bill locations in scale space by looking for locations that
Navneet gradi- key Triggs
INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
o
ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function.
The keys are used as input to a nearest-neighbor indexing Each http://lear.inrialpes.fr
{Navneet.Dalal,Bill.Triggs}@inrialpes.fr,point is used to generate a feature vector that describes
method that identifies candidate object matches. Final veri- the local image region sampled relative to its scale-space co-
Nalal and Triggs
fication of each match is achieved by finding a low-residual ordinate frame. The features achieve partial invariance to
least-squares solution for the unknown model parameters.
Abstract
Experimental results show that robust object recognition We briefly discusssuch as affine or 3D projections, by blur-
local variations, previous work on human detection in
We study the question of feature sets for robust visual with §2, give an overview of our method §3, describe our data
ring image gradient locations. This approach is based on a
can be achieved in cluttered partially-occluded imagesob-
(2005) ject recognition,time of under 2 seconds.
a computation adopting linear SVM based human detec-
tion as a test case. After reviewing existing edge and gra-
dient based descriptors, we show experimentally that grids
setsmodel and give a detailedcomplex cells in experimental cor-
in §4 of the behavior of description and the cerebral
evaluation of each stage of the process in §5–6. The main
tex of mammalian vision. The resulting feature vectors are
conclusions are summarized in §7. implementation, each im-
called SIFT keys. In the current
1. Introduction
of Histograms of Oriented Gradient (HOG) descriptors sig-
age generates on the order of 1000 SIFT keys, a process that
2 requires lessWork second of computation time.
Previous than 1
nificantly outperform existing feature sets for human detec-
tion. We recognition in cluttered real-world scenes requires
Object study the influence of each stage of the computation There is SIFT keys derived from an image are used in a
The an extensive literature on object detection, but
local image features that are unaffected by nearby clutter or nearest-neighbour approach to indexing to identify candi-
partial occlusion. The features must be at least partially in- date object models. Collections of keys that agree on a po-
variant to illumination, 3D projective transforms, and com- tential model pose are first identified through a Hough trans-
A Discriminatively Trained, Multiscale, Deformable Part Model fit to a final
mon object variations. On the other hand, the features must form hash table, and then through a least-squares
also be sufficiently distinctive to identify specific objects estimate of model parameters. When at least 3 keys agree
among many alternatives. The difficulty of the object recog- on the model parameters with low residual, there is strong
nition problem is due in large part to the lack of success in
Pedro Felzenszwalb David McAllester for the presence of the object. Since there may be
evidence
Deva Ramanan
finding such image features. However, recent research on dozens of SIFT keys in the image of a typical object, it is
the University of Chicago (e.g.,Toyota Technological Institute to have substantial levels ofUC Irvine the image
use of dense local features Schmid & Mohr [19]) possible at Chicago occlusion in
Felzenszwalb et al. has pff@cs.uchicago.edu mcallester@tti-c.org
shown that efficient recognition can often be achieved and yet retain high levels of dramanan@ics.uci.edu
reliability.
by using local image descriptors sampled at a large number The current object models are represented as 2D loca-
of repeatable locations. tions of SIFT keys that can undergo affine projection. Suf-
(2008)
This paper presents aAbstract for image feature gen-
new method ficient variation in feature location is allowed to recognize
eration called the Scale Invariant Feature Transform (SIFT). perspective projection of planar shapes at up to a 60 degree
This approach describes a an image into a large collection
This paper transforms discriminatively trained, multi- rotation away from the camera or to allow up to a 20 degree
of local feature vectors, each of which isdetection. to image
scale, deformable part model for object invariant Our sys- rotation of a 3D object.
tem achieves a two-fold improvement in average precision
over thethe International Conference 2006 PASCAL person de- 1
Proc. of best performance in the on
tection challenge. It also outperforms the best results in the
Computer Vision, Corfu (Sept. 1999)
2007 challenge in ten out of twenty categories. The system
relies heavily on deformable parts. While deformable part
models have become quite popular, their value had not been Figure 1. Example detection obtained with the person model. The
demonstrated on difficult benchmarks such as the PASCAL model is defined by a coarse template, several higher resolution
18. like me they are robust...
Text
... to changes in illumination,
noise, viewpoint, occlusion, etc.
19. I am sure you want to know
how to build them
Text
20. I am sure you want to know
how to build them
1. find interest points or “keypoints”
Text
21. I am sure you want to know
how to build them
1. find interest points or “keypoints”
Text
2. find their dominant orientation
22. I am sure you want to know
how to build them
1. find interest points or “keypoints”
Text
2. find their dominant orientation
3. compute their descriptor
23. I am sure you want to know
how to build them
1. find interest points or “keypoints”
Text
2. find their dominant orientation
3. compute their descriptor
4. match them on other images
25. keypoints are taken as maxima/minima
of a DoG pyramid
Text
in this settings, extremas are invariant to scale...
26. a DoG (Difference of Gaussians) pyramid
is simple to compute... even him can do it!
before after
adapted from Pallus and Fleishman
27. then we just have to find
neighborhood extremas
in this 3D DoG space
28. then we just have to find
neighborhood extremas
in this 3D DoG space
if a pixel is an extrema
in its neighboring region
he becomes a candidate
keypoint
39. How?
using the DoG pyramid to achieve
scale invariance:
a. compute image gradient
magnitude and orientation
40. How?
using the DoG pyramid to achieve
scale invariance:
a. compute image gradient
magnitude and orientation
b. build an orientation histogram
41. How?
using the DoG pyramid to achieve
scale invariance:
a. compute image gradient
magnitude and orientation
b. build an orientation histogram
c. keypoint’s orientation(s) = peak(s)
56. SIFT is great!
Text
invariant to affine transformations
easy to understand
57. SIFT is great!
Text
invariant to affine transformations
easy to understand
fast to compute
58. Extension example:
Spatial Pyramid Matching using SIFT
Beyond Bags of Features: Spatial Pyramid Matching
for Recognizing Natural Scene Categories
Svetlana Lazebnik1 Cordelia Schmid2 Jean Ponce1,3
slazebni@uiuc.edu Cordelia.Schmid@inrialpes.fr ponce@cs.uiuc.edu
1
Beckman Institute 2 Text
INRIA Rhˆ ne-Alpes
o 3
Ecole Normale Sup´ rieure
e
University of Illinois Montbonnot, France Paris, France
CVPR 2006
59. Object Recognition from Local Scale-Invariant Features
David G. Lowe
Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
(1999)
lowe@cs.ubc.ca
Abstract translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to
new class of local image features. The features are invariant scale and were more sensitive to projective distortion and
to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of
variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe-
Histograms of Oriented Gradients for Human Detection
Navneet Dalal and Bill Triggs
INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
o
{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
Nalal and Triggs Abstract
We study the question of feature sets for robust visual ob-
We briefly discuss previous work on human detection in
§2, give an overview of our method §3, describe our data
(2005) ject recognition, adopting linear SVM based human detec-
tion as a test case. After reviewing existing edge and gra-
dient based descriptors, we show experimentally that grids
sets in §4 and give a detailed description and experimental
evaluation of each stage of the process in §5–6. The main
conclusions are summarized in §7.
of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work
nificantly outperform existing feature sets for human detec-
tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but
on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec-
orientation binning, relatively coarse spatial binning, and tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et
high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial
A Discriminatively Trained, Multiscale, Deformable Part Model with
scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors,
approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al
pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen
[8] take a more direct approach, extracting edge images and
dataset containing over 1800 annotated human images with McAllester
Pedro Felzenszwalb David matching them to a set of learned exemplarsRamanan
Deva using chamfer
a large range of pose variations and backgrounds.
University of Chicago Toyota Technological Institute athas been used in a practical real-time pedes-
distance. This Chicago UC Irvine
1 Introduction
Felzenszwalb et al.
pff@cs.uchicago.edu mcallester@tti-c.org system [7]. Viola dramanan@ics.uci.edu
trian detection et al [22] build an efficient
Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of
to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on
Haar-like wavelets and space-time differences. Ronfard et
(2008)
they can adopt. The first need is a robust feature set that
Abstract
allows the human form to be discriminated cleanly, even in al [19] build an articulated body detector by incorporating
cluttered backgrounds under difficult illumination. We study SVM based limb classifiers over 1st and 2nd order Gaussian
This paper describes a discriminatively trained, multi- filters in a dynamic programming framework similar to those
the issue of feature sets for human detection, showing that lo-
scale, deformable part model for object detection. Our sys- of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
cally normalized Histogram of Oriented Gradient (HOG) de-
scriptors providetwo-fold improvement relative to other ex-
tem achieves a excellent performance in average precision [9]. Mikolajczyk et al [16] use combinations of orientation-
isting the bestsets including wavelets [17,22]. The person de- position histograms with binary-thresholded gradient magni-
over feature performance in the 2006 PASCAL proposed
descriptors are reminiscent of edge orientation results in the tudes to build a parts based method containing detectors for
tection challenge. It also outperforms the best histograms
[4,5], SIFT descriptors [12]of twenty categories. The system faces, heads, and front and side profiles of upper and lower
2007 challenge in ten out and shape contexts [1], but they
are computed on adeformableof uniformly spaced cells and
relies heavily on dense grid parts. While deformable part body parts. In contrast, our detector uses a simpler archi-
models overlapping quite contrast their value had not been tecture with a single detection obtained with the person model. The
they usehave become local popular, normalizations for im-
Figure 1. Example detection window, but appears to give
model is defined by a coarse template, several higher resolution
proved performance. We make a detailedsuch as the PASCAL significantly higher performance on pedestrian images.
demonstrated on difficult benchmarks study of the effects
of various implementation choices on detector performance,
taking “pedestrian detection” (the detection of mostly visible
3 Overview of the Method
60. Histograms of Oriented Gradients for Human Detection
Navneet Dalal and Bill Triggs
INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
o
{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
Abstract We briefly discuss previous work on human detection in
We study the question of feature sets for robust visual ob- §2, give an overview of our method §3, describe our data
ject recognition, adopting linear SVM based human detec- sets in §4 and give a detailed description and experimental
tion as a test case. After reviewing existing edge and gra- evaluation of each stage of the process in §5–6. The main
dient based descriptors, we show experimentally that grids conclusions are summarized in §7.
of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work
nificantly outperform existing feature sets for human detec-
tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but
on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec-
orientation binning, relatively coarse spatial binning, and tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et
high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial
scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, with
approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al
pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen
dataset containing over 1800 annotated human images with [8] take a more direct approach, extracting edge images and
a large range of pose variations and backgrounds. matching them to a set of learned exemplars using chamfer
distance. This has been used in a practical real-time pedes-
1 Introduction trian detection system [7]. Viola et al [22] build an efficient
Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of
to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on
they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et
first of all, let me put this paper in
allows the human form to be discriminated cleanly, even in
cluttered backgrounds under difficult illumination. We study
al [19] build an articulated body detector by incorporating
SVM based limb classifiers over 1st and 2nd order Gaussian
context
the issue of feature sets for human detection, showing that lo-
cally normalized Histogram of Oriented Gradient (HOG) de-
filters in a dynamic programming framework similar to those
of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
61. Histograms of Oriented Gradients for Human Detection
Navneet Dalal and Bill Triggs
INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
o
{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
λ
λ
λ
Abstract We briefly discuss previous work on human detection in
Swain & Ballard 1991 - Color an overview of our method §3, describe our data
§2, give Histograms
We study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec- sets in §4 and give a detailed description and experimental
tion as a test case. After reviewing& Crowley 1996 evaluation of each stage of the process in §5–6. The main
Schiele existing edge and gra- conclusions are summarized in §7.
- Receptive Fields Histograms
dient based descriptors, we show experimentally that grids
of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work
nificantly outperform existing feature sets - SIFT detec-
Lowe 1999 for human
tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but
on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec-
Schneiderman & Kanade 2000 - Localized for a survey. PapageorgiouWavelets
tion [18,17,22,16,20]. See [6] Histograms of et
orientation binning, relatively coarse spatial binning, and
high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial
SVM using rectified Haar wavelets as input descriptors, with
scriptor blocks are all Leung for good results. The new Texton Histograms
important & Malik 2001 -
approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al
pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen
dataset containing over 1800 annotated human images with Shape Context approach, extracting edge images and
Belongie et al. 2002 - [8] take a more direct
a large range of pose variations and backgrounds. matching them to a set of learned exemplars using chamfer
distance. This has been used in a practical real-time pedes-
1 Introduction Dalal & Triggs 2005 - Dense Orientation Histogramsan efficient
trian detection system [7]. Viola et al [22] build
Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of
to their variable appearance and the wide range of poses that
... progressively more complex region rejection rules based on
they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et
histograms of local image measurement
allows the human form to be discriminated cleanly, even in
cluttered backgrounds under difficult illumination. We study
al [19] build an articulated body detector by incorporating
SVM based limb classifiers over 1st and 2nd order Gaussian
have been quite successful
the issue of feature sets for human detection, showing that lo-
cally normalized Histogram of Oriented Gradient (HOG) de-
filters in a dynamic programming framework similar to those
of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
62. Histograms of Oriented Gradients for Human Detection
Navneet Dalal and Bill Triggs
features
INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
o
{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
Abstract We briefly discuss previous work on human detection in
We study the question of feature sets for robust visual ob- §2, give an overview of our method §3, describe our data
Gravrila & Philomen 1999 - Edgegive a detailed description and experimental
ject recognition, adopting linear SVM based human detec- sets in §4 and Templates + Nearest Neighbor
tion as a test case. After reviewing existing edge and gra- evaluation of each stage of the process in §5–6. The main
dient based descriptors, we show experimentally that grids conclusions are summarized in §7.
Papageorgiou & Poggio 2000, Mohan et al. 2001, DePoortere et al.
of Histograms of Oriented Gradient (HOG) descriptors sig-
2002 - Haar Wavelets 2 Previous Work
nificantly outperform existing feature sets for human detec- + SVM
tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but
on performance, concluding that fine-scale gradients, - Rectangular Differentialpapers on human +
here we mention just a few relevant
Viola & Jones 2001 fine tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et
Features
detec-
orientation binning, relatively coarse spatial binning, and
AdaBoost
high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial
scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, with
approach gives near-perfect separation on the original MIT - parts (subwindow) based variant in [17]. Depoortere et al
a
Mikolajczyk et al. 2004 give an optimized version of this [2]. Gavrila & Philomen
Parts Based Histograms + AdaBoost
pedestrian database, so we introduce a more challenging
dataset containing over 1800 annotated human images with [8] take a more direct approach, extracting edge images and
a large range of pose variations Sukthankar 2004 - PCA-SIFT set of learned exemplars using chamfer
Ke & and backgrounds. matching them to a
distance. This has been used in a practical real-time pedes-
1 Introduction trian detection system [7]. Viola et al [22] build an efficient
...
Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of
to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on
they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et
allows the human form to be discriminated cleanly, even in al [19] build an articulated body detector by incorporating
tons of “feature sets” have been proposed
cluttered backgrounds under difficult illumination. We study
the issue of feature sets for human detection, showing that lo-
SVM based limb classifiers over 1st and 2nd order Gaussian
filters in a dynamic programming framework similar to those
cally normalized Histogram of Oriented Gradient (HOG) de- of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
63. Histograms of Oriented Gradients for Human Detection
Navneet Dalal and Bill Triggs
difficult!
INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
o
{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
Abstract We briefly discuss previous work on human detection in
We study the question of feature sets for robust visual ob- §2, give an overview of our method §3, describe our data
ject recognition, adopting linearvariety human detec-
Wide SVM based of articulated poses a detailed description and experimental
sets in §4 and give
tion as a test case. After reviewing existing edge and gra- evaluation of each stage of the process in §5–6. The main
dient based descriptors, we show experimentally that grids conclusions are summarized in §7.
Variable appearance/clothing
of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work
nificantly outperform existing feature sets for human detec-
Complex backgrounds
tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but
on performance, concluding that fine-scale gradients, fine here we mention just a few relevant papers on human detec-
orientation binning, relatively coarse spatial binning, and tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou et
Unconstrained illuminations
high-quality local contrast normalization in overlapping de- al [18] describe a pedestrian detector based on a polynomial
scriptor blocks are all important for good results. The new SVM using rectified Haar wavelets as input descriptors, with
approach gives near-perfect separation on the original MIT a parts (subwindow) based variant in [17]. Depoortere et al
Occlusions
pedestrian database, so we introduce a more challenging give an optimized version of this [2]. Gavrila & Philomen
dataset containing over 1800 annotated human images with [8] take a more direct approach, extracting edge images and
Different scales
a large range of pose variations and backgrounds. matching them to a set of learned exemplars using chamfer
distance. This has been used in a practical real-time pedes-
1 Introduction trian detection system [7]. Viola et al [22] build an efficient
...
Detecting humans in images is a challenging task owing moving person detector, using AdaBoost to train a chain of
to their variable appearance and the wide range of poses that progressively more complex region rejection rules based on
they can adopt. The first need is a robust feature set that Haar-like wavelets and space-time differences. Ronfard et
localizing humans in images is a
allows the human form to be discriminated cleanly, even in
cluttered backgrounds under difficult illumination. We study
al [19] build an articulated body detector by incorporating
SVM based limb classifiers over 1st and 2nd order Gaussian
challenging task...
the issue of feature sets for human detection, showing that lo-
cally normalized Histogram of Oriented Gradient (HOG) de-
filters in a dynamic programming framework similar to those
of Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth
93. Further
Development
• Detection on Pascal VOC (2006)
• Human Detection in Movies (ECCV 2006)
• US Patent by MERL (2006)
• Stereo Vision HoG (ICVES 2008)
104. so, it doesn’t work ?!?
no no, it works...
...it just doesn’t work well...
105. Object Recognition from Local Scale-Invariant Features
David G. Lowe
Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
(1999)
lowe@cs.ubc.ca
Abstract translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to
new class of local image features. The features are invariant scale and were more sensitive to projective distortion and
to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of
variant to illumination changes and affine or 3D projection. properties in common with the responses of neurons in infe-
Histograms of Oriented Gradients for Human Detection
Navneet Dalal and Bill Triggs
INRIA Rhˆ ne-Alps, 655 avenue de l’Europe, Montbonnot 38334, France
o
{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
Nalal and Triggs Abstract
We study the question of feature sets for robust visual ob-
We briefly discuss previous work on human detection in
§2, give an overview of our method §3, describe our data
(2005) ject recognition, adopting linear SVM based human detec-
tion as a test case. After reviewing existing edge and gra-
dient based descriptors, we show experimentally that grids
sets in §4 and give a detailed description and experimental
evaluation of each stage of the process in §5–6. The main
conclusions are summarized in §7.
of Histograms of Oriented Gradient (HOG) descriptors sig- 2 Previous Work
nificantly outperform existing feature sets for human detec-
tion. We study the influence of each stage of the computation There is an extensive literature on object detection, but
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro Felzenszwalb David McAllester Deva Ramanan
University of Chicago Toyota Technological Institute at Chicago UC Irvine
Felzenszwalb et al.
pff@cs.uchicago.edu mcallester@tti-c.org dramanan@ics.uci.edu
(2008) Abstract
This paper describes a discriminatively trained, multi-
scale, deformable part model for object detection. Our sys-
tem achieves a two-fold improvement in average precision
over the best performance in the 2006 PASCAL person de-
tection challenge. It also outperforms the best results in the
2007 challenge in ten out of twenty categories. The system
relies heavily on deformable parts. While deformable part
models have become quite popular, their value had not been Figure 1. Example detection obtained with the person model. The
demonstrated on difficult benchmarks such as the PASCAL model is defined by a coarse template, several higher resolution
part templates and a spatial model for the location of each part.
challenge. Our system also relies heavily on new methods
for discriminative training. We combine a margin-sensitive
116. l
M ode
Part
mable
D efor
// each part is a local
property
// springs capture
spatial relationships
// here, the springs
can be “negative”
117. l
M ode
art
Defor
mable
P
detection score =
sum of filter responses - deformation cost
118. l
M ode
art
Defor
mable
P
detection score =
sum of filter responses - deformation cost
root filter
119. l
M ode
art
Defor
mable
P
detection score =
sum of filter responses - deformation cost
root filter
part filters
120. l
M ode
art
Defor
mable
P
detection score =
sum of filter responses - deformation cost
root filter
deformable
part filters
model
121. l
M ode
Part
mable
efor
D
score of a placement
filters feature vector coefficients of a
position relative
(at position p quadratic function on
to the root location
in the pyramid H) the placement
136. ns
cl usio
Con
so, it doesn’t work ?!?
no no, it works...
137. ns
cl usio
Con
so, it doesn’t work ?!?
no no, it works...
...it just doesn’t work well...
138. ns
cl usio
Con
so, it doesn’t work ?!?
no no, it works...
...it just doesn’t work well...
...or there is a problem with the
seat-computer interface...