MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
1. 6.870 Grounding object
recognition and scene
understanding
Wednesdays 1-4pm
Room 13-1143
Instructor: Antonio Torralba
Email: torralba@csail.mit.edu
http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm
Some slides are borrowed from other classes (see links on the course
web site). Let me know if I forget to give credit to the right people.
4. Course project
• Topics for projects: It can derive from one
of the papers studied or from your own
research.
• Work individually or in pairs.
• Results described as a 4 pages CVPR
paper
• Short presentation at the end of the
semester
5. Paper presentations (40%)
Email me at the end of the class for scheduling the next week. We will
first decide how to structure the week together.
• Presenter:
– Present the key ideas, background material, and technical details.
– Show me the slides two days before the class.
– To test the basic ideas of the paper(s), using code available online or
writing toy code.
– Create toy test problems that reveal something about the algorithm.
– Constructive criticism.
7. 6.870 Grounding object recognition
and scene understanding
Lecture
1
Class
goals
and
a
short
introduc2on
8. What
is
vision?
• What
does
it
mean,
to
see?
“to
know
what
is
where
by
looking”.
• How
to
discover
from
images
what
is
present
in
the
world,
where
things
are,
what
ac2ons
are
taking
place.
from
Marr,
1982
9. The
importance
of
images
Some
images
are
more
important
than
others
“Dora
Maar
au
Chat”
Pablo
Picasso,
1941
100
million
$
13. The
Plenop2c
Func2on
Adelson & Bergen, 91
The intensity P can be parameterized as:
P (θ, φ,
λ,
t, X, Y, Z)
“The complete set of all convergence points constitutes the permanent possibilities
of vision.” Gibson
51. So,
let’s
make
the
problem
simpler:
Block
world
Nice framework to develop fancy math, but too far from reality…
Object Recognition in the Geometric Era:
a Retrospective. Joseph L. Mundy. 2006
52. Binford
and
generalized
cylinders
Object Recognition in the Geometric Era:
a Retrospective. Joseph L. Mundy. 2006
54. Recogni2on
by
components
Irving Biederman
Recognition-by-Components: A Theory of Human Image Understanding.
Psychological Review, 1987.
55. Recogni2on
by
components
The
fundamental
assump2on
of
the
proposed
theory,
recogni2on-‐by-‐components
(RBC),
is
that
a
modest
set
of
generalized-‐cone
components,
called
geons
(N
=
36),
can
be
derived
from
contrasts
of
five
readily
detectable
proper2es
of
edges
in
a
two-‐dimensional
image:
curvature,
collinearity,
symmetry,
parallelism,
and
cotermina2on.
The
“contribu2on
lies
in
its
proposal
for
a
par2cular
vocabulary
of
components
derived
from
perceptual
mechanisms
and
its
account
of
how
an
arrangement
of
these
components
can
access
a
representa2on
of
an
object
in
memory.”
56. A
do-‐it-‐yourself
example
1) We know that this object is nothing we know
2) We can split this objects into parts that everybody will agree
3) We can see how it resembles something familiar: “a hot dog cart”
“The naive realism that emerges in descriptions of nonsense objects may be
reflecting the workings of a representational system by which objects are
identified.”
57. Stages
of
processing
“Parsing is performed, primarily at concave regions, simultaneously with a
detection of nonaccidental properties.”
58. Non
accidental
proper2es
Certain properties of edges in a two-dimensional image are taken by the visual
system as strong evidence that the edges in the three-dimensional world contain those
same properties.
Non accidental properties, (Witkin & Tenenbaum,1983): Rarely be produced by
accidental alignments of viewpoint and object features and consequently are generally
unaffected by slight variations in viewpoint.
image
?
60. From
generalized
cylinders
to
GEONS
“From variation over only two or three levels in the nonaccidental relations of four
attributes of generalized cylinders, a set of 36 GEONS can be generated.”
Geons represent a restricted form of generalized cylinders.
64. Parts and Structure approaches
With a different perspective, these models focused more on the
geometry than on defining the constituent elements:
• Fischler & Elschlager 1973
• Yuille ‘91
• Brunelli & Poggio ‘93
• Lades, v.d. Malsburg et al. ‘93
• Cootes, Lanitis, Taylor et al. ‘95
• Amit & Geman ‘95, ‘99
• Perona et al. ‘95, ‘96, ’98, ’00, ’03, ‘04, ‘05
• Felzenszwalb & Huttenlocher ’00, ’04 Figure from [Fischler & Elschlager 73]
• Crandall & Huttenlocher ’05, ’06
• Leibe & Schiele ’03, ’04
• Many papers since 2000
65. But,
despite
promising
ini2al
results…things
did
not
work
out
so
well
(lack
of
data,
processing
power,
lack
of
reliable
methods
for
low-‐level
and
mid-‐level
vision)
Instead,
a
different
way
of
thinking
about
object
detec2on
started
making
some
progress:
learning
based
approaches
and
classifiers,
which
ignored
low
and
mid-‐level
vision.
Maybe
the
2me
is
here
to
come
back
to
some
of
the
earlier
models,
more
grounded
in
intui2ons
about
visual
percep2on.
67. Neocognitron
Fukushima (1980). Hierarchical multilayered neural network
S-cells work as feature-extracting cells. They resemble simple cells of the
primary visual cortex in their response.
C-cells, which resembles complex cells in the visual cortex, are inserted in the
network to allow for positional errors in the features of the stimulus. The input
connections of C-cells, which come from S-cells of the preceding layer, are fixed
and invariable. Each C-cell receives excitatory input connections from a group
of S-cells that extract the same feature, but from slightly different positions. The
C-cell responds if at least one of these S-cells yield an output.
68. Neocognitron
Learning is done greedily for each layer
70. Face detection and the success
of learning based approaches
• The representation and matching of pictorial structures Fischler, Elschlager (1973).
• Face recognition using eigenfaces M. Turk and A. Pentland (1991).
• Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995)
• Graded Learning for Object Detection - Fleuret, Geman (1999)
• Robust Real-time Object Detection - Viola, Jones (2001)
• Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre,
Mukherjee, Poggio (2001)
• ….
71. • The representation and matching of pictorial structures Fischler, Elschlager (1973).
• Face recognition using eigenfaces M. Turk and A. Pentland (1991).
• Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995)
• Graded Learning for Object Detection - Fleuret, Geman (1999)
• Robust Real-time Object Detection - Viola, Jones (2001)
• Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre,
Mukherjee, Poggio (2001)
• ….
73. The face age
Feret dataset, 1996 DARPA
• The representation and matching of pictorial structures Fischler,
Elschlager (1973).
• Face recognition using eigenfaces M. Turk and A. Pentland (1991).
• Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995)
• Graded Learning for Object Detection - Fleuret, Geman (1999)
• Robust Real-time Object Detection - Viola, Jones (2001)
• Feature Reduction and Hierarchy of Classifiers for Fast Object Detection
in Video Images - Heisele, Serre, Mukherjee, Poggio (2001)
• ….
74. Rapid Object Detection Using a Boosted
Cascade of Simple Features
Paul Viola Michael J. Jones
Mitsubishi Electric Research Laboratories (MERL)
Cambridge, MA
Most of this work was done at Compaq CRL before the authors moved to MERL
Manuscript available on web:
http://citeseer.ist.psu.edu/cache/papers/cs/23183/http:zSzzSzwww.ai.mit.eduzSzpeoplezSzviolazSzresearchzSzpublicationszSzICCV01-Viola-Jones.pdf/viola01robust.pdf
75. Haar-like filters and cascades
Viola and Jones, ICCV 2001
The average intensity in the
block is computed with four
sums independently of the
block size.
Also Fleuret and Geman, 2001
77. • The representation and matching of pictorial structures
Fischler, Elschlager (1973).
• Face recognition using eigenfaces M. Turk and A.
Pentland (1991).
• Human Face Detection in Visual Scenes - Rowley, Baluja,
Kanade (1995)
• Graded Learning for Object Detection - Fleuret, Geman
(1999)
• Robust Real-time Object Detection - Viola, Jones (2001)
• Feature Reduction and Hierarchy of Classifiers for Fast
Object Detection in Video Images - Heisele, Serre,
Mukherjee, Poggio (2001)
• ….
78. Families of recognition algorithms
Voting models Shape matching
Bag of words models
Deformable models
Viola and Jones, ICCV 2001 Berg, Berg, Malik, 2005
Csurka, Dance, Fan, Willamowski, and Heisele, Poggio, et. al., NIPS 01
Cootes, Edwards, Taylor, 2001
Bray 2004 Schneiderman, Kanade 2004
Sivic, Russell, Freeman, Zisserman, Vidal-Naquet, Ullman 2003
ICCV 2005
Rigid template models
Constellation models
Fischler and Elschlager, 1973 Sirovich and Kirby 1987
Turk, Pentland, 1991
Burl, Leung, and Perona, 1995
Weber, Welling, and Perona, 2000 Dalal & Triggs, 2006
Fergus, Perona, & Zisserman, CVPR 2003
79. Scene understanding
Torralba,
Sinha
(2001)
Torralba
Murphy
Freeman
(2004)
Carboneio,
de
Freitas
&
Barnard
(2004)
Fink
&
Perona
(2003)
Rabinovich
et
al
(2007)
Sudderth,
Torralba,
Wilsky,
Freeman
(2005)
Hoiem,
Efros,
Hebert
(2005)
Kumar,
Hebert
(2005)
Choi, Lim,
Torralba,
Desai,
Ramanan,
and
Fowlkes
(2009)
Willsky (2010)
Heitz
and
Koller
(2008)
98. Data
Human vision
• Many input modalities
• Active
• Supervised, unsupervised, semi supervised
learning. It can look for supervision.
Robot vision
• Many poor input modalities
• Active, but it does not go far
Internet vision
• Many input modalities
• It can reach everywhere
• Tons of data
100. Active stereo with structured light
Li Zhang’s one-shot stereo
camera 1 camera 1
projector projector
camera 2
Project “structured” light patterns onto the object
• simplifies the correspondence problem
Li Zhang, Brian Curless, and Steven M. Seitz. Rapid Shape Acquisition Using Color Structured
Light and Multi-pass Dynamic Programming. In Proceedings of the 1st International
Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT), Padova, Italy,
June 19-21, 2002, pp. 24-36.
CSE 576, Spring 2008 Szeliski
Slide credit: Rick Stereo matching 100
107. Class goals
• Vision and language
• Vision and robotics
• Vision and others
The strategies our visual system uses are tuned to our visual world
To provide the right vision tools for not vision experts
Thinking about the tasks to find new representations