2. Deformable models
• Can take us a long way...
• But not all the way
(a) (b)
Tuesday, August 23, 11
3. Structure variation
• Object in rich categories have variable structure
• These are NOT deformations
• There is always something you never saw before
• Mixture of deformable models? too many combined choices
• Bag of words? not enough structure
• Non-parametric? doesn’t generalize
Tuesday, August 23, 11
4. Structure variation
• Object in rich categories have variable structure
• These are NOT deformations
• There is always something you never saw before
• Mixture of deformable models? too many combined choices
• Bag of words? Compositional model
not enough structure
• Non-parametric? doesn’t generalize
Tuesday, August 23, 11
5. Object detection grammars
• Pictorial structure model with variable structure
• Stochastic context-free grammar
- Generates tree-structured model
- Springs connect symbols along derivation tree
- Appearance model associated with each terminal
Tuesday, August 23, 11
6. - person -> face, trunk, arms, lower-part
- face -> hat, eyes, nose, mouth
- face -> eyes, nose, mouth
- hat -> baseball-cap
- hat -> sombrero
- lower-part -> shoe, shoe, legs
- lower-part -> bare-foot, bare-foot, legs
- legs -> pants
- legs -> skirt
Tuesday, August 23, 11
7. Person detection grammar
108 Subtype 1 Subtype 2 Example detections and derived filters
109
110 Part 1
111
112 Part 2
113
Part 3
114
Part 4
115
116 Part 5
117 Part 6
118
Occluder Parts 1-6 (no occlusion) Parts 1-4 & occluder Parts 1-2 & occluder
119
120
Figure 1: Shallow grammar model. This figure illustrates a shallow version of our grammar model
•
121 Instantiation includes six variable number of parts(“occluder”), each of which
(Section 2.1). This model has a person parts and an occlusion model
122
comes in one of two subtypes. A detection places one subtype of each visible part at a location and
123
124
- scale in toand relative toderivation < 6are constrained by deformation penalties.
1,...,k move occluder if k does not place all parts it must place the occluder. Parts are
allowed
the image. If the
each other but
125
126
• Parts can translateproductions specified by two kinds of schemas (a schema is a template for
We consider models with relative to each other
127
generating productions). A structure schema specifies one production for each placement ! 2 ⌦,
128
• Parts have subtypes
129 X(!)
s
! { Y1 (! 1 ), . . . , Yn (! n) }. (3)
130
131 Here the i specify constant displacements within the feature map pyramid. Structure schemas can
•
132
Parts have deformable sub-parts other objects.
be used to define decompositions of objects into (not shown)
133 Let be the set of possible displacements within a single scale of a feature map pyramid. A
134 deformation schema specifies one production for each placement ! 2 ⌦ and displacement 2
• Beats all other methods on PASCAL 2010 (49.5 AP) ,
135
↵· ( )
136
Tuesday, August 23, 11
X(!) ! { Y (! ) }. (4)
8. O(!) ! { Ot (!) } Ot (!) ! { At (! )}
175
176 subtypes The mixture start symbol [12] hassix alternate choices that derive component. The
Part The grammar has a model from Q with two subtypes for each mixture people under varying de-
Building the model
Partare forced to be mixture modelof each has corresponding nonterminal Yp that is placed The
subtypes The (occlusion). from [12] has two subtypes roughly mixture component.
subtypesgrees of visibilitymirror imagesEach partotheraand correspond for each to left-facing people at some
177
Part subtypes The mixture model from [12] hasother subtypes for part, which to left-facing The
subtypes are forced to be mirror images of each two and correspond roughly arecomponent. The
subtypes The mixture model from [12] has two subtypes for each mixture also forced
andPart ideal position relative to Q. Derivations with occlusion for each each mixture component.people
178 right-facing people. Our grammar model has two subtypesinclude the occlusion symbol O. A derivation
to besubtypes aresubtypeto bedisplacement for eachhas two and correspond roughlywhich are also people
andselects forced be Our grammar of each our grammar model,each part, toto grammar people
right-facing people. mirror images of each other subtypes for roughly left-facing forced
subtypes areaforced toand mirror images model other and correspondthe decisionleft-facing (production
179 mirror images of each other. But in the case of visible part. The parameters of theof which part
subtypebe mirror imagesdetectiongrammar model has two our grammar each part, which procedure forced
and right-facing people. each other. isandmodelcasefor each part.for model,part, decision ofalso described
andto instantiate people.Our time But in the has of subtypes for each the which are which part
to scores, deformation parameters independent two subtypes the discriminative are also forced
right-facing at of Our grammar filters) are learned with
180 to be mirror images ofat detection timein the case of our grammar model, the decision of which part
subtype to instantiate each other. But is independent for each part.
to be in Section 4. Figure 1 illustrates the the case of our grammar model, the decision of which part
mirror images of each other. But in filters in the resulting model and some example detections.
181 subtype Type grammar detection defined independent for each part. The indices p (for part), t
subtypeperson in any non-recursive grammar
• to instantiate detection time independent for each part.
The shallowto instantiate atat model istime isis by the following grammar.
182 subtype), and person grammar modelshallow 2 {1,the. followingdeformableThe indices,p (forscales: t (1)
(for TheDeeper model We following ranges: p model. . , 6}, t 2 {L, R} and ksubparts. at 5}. part),
shallow k have the extend the is defined by by adding grammar. 2 {1, . . two
Thethe sameperson grammar followingdefined by 2 {1, . . . ,symbol Q. R} andindices p.(for 5}. t t
subtype), and k (2) twice the is ranges: p the start 6}, t 2 {L, The indices p . . part),
(for shallow as, andhave the model resolution of the following grammar. The k 2 {1,(for , part),
183 The shallow person grammar model is defined by the following grammar. When detecting large objects,
subtype), and have the following ranges: details. , 6}, 2 {L, R} and 2 {1, , 5}.
(forhigh-resolution have kthe capture fine image p {1, . However, {L, R}detecting {1, . .. ,.5}.
(for subtype), and kksubpartsfollowing ranges:. , Y 2 {1, .. .. ,. 6}, t t 2when ) and k k 2small..objects, high-
s p 2(!
184 Q(!) ! sk Y1 (!
{ 1 ), . . k k ), O(! k+1 }
resolution Q(!) Q(!)
subparts ! sk{ Y (!
cannot !be { Y1 (! ), they Yk (! off the O(!
. . . , “fall ), bottom” k+1the feature map pyramid.
of ) }
sk 1 used because Y6 (!
s6 1 k
185 ), . .),), .. .. ,. YY(! 6 ) } ),), O(!
. ,.
! { 1 1 (! 1 1 ), . . . , k k (! when detecting ) ) } objects.
The model usesQ(!) ! withY(!
s6 1
186
derivations { { Ylow-resolution ,subparts k k )O(!
Q(!)
Q(!) ! Y 1 (! 1 Y6 (! 6 } k+1 }
k+1 small
s s6
Q(!) {!p,t (!)Y(! Y1 ),), .. .. ,. Y↵p,t! p,t ·) { } p,t (!
Q(!) 0 6! { { 1}1 (! p,t (!) , Y(!· ↵ ) 6 6 ( A
0
6 6 (!
(
))
We begin (!) replacing! productions from p,t (!) the grammar Ap,t (! and ) } adding new pro- ) } then
Yp by ! Y Y 1 . }
187 Yp (!) the { Yp,t (!) } Y Yp,t in !) ) { above,
188 ductions. Recall that 0 {indexesp,t (!) } OtYYp,t (!) ·!↵indexes{subtypes. In )the}following schemas,
YYp (!)
0
! Yp,t (!) (!)parts ↵t ↵(↵p,t·!
O(!)(!) ! ! O{(!) } top-levelp,t (!)and p,t! ({)At (!Ap,t (!
p0 t { Y the } t )t ·( (
·
)} )
Ot (!) ↵ · ·(! ) { Ar(!
0 {
A }
189
pO(!) ! { Ot (!) } subpart) have↵the() ranges:p,t(! {H, L}, u 2 {1, . . . , Np },
the indices r (for resolution) and u (for t 2 )}
00
O(!)number Q subparts alternateOt (!) Y!
start symbol { with(!) in Otchoices p! { AAt (! ) ) }
The grammar haspais O(!) ! of { O(!) } } a top-level partthat derive{people under varying de-
(!) t (!
t t
! Ot t six }
190 Thewhere N has a start symbol Q with six alternate choices that derive people under varying de-
grammar
the .
grees of visibility (occlusion). Each part has a corresponding nonterminal Yp that is placed at some
191 The grammar has a (occlusion). Each with has alternate choicesnonterminalpeople under varying de-
ideal The grammar has aQ. Derivations with occlusion include the occlusion symbolthat A derivationsome
grees of visibility start ↵p,t · ( ) Q with six alternate choices that derive people under varying de-
position relative to start symbol Q part six a corresponding that derive Yp O. is placed at
symbol
192 grees of visibility(!) to Q. !Each part p,t (! occlusion includenonterminal Yp that isO. A derivation
)}
selects a subtype and displacement for each visibleapart. The parametersocclusion symbolplaced atat some
grees position p,t (occlusion). Each{part has corresponding nonterminal grammar (production
ideal of visibility (occlusion).
Y relative Derivations witha corresponding the of the Yp that is placed some
Z has
193 ideal positionZ relative displacement forwith occlusion include parameters,of symbol (! AA derivation
ideal position p,t (!) to Q. Derivations are(!), Wp,t,r,1 (!The p,t,r,1occlusionp,t,r,N O.described )}
selects a subtype and to Q. and filters) each visiblewithinclude the ), . . . Wthe grammar (production
relative Derivationsp,t learned part.the discriminative procedureO. derivation
0
{A with occlusion the occlusion symbol
scores, deformation parameters ! p p,t,r,Np
194 selects a a subtype andparameters and filters) visible part. with parameters of the grammar (production
selects subtype 1 illustrates the(filterseach visible part. The parameters example procedure described
scores, deformationdisplacement for each are learned The the discriminative
and displacement for of the grammar (production
in Section 4. Wp,t,r,u (!) ↵p,t,r,u · ) {Ap,t,r,uresulting model and some
Figure in the detections.
195 scores, deformation parameters and filters)in(! learned with the discriminative procedure described
! )}
scores, deformation parameters and filters) are the resulting model and some example detections.
in Section 4. Figure 1 illustrates the filters are learned with the discriminative procedure described
DeeperSection 4. Figure 1 the shallowfilters inbythe resulting model and some examplescales: (1)
in model We
196 in Sectionmodel extendillustrates shallow model resulting model and some example detections. (1)
4. Figure 1 illustrates the model in adding deformable subparts at two detections.
the filters the
Deepernote that Wein [22] our model has hierarchical deformations. The part terminal Ap,t can move
We and (2) as extendresolution of the start symbol Q. When detecting large objects,
the by adding deformable subparts at two scales:
the same as, twice the
197 Deeper model and (2) twice the shallow model by start symbol Q. to subparts atat two scales:(1)
the relative to Q Weextend the shallow model the adding deformable subparts two scales:
same as, We extend the resolution of by adding relative When
Deeper model parameters from p,t,r,u can movedeformableAp,t . detecting large objects,
high-resolution subparts capture fine terminal Aboundingwhen detecting small objects, high- (1)
•sameas, andsubparts subpart resolutionofoftheHowever, box annotations objects, high-
Trainandand twice the resolution details. However, Q. When detecting large objects,
198 the same as,
the image details.
high-resolution (2) twice the fine image thestart symbol whenWhen detecting large objects,
the (2) capture start symbol Q. detecting small
resolution subparts cannot bep,t,H,ubecause they “fall off the bottom”octave below Zmap pyramid.
The displacements used place the symbols Wp,t,H,u one of the feature p,t in the feature map
199 high-resolution subparts capture fine because they “fall off the when detecting feature map pyramid.
high-resolution subparts capture fine image details.However, bottom” of the small objects, high-
resolution subparts cannot be used image details. However, when detecting small objects, high-
The model uses derivations with low-resolution subparts when detecting small objects.
resolution Production costs place the symbols Wp,t,L,u at ofsmall objects.as Zp,t . We add
Thepyramid. The displacements p,t,L,u they subparts when detecting the feature map pyramid.
modelsubparts cannot be used because they “fall off the bottom” the same scale
200 resolution subparts cannot bewith low-resolution “fall off the bottom” of the feature map pyramid.
derivations used because
- usesthe first two top-level parts (p = 1 and 2), with the number of subparts set to N1 = 3
Thesubparts to
201 The model uses derivations with low-resolution subparts when above, and then objects. new pro-
We beginmodel uses derivations with low-resolution the grammar detecting small adding
by replacing the productions from Yp,t in subparts when detecting small objects.
We and N2 =replacing the productions from Yp,t in the grammarimproveand then adding new pro-
begin by 2. We find that adding additional subparts does not above, detection performance.
202 We begin by that p indexes productions from YY inindexes subtypes. In the following schemas, pro-
ductions. Recall replacing the the top-level parts and t the grammar above, and then adding new pro-
We begin Recall that p indexes the top-level parts and t indexes subtypes. and then adding schemas,
ductions. by replacing the productions from p,t in the grammar above, In the following new
the ductions. Recall that p indexes u modelsparts and t indexes subtypes. L}, u 2 {1, . . . ,schemas,
indices - (for resolution) and the top-level have the ranges: r 2 {H, In the following Np },
203 ductions. Recall that p indexes the u subpart)parts and t indexes subtypes. {H, L}, u 2 {1, . . . , Np },
r Deformation (for p,t
the indices r (for resolution) and top-level (for subpart) have the ranges: r 2 In the following schemas,
where N2.2 thernumber of subparts inudetection parthave the ranges: r 2 {H, L}, u 2 {1, . . . , N },
is Inference and test and a (for subpart) Yp .
top-level
the p Np r the resolution) time u (for subpart) part the
204 the indices is (for number of subparts in a top-level haveYp . ranges: r 2 {H, L}, u 2 {1, . . . , Np },
whereindices (for resolution) and p
205 where Np p is the number of subparts in a top-level part Y.p .
where N is the ↵p,t · ( ) of subparts in a top-level part Yp
number
206
- HOG! findingfor(!scoring derivations. At test time, because images may contain mul-
Yp,t (!) involvesp,t · ( { Zhighterminals
Inference filters {
↵ ) )}
Yp,t (!) 0 p,t · ·( () ) p,t Zp,t (! compute the maximum scoring derivation rooted at Q(!), for
!
tiple instances of p,t object class, we
↵↵ an )}
207 Zp,tYYp,t (!) This can be done Zp,t (! ) ) } ap,t,r,1 ), . . dynamic programming algorithm [11].
each Zp,t (!) ! ! {Ap,t{ p,t (! p,t,r,1p,t,r,1 (!standard . ),W.p,t,r,Np (! p (!
(!)(!)
p,t 2 ⌦. 0
! (!),
{ {A W W using
Z efficiently } (! , p,t,r,Np )}
! ! p,t (!), p,t,r,1 . . , Wp,t,r,N p,t,r,Np )}
208 0)
↵p,t,r,u · ( 0
Tuesday, August 23, (!)
Z 11 ↵p,t,r,u · ( ) {A (!), W
! (! ), . . . , W (! )}
9. Salient contours
Figure 15: Running time of di↵erent search algorithms as a function of the problem siz
Each sample point indicates the average running time taken over 200 ran
inputs. In each case N = 20 and = 100. See text for discussion.
• Curve(a,b) + Curve(b,c) --> Curve(a,c)
Felzenszwalb & McAllester
b
t
a c
Figure 16: A curve with endpoints (a, c) is formed by composing curves with endpo
(a, b) and (b, c). We assume that t ⇡/2. The cost of the compositio
proportional to sin2 (t). This cost is scale invariant and encourages curves t
relatively straight.
assume that these short curves are straight, and their weight depends only on the im
data along the line segment from a to b. We use a data term, seg(a, b), that is zero if
image gradient along pixels in ab is perpendicular to ab, and higher otherwise.
Figure 17 gives a formal definition of the two rules in our model. The constants k1
k2 specify the minimum and maximum length of the base case curves, while L is a cons
184
Figure 20: An example where the most salient curve goes over locations with essentially no
local evidence for a the curve at those locations.
Tuesday, August 23, 11
10. Shapes / Regions
Random shapes
Samples from stochastic context-free shape grammar
Example results
“Matching” to images
(samples from posterior)
33
Tuesday, August 23, 11
11. Processing pipeline
Regions
Pixels
Objects
Edges Contours
• Vision system have multiple processing stages
• Compositional model: each stage builds structures by grouping
structures from previous stages
- Single parsing problem
- Avoids intermediate decisions
(high-level information influences low-level interpretations)
Tuesday, August 23, 11
12. Computation
• Context-free or Context-sensitive?
• Even context-free models lead to hard parsing problem
- Too many constituents!
GETIKDSWOWZQE
- String of length n have O(n2) substrings
- Images with n pixels have O(2n) regions
Tuesday, August 23, 11
13. Alternative parsing problems
1. Whole image parsing room
- Explains every pixel exactly once wall floor
chest
shelves pictures
- Hard
2. Find light derivations within an image book book ... book
- Expansion of start symbol into terminals results
Example
- Explains part of the image
- May explain the same pixel more then once
- Efficient
Tuesday, August 23, 11
14. Computation
• Bottom-up
- Repeated grouping structures (KLD / A*LD)
• Top-down
- Repeated refining with backtracking (AO*)
• Bottom-up + Top-down
- Bottom-up computation guided by top-down influence
- Coarse derivations provide heuristic guidance
for finding finer structures (HA*LD)
Tuesday, August 23, 11
15. Coarse-to-fine
• Model abstraction f : Si --> Si+1
- lower resolution
- coarsen labels
horse --> animal --> piecewise smooth object
Felzenszwalb & McAllester
• Coarse computation guides finer computation
m 1 Edges Contours Recognition
1 Edges Contours Recognition
0 Edges Contours Recognition
Figure 8: A vision system with several levels of processing. Forward arrows represent the
Tuesday, August 23, 11
16. Challenges
• Whole image parsing (with context-free grammars)
- Restrict possible constituents
- LP relaxation
- DDMCMC
• Learn object grammars from weakly labeled data
- PASCAL VOC
• Build a complete processing pipeline unifying
segmentation and recognition
Tuesday, August 23, 11