IEEE 2013 Conference Paper on Fine-Grained Visual Categorization Using Polynomial Embedding
1. IEEE International Conference on Multimedia & Expo 2013
Augmenting Descriptors for Fine-grained Visual
Categorization Using Polynomial Embedding
Hideki Nakayama
Graduate School of Information Science and Technology
The University of Tokyo
3. Fine-grained visual categorization (FGVC)
Distinguish hundreds of fine-grained objects
under a certain domain
(e.g., species of animals and plants)
Complement to traditional object recognition problems
Caltech-256
[Griffin et al., 2007]
Caltech-Bird-200
[Welinder et al., 2010]
Generic Object Recognition FGVC
Yellow
Warbler
Pririe
Warbler
Pine
WarblerAirplane Monitor Dog
V.S.
3
4. Motivation
We need highly discriminative features to
distinguish visually very similar categories
Especially at local level.
4
5. Two basic ideas
1. Co-occurrence (correlation) of
neighboring local descriptors
Shaplet [Sabzmeydani et al., 2007] Covariance feature [Tuzel et al., 2006]
GGV [Harada et al., 2012]
Expected to capture middle-level local information
Results in high-dimensional local features
2. State-of-the-art bag-of-words representation
Based on higher-order statistics of local features
Fisher vector [Perronnin et al., 2010]
VLAD [Jegou et al., 2010]
Remarkably high-performance, enables linear classification
Dimensionality increases in linear to the size of local features
☹
☺
☺
☹
(N: number of visual words, D: size of local features)
ND
2ND
conflict
5
6. Our approach
Compress polynomials of neighboring local features vector
with supervised dimensionality reduction
Discriminative latent descriptor
Encode by means of bag-of-words (Fisher vector)
Logistic regression classifier
1,000~
1,0000 dim 64 dimDescriptor
(e.g. SIFT)
Dense sampling
polynomial
vectors
latent
descriptor
category
label
CCA
(training)
Fisher
vector
logistic
regression
classifier
6
7. ★● ●
Exploit co-occurrence information
e.g. SIFT
( )
( )
( )
=
+
−
T
yxyx
T
yxyx
T
yxyx
yx
yx
Vec
Vec
upperVec
),(),(
),(),(
),(),(
),(
2
),(
δ
δ
vv
vv
vv
v
p
),( yxv),( yx σ−v ),( yx σ+v
Neighbor
(Left)
Neighbor
(Right)
Descriptor
at target
position
×
×
Polynomial Vector
a Matrixofvectorflattened:)(Vec
7
8. Exploit co-occurrence information
More spatial information can be integrated with more
neighbors (but become high-dimensional)
( )
( )
( )
=
+
−
T
yxyx
T
yxyx
T
yxyx
yx
yx
Vec
Vec
upperVec
),(),(
),(),(
),(),(
),(
2
),(
δ
δ
vv
vv
vv
v
p
( )
= T
yxyx
yx
yx
upperVec ),(),(
),(0
),(
vv
v
p
( )
( )
( )
( )
( )
=
+
+
−
−
T
yxyx
T
yxyx
T
yxyx
T
yxyx
T
yxyx
yx
yx
Vec
Vec
Vec
Vec
upperVec
),(),(
),(),(
),(),(
),(),(
),(),(
),(
4
),(
δ
δ
δ
δ
vv
vv
vv
vv
vv
v
p
★
★● ●
★● ●
●
●
0-neighbor
2-neighbors
4-neighbors
2,144dim
10,336dim
18,528dim
8
10. Patch feature and label pairs
Category label: Binary occurrence vector
Strong supervision assumption
Most patches should be related to the content (category)
(Somewhat) justified for FGVC considering the applications
Users will target the object, sometimes can give segmentation
Supervised dimensionality reduction
Allium
triquetrum
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
(We do not perform manual
segmentation in this work, though)10
11. Canonical Correlation Analysis (CCA) [Hotelling, 1936]
( ) ( ) tslltpps
lp
andbetweenncorrelatiothemaximizethat,
tionstransformalinearfindsCCA
featurelabel:ls),(polynomiafeaturepatch:
−=−= TT
BA
Supervised dimensionality reduction
( )
( )IBCBBCBCCC
IACAACACCC
ll
T
llplpplp
pp
T
pplpllpl
=Λ=
=Λ=
−
−
21
21
nscorrelatiocanonical:
matricescovariance:
Λ
C
p l
Canonical space
s t
s
t
Image feature Labels feature
( )pps −= T
A
Latent descriptor
1,000~
1,0000 dim
64 dim
(discriminative)
11
15. Experimental setup
FGVC Datasets
Oxford-Flowers-102
Caltech-Bird-200
Descriptors
SIFT, C-SIFT, Opponent-SIFT, Self Similarity
Compressed into 64dim using several methods
Fisher Vector
64 Gaussians (visual wods)
Global + 3 horizontal spatial regions
Classifier
Logistic regression
Evaluation
Mean classification accuracy
Flowers
Birds
15
16. Results: comparison with PCA and CCA
Our method substantially improves performance for all
descriptors
Just applying CCA to concatenated neighbors does not
improve performance
Polynomial embedding makes sense (non-linear convolution)
0
10
20
30
40
50
60
70
80
90
SIFT C-SIFT Opp.-SIFT SSIM
PCA (baseline)
CCA (4-neighbors)
PolEmb (4-neighbors)
0
5
10
15
20
25
SIFT C-SIFT Opp.-SIFT SSIM
Flower Bird
Classification performance (%) with different embedding methods (all 64dim)
Baseline
(PCA)
Ours
16
CCA
without
Pol.
17. Results: number of neighbors
Including more neighbors improves performance
0
10
20
30
40
50
60
70
80
90
SIFT C-SIFT Opp.-SIFT SSIM
0
5
10
15
20
25
SIFT C-SIFT Opp.-SIFT SSIM
Classification performance (%) of our method with different number of neighbors
Flower Bird
★
★● ●
★● ●
●
●
17
19. Our final system
Combine four descriptors in late-fusion
approach (SIFT, C-SIFT, Opp.-SIFT, SSIM)
Sum of log-likelihoods output by each classifier
(weighted by its individual confidence)
Descriptor 1
(e.g. SIFT)
Dense sampling
polynomial
vectors
category
label
CCA
latent
descriptor
Fisher
vector
logistic
regression
classifier
logistic
regression
classifier
logistic
regression
classifier
(training)
Descriptor 2
Descriptor K
+
・
・
・
・
・
・
Same as above
Same as above
Allium
triquetrum
19
20. Comparison on FGVC datasets
Our method outperforms previous work on bird
and flower datasets
For the bird dataset, [32] uses the bounding box only for training images,
therefore the result is not directly comparable to ours.
(PCA)
(PolEmb)
(PCA+PolEmb)
← baseline
Mean classification accuracy (%)
20
21. ImageCLEF 2013 Plant Identification
Flower FruitLeaf Stem Entire
Kaki
Persimmon
Silver
birch
Boxelder
mapple
Identify 250 plant species from different organs (Leaf,
Flower, Fruit, etc.)
Got the 1st place in Natural Background task, and in 4/5
subtasks.
(Coming in Sept., 2013.)
21
22. Conclusion
A simple but effective method for FGVC
Embedding co-occurrence patterns of neighboring descriptors
Obtain discriminative and small-dimensional latent descriptor
to use together with Fisher vector
Polynomial embedding greatly improves the performance,
indicating the importance of non-linearity
Patch-level strong supervision approximation
Not always perfect but reasonable for FGVC problems
Future work
Theoretical analysis (probabilistic interpretation)
Multiple instance dimensionality reduction
22