This document summarizes two papers on large-scale object recognition and improving the Fisher kernel method.
The first paper discusses how people can recognize tens of thousands of objects but computers have struggled with this task. It evaluates several datasets and algorithms for classifying over 10,000 image categories. The second paper revisits the Fisher vector representation and proposes normalization and spatial pyramid techniques to improve its performance for large-scale image classification. It evaluates the improved Fisher kernel on several datasets containing thousands to tens of thousands of categories.
3. 3
"What does classifying more than
10,000 image categories tell us?”
tries to discuss this question
Deng, Jia, et al. "What does classifying more than 10,000 image categories tell
us?." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 71-84.
12. 12
From large scale image categorization
to entry-level categories
Ordonez, Vicente, et al. "From large scale image categorization to entry-level
categories." Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.
13. Motivation
• One image has many labels, what
should I actually call it?
13
Entry-Level
category
14. Definition of entry-level category
• The name that most people tend to call
– 圓仔、熊貓、哺乳類、 Ailuropoda
melanoleuca(學名)
14
15. To achieve entry-level recognition
• By hypernym?
– Just replace the given output by its
hypernym
15
Bird
sparrow penguin
16. Problem 1
• You may call a sparrow a bird, but you
may not call a penguin a bird
16
Bird
sparrow penguin
17. Problem 2
• Encyclopedia knowledge v.s. Common
sense knowledge
17
Tulip is not a
kind of flower.
What a beautiful
flower!
18. Two methods
• Translate the result to entry-level category
• Directly learn a entry-level classifier
18
Image
Classifier
Tulip Flower
Image
Classifier
Flower
19. Method 1
• Use a metric for scoring each node
19
Bird
sparrow penguin
Output of linear SVM
0.80.1
0.9
20. Method 1
• Add the concept of naturalness
• We want v to be natural, but not too
high level to keep specificity
20
In Google 1T corpus,
v appears more
φ(v) gets higher
The max height of the
tree under v
21. Method 1
• Combine the two scores
• Experiments are passed since there are
too many details...
21
24. An interesting perspective...
• Why can we (as a human) recognize tens
of thousands of objects in a really short
time?
• We have simplified the world, or
– We process thing slow (computation cost)
– We receive lots of information(memory cost)
24
我的觀察啦XD
25. An explanation for the paper
• Different kind of dolphins have similar
properties
– So why bother to know all kind of dolphin?
• Dolphin has similar properties of fish
– So people think it is a kind of fish
25
26. How do we simplify?
• Hierarchy matters
• But do we follow WordNet?
26
27. Probably No
• Natural Objects
– We identify them by properties
• Artifacts
– We identify them by functionalities
27
28. Probably No
• Natural Objects
– We identify them by properties
• Artifacts
– We identify them by functionalities
28
29. A support from paper
• Even if the result is
incorrect, animals
tend to be miscate-
gorized as other
animals
29
Deng, Jia, et al. "What does classifying more than 10,000 image categories tell
us?." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 71-84.
30. 30
Maybe it’s because the logic of
making things are different.
(God v.s. Human)
Artifacts are
made to let
human use.
Natural objects
are made to live
their lives.
31. How to implement?
• It is still an open question.
31
Yao, Bangpeng, Jiayuan Ma, and Li Fei-Fei. "Discovering object
functionality."Computer Vision (ICCV), 2013 IEEE International Conference on.
IEEE, 2013.
Woods, Kevin, et al. "Learning membership functions in a function-based
object recognition system." J. Artif. Intell. Res.(JAIR) 3 (1995): 187-222.
Weng, Juyang, and Matthew Luciw. "Brain-like emergent spatial
processing."Autonomous Mental Development, IEEE Transactions on 4.2 (2012):
161-185.
32. 32
Improving the Fisher Kernel for Large-
Scale Image Classification
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel
for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin
Heidelberg, 2010. 143-156.
33. Fisher vector revisit
• A kind of representation of image
– Input: a set of local descriptors
– Output: a fixed-length fisher vector
33
36. Fisher vector revisit
• For each image, N=2
36
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies
for image categorization." Computer Vision and Pattern Recognition, 2007.
CVPR'07. IEEE Conference on.
37. Fisher vector revisit
• Since we already know the GMM of that
image, we can take derivatives
• Derivatives
– the change of the parameters will change
the fitness of GMM to the image
37
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies
for image categorization." Computer Vision and Pattern Recognition, 2007.
CVPR'07. IEEE Conference on.
38. Fisher vector revisit
• Concatenate these derivatives, we got
Fisher Vector!
38
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies
for image categorization." Computer Vision and Pattern Recognition, 2007.
CVPR'07. IEEE Conference on.
The number of parameters is
the same for every image
39. Fisher vector revisit
• The form of Fisher Vector
– Local descriptors
– Fisher Vector (not normalized)
39
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel
for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin
Heidelberg, 2010. 143-156.
40. Improvement - L2 Normalization
• Assume: the descriptors of a given
image follow a distribution p
• p has two parts
– background part uλ (image independent)
– Image-specific part q
40
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel
for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin
Heidelberg, 2010. 143-156.
41. Improvement - L2 Normalization
• Decompose the vector
41
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel
for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin
Heidelberg, 2010. 143-156.
42. Improvement - L2 Normalization
• Learning process minimize the image-
independent part
42
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel
for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin
Heidelberg, 2010. 143-156.
43. Improvement - L2 Normalization
• To remove the dependence on ω, we
can L2-normalize the vector
43
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel
for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin
Heidelberg, 2010. 143-156.
44. Improvement - Power Normalization
• As the number of Gaussians increases,
Fisher vector becomes sparser
44
16 Gaussians 64 256
45. Improvement - Power Normalization
• Apply power normalization to each
dimension of Fisher vector
• α=0.5 for 256 Gaussians is reasonable
45
50. Another thing I want to share
• Deep Learning can be used in robotics!
50
Deep Learning for Detecting Robotic Grasps, Ian Lenz, Honglak Lee, Ashutosh
Saxena. To appear in International Journal of Robotics Research (IJRR), 2014.