Large scale object recognition （AMMAI presentation）

Large-Scale Object
Recognition
Presenter: 電機碩二賴柏任
Date: 06.18.2015

Motivation
• People can recognize tens of thousands
of objects...
• How about computers?
2

3
"What does classifying more than
10,000 image categories tell us?”
tries to discuss this question
Deng, Jia, et al. "What does classifying more than 10,000 image categories tell
us?." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 71-84.

Datasets
• ImageNet10K
– 10184 categories, 9 million images
• ImageNet7K (7404 categories)
• ImageNet1K (1000 categories)
• Rand200{a,b,c} (200 categories)
• CalNet200 (200 categories)
• Ungulate183, Fungus134, Vehicle262
4

Algorithms
• GIST+NN
– kNN on L2 distance
• BOW + NN
– SIFT for BOW, kNN on L1 distance
• BOW + SVM
– # of SVM == # of categories (1-vs-all)
• SPM + SVM
– SIFT for SPM, 1-vs-all SVM
5

Computation time analysis
• BOW+SVM (ImageNet 10K)
– A 1-vs-all SVM classifier needs 1 hr
(2.66 GHz Intel Xeon)
– 16 hrs for testing
• 66 multi-core machine needs several
weeks
6
Distributed computing and efficient
learning are needed.

Size analysis
• 2x decrease in accuracy with 10x
increase in the number of classes
7

Size analysis
• Techniques that outperforms others on
small datasets may underperform on
large datasets
8

Size analysis
• Semantic hierarchy is correlated to
visual confusion
9

Density Analysis
• Density of a dataset
10

Density Analysis
• Denser dataset predict lower accuracy
11

12
From large scale image categorization
to entry-level categories
Ordonez, Vicente, et al. "From large scale image categorization to entry-level
categories." Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.

Motivation
• One image has many labels, what
should I actually call it?
13
Entry-Level
category

Definition of entry-level category
• The name that most people tend to call
– 圓仔、熊貓、哺乳類、 Ailuropoda
melanoleuca(學名)
14

To achieve entry-level recognition
• By hypernym?
– Just replace the given output by its
hypernym
15
Bird
sparrow penguin

Problem 1
• You may call a sparrow a bird, but you
may not call a penguin a bird
16
Bird
sparrow penguin

Problem 2
• Encyclopedia knowledge v.s. Common
sense knowledge
17
Tulip is not a
kind of flower.
What a beautiful
flower!

Two methods
• Translate the result to entry-level category
• Directly learn a entry-level classifier
18
Image
Classifier
Tulip Flower
Image
Classifier
Flower

Method 1
• Use a metric for scoring each node
19
Bird
sparrow penguin
Output of linear SVM
0.80.1
0.9

Method 1
• Add the concept of naturalness
• We want v to be natural, but not too
high level to keep specificity
20
In Google 1T corpus,
v appears more
φ(v) gets higher
The max height of the
tree under v

Method 1
• Combine the two scores
• Experiments are passed since there are
too many details...
21

23
What I have learned from the two
papers above

An interesting perspective...
• Why can we (as a human) recognize tens
of thousands of objects in a really short
time?
• We have simplified the world, or
– We process thing slow (computation cost)
– We receive lots of information(memory cost)
24
我的觀察啦XD

An explanation for the paper
• Different kind of dolphins have similar
properties
– So why bother to know all kind of dolphin?
• Dolphin has similar properties of fish
– So people think it is a kind of fish
25

How do we simplify?
• Hierarchy matters
• But do we follow WordNet?
26

Probably No
• Natural Objects
– We identify them by properties
• Artifacts
– We identify them by functionalities
27

Probably No
• Natural Objects
– We identify them by properties
• Artifacts
– We identify them by functionalities
28

A support from paper
• Even if the result is
incorrect, animals
tend to be miscate-
gorized as other
animals
29
Deng, Jia, et al. "What does classifying more than 10,000 image categories tell
us?." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 71-84.

30
Maybe it’s because the logic of
making things are different.
(God v.s. Human)
Artifacts are
made to let
human use.
Natural objects
are made to live
their lives.

How to implement?
• It is still an open question.
31
Yao, Bangpeng, Jiayuan Ma, and Li Fei-Fei. "Discovering object
functionality."Computer Vision (ICCV), 2013 IEEE International Conference on.
IEEE, 2013.
Woods, Kevin, et al. "Learning membership functions in a function-based
object recognition system." J. Artif. Intell. Res.(JAIR) 3 (1995): 187-222.
Weng, Juyang, and Matthew Luciw. "Brain-like emergent spatial
processing."Autonomous Mental Development, IEEE Transactions on 4.2 (2012):
161-185.

32
Improving the Fisher Kernel for Large-
Scale Image Classification
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel
for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin
Heidelberg, 2010. 143-156.

Fisher vector revisit
• A kind of representation of image
– Input: a set of local descriptors
– Output: a fixed-length fisher vector
33

• Use GMM to model input images
34

• Assume: only 2 Gaussians are used
35

• For each image, N=2
36
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies
for image categorization." Computer Vision and Pattern Recognition, 2007.
CVPR'07. IEEE Conference on.

• Since we already know the GMM of that
image, we can take derivatives
• Derivatives
– the change of the parameters will change
the fitness of GMM to the image
37

• Concatenate these derivatives, we got
Fisher Vector!
38
The number of parameters is
the same for every image

• The form of Fisher Vector
– Local descriptors
– Fisher Vector (not normalized)
39
Heidelberg, 2010. 143-156.

Improvement - L2 Normalization
• Assume: the descriptors of a given
image follow a distribution p
• p has two parts
– background part uλ (image independent)
– Image-specific part q
40
Heidelberg, 2010. 143-156.

• Decompose the vector
41
Heidelberg, 2010. 143-156.

• Learning process minimize the image-
independent part
42
Heidelberg, 2010. 143-156.

• To remove the dependence on ω, we
can L2-normalize the vector
43
Heidelberg, 2010. 143-156.

Improvement - Power Normalization
• As the number of Gaussians increases,
Fisher vector becomes sparser
44
16 Gaussians 64 256

Improvement - Power Normalization
• Apply power normalization to each
dimension of Fisher vector
• α=0.5 for 256 Gaussians is reasonable
45

Improvement-Spatial Pyramid
• Original spatial pyramid
46

• Combine spatial pyramid and FK
47
BoW histogram

• Combine spatial pyramid and FK
48
Fisher Vector

Large-Scale Experiments
• Training: ImageNet, Flickr groups, VOC
2007 trainval
• Testing: PASCAL VOC 2007 (20 classes)
49
[29] Harzallah, Hedi, Frédéric Jurie, and Cordelia Schmid. "Combining efficient
object localization and image classification." Computer Vision, 2009 IEEE 12th
International Conference on.

Another thing I want to share
• Deep Learning can be used in robotics!
50
Deep Learning for Detecting Robotic Grasps, Ian Lenz, Honglak Lee, Ashutosh
Saxena. To appear in International Journal of Robotics Research (IJRR), 2014.

Large scale object recognition （AMMAI presentation）

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Large scale object recognition （AMMAI presentation）

Similaire à Large scale object recognition （AMMAI presentation） (20)

Plus de Po-Jen Lai

Plus de Po-Jen Lai (6)

Dernier

Dernier (20)

Large scale object recognition （AMMAI presentation）