SlideShare une entreprise Scribd logo
1  sur  92
Télécharger pour lire hors ligne
Fine-Grained Image Analysis
ICME Tutorial
Jul. 8, 2019 Shanghai, China
Xiu-Shen WEI
Megvii Research Nanjing, Megvii Inc.
IEEE ICME2019
T-06: Computer Vision for Transportation............................................................
T-07: Causally Regularized Machine Learning ....................................................
T-08: Architecture Design for Deep Neural Networks .........................................
T-09: Intelligent Multimedia Recommendation ...................................................
Oral Sessions...............................................................................................................
Best Paper Session ................................................................................................
O-01: Content Recommendation and Cross-modal Hashing................................
O-02: Development of Multimedia Standards and Related Research...................
O-03: Classification and Low Shot Learning........................................................
O-04: 3D Media Computing .................................................................................
O-05: Special Session "Pedestrian Detection, Tracking and Re-identification in
Outline
Part 1
Part 2
Background about CV, DL, and image analysis
Introduction of fine-grained image analysis
Fine-grained image retrieval
Fine-grained image recognition
Other computer vision tasks related to fine-grained image analysis
New developments of fine-grained image analysis
http://www.weixiushen.com/
Part 1
Background
A brief introduction of computer vision
Traditional image recognition and retrieval
Deep learning and convolutional neural networks
Introduction
Fine-grained images vs. generic images
Various real-world applications of fine-grained images
Challenges of fine-grained image analysis
Fine-grained benchmark datasets
Fine-grained image retrieval
Fine-grained image retrieval based on hand-crafted features
Fine-grained image retrieval based on deep learning
http://www.weixiushen.com/
Background
What is computer vision?
What we see
The goal of computer vision
• To extract “meaning” from pixels
What we see What a computer sees
Source: S. Narasimhan
What a computer sees
http://www.weixiushen.com/
Background (con’t)
Why study computer vision?
CV is useful
CV is interesting
CV is difficult
…
Finger reader
Image captioning
Crowds and occlusions
http://www.weixiushen.com/
Background (con’t)
Successes of computer vision to date
Optical character recognition Biometric systems
Face recognition Self-driving cars
http://www.weixiushen.com/
Top-tier CV conferences/journals and prizes
Marr Prize
Background (con’t)
http://www.weixiushen.com/
Traditional image recognition and image retrieval
Tradi=onal$Approaches$
Image$ vision$features$ Recogni=on$
ct$
=on$
o$
fica=on$
Audio$ audio$features$
Speaker$
iden=fica=on$
Data Feature
extraction
Learning
algorithm
Recognition:
SVM
Random Forest
…
Retrieval:
k-NN
…
Background (con’t)
http://www.weixiushen.com/
Computer vision features
Computer$Vision$Features$
SIFT$
HoG$ RIFT$
Textons$
GIST$
Deep learning
Background (con’t)
http://www.weixiushen.com/
Deep learning and convolutional neural networks
Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201518
before:
now:
input
layer hidden layer
output layer
Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201518
before:
now:
input
layer hidden layer
output layer
Before:
Now:
Figures are courtesy of L. Fei-Fei.
“Rome was not built in one day!”
Background (con’t)
http://www.weixiushen.com/
All neural net activations arranged in 3-dimension:
与之类似,若三维情形下的卷积层 l 的输入张量为 xl ∈ RHl×Wl×Dl
,
则该层卷积核为 fl ∈ RH×W×Dl
。在三维时卷积操作实际上只是将二维卷
积扩展到了对应位置的所有通道上(即 Dl),最终将一次卷积处理的所有
HWDl 个元素求和作为该位置卷积结果。如图2-5所示。
三维场景卷积操作 卷积特征
图 三维场景下的卷积核与输入数据。图左卷积核大小为 3 × 4 × 3,图右为在
该位置进行卷积操作后得到的 1 × 1 × 1 的输出结果
进一步地,若类似 fl 这样的卷积核有 D 个,则在同一个位置上可得到
Deep learning and convolutional neural networks
与之类似,若三维情形下的卷积层 l 的输入张量为 xl ∈ RHl×Wl×Dl
,
则该层卷积核为 fl ∈ RH×W×Dl
。在三维时卷积操作实际上只是将二维卷
积扩展到了对应位置的所有通道上(即 Dl),最终将一次卷积处理的所有
HWDl 个元素求和作为该位置卷积结果。如图2-5所示。
三维场景卷积操作 卷积特征
图 三维场景下的卷积核与输入数据。图左卷积核大小为 3 × 4 × 3,图右为在
该位置进行卷积操作后得到的 1 × 1 × 1 的输出结果
进一步地,若类似 fl 这样的卷积核有 D 个,则在同一个位置上可得到
1 × 1 × 1 × D 维度的卷积输出,而 D 即为第 l + 1 层特征 xl+1 的通道数
Dl+1。形式化的卷积操作可表示为:
yil+1,jl+1,d =
H
i=0
W
j=0
Dl
dl=0
fi,j,dl,d × xl
il+1+i,jl+1+j,dl , (2.1)
其中,(il+1, jl+1) 为卷积结果的位置坐标,满足:
l+1 l l+1
Background (con’t)
http://www.weixiushen.com/
Unit processing of CNNs
积特征)
特征
27
4 3 2 1 0
1 2 3 4 5
1 0 1
0 1 0
6 7 8 9 0
9 8 7 6 5
28
第2次卷积操作 卷积后结果(卷积特征)
(b) 第二次卷积操作及得到的卷积特征
29
卷积特征)
特征
27
1 0 1
1 0 1
0 1 0
6 7 8 9 0
1 2 3 4 5
9 8 7 6 5
28 29
4 3 2 1 0
1 2 3 4 5
21
1628
23
27
22
27
第9次卷积操作 卷积后结果(卷积特征)
(d) 第九次卷积操作及得到的卷积特征
卷积操作示例
25
Convolution
Max-pooling: yil+1,jl+1,d = max
0 i<H,0 j<W
xl
il+1×H+i,jl+1×W+j,dl , (2.6)
其中,0 il+1 < Hl+1,0 jl+1 < Wl+1,0 d < Dl+1 = Dl。
图2-7所示为 2 × 2 大小、步长为 1 的最大值汇合操作示例。
7
4 3 2 1 0
1 2 3 4 5
6 7 8 9 0
1 2 3 4 5
9 8 7 6 5
第1次最大值汇合操作 汇合后结果(汇合特征)
(a) 第 1 次汇合操作及得到的汇合特征
7
6 7 8 9 0
1 2 3 4 5
9 8 7 6 5
4 3 2 1 0
1 3 4 52
7 8 9 9
9 8 9 9
9 8 7 6
4 3 4 5
第16次最大值汇合操作 汇合后结果(汇合特征)
(b) 第 16 次汇合操作及得到的汇合特征
图 最大值汇合操作示例
除了上述最常用的两种汇合操作外,随机汇合(stochastic-pooling)[94]
则介于二者之间。随机汇合操作非常简单,只需对输入数据中的元素按照
一定概率值大小随机选择,其并不像最大值汇合那样永远只取那个最大值
元素。对随机汇合而言,元素值大的响应(activation)被选中的概率也大,
反之亦然。可以说,在全局意义上,随机汇合与平均值汇合近似;在局部意
Pooling
ReLU(x) = max {0, x} (8.3)
=
⎧
⎪⎨
⎪⎩
x x 0
0 x < 0
. (8.4)
与前两个激活函数相比,ReLU 函数的梯度在 x 0 时为 1,反之为 0(如
图8-3所示);x 0 部分完全消除了 Sigmoid 型函数的梯度饱和效应。在计
算复杂度上,ReLU 函数也相对前两者的指数函数计算更为简单。同时,实
验中还发现 ReLU 函数有助于随机梯度下降方法收敛,收敛速度约快 6 倍
左右 [52]。不过,ReLU 函数也有自身的缺陷,即在 x < 0 时,梯度便为 0。
换句话说,对于小于 0 的这部分卷积结果响应,它们一旦变为负值将再无
法影响网络训练——这种现象被称作“死区”。
−10 −5 0 5 10
0
1
2
3
4
5
6
7
8
9
10
(a) ReLU 函数
−10 −5 0 5 10
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
(b) ReLU 函数梯度
图 函数及其函数梯度
Non-linear activation function
Background (con’t)
http://www.weixiushen.com/
CNN architecture
Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201563
CONV
ReLU
CONV
ReLU
POOLCONV
ReLU
CONV
ReLU
POOL CONV
ReLU
CONV
ReLU
POOL FC
(Fully-connected)
Figures are courtesy of L. Fei-Fei.
Background (con’t)
http://www.weixiushen.com/
Meta class
Bird Dog Orange……
Traditional image recognition
(Coarse-grained)
Fine-grained image recognition
Dog
Husky Satsuma Alaska……
Introduction
Fine-grained images vs. generic images
http://www.weixiushen.com/
Various real-world applications
Introduction (con’t)
http://www.weixiushen.com/
Various real-world applications
Introduction (con’t)
http://www.weixiushen.com/
Various real-world applications
Introduction (con’t)
http://www.weixiushen.com/
Results
1. Megvii Research Nanjing
a. Error = 0.10267
2. Alibaba Machine Intelligence Technology Lab
a. Error = 0.11315
3. General Dynamics Mission Systems
a. Error = 0.12678
FGVC6
Sponsored by
This certificate is awarded to
Bo-Yan Zhou, Bo-Rui Zhao, Quan Cui, Yan-Ping Xie,
Zhao-Min Chen, Ren-Jie Song, and Xiu-Shen Wei
Megvii Research Nanjing
iNat2019
winners of the iNaturalist 2019 image
classification challenge held in conjunction with
the FGVC workshop at CVPR 2019.
Genera with at least 10 species
Herbarium Challenge 2019 (top team
● #1 Megvii Research Nanjing (89.8%)
○ Boyan Zhou, Quan Cui, Borui Zhao, Yanping Xie,
Renjie Song
● #2 PEAK (89.1%)
○ Chunqiao Xu, Shao Zeng, Qiule Sun, Shuyu Ge,
Peihua Li (Dalian University of Technology)
● #3 Miroslav Valan (89.0%)
○ Swedish Museum of Natural History
● #4 Hugo Touvron (88.9%)
○ Hugo Touvron and Andrea Vedaldi (Facebook AI
Research)
Various real-world applications
Introduction (con’t)
http://www.weixiushen.com/
Herbarium Challenge 2019
Kiat Chuan Tan1
, Yulong Liu1
, Barbara Ambrose2
, Melissa Tulig2
, Serge Belongie1,3
1
Google Research, 2
New York Botanical Garden, 3
Cornell Tech
Google Research
, Xiu-Shen Wei
Various real-world applications
Introduction (con’t)
http://www.weixiushen.com/
Various real-world applications
Introduction (con’t)
http://www.weixiushen.com/
Various real-world applications
Introduction (con’t)
http://www.weixiushen.com/
Challenge of fine-grained
image analysis
Small inter-class variance
Large intra-class variance
Institute of Computer Science and Technology, Peking University
hexiangteng@pku.edu.cn, pengyuxin@pku.edu.cn
Abstract
ed image classification is a challenging task
ge intra-class variance and small inter-class
ng at recognizing hundreds of sub-categories
he same basic-level category. Most existing
mage classification methods generally learn
models to obtain the semantic parts for bet-
ion accuracy. Despite achieving promising
methods mainly have two limitations: (1)
arts which obtained through the part detec-
re beneficial and indispensable for classifi-
2) fine-grained image classification requires
visual descriptions which could not be pro-
art locations or attribute annotations. For ad-
bove two limitations, this paper proposes the
odel combining vision and language (CVL)
atent semantic representations. The vision
deep representations from the original visual
ia deep convolutional neural network. The
am utilizes the natural language descriptions
oint out the discriminative parts or charac-
ach image, and provides a flexible and com-
Heermann Gull
Herring Gull
Slaty-backed Gull
Western Gullinter-class variance
intra-classvariance
Figure 1. Examples from CUB-200-2011. Note that it is a techni-
cally challenging task even for humans to categorize them due to
Introduction (con’t)
Figures are courtesy of [X. He andY. Peng, CVPR 2017]. http://www.weixiushen.com/
The key of fine-grained image analysis
Introduction (con’t)
http://www.weixiushen.com/
Fine-grained benchmark datasets
类,11788 张图像(其中含 5994 张训练图像和 5794 张测试图像),每张图提
供了图像级别标记、物体边界框、部件关键点和鸟类属性信息。该数据集图像
⽰例如图 2 12所⽰。此外,Birdsnap[6]
是 2014 年哥伦⽐亚⼤学的研究者提出的
⼤规模鸟类细粒度级别图像数据集,共 500 种鸟类⼦类,49829 张图像(其中
47386 张训练图像和 2443 张测试图像),与 CUB200-2011 数据集类似,每张图
提供了图像级别标记、物体边界框、部件关键点和鸟类属性信息。
图 2 12: CUB200-2011[75]
数据集图像⽰例。
CUB200-2011
[C.Wah et al., CNS-TR-2011-001, 2011]
11,788 images, 200 fine-grained classes
Introduction (con’t)
http://www.weixiushen.com/
Fine-grained benchmark datasets
2 . 3 . 2 细粒度级别狗类数据集
Stanford Dogs[38]
数据集是斯坦福⼤学于 2011 年提出的细粒度级别图像数
集。该数据集共含 120 种狗类⼦类的 20580 张图像,每张图像提供了图像级
标记和物体边界框,但未提供部件关键点和鸟类属性信息等精细级别监督信
。该数据集图像⽰例如图 2 13所⽰。
[A. Khosla et al., CVPR Workshop 2011]
Stanford Dogs
20,580 images
120 fine-grained classes
Introduction (con’t)
http://www.weixiushen.com/
Fine-grained benchmark datasets
Oxford Flowers 8,189 images, 102 fine-grained classes
[M.-E. Nilsback and A. Zisserman, CVGIP 2008]
Introduction (con’t)
http://www.weixiushen.com/
Fine-grained benchmark datasets
24 第二章 相关研究内容综述
图 2 15: Aircrafts[51]
数据集图像⽰例。
Aircrafts
10,200 images
100 fine-grained classes
[S. Maji et al., arXiv: 1306.5151, 2013]
Introduction (con’t)
http://www.weixiushen.com/
Fine-grained benchmark datasets
Stanford Cars
图 2 15: Aircrafts[51]
数据集图像⽰例。
图 2 16: Stanford Cars[40]
数据集图像⽰例。
16,185 images, 196 fine-grained classes
[J. Krause et al., ICCV Workshop 2013]
Introduction (con’t)
http://www.weixiushen.com/
Fine-grained image analysis is hot …
Introduction (con’t)
Many papers published on top-tier conf./journals
CVPR, ICCV, ECCV, IJCAI, etc.
TPAMI, IJCV,TIP, etc.
Many frequently held workshops
Workshop on Fine-GrainedVisual Categorization
…
Many academic challenges about fine-grained tasks
The Nature Conservancy Fisheries Monitoring
iFood Classification Challenge
iNature Classification Challenge
…
http://www.weixiushen.com/
Image Retrieval (IR)
What is image retrieval?What is image retrieval?
! Description Based Image Retrieval (DBIR)! Description Based Image Retrieval (DBIR)
! Content Based Image Retrieval (CBIR)
Query
Textual
Text
Textual
query
Query
by
example
Image
Sketchexample Sketch
5/46
Fine-grained image retrieval
http://www.weixiushen.com/
Fine-grained image retrieval (con’t)
Common components of CBIR systems
Feature
extraction
Input
Database
Query Feature
extraction
Comparison
Return
http://www.weixiushen.com/
Deep learning for image retrieval
Figure 9: Sample search results using CroW features compressed to just d “ 32 dimensions. The query image is shown at
the leftmost side with the query bounding box marked in a red rectangle.
Fine-grained image retrieval (con’t)
http://www.weixiushen.com/
FGIR vs. General-purposed IR
Fine-grained image retrieval (con’t)
volutional Descriptor Aggregation for
e-Grained Image Retrieval
-Hao Luo, Jianxin Wu, Member, IEEE, Zhi-Hua Zhou, Fellow, IEEE
ural network models pre-
n task have been successfully
ch as texture description and
e tasks require annotations
this paper, we focus on a
pure unsupervised setting:
h image labels, fine-grained
e the unsupervised retrieval
olutional Descriptor Aggre-
localizes the main object(s)
discards noisy background
The selected descriptors are
reduced into a short feature
und. SCDA is unsupervised,
ox annotation. Experiments
he effectiveness of SCDA for
visualization of the SCDA
d to visual attributes (even
SCDA’s high mean average
Moreover, on general image
omparable retrieval results
etrieval approaches.
retrieval, selection and ag-
zation.
TION
(a) Fine-grained image retrieval. Two examples (“Mallard” and “Rolls-
Royce Phantom Sedan 2012”) from the CUB200-2011 [10] and Cars [11]
datasets, respectively.
(b) General image retrieval. Two examples from the Oxford Building [12]
dataset.
Figure 1. Fine-grained image retrieval vs. general image retrieval. Fine
grained image retrieval (FGIR) processes visually similar objects as the probe
and gallery. For example, given an image of Mallard (or Rolls-Royce Phantom
Sedan 2012) as the query, the FGIR system should return images of the same
http://www.weixiushen.com/
FGIR based on hand-crafted features
Fine-grained image retrieval (con’t)
细粒度级别图像的检索任务通常只在于某个物体类别内,例如仅会对不同鸟类
类别进⾏图像检索,通常并不会对鸟类、狗类等等多种类别物体的混合数据集
进⾏细粒度级别图像检索。相⽐之下,本⽂第 3章提出的检索⽅法是⾸个基于
深度学习的细粒度级别图像检索⽅法,同时,本⽂研究的细粒度级别图像检索
图 2 11: Xie 等⼈[83]
提出的细粒度级别图像“搜索”任务⽰意图。(该插图由 IEEE 授权使
⽤。)
[Xie et al., IEEE TMM 2015] http://www.weixiushen.com/
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
CCV
#545
ECC
#54
2 ECCV-16 submission ID 545
(a) Input image
(d) Descriptor
aggregation(c) Descriptor
selection
(b) Convolutional
activation tensor
AnSCDAfeature
Figure 1. Pipeline of the proposed SCDA method. (Best viewed in color.)
annotations are expensive and unrealistic in many real applications. In answer
to this di culty, there are attempts to categorize fine-grained images with only
image-level labels, e.g., [6–9].
In this paper, we handle a more challenging but more realistic task, i.e., Fine-
[Wei et al., IEEE TIP 2017]
Fine-grained image retrieval (con’t)
Selective Convolutional Descriptor Aggregation (SCDA)
http://www.weixiushen.com/
45
46
47
48
49
50
51
52
53
54
55
56
CV
45
2 ECCV-16 submission ID 545
(a) Input image
(c) Descriptor
selection
(b) Convolutional
activation tensor
Figure 1. Pipeline of the proposed SCDA method. (Best viewe
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
ECCV
#545
CCV-16 submission ID 545
ed applications. For example, a bird protection project may not want
g images given a bird query. To our best knowledge, this is the first
o fine-grained image retrieval using deep learning.
ctive Convolutional Descriptor Aggregation
per, we follow the notations in [24]. The term “feature map” indicates
lution results of one channel; the term “activations” indicates feature
ll channels in a convolution layer; and the term “descriptor” indicates
ensional component vector of activations. “pool5” refers to the activa-
e max-pooled last convolution layer, and “fc8” refers to the activations
fully connected layer.
an input image I of size H ⇥ W, the activations of a convolution
formulated as an order-3 tensor T with h ⇥ w ⇥ d elements, which
set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h ⇥ w is
ature map of the corresponding channel. From another point of view,
also considered as having h ⇥ w cells and each cell contains one d-
al deep descriptor. We denote the deep descriptors as X = x(i,j) ,
j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 Rd
). For
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
3 Selective Convolutional Descripto
In this paper, we follow the notations in [24]. The
the convolution results of one channel; the term “
maps of all channels in a convolution layer; and th
the d-dimensional component vector of activations
tions of the max-pooled last convolution layer, and
of the last fully connected layer.
Given an input image I of size H ⇥ W, the
layer are formulated as an order-3 tensor T wit
include a set of 2-D feature maps S = {Sn} (n =
the nth feature map of the corresponding channel
T can be also considered as having h ⇥ w cells
dimensional deep descriptor. We denote the deep
where (i, j) is a particular cell (i 2 {1, . . . , h} , j 2
instance, by employing the popular pre-trained V
deep descriptors, we can get a 7 ⇥ 7 ⇥ 512 activati
image is 224⇥224. Thus, on one hand, for this im
(i.e., Sn) of size 7 ⇥ 7; on the other hand, 49 deep
obtained.
3.1 Selecting Convolutional Descriptors
What distinguishes SCDA from existing deep le
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
3 Selective Convolutional Descriptor Aggregation
In this paper, we follow the notations in [24]. The term “feature map” in
the convolution results of one channel; the term “activations” indicates
maps of all channels in a convolution layer; and the term “descriptor” in
the d-dimensional component vector of activations. “pool5” refers to the
tions of the max-pooled last convolution layer, and “fc8” refers to the activ
of the last fully connected layer.
Given an input image I of size H ⇥ W, the activations of a conv
layer are formulated as an order-3 tensor T with h ⇥ w ⇥ d elements,
include a set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h
the nth feature map of the corresponding channel. From another point o
T can be also considered as having h ⇥ w cells and each cell contains
dimensional deep descriptor. We denote the deep descriptors as X = x
where (i, j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 R
instance, by employing the popular pre-trained VGG-16 model [25] to e
deep descriptors, we can get a 7 ⇥ 7 ⇥ 512 activation tensor in pool5 if the
image is 224⇥224. Thus, on one hand, for this image, we have 512 featur
(i.e., Sn) of size 7 ⇥ 7; on the other hand, 49 deep descriptors of 512-d a
obtained.
3.1 Selecting Convolutional Descriptors
What distinguishes SCDA from existing deep learning-based image re
methods is: using only the pre-trained model, SCDA is able to find usef
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
Selective Convolutional Descriptor Aggregation
n this paper, we follow the notations in [24]. The term “feature map” indicates
he convolution results of one channel; the term “activations” indicates feature
aps of all channels in a convolution layer; and the term “descriptor” indicates
he d-dimensional component vector of activations. “pool5” refers to the activa-
ons of the max-pooled last convolution layer, and “fc8” refers to the activations
the last fully connected layer.
Given an input image I of size H ⇥ W, the activations of a convolution
yer are formulated as an order-3 tensor T with h ⇥ w ⇥ d elements, which
clude a set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h ⇥ w is
he nth feature map of the corresponding channel. From another point of view,
can be also considered as having h ⇥ w cells and each cell contains one d-
mensional deep descriptor. We denote the deep descriptors as X = x(i,j) ,
here (i, j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 Rd
). For
stance, by employing the popular pre-trained VGG-16 model [25] to extract
eep descriptors, we can get a 7 ⇥ 7 ⇥ 512 activation tensor in pool5 if the input
mage is 224⇥224. Thus, on one hand, for this image, we have 512 feature maps
.e., Sn) of size 7 ⇥ 7; on the other hand, 49 deep descriptors of 512-d are also
btained.
.1 Selecting Convolutional Descriptors
Feature maps:
Descriptors:
Fine-grained image retrieval (con’t)
Notations
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
CCV
#545
ECCV
#54
ECCV-16 submission ID 545 5
The 468−th channel The 375−th channel The 284−th channelThe 108−th channel
The 481−th channel The 245−th channel The 6−th channel The 163−th channel
Figure 2. Sampled feature maps of four fine-grained images. Although we resize the
images for better visualization, our method can deal with images of any resolution.
Fine-grained image retrieval (con’t)
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
270
271
272
273
274
275
276
277
278
279
280
281
ECCV
#545
ECCV-16 sub
...
...
(a) Input image (d) Mask m(c) Activation map(b) Feature maps
Figure 4. Selecting useful deep convolutional descriptors. (B
Therefore, we use fM to select useful and meaningful d
f
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
(b) Visualization of the mask map
Figure 3. Visualization of the mask map M and the correspondin
component fM. The selected regions are highlighted in red. (Best v
region, we could expect this region to be an object rather tha
Therefore, in the proposed method, we add up the obtained
tensor through the depth direction. Thus, the h ⇥ w ⇥ d 3-
an h ⇥ w 2-D tensor, which we call the “activation map”, i.
(where Sn is the nth feature map in pool5). For the activation
h ⇥ w summed activation responses, corresponding to h ⇥ w
on the aforementioned observation, it is straightforward to sa
activation response a particular position (i, j) is, the more po
responding region being part of the object. Then, we calcula
¯a of all the positions in A as the threshold to decide which
objects: the position (i, j) whose activation response is highe
the main object, e.g., birds or dogs, might appear in that posi
M of the same size as A can be obtained as:
Mi,j =
(
1 if Ai,j > ¯a
0 otherwise
,
where (i, j) is a particular position in these h ⇥ w positions.
The figures in the first row of Fig. 3 show some examples
for birds and dogs. In these figures, we first resize the mask
Fine-grained image retrieval (con’t)
Obtaining the activation map by summarizing feature maps
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
CCV
#545
ECC
#54
6 ECCV-16 submission ID 545
(a) Visualization of the mask map M
(b) Visualization of the mask map
Figure 3. Visualization of the mask map M and the corresponding largest connected
component fM. The selected regions are highlighted in red. (Best viewed in color.)
region, we could expect this region to be an object rather than the background.
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
CCV
#545
ECC
#54
6 ECCV-16 submission ID 545
(a) Visualization of the mask map M
(b) Visualization of the mask map
Figure 3. Visualization of the mask map M and the corresponding largest connected
component fM. The selected regions are highlighted in red. (Best viewed in color.)
Fine-grained image retrieval (con’t)
Visualization of the mask map M
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
ECCV
#545
ECCV
#54
ECCV-16 submission ID 545 7
...
...
(a) Input image
(e) The largest connected
component of the mask map(d) Mask map(c) Activation map(b) Feature maps
Figure 4. Selecting useful deep convolutional descriptors. (Best viewed in color.)
Therefore, we use fM to select useful and meaningful deep convolutional de-
scriptors. The descriptor x(i,j) should be kept when fMi,j = 1, while fMi,j = 0
means the position (i, j) might have background or noisy parts:
F =
n
x(i,j)|fMi,j = 1
o
, (2)
Fine-grained image retrieval (con’t)
Selecting useful deep convolutional descriptors
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
Qualitative evaluation
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
ECCV
#545
ECCV
#545
ECCV-16 submission ID 545 7
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
Figure 6. Samples of predicted object localization bounding box on CUB200-2011 and
Stanford Dogs. The ground-truth bounding box is marked as the red dashed rectangle,
Fine-grained image retrieval (con’t)
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
ECCV
#545
ECCV
#545
ECCV-16 submission ID 545 9
Table 2. Comparison of di↵erent encoding or pooling approaches for FGIR.
Approach Dimension
CUB200-2011 Stanford Dogs
top1 top5 top1 top5
VLAD 1,024 55.92% 62.51% 69.28% 74.43%
Fisher Vector 2,048 52.04% 59.19% 68.37% 73.74%
avgPool 512 56.42% 63.14% 73.76% 78.47%
maxPool 512 58.35% 64.18% 70.37% 75.59%
avg&maxPool 1,024 59.72% 65.79% 74.86% 79.24%
– Pooling approaches. We also try two traditional pooling approaches, i.e.,
max-pooling and average-pooling, to aggregate the deep descriptors.
After encoding or pooling into a single vector, for VLAD and FV, the square
root normalization and `2-normalization are followed; for max- and average-
pooling methods, we just do `2-normalization (the square root normalization did
not work well). Finally, the cosine similarity is used for nearest neighbor search.
We use two datasets to demonstrate which type of aggregation method is optimal
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
localization performance is just slightly lower or even comparable to that of
these algorithms using strong supervisions, e.g., ground-truth bounding box and
parts annotations (even in the test phase).
3.2 Aggregating Convolutional Descriptors
After selecting F =
n
x(i,j)|fMi,j = 1
o
, we compare several encoding or pooling
approaches to aggregate these convolutional features, and then give our proposal.
– VLAD [14] uses k-means to find a codebook of K centroids {c1, . . . , cK}
and maps x(i,j) into a single vector v(i,j) =
⇥
0 . . . 0 x(i,j) ck . . . 0
⇤
2
RK⇥d
, where ck is the closest centroid to x(i,j). The final representation isP
i,j v(i,j).
– Fisher Vector [15]: FV is similar to VLAD, but uses a soft assignment
(i.e., Gaussian Mixture Model) instead of using k-means. Moreover, FV also
includes second-order statistics.2
2
For parameter choice of VLAD/FV, we follow the suggestions reported in [30]. The
number of clusters in VLAD and the number of Gaussian components in FV are
both set to 2. Larger values lead to lower accuracy.
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
whole-object in CUB200-2011. By comparing the results of PCP for torso and
our whole-object, we find that, even though our method is unsupervised, the
localization performance is just slightly lower or even comparable to that of
these algorithms using strong supervisions, e.g., ground-truth bounding box and
parts annotations (even in the test phase).
3.2 Aggregating Convolutional Descriptors
After selecting F =
n
x(i,j)|fMi,j = 1
o
, we compare several encoding or pooling
approaches to aggregate these convolutional features, and then give our proposal.
– VLAD [14] uses k-means to find a codebook of K centroids {c1, . . . , cK}
and maps x(i,j) into a single vector v(i,j) =
⇥
0 . . . 0 x(i,j) ck . . . 0
⇤
2
RK⇥d
, where ck is the closest centroid to x(i,j). The final representation isP
i,j v(i,j).
– Fisher Vector [15]: FV is similar to VLAD, but uses a soft assignment
(i.e., Gaussian Mixture Model) instead of using k-means. Moreover, FV also
includes second-order statistics.2
2
For parameter choice of VLAD/FV, we follow the suggestions reported in [30]. The
number of clusters in VLAD and the number of Gaussian components in FV are
both set to 2. Larger values lead to lower accuracy.
Fine-grained image retrieval (con’t)
Aggregating convolutional descriptors
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
ECCV-16 submission ID 545 9
Table 2. Comparison of di↵erent encoding or pooling approaches for FGIR.
Approach Dimension
CUB200-2011 Stanford Dogs
top1 top5 top1 top5
VLAD 1,024 55.92% 62.51% 69.28% 74.43%
Fisher Vector 2,048 52.04% 59.19% 68.37% 73.74%
avgPool 512 56.42% 63.14% 73.76% 78.47%
maxPool 512 58.35% 64.18% 70.37% 75.59%
avg&maxPool 1,024 59.72% 65.79% 74.86% 79.24%
– Pooling approaches. We also try two traditional pooling approaches, i.e.,
max-pooling and average-pooling, to aggregate the deep descriptors.
After encoding or pooling into a single vector, for VLAD and FV, the square
root normalization and `2-normalization are followed; for max- and average-
SCDA
Fine-grained image retrieval (con’t)
Comparing difference encoding or pooling methods
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
ECCV
#545
ECCV
#545
10 ECCV-16 submission ID 545
(a) M of Pool5 (d) of Relu5_2(c) M of Relu5_2(b) of Pool5
Figure 6. The mask map and its corresponding largest connected component of dif-
ferent CNN layers. (The figure is best viewed in color.)
where ↵ is the coe cient for SCDArelu5 2
. It is set to 0.5 for FGIR. The `2
normalization is followed. In addition, another SCDA+
of the horizontal flip of
the original image is incorporated, which is denoted as “SCDA flip+
” (4,096-d).
4 Experiments and Results
In this section, we report the fine-grained image retrieval results. In addition,
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
our aggregation scheme. Its performance is significantly and consistently higher
than the others. We use the “avg&maxPool” aggregation as “SCDA feature” to
represent the whole fine-grained image.
3.3 Multiple Layer Ensemble
As studied in [31,32], the ensemble of multiple layers boost the final performance.
Thus, we also incorporate another SCDA feature produced from the relu5 2 layer
which is three layers in front of pool5 in the VGG-16 model [25].
Following pool5, we get the mask map Mrelu5 2
from relu5 2. Its activations are
less related to the semantic meaning than those of pool5. As shown in Fig. 6 (c),
there are many noisy parts. However, the bird is more accurately detected than
pool5. Therefore, we combine fMpool5
and Mrelu5 2 together to get the final mask
map of relu5 2. fMpool5
is firstly upsampled to the size of Mrelu5 2
. We keep the
descriptors when their position in both fMpool5
and Mrelu5 2
are 1, which are
the final selected relu5 2 descriptors. The aggregation process remains the same.
Finally, we concatenate the SCDA features of relu5 2 and pool5 into a single
representation, denoted by “SCDA+
”:
SCDA+
⇥
SCDApool5
, ↵ ⇥ SCDArelu5 2
⇤
, (3)
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
ECCV
#545
6 submission ID 545
(d) of Relu5_2(c) M of Relu5_2(b) of Pool5
mask map and its corresponding largest connected component of dif-
s. (The figure is best viewed in color.)
coe cient for SCDArelu5 2
. It is set to 0.5 for FGIR. The `2
s followed. In addition, another SCDA+
of the horizontal flip of
ge is incorporated, which is denoted as “SCDA flip+
” (4,096-d).
ments and Results
Fine-grained image retrieval (con’t)
Multiple layer ensemble
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
CV
545
ECC
#5
12 ECCV-16 submission ID 545
Figure 7. Some retrieval results of four fine-grained datasets. On the left, there are
Fine-grained image retrieval (con’t)
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
V
5
E
#
ECCV-16 submission ID 545 13
Figure 8. Quality demonstrations of the SCDA feature. From the top to bottom of
each column, there are six returned original images in the order of one sorted dimension
of “256-d SVD+whitening”. (Best viewed in color and zoomed in.)
Fine-grained image retrieval (con’t)
Quality demonstration of the SCDA feature
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
Deep Descriptor Transforming
CNN pre-trained models
Deep Descriptor Transforming
Four b
Object
Col-lo
Image co-localization (a.k.a. unsupervised object discovery) is a
fundamental computer vision problem, which simultaneously localizes
objects of the same category across a set of distinct images.
2 – The proposed method: DDT 4 – E
stat
and
ü DD
rob
Shen2
, Zhi-Hua Zhou1
Technology, Nanjing University, Nanjing, China
aide, Adelaide, Australia
edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
CNN pre-trained models
Deep Descriptor Transforming
Input images
into a set of linearly uncorrelated variables (i.e., the principal
components). This transformation is defined in such a way
that the first principal component has the largest possible vari-
ance, and each succeeding component in turn has the highest
variance possible under the constraint that it is orthogonal to
all the preceding components.
PCA is widely used in machine learning and computer
vision for dimension reduction [Chen et al., 2013; Gu et
al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc-
tion [Zhang et al., 2013; Nie et al., 2011] and so on. Specifi-
cally, in this paper, we utilize PCA as projection directions for
transforming these deep descriptors {x(i,j)} to evaluate their
correlations. Then, on each projection direction, the corre-
sponding principal component’s values are treated as the cues
for image co-localization, especially the first principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we first collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
¯x =
1
K
X
n
X
i,j
xn
(i,j) , (1)
characteristic th
localization sce
indicates indeed
Therefore, th
old for dividing
positive values
part has negativ
appear. In additi
indicates that n
can be used for
resized by the n
same as that of
largest connecte
what is done in [
relation values a
bounding box w
of positive regi
prediction.
3.4 Discussi
In this section,
comparing with
As shown in F
and DDT are hi
nsforming for Image Co-Localization
n-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Chunhua Shen2
, Zhi-Hua Zhou1
Software Technology, Nanjing University, Nanjing, C
y of Adelaide, Adelaide, Australia
amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.
h the
s. In
ained
erent CNN pre-trained models
XN-1X1 X2
XN…
Deep
descriptors
Calculating descriptors’
mean vector
nsforming for Image Co-Localization⇤
n-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Chunhua Shen2
, Zhi-Hua Zhou1
Software Technology, Nanjing University, Nanjing, C
y of Adelaide, Adelaide, Australia
amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.
h the
all the preceding components.
PCA is widely used in machine learning and computer
vision for dimension reduction [Chen et al., 2013; Gu et
al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc-
tion [Zhang et al., 2013; Nie et al., 2011] and so on. Specifi-
cally, in this paper, we utilize PCA as projection directions for
transforming these deep descriptors {x(i,j)} to evaluate their
correlations. Then, on each projection direction, the corre-
sponding principal component’s values are treated as the cues
for image co-localization, especially the first principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we first collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
¯x =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ⇥ w ⇥ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ⇥ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
positive values
part has negativ
appear. In addi
indicates that n
can be used fo
resized by the
same as that of
largest connect
what is done in
relation values
bounding box w
of positive reg
prediction.
3.4 Discuss
In this section
comparing with
As shown in
and DDT are h
ers the informa
“person” and ev
jects. Furtherm
the aggregatio
calculate the m
on for dimension reduction [Chen et al., 2013; Gu et
2011; Zhang et al., 2009; Davidson, 2009], noise reduc-
[Zhang et al., 2013; Nie et al., 2011] and so on. Specifi-
y, in this paper, we utilize PCA as projection directions for
sforming these deep descriptors {x(i,j)} to evaluate their
elations. Then, on each projection direction, the corre-
nding principal component’s values are treated as the cues
mage co-localization, especially the first principal com-
ent. Thanks to the property of this kind of transforming,
T is also able to handle data noise.
n DDT, for a set of N images containing objects from the
e category, we first collect the corresponding convolutional
criptors (X1
, . . . , XN
) by feeding them into a pre-trained
N model. Then, the mean vector of all the descriptors is
ulated by:
¯x =
1
K
X
n
X
i,j
xn
(i,j) , (1)
re K = h ⇥ w ⇥ N. Note that, here we assume each
ge has the same number of deep descriptors (i.e., h ⇥ w)
presentation clarity. Our proposed method, however, can
dle input images with arbitrary resolutions.
hen, after obtaining the covariance matrix:
appear. In addition, if P
indicates that no comm
can be used for detecti
resized by the nearest
same as that of the inp
largest connected comp
what is done in [Wei et a
relation values and the z
bounding box which con
of positive regions is r
prediction.
3.4 Discussions an
In this section, we inve
comparing with SCDA.
As shown in Fig. 2, th
and DDT are highlighte
ers the information from
“person” and even “gui
jects. Furthermore, we
the aggregation map o
calculate the mean valu
tion [Zhang et al., 2013; Nie et al., 2011] and so on. Specifi-
cally, in this paper, we utilize PCA as projection directions for
transforming these deep descriptors {x(i,j)} to evaluate their
correlations. Then, on each projection direction, the corre-
sponding principal component’s values are treated as the cues
for image co-localization, especially the first principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we first collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
¯x =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ⇥ w ⇥ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ⇥ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
Then, after obtaining the covariance matrix:
Cov(x) =
1
K
X
n
X
i,j
(xn
(i,j) ¯x)(xn
(i,j) ¯x)>
, (2)
can be used for det
resized by the neare
same as that of the i
largest connected co
what is done in [Wei
relation values and th
bounding box which
of positive regions i
prediction.
3.4 Discussions
In this section, we i
comparing with SCD
As shown in Fig. 2
and DDT are highlig
ers the information f
“person” and even “g
jects. Furthermore,
the aggregation map
calculate the mean v
ization threshold in S
values in aggregatio
red vertical line corre
beyond the threshold
sforming for Image Co-Localization⇤
n-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Chunhua Shen2
, Zhi-Hua Zhou1
Software Technology, Nanjing University, Nanjing, C
y of Adelaide, Adelaide, Australia
amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.
sforming for Image Co-Localization⇤
n-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Chunhua Shen2
, Zhi-Hua Zhou1
Software Technology, Nanjing University, Nanjing, Ch
y of Adelaide, Adelaide, Australia
ng for Image Co-Localization⇤
ang1
, Yao Li2
, Chen-Wei Xie1
,
a Shen2
, Zhi-Hua Zhou1
Technology, Nanjing University, Nanjing, China
laide, Adelaide, Australia
(a) Bike (b) Diningtable
0 0.5 1
0
50
100
150
-1 0 1
0
100
200
300
400
(c) Boat
0 0.5 1
0
20
40
60
-1
0
100
200
300
400
(d) Bottle
0 0.2 0.4 0.6 0.8 1
0
20
40
60
80
100
120
140
-1 -0.5 0 0.5 10
20
40
60
80
100
120
140
SCDA DDT SCDA DDT DDT DSCDA SCDA
(g) Chair (h) Dog (i) Horse (j) Motorbi
0 0.2 0.4 0.6 0.8 1
0
20
40
60
80
8 1 -1 -0.5 0 0.5 1
0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8 1
0
50
100
150
-1 -0.5 0 0.5 1
0
50
100
150
200
250
300
0 0.5 1
0
20
40
60
80
-1
0
200
400
600
0 0.5 1
0
50
100
150
200
-1 0 1
0
200
400
600
800
0 0.2 0.4 0.6 0.8 1
0
20
40
60
80
-1 -0.5 0 0.5 1
0
50
100
150
200
250
Figure 2: Examples from twelve randomly sampled classes of VOC 2007. Th
SCDA, the second column are by our DDT. The red vertical lines in the histogra
localizing objects. The selected regions in images are highlighted in red. (Best
“diningtable” image region. In Fig. 2, more examples are pre-
sented. In that figure, some failure cases can be also found,
e.g., the chair class in Fig. 2 (g).
In addition, the normalized P1
can be also used as localiza-
tion probability scores. Combining it with conditional random
filed techniques might produce more accurate object bound-
aries. Thus, DDT can be modified slightly in that way, and
then perform the co-segmentation problem. More importantly,
different from other co-segmentation methods, DDT can detect
noisy images while other methods can not.
4 Experiments
In this section, we first introduce the evaluation metric and
datasets used in image co-localization. Then, we compare the
empirical results of our DDT with other state-of-the-arts on
these datasets. The computational cost of DDT is reported too.
Moreover, the results in Sec. 4.4 and Sec. 4.5 illustrate the
generalization ability and robustness of the proposed method.
Finally, our further study in Sec. 4.6 reveals DDT might deal
Table 1: C
M
[Joulin
[Joulin
[Rubinste
[Tang
S
[Cho e
O
predicted bou
box. All Cor
Our experi
commonly u
covery datas
2007 / VOC
ImageNet Su
For exper
al., 2015; Li
Methods
[Joulin et al., 2
SCDA
[Cho et al., 2
[Li et al., 20
Our DDT
Methods
SCDA
[Cho et al., 20
[Li et al., 201
Our DDT
Table 4: Comparisons of the CorLoc metric with weakly supervised object l
“X” in the “Neg.” column indicates that these WSOL methods require access t
Methods Neg. aero bike bird boat bottle bus car cat chair cow table dog
[Shi et al., 2013] X 67.3 54.4 34.3 17.8 1.3 46.6 60.7 68.9 2.5 32.4 16.2 58.
[Cinbis et al., 2015] X 56.6 58.3 28.4 20.7 6.8 54.9 69.1 20.8 9.2 50.5 10.2 29.
[Wang et al., 2015] X 37.7 58.8 39.0 4.7 4.0 48.4 70.0 63.7 9.0 54.2 33.3 37.
[Bilen et al., 2015] X 66.4 59.3 42.7 20.4 21.3 63.4 74.3 59.6 21.1 58.2 14.0 38.
[Ren et al., 2016] X 79.2 56.9 46.0 12.2 15.7 58.4 71.4 48.6 7.2 69.9 16.7 47.
[Wang et al., 2014] X 80.1 63.9 51.5 14.9 21.0 55.7 74.2 43.5 26.2 53.4 16.3 56.
Our DDT 67.3 63.3 61.3 22.7 8.5 64.8 57.0 80.5 9.4 49.0 22.5 72.
Table 4:
“X” in th
Meth
[Shi et al
[Cinbis et
[Wang et a
[Bilen et a
[Ren et a
[Wang et a
Our D
Figure 3
successf
rectangle
the same
Table 5:
Metho
[Cho et al.
SCDA
erent from their original purposes (e.g., for describing texture
r finding object proposals [Ghodrati et al., 2015]). However,
or such adaptations of pre-trained models, they still require
urther annotations in the new domain (e.g., image labels).
While, DDT deals with the image co-localization problem in
n unsupervised setting.
Coincidentally, several recent works also shed lights on
NN pre-trained model reuse in the unsupervised setting, e.g.,
CDA [Wei et al., 2017]. SCDA is proposed for handling
he fine-grained image retrieval task, where it uses pre-trained
models (from ImageNet, which is not fine-grained) to locate
main objects in fine-grained images. It is the most related work
o ours, even though SCDA is not for image co-localization.
Different from our DDT, SCDA assumes only an object of
nterest in each image, and meanwhile objects from other
ategories does not exist. Thus, SCDA locates the object using
ues from this single image assumption. Apparently, it can not
ork well for images containing diverse objects (cf. Table 2
nd Table 3), and also can not handle data noise (cf. Sec. 4.5).
.2 Image Co-Localization
mage co-localization is a fundamental problem in computer
ision, where it needs to discover the common object emerging
n only positive sets of example images (without any nega-
ve examples or further supervisions). Image co-localization
hares some similarities with image co-segmentation [Zhao
nd Fu, 2015; Kim et al., 2011; Joulin et al., 2012]. Instead
f generating a precise segmentation of the related objects in
3 The Proposed Method
3.1 Preliminary
The following notations are used in the
term “feature map” indicates the con
channel; the term “activations” indica
channels in a convolution layer; and
indicates the d-dimensional componen
Given an input image I of size H ⇥
convolution layer are formulated as an
h⇥w⇥d elements. T can be considere
and each cell contains one d-dimension
the n-th image, we denote its correspo
as Xn
=
n
xn
(i,j) 2 Rd
o
, where (i,
(i 2 {1, . . . , h} , j 2 {1, . . . , w}) and
3.2 SCDA Recap
Since SCDA [Wei et al., 2017] is the m
we hereby present a recap of this meth
for dealing with the fine-grained imag
employs pre-trained models to select t
scriptors by localizing the main object
unsupervisedly. In SCDA, it assumes th
only one main object of interest and w
objects. Thus, the object localization s
Fine-grained image retrieval (con’t)
[Wei et al., IJCAI 2017] http://www.weixiushen.com/
CNN pre-trained models
Deep Descriptor Transforming
Four benchmark c
Object Discovery, PA
Col-localization re
Detecting noisy im
2 – The proposed method: DDT 4 – Experimen
5 – Further st
the
s. In
ined
rent
trac-
onal
d act
e co-
ffec-
ming
tors
ons,
in a
ffec-
nch-
nsis-
hods
mon-
cate-
ta.
model by
for other
CNN pre-trained models
Deep Descriptor Transforming
Figure 1: Pipeline of the proposed DDT method for image
co-localization. In this instance, the goal is to localize the
airplane within each image. Note that, there might be few
noisy images in the image set. (Best viewed in color.)
adopted to various usages, e.g., as universal feature extrac-
tors [Wang et al., 2015; Li et al., 2016], object proposal gen-
Input images
PCA is widely used in machine learning and computer
vision for dimension reduction [Chen et al., 2013; Gu et
al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc-
tion [Zhang et al., 2013; Nie et al., 2011] and so on. Specifi-
cally, in this paper, we utilize PCA as projection directions for
transforming these deep descriptors {x(i,j)} to evaluate their
correlations. Then, on each projection direction, the corre-
sponding principal component’s values are treated as the cues
for image co-localization, especially the first principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we first collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
¯x =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ⇥ w ⇥ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ⇥ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
Then, after obtaining the covariance matrix:
Cov(x) =
1
K
X
n
X
i,j
(xn
(i,j) ¯x)(xn
(i,j) ¯x)>
, (2)
we can get the eigenvectors ⇠1, . . . , ⇠d of Cov(x) which cor-
respond to the sorted eigenvalues 1 · · · d 0.
part has negative values pre
appear. In addition, if P1
of
indicates that no common o
can be used for detecting
resized by the nearest inter
same as that of the input im
largest connected componen
what is done in [Wei et al., 2
relation values and the zero
bounding box which contain
of positive regions is retur
prediction.
3.4 Discussions and A
In this section, we investig
comparing with SCDA.
As shown in Fig. 2, the ob
and DDT are highlighted in
ers the information from a s
“person” and even “guide-b
jects. Furthermore, we nor
the aggregation map of SC
calculate the mean value (w
ization threshold in SCDA)
values in aggregation map
red vertical line corresponds
beyond the threshold, there
explanation about why SCD
Whilst, for DDT, it lever
form these deep descriptor
n Wu , Chunhua Shen , Zhi-Hua Zhou
Novel Software Technology, Nanjing University, Nanjing, China
niversity of Adelaide, Adelaide, Australia
zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
able with the
plications. In
of pre-trained
ally, different
eature extrac-
convolutional
ons could act
the image co-
mple but effec-
Transforming
of descriptors
stent regions,
CNN pre-trained models
Deep Descriptor Transforming
XN-1X1 X2
XN…
Deep
descriptors
Calculating descriptors’
mean vector
i1
, Chen-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
n Wu1
, Chunhua Shen2
, Zhi-Hua Zhou1
Novel Software Technology, Nanjing University, Nanjing, China
niversity of Adelaide, Adelaide, Australia
zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
able with the
plications. In
of pre-trained
ally, different
eature extrac-
convolutional
ons could act
the image co-
mple but effec-
Transforming
CNN pre-trained models
transforming these deep descriptors {x(i,j)} to evaluate their
correlations. Then, on each projection direction, the corre-
sponding principal component’s values are treated as the cues
for image co-localization, especially the first principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we first collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
¯x =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ⇥ w ⇥ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ⇥ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
Then, after obtaining the covariance matrix:
Cov(x) =
1
K
X
n
X
i,j
(xn
(i,j) ¯x)(xn
(i,j) ¯x)>
, (2)
we can get the eigenvectors ⇠1, . . . , ⇠d of Cov(x) which cor-
respond to the sorted eigenvalues 1 · · · d 0.
As aforementioned, since the first principal component has
the largest variance, we take the eigenvector ⇠1 corresponding
to the largest eigenvalue as the main projection direction. For
the deep descriptor at a particular position (i, j) of an image,
its first principal component p1
is calculated as follows:
same as that of the input i
largest connected compone
what is done in [Wei et al., 2
relation values and the zero
bounding box which contai
of positive regions is retu
prediction.
3.4 Discussions and A
In this section, we investi
comparing with SCDA.
As shown in Fig. 2, the o
and DDT are highlighted in
ers the information from a
“person” and even “guide-b
jects. Furthermore, we nor
the aggregation map of S
calculate the mean value (
ization threshold in SCDA)
values in aggregation map
red vertical line correspond
beyond the threshold, ther
explanation about why SC
Whilst, for DDT, it leve
form these deep descripto
class, DDT can accurately
histogram is also drawn. B
tive values. We normalize
Apparently, few values ar
(i.e., 0). More importantly,
Obtaining the
covariance matrix
and then getting the
eigenvectors
sponding principal component’s values are treated as the cues
for image co-localization, especially the first principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we first collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
¯x =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ⇥ w ⇥ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ⇥ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
Then, after obtaining the covariance matrix:
Cov(x) =
1
K
X
n
X
i,j
(xn
(i,j) ¯x)(xn
(i,j) ¯x)>
, (2)
we can get the eigenvectors ⇠1, . . . , ⇠d of Cov(x) which cor-
respond to the sorted eigenvalues 1 · · · d 0.
As aforementioned, since the first principal component has
the largest variance, we take the eigenvector ⇠1 corresponding
to the largest eigenvalue as the main projection direction. For
the deep descriptor at a particular position (i, j) of an image,
its first principal component p1
is calculated as follows:
what is done in [Wei et al., 2017]). B
relation values and the zero thresho
bounding box which contains the la
of positive regions is returned as
prediction.
3.4 Discussions and Analyse
In this section, we investigate the
comparing with SCDA.
As shown in Fig. 2, the object loc
and DDT are highlighted in red. B
ers the information from a single i
“person” and even “guide-board” a
jects. Furthermore, we normalize
the aggregation map of SCDA in
calculate the mean value (which i
ization threshold in SCDA). The hi
values in aggregation map is also
red vertical line corresponds to the
beyond the threshold, there are sti
explanation about why SCDA high
Whilst, for DDT, it leverages th
form these deep descriptors into
class, DDT can accurately locate
histogram is also drawn. But, P1
h
tive values. We normalize P1
into
Apparently, few values are larger
(i.e., 0). More importantly, many va
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we first collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
¯x =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ⇥ w ⇥ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ⇥ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
Then, after obtaining the covariance matrix:
Cov(x) =
1
K
X
n
X
i,j
(xn
(i,j) ¯x)(xn
(i,j) ¯x)>
, (2)
we can get the eigenvectors ⇠1, . . . , ⇠d of Cov(x) which cor-
respond to the sorted eigenvalues 1 · · · d 0.
As aforementioned, since the first principal component has
the largest variance, we take the eigenvector ⇠1 corresponding
to the largest eigenvalue as the main projection direction. For
the deep descriptor at a particular position (i, j) of an image,
its first principal component p1
is calculated as follows:
p1
(i,j) = ⇠>
1 x(i,j) ¯x . (3)
According to their spatial locations, all p1
(i,j) from an image
are combined into a 2-D matrix whose dimensions are h ⇥ w.
bounding box which contains the
of positive regions is returned a
prediction.
3.4 Discussions and Analys
In this section, we investigate th
comparing with SCDA.
As shown in Fig. 2, the object l
and DDT are highlighted in red.
ers the information from a single
“person” and even “guide-board”
jects. Furthermore, we normaliz
the aggregation map of SCDA
calculate the mean value (which
ization threshold in SCDA). The
values in aggregation map is als
red vertical line corresponds to th
beyond the threshold, there are s
explanation about why SCDA hig
Whilst, for DDT, it leverages t
form these deep descriptors into
class, DDT can accurately locat
histogram is also drawn. But, P1
tive values. We normalize P1
int
Apparently, few values are larg
(i.e., 0). More importantly, many
indicates the strong negative co
validates the effectiveness of DD
As another example shown in Fig
locates “person” in the image b
class. While, DDT can correctl
r Transforming for Image Co-Localization⇤
1
, Chen-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Wu1
, Chunhua Shen2
, Zhi-Hua Zhou1
Novel Software Technology, Nanjing University, Nanjing, China
niversity of Adelaide, Adelaide, Australia
zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
able with the
plications. In
of pre-trained
ally, different
eature extrac-
convolutional
CNN pre-trained modelsTransforming
Transforming for Image Co-Localization⇤
1
, Chen-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Wu1
, Chunhua Shen2
, Zhi-Hua Zhou1
Novel Software Technology, Nanjing University, Nanjing, China
niversity of Adelaide, Adelaide, Australia
zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
able with the
plications. In
of pre-trained
lly, different CNN pre-trained models
ming (DDT)
is that we can leverage
mage set, instead of a
We call that matrix as indicator matrix:
2
6
p1
(1,1) p1
(1,2) . . . p1
(1,w)
p1
p1
. . . p1
3
7
sforming for Image Co-Localization⇤
n-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Chunhua Shen2
, Zhi-Hua Zhou1
Software Technology, Nanjing University, Nanjing, China
y of Adelaide, Adelaide, Australia
amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
h the
s. In
ined
erent
trac-
onal
CNN pre-trained models
0 0.2 0.4 0.6 0.8 1
0
20
40
60
80
100
120
-1 -0.5 0 0.5 10
20
40
60
80
100
120
(g) Chair (h) Dog (i) Horse (j) Motorbike (k) Pla
0 0.2 0.4 0.6 0.8 1
0
20
40
60
8 1 -1 -0.5 0 0.5 1
0
50
100
150
200
250
0 0.2 0.4 0.6 0.8 1
0
50
100
-1 -0.5 0 0.5 1
0
50
100
150
200
250
0 0.5 1
0
20
40
60
80
-1
0
200
400
600
0 0.5 1
0
20
40
-1
0
100
200
Figure 2: Examples from twelve randomly sampled classes of VOC 2007. The first column of ea
SCDA, the second column are by our DDT. The red vertical lines in the histogram plots indicate th
localizing objects. The selected regions in images are highlighted in red. (Best viewed in color a
“diningtable” image region. In Fig. 2, more examples are pre-
sented. In that figure, some failure cases can be also found,
e.g., the chair class in Fig. 2 (g).
In addition, the normalized P1
can be also used as localiza-
tion probability scores. Combining it with conditional random
filed techniques might produce more accurate object bound-
aries. Thus, DDT can be modified slightly in that way, and
then perform the co-segmentation problem. More importantly,
different from other co-segmentation methods, DDT can detect
noisy images while other methods can not.
4 Experiments
In this section, we first introduce the evaluation metric and
datasets used in image co-localization. Then, we compare the
empirical results of our DDT with other state-of-the-arts on
these datasets. The computational cost of DDT is reported too.
Moreover, the results in Sec. 4.4 and Sec. 4.5 illustrate the
generalization ability and robustness of the proposed method.
Finally, our further study in Sec. 4.6 reveals DDT might deal
with part-based image co-localization, which is a novel and
challenging problem.
In our experiments, the images keep the original image reso-
lutions. For the pre-trained deep model, the publicly available
VGG-19 model [Simonyan and Zisserman, 2015] is employed
to extract deep convolution descriptors from the last convo-
lution layer (before pool5). We use the open-source library
MatConvNet [Vedaldi and Lenc, 2015] for conducting experi-
ments. All the experiments are run on a computer with Intel
Xeon E5-2660 v3, 500G main memory, and a K80 GPU.
4.1 Evaluation Metric and Datasets
Following previous image co-localization works [Li et al.,
2016; Cho et al., 2015; Tang et al., 2014], we take the cor-
rect localization (CorLoc) metric for evaluating the proposed
method. CorLoc is defined as the percentage of images cor-
rectly localized according to the PASCAL-criterion [Ever-
ingham et al., 2015]:
area(BpBgt)
area(Bp[Bgt) > 0.5, where Bp is the
Table 1: Comparisons of Co
Methods Airpla
[Joulin et al., 2010] 32.93
[Joulin et al., 2012] 57.32
[Rubinstein et al., 2013] 74.39
[Tang et al., 2014] 71.95
SCDA 87.80
[Cho et al., 2015] 82.93
Our DDT 91.46
predicted bounding box and Bg
box. All CorLoc results are rep
Our experiments are conducte
commonly used in image co-lo
covery dataset [Rubinstein et
2007 / VOC 2012 dataset [Eve
ImageNet Subsets [Li et al., 201
For experiments on the VOC
al., 2015; Li et al., 2016; Joulin
in the trainval set (excluding im
instances annotated as difficult
covery, we use the 100-image s
al., 2013; Cho et al., 2015] in
comparison with other methods
In addition, Object Discover
images in the Airplane, Car and
These noisy images contain no
egory, as the third image show
Sec. 4.5, we quantitatively meas
DDT to identify these noisy im
To further investigate the g
ImageNet Subsets [Li et al., 2
six subsets/categories. These s
from the 1000-label ILSVRC c
al., 2015]. That is to say, these
trained CNN models. Experim
that DDT is insensitive to the o
Table 2: Comparison
Methods aero bike bird b
[Joulin et al., 2014] 32.8 17.3 20.9 1
SCDA 54.4 27.2 43.4 1
[Cho et al., 2015] 50.3 42.8 30.0 1
[Li et al., 2016] 73.1 45.0 43.4 2
Our DDT 67.3 63.3 61.3 2
Table 3: Comparison
Methods aero bike bird bo
SCDA 60.8 41.7 38.6 21
[Cho et al., 2015] 57.0 41.2 36.0 26
[Li et al., 2016] 65.7 57.8 47.9 28
Our DDT 76.7 67.1 57.9 30
4.2 Comparisons with Sta
Comparisons to Image Co-Loc
We first compare the results of
cluding SCDA) on Object Disc
we also use VGG-19 to extract th
perform experiments. As show
forms other methods by about 4%
Especially for the airplane class
that of [Cho et al., 2015]. In ad
of each category in this dataset
SCDA can perform well.
For VOC 2007 and 2012, th
objects per image, which is m
Discovery. The comparisons of
two datasets are reported in Tab
It is clear that on average our D
state-of-the-arts (based on deep l
both two datasets. Moreover, D
small common objects, e.g., “bo
because most images of these d
which do not obey SCDA’s assum
Table 4: Comparisons of the CorLoc metric with weakly supervised object localization met
“X” in the “Neg.” column indicates that these WSOL methods require access to a negative ima
Methods Neg. aero bike bird boat bottle bus car cat chair cow table dog horse mbike per
[Shi et al., 2013] X 67.3 54.4 34.3 17.8 1.3 46.6 60.7 68.9 2.5 32.4 16.2 58.9 51.5 64.6 18
[Cinbis et al., 2015] X 56.6 58.3 28.4 20.7 6.8 54.9 69.1 20.8 9.2 50.5 10.2 29.0 58.0 64.9 36
[Wang et al., 2015] X 37.7 58.8 39.0 4.7 4.0 48.4 70.0 63.7 9.0 54.2 33.3 37.4 61.6 57.6 30
[Bilen et al., 2015] X 66.4 59.3 42.7 20.4 21.3 63.4 74.3 59.6 21.1 58.2 14.0 38.5 49.5 60.0 19
[Ren et al., 2016] X 79.2 56.9 46.0 12.2 15.7 58.4 71.4 48.6 7.2 69.9 16.7 47.4 44.2 75.5 41
[Wang et al., 2014] X 80.1 63.9 51.5 14.9 21.0 55.7 74.2 43.5 26.2 53.4 16.3 56.7 58.3 69.5 14
Our DDT 67.3 63.3 61.3 22.7 8.5 64.8 57.0 80.5 9.4 49.0 22.5 72.6 73.8 69.0 7
(a) Chipmunk (b) Rhino
(d) Racoon (e) Rake
Figure 3: Random samples of predicted object co-localization bounding box on ImageNet Sub
successful predictions and one failure case. In these images, the red rectangle is the predictio
rectangle is the ground truth bounding box. In the successful predictions, the yellow rectangle
the same as the red predictions. (Best viewed in color and zoomed in.)
Table 5: Comparisons of on image sets disjoint with ImageNet.
Methods Chipmunk Rhino Stoat Racoon Rake Wheelchair Mean
Image
Methods Neg. aer
[Shi et al., 2013] X 67.
[Cinbis et al., 2015] X 56.
[Wang et al., 2015] X 37.
[Bilen et al., 2015] X 66.
[Ren et al., 2016] X 79.
[Wang et al., 2014] X 80.
Our DDT 67.
(a) Chipmu
(d) Racoo
Figure 3: Random samp
successful predictions a
rectangle is the ground t
the same as the red pred
Table 5: Comparisons of
Methods Chipmunk
[Cho et al., 2015] 26.6
SCDA 32.3
[Li et al., 2016] 44.9
Our DDT 70.3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
false positive rate
truepositiverate
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
truepositiverate
Airplane
False positive rate
Truepositiverate
Truepositiverate
Figure 4: ROC curves ill
at identifying noisy ima
The curves in red line ar
in blue dashed line prese
only the method in [Tan
CNN pre-trained model reuse in the unsupervised setting, e.g.,
SCDA [Wei et al., 2017]. SCDA is proposed for handling
the fine-grained image retrieval task, where it uses pre-trained
models (from ImageNet, which is not fine-grained) to locate
main objects in fine-grained images. It is the most related work
to ours, even though SCDA is not for image co-localization.
Different from our DDT, SCDA assumes only an object of
interest in each image, and meanwhile objects from other
categories does not exist. Thus, SCDA locates the object using
cues from this single image assumption. Apparently, it can not
work well for images containing diverse objects (cf. Table 2
and Table 3), and also can not handle data noise (cf. Sec. 4.5).
2.2 Image Co-Localization
Image co-localization is a fundamental problem in computer
vision, where it needs to discover the common object emerging
in only positive sets of example images (without any nega-
tive examples or further supervisions). Image co-localization
shares some similarities with image co-segmentation [Zhao
and Fu, 2015; Kim et al., 2011; Joulin et al., 2012]. Instead
of generating a precise segmentation of the related objects in
each image, co-localization methods aim to return a bound-
ing box around the object. Moreover, co-segmentation has
a strong assumption that every image contains the object of
interest, and hence is unable to handle noisy images.
Additionally, co-localization is also related to weakly su-
pervised object localization (WSOL) [Zhang et al., 2016;
Bilen et al., 2015; Wang et al., 2014; Siva and Xiang, 2011].
But the key difference between them is WSOL requires
manually-labeled negative images whereas co-localization
channel; the term “activations” indicates feature m
channels in a convolution layer; and the term “d
indicates the d-dimensional component vector of ac
Given an input image I of size H ⇥ W, the activa
convolution layer are formulated as an order-3 tenso
h⇥w⇥d elements. T can be considered as having h
and each cell contains one d-dimensional deep descr
the n-th image, we denote its corresponding deep de
as Xn
=
n
xn
(i,j) 2 Rd
o
, where (i, j) is a partic
(i 2 {1, . . . , h} , j 2 {1, . . . , w}) and n 2 {1, . . . , N
3.2 SCDA Recap
Since SCDA [Wei et al., 2017] is the most related wo
we hereby present a recap of this method. SCDA is
for dealing with the fine-grained image retrieval pr
employs pre-trained models to select the meaningfu
scriptors by localizing the main object in fine-graine
unsupervisedly. In SCDA, it assumes that each image
only one main object of interest and without other c
objects. Thus, the object localization strategy is bas
activation tensor of a single image.
Concretely, for an image, the activation tensor is
through the depth direction. Thus, the h ⇥ w ⇥ d 3
becomes a h ⇥ w 2-D matrix, which is called the “ag
map” in SCDA. Then, the mean value ¯a of the ag
map is regarded as the threshold for localizing the
the activation response in the position (i, j) of the ag
map is larger than ¯a, it indicates the object might
that position.
What we need is a mapping function …
Fine-grained image retrieval (con’t)
[Wei et al., IJCAI 2017] http://www.weixiushen.com/
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III

Contenu connexe

Tendances

[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...Susang Kim
 
Attentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernelsAttentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernelsNAVER Engineering
 
[論文紹介] DPSNet: End-to-end Deep Plane Sweep Stereo
[論文紹介] DPSNet: End-to-end Deep Plane Sweep Stereo[論文紹介] DPSNet: End-to-end Deep Plane Sweep Stereo
[論文紹介] DPSNet: End-to-end Deep Plane Sweep StereoSeiya Ito
 
Comparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageComparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageIRJET Journal
 
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture ScenesA Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture ScenesIJMER
 
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLINGAPPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLINGsipij
 
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringConvolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringSOYEON KIM
 
Geometry Batching Using Texture-Arrays
Geometry Batching Using Texture-ArraysGeometry Batching Using Texture-Arrays
Geometry Batching Using Texture-ArraysMatthias Trapp
 
Survey on optical flow estimation with DL
Survey on optical flow estimation with DLSurvey on optical flow estimation with DL
Survey on optical flow estimation with DLLeapMind Inc
 
Differentiable Ray Sampling for Neural 3D Representation
Differentiable Ray Sampling for Neural 3D RepresentationDifferentiable Ray Sampling for Neural 3D Representation
Differentiable Ray Sampling for Neural 3D RepresentationPreferred Networks
 
ICRA 2015 interactive presentation
ICRA 2015 interactive presentationICRA 2015 interactive presentation
ICRA 2015 interactive presentationSunando Sengupta
 
Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Universitat de Barcelona
 
Texture mapping in_opengl
Texture mapping in_openglTexture mapping in_opengl
Texture mapping in_openglManas Nayak
 
Object Pose Estimation
Object Pose EstimationObject Pose Estimation
Object Pose EstimationArithmer Inc.
 
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachConvolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachUniversitat de Barcelona
 

Tendances (20)

[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
 
Lec11 object-re-id
Lec11 object-re-idLec11 object-re-id
Lec11 object-re-id
 
Attentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernelsAttentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernels
 
Lec07 aggregation-and-retrieval-system
Lec07 aggregation-and-retrieval-systemLec07 aggregation-and-retrieval-system
Lec07 aggregation-and-retrieval-system
 
[論文紹介] DPSNet: End-to-end Deep Plane Sweep Stereo
[論文紹介] DPSNet: End-to-end Deep Plane Sweep Stereo[論文紹介] DPSNet: End-to-end Deep Plane Sweep Stereo
[論文紹介] DPSNet: End-to-end Deep Plane Sweep Stereo
 
Comparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageComparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from Image
 
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
 
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture ScenesA Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
 
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLINGAPPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
 
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringConvolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
 
Lec12 review-part-i
Lec12 review-part-iLec12 review-part-i
Lec12 review-part-i
 
Geometry Batching Using Texture-Arrays
Geometry Batching Using Texture-ArraysGeometry Batching Using Texture-Arrays
Geometry Batching Using Texture-Arrays
 
Gnn overview
Gnn overviewGnn overview
Gnn overview
 
Survey on optical flow estimation with DL
Survey on optical flow estimation with DLSurvey on optical flow estimation with DL
Survey on optical flow estimation with DL
 
Differentiable Ray Sampling for Neural 3D Representation
Differentiable Ray Sampling for Neural 3D RepresentationDifferentiable Ray Sampling for Neural 3D Representation
Differentiable Ray Sampling for Neural 3D Representation
 
ICRA 2015 interactive presentation
ICRA 2015 interactive presentationICRA 2015 interactive presentation
ICRA 2015 interactive presentation
 
Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...
 
Texture mapping in_opengl
Texture mapping in_openglTexture mapping in_opengl
Texture mapping in_opengl
 
Object Pose Estimation
Object Pose EstimationObject Pose Estimation
Object Pose Estimation
 
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachConvolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approach
 

Similaire à Object Detection Beyond Mask R-CNN and RetinaNet III

Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningKrzysztof Kowalczyk
 
Lecture17 xing fei-fei
Lecture17 xing fei-feiLecture17 xing fei-fei
Lecture17 xing fei-feiTianlu Wang
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
A survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural NetworkA survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural NetworkSasanko Sekhar Gantayat
 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000Ivonne Liu
 
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkRunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkPutra Wanda
 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNNLin JiaMing
 
Deep Learning
Deep LearningDeep Learning
Deep LearningJun Wang
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...AILABS Academy
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia岳華 杜
 
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...Rafael Nogueras
 
Citython presentation
Citython presentationCitython presentation
Citython presentationAnkit Tewari
 
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...ijcsit
 

Similaire à Object Detection Beyond Mask R-CNN and RetinaNet III (20)

Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine Learning
 
Lecture17 xing fei-fei
Lecture17 xing fei-feiLecture17 xing fei-fei
Lecture17 xing fei-fei
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
A survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural NetworkA survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural Network
 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000
 
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkRunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNN
 
A detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning AlgorithmsA detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning Algorithms
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Deep learning and computer vision
Deep learning and computer visionDeep learning and computer vision
Deep learning and computer vision
 
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia
 
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
 
1809.05680.pdf
1809.05680.pdf1809.05680.pdf
1809.05680.pdf
 
Citython presentation
Citython presentationCitython presentation
Citython presentation
 
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
 
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
 

Plus de Wanjin Yu

Architecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIArchitecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIWanjin Yu
 
Intelligent Multimedia Recommendation
Intelligent Multimedia RecommendationIntelligent Multimedia Recommendation
Intelligent Multimedia RecommendationWanjin Yu
 
Architecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIArchitecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIWanjin Yu
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IWanjin Yu
 
Causally regularized machine learning
Causally regularized machine learningCausally regularized machine learning
Causally regularized machine learningWanjin Yu
 
Computer vision for transportation
Computer vision for transportationComputer vision for transportation
Computer vision for transportationWanjin Yu
 
Object Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIObject Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIWanjin Yu
 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IWanjin Yu
 
Visual Search and Question Answering II
Visual Search and Question Answering IIVisual Search and Question Answering II
Visual Search and Question Answering IIWanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Wanjin Yu
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Wanjin Yu
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Wanjin Yu
 

Plus de Wanjin Yu (16)

Architecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIArchitecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks III
 
Intelligent Multimedia Recommendation
Intelligent Multimedia RecommendationIntelligent Multimedia Recommendation
Intelligent Multimedia Recommendation
 
Architecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIArchitecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks II
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
 
Causally regularized machine learning
Causally regularized machine learningCausally regularized machine learning
Causally regularized machine learning
 
Computer vision for transportation
Computer vision for transportationComputer vision for transportation
Computer vision for transportation
 
Object Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIObject Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet II
 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet I
 
Visual Search and Question Answering II
Visual Search and Question Answering IIVisual Search and Question Answering II
Visual Search and Question Answering II
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
 

Dernier

Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationMarko4394
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 

Dernier (17)

young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentation
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 

Object Detection Beyond Mask R-CNN and RetinaNet III

  • 1. Fine-Grained Image Analysis ICME Tutorial Jul. 8, 2019 Shanghai, China Xiu-Shen WEI Megvii Research Nanjing, Megvii Inc. IEEE ICME2019 T-06: Computer Vision for Transportation............................................................ T-07: Causally Regularized Machine Learning .................................................... T-08: Architecture Design for Deep Neural Networks ......................................... T-09: Intelligent Multimedia Recommendation ................................................... Oral Sessions............................................................................................................... Best Paper Session ................................................................................................ O-01: Content Recommendation and Cross-modal Hashing................................ O-02: Development of Multimedia Standards and Related Research................... O-03: Classification and Low Shot Learning........................................................ O-04: 3D Media Computing ................................................................................. O-05: Special Session "Pedestrian Detection, Tracking and Re-identification in
  • 2. Outline Part 1 Part 2 Background about CV, DL, and image analysis Introduction of fine-grained image analysis Fine-grained image retrieval Fine-grained image recognition Other computer vision tasks related to fine-grained image analysis New developments of fine-grained image analysis http://www.weixiushen.com/
  • 3. Part 1 Background A brief introduction of computer vision Traditional image recognition and retrieval Deep learning and convolutional neural networks Introduction Fine-grained images vs. generic images Various real-world applications of fine-grained images Challenges of fine-grained image analysis Fine-grained benchmark datasets Fine-grained image retrieval Fine-grained image retrieval based on hand-crafted features Fine-grained image retrieval based on deep learning http://www.weixiushen.com/
  • 4. Background What is computer vision? What we see The goal of computer vision • To extract “meaning” from pixels What we see What a computer sees Source: S. Narasimhan What a computer sees http://www.weixiushen.com/
  • 5. Background (con’t) Why study computer vision? CV is useful CV is interesting CV is difficult … Finger reader Image captioning Crowds and occlusions http://www.weixiushen.com/
  • 6. Background (con’t) Successes of computer vision to date Optical character recognition Biometric systems Face recognition Self-driving cars http://www.weixiushen.com/
  • 7. Top-tier CV conferences/journals and prizes Marr Prize Background (con’t) http://www.weixiushen.com/
  • 8. Traditional image recognition and image retrieval Tradi=onal$Approaches$ Image$ vision$features$ Recogni=on$ ct$ =on$ o$ fica=on$ Audio$ audio$features$ Speaker$ iden=fica=on$ Data Feature extraction Learning algorithm Recognition: SVM Random Forest … Retrieval: k-NN … Background (con’t) http://www.weixiushen.com/
  • 9. Computer vision features Computer$Vision$Features$ SIFT$ HoG$ RIFT$ Textons$ GIST$ Deep learning Background (con’t) http://www.weixiushen.com/
  • 10. Deep learning and convolutional neural networks Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201518 before: now: input layer hidden layer output layer Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201518 before: now: input layer hidden layer output layer Before: Now: Figures are courtesy of L. Fei-Fei. “Rome was not built in one day!” Background (con’t) http://www.weixiushen.com/
  • 11. All neural net activations arranged in 3-dimension: 与之类似,若三维情形下的卷积层 l 的输入张量为 xl ∈ RHl×Wl×Dl , 则该层卷积核为 fl ∈ RH×W×Dl 。在三维时卷积操作实际上只是将二维卷 积扩展到了对应位置的所有通道上(即 Dl),最终将一次卷积处理的所有 HWDl 个元素求和作为该位置卷积结果。如图2-5所示。 三维场景卷积操作 卷积特征 图 三维场景下的卷积核与输入数据。图左卷积核大小为 3 × 4 × 3,图右为在 该位置进行卷积操作后得到的 1 × 1 × 1 的输出结果 进一步地,若类似 fl 这样的卷积核有 D 个,则在同一个位置上可得到 Deep learning and convolutional neural networks 与之类似,若三维情形下的卷积层 l 的输入张量为 xl ∈ RHl×Wl×Dl , 则该层卷积核为 fl ∈ RH×W×Dl 。在三维时卷积操作实际上只是将二维卷 积扩展到了对应位置的所有通道上(即 Dl),最终将一次卷积处理的所有 HWDl 个元素求和作为该位置卷积结果。如图2-5所示。 三维场景卷积操作 卷积特征 图 三维场景下的卷积核与输入数据。图左卷积核大小为 3 × 4 × 3,图右为在 该位置进行卷积操作后得到的 1 × 1 × 1 的输出结果 进一步地,若类似 fl 这样的卷积核有 D 个,则在同一个位置上可得到 1 × 1 × 1 × D 维度的卷积输出,而 D 即为第 l + 1 层特征 xl+1 的通道数 Dl+1。形式化的卷积操作可表示为: yil+1,jl+1,d = H i=0 W j=0 Dl dl=0 fi,j,dl,d × xl il+1+i,jl+1+j,dl , (2.1) 其中,(il+1, jl+1) 为卷积结果的位置坐标,满足: l+1 l l+1 Background (con’t) http://www.weixiushen.com/
  • 12. Unit processing of CNNs 积特征) 特征 27 4 3 2 1 0 1 2 3 4 5 1 0 1 0 1 0 6 7 8 9 0 9 8 7 6 5 28 第2次卷积操作 卷积后结果(卷积特征) (b) 第二次卷积操作及得到的卷积特征 29 卷积特征) 特征 27 1 0 1 1 0 1 0 1 0 6 7 8 9 0 1 2 3 4 5 9 8 7 6 5 28 29 4 3 2 1 0 1 2 3 4 5 21 1628 23 27 22 27 第9次卷积操作 卷积后结果(卷积特征) (d) 第九次卷积操作及得到的卷积特征 卷积操作示例 25 Convolution Max-pooling: yil+1,jl+1,d = max 0 i<H,0 j<W xl il+1×H+i,jl+1×W+j,dl , (2.6) 其中,0 il+1 < Hl+1,0 jl+1 < Wl+1,0 d < Dl+1 = Dl。 图2-7所示为 2 × 2 大小、步长为 1 的最大值汇合操作示例。 7 4 3 2 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 9 8 7 6 5 第1次最大值汇合操作 汇合后结果(汇合特征) (a) 第 1 次汇合操作及得到的汇合特征 7 6 7 8 9 0 1 2 3 4 5 9 8 7 6 5 4 3 2 1 0 1 3 4 52 7 8 9 9 9 8 9 9 9 8 7 6 4 3 4 5 第16次最大值汇合操作 汇合后结果(汇合特征) (b) 第 16 次汇合操作及得到的汇合特征 图 最大值汇合操作示例 除了上述最常用的两种汇合操作外,随机汇合(stochastic-pooling)[94] 则介于二者之间。随机汇合操作非常简单,只需对输入数据中的元素按照 一定概率值大小随机选择,其并不像最大值汇合那样永远只取那个最大值 元素。对随机汇合而言,元素值大的响应(activation)被选中的概率也大, 反之亦然。可以说,在全局意义上,随机汇合与平均值汇合近似;在局部意 Pooling ReLU(x) = max {0, x} (8.3) = ⎧ ⎪⎨ ⎪⎩ x x 0 0 x < 0 . (8.4) 与前两个激活函数相比,ReLU 函数的梯度在 x 0 时为 1,反之为 0(如 图8-3所示);x 0 部分完全消除了 Sigmoid 型函数的梯度饱和效应。在计 算复杂度上,ReLU 函数也相对前两者的指数函数计算更为简单。同时,实 验中还发现 ReLU 函数有助于随机梯度下降方法收敛,收敛速度约快 6 倍 左右 [52]。不过,ReLU 函数也有自身的缺陷,即在 x < 0 时,梯度便为 0。 换句话说,对于小于 0 的这部分卷积结果响应,它们一旦变为负值将再无 法影响网络训练——这种现象被称作“死区”。 −10 −5 0 5 10 0 1 2 3 4 5 6 7 8 9 10 (a) ReLU 函数 −10 −5 0 5 10 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 (b) ReLU 函数梯度 图 函数及其函数梯度 Non-linear activation function Background (con’t) http://www.weixiushen.com/
  • 13. CNN architecture Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201563 CONV ReLU CONV ReLU POOLCONV ReLU CONV ReLU POOL CONV ReLU CONV ReLU POOL FC (Fully-connected) Figures are courtesy of L. Fei-Fei. Background (con’t) http://www.weixiushen.com/
  • 14. Meta class Bird Dog Orange…… Traditional image recognition (Coarse-grained) Fine-grained image recognition Dog Husky Satsuma Alaska…… Introduction Fine-grained images vs. generic images http://www.weixiushen.com/
  • 15. Various real-world applications Introduction (con’t) http://www.weixiushen.com/
  • 16. Various real-world applications Introduction (con’t) http://www.weixiushen.com/
  • 17. Various real-world applications Introduction (con’t) http://www.weixiushen.com/ Results 1. Megvii Research Nanjing a. Error = 0.10267 2. Alibaba Machine Intelligence Technology Lab a. Error = 0.11315 3. General Dynamics Mission Systems a. Error = 0.12678 FGVC6 Sponsored by This certificate is awarded to Bo-Yan Zhou, Bo-Rui Zhao, Quan Cui, Yan-Ping Xie, Zhao-Min Chen, Ren-Jie Song, and Xiu-Shen Wei Megvii Research Nanjing iNat2019 winners of the iNaturalist 2019 image classification challenge held in conjunction with the FGVC workshop at CVPR 2019. Genera with at least 10 species
  • 18. Herbarium Challenge 2019 (top team ● #1 Megvii Research Nanjing (89.8%) ○ Boyan Zhou, Quan Cui, Borui Zhao, Yanping Xie, Renjie Song ● #2 PEAK (89.1%) ○ Chunqiao Xu, Shao Zeng, Qiule Sun, Shuyu Ge, Peihua Li (Dalian University of Technology) ● #3 Miroslav Valan (89.0%) ○ Swedish Museum of Natural History ● #4 Hugo Touvron (88.9%) ○ Hugo Touvron and Andrea Vedaldi (Facebook AI Research) Various real-world applications Introduction (con’t) http://www.weixiushen.com/ Herbarium Challenge 2019 Kiat Chuan Tan1 , Yulong Liu1 , Barbara Ambrose2 , Melissa Tulig2 , Serge Belongie1,3 1 Google Research, 2 New York Botanical Garden, 3 Cornell Tech Google Research , Xiu-Shen Wei
  • 19. Various real-world applications Introduction (con’t) http://www.weixiushen.com/
  • 20. Various real-world applications Introduction (con’t) http://www.weixiushen.com/
  • 21. Various real-world applications Introduction (con’t) http://www.weixiushen.com/
  • 22. Challenge of fine-grained image analysis Small inter-class variance Large intra-class variance Institute of Computer Science and Technology, Peking University hexiangteng@pku.edu.cn, pengyuxin@pku.edu.cn Abstract ed image classification is a challenging task ge intra-class variance and small inter-class ng at recognizing hundreds of sub-categories he same basic-level category. Most existing mage classification methods generally learn models to obtain the semantic parts for bet- ion accuracy. Despite achieving promising methods mainly have two limitations: (1) arts which obtained through the part detec- re beneficial and indispensable for classifi- 2) fine-grained image classification requires visual descriptions which could not be pro- art locations or attribute annotations. For ad- bove two limitations, this paper proposes the odel combining vision and language (CVL) atent semantic representations. The vision deep representations from the original visual ia deep convolutional neural network. The am utilizes the natural language descriptions oint out the discriminative parts or charac- ach image, and provides a flexible and com- Heermann Gull Herring Gull Slaty-backed Gull Western Gullinter-class variance intra-classvariance Figure 1. Examples from CUB-200-2011. Note that it is a techni- cally challenging task even for humans to categorize them due to Introduction (con’t) Figures are courtesy of [X. He andY. Peng, CVPR 2017]. http://www.weixiushen.com/
  • 23. The key of fine-grained image analysis Introduction (con’t) http://www.weixiushen.com/
  • 24. Fine-grained benchmark datasets 类,11788 张图像(其中含 5994 张训练图像和 5794 张测试图像),每张图提 供了图像级别标记、物体边界框、部件关键点和鸟类属性信息。该数据集图像 ⽰例如图 2 12所⽰。此外,Birdsnap[6] 是 2014 年哥伦⽐亚⼤学的研究者提出的 ⼤规模鸟类细粒度级别图像数据集,共 500 种鸟类⼦类,49829 张图像(其中 47386 张训练图像和 2443 张测试图像),与 CUB200-2011 数据集类似,每张图 提供了图像级别标记、物体边界框、部件关键点和鸟类属性信息。 图 2 12: CUB200-2011[75] 数据集图像⽰例。 CUB200-2011 [C.Wah et al., CNS-TR-2011-001, 2011] 11,788 images, 200 fine-grained classes Introduction (con’t) http://www.weixiushen.com/
  • 25. Fine-grained benchmark datasets 2 . 3 . 2 细粒度级别狗类数据集 Stanford Dogs[38] 数据集是斯坦福⼤学于 2011 年提出的细粒度级别图像数 集。该数据集共含 120 种狗类⼦类的 20580 张图像,每张图像提供了图像级 标记和物体边界框,但未提供部件关键点和鸟类属性信息等精细级别监督信 。该数据集图像⽰例如图 2 13所⽰。 [A. Khosla et al., CVPR Workshop 2011] Stanford Dogs 20,580 images 120 fine-grained classes Introduction (con’t) http://www.weixiushen.com/
  • 26. Fine-grained benchmark datasets Oxford Flowers 8,189 images, 102 fine-grained classes [M.-E. Nilsback and A. Zisserman, CVGIP 2008] Introduction (con’t) http://www.weixiushen.com/
  • 27. Fine-grained benchmark datasets 24 第二章 相关研究内容综述 图 2 15: Aircrafts[51] 数据集图像⽰例。 Aircrafts 10,200 images 100 fine-grained classes [S. Maji et al., arXiv: 1306.5151, 2013] Introduction (con’t) http://www.weixiushen.com/
  • 28. Fine-grained benchmark datasets Stanford Cars 图 2 15: Aircrafts[51] 数据集图像⽰例。 图 2 16: Stanford Cars[40] 数据集图像⽰例。 16,185 images, 196 fine-grained classes [J. Krause et al., ICCV Workshop 2013] Introduction (con’t) http://www.weixiushen.com/
  • 29. Fine-grained image analysis is hot … Introduction (con’t) Many papers published on top-tier conf./journals CVPR, ICCV, ECCV, IJCAI, etc. TPAMI, IJCV,TIP, etc. Many frequently held workshops Workshop on Fine-GrainedVisual Categorization … Many academic challenges about fine-grained tasks The Nature Conservancy Fisheries Monitoring iFood Classification Challenge iNature Classification Challenge … http://www.weixiushen.com/
  • 30. Image Retrieval (IR) What is image retrieval?What is image retrieval? ! Description Based Image Retrieval (DBIR)! Description Based Image Retrieval (DBIR) ! Content Based Image Retrieval (CBIR) Query Textual Text Textual query Query by example Image Sketchexample Sketch 5/46 Fine-grained image retrieval http://www.weixiushen.com/
  • 31. Fine-grained image retrieval (con’t) Common components of CBIR systems Feature extraction Input Database Query Feature extraction Comparison Return http://www.weixiushen.com/
  • 32. Deep learning for image retrieval Figure 9: Sample search results using CroW features compressed to just d “ 32 dimensions. The query image is shown at the leftmost side with the query bounding box marked in a red rectangle. Fine-grained image retrieval (con’t) http://www.weixiushen.com/
  • 33. FGIR vs. General-purposed IR Fine-grained image retrieval (con’t) volutional Descriptor Aggregation for e-Grained Image Retrieval -Hao Luo, Jianxin Wu, Member, IEEE, Zhi-Hua Zhou, Fellow, IEEE ural network models pre- n task have been successfully ch as texture description and e tasks require annotations this paper, we focus on a pure unsupervised setting: h image labels, fine-grained e the unsupervised retrieval olutional Descriptor Aggre- localizes the main object(s) discards noisy background The selected descriptors are reduced into a short feature und. SCDA is unsupervised, ox annotation. Experiments he effectiveness of SCDA for visualization of the SCDA d to visual attributes (even SCDA’s high mean average Moreover, on general image omparable retrieval results etrieval approaches. retrieval, selection and ag- zation. TION (a) Fine-grained image retrieval. Two examples (“Mallard” and “Rolls- Royce Phantom Sedan 2012”) from the CUB200-2011 [10] and Cars [11] datasets, respectively. (b) General image retrieval. Two examples from the Oxford Building [12] dataset. Figure 1. Fine-grained image retrieval vs. general image retrieval. Fine grained image retrieval (FGIR) processes visually similar objects as the probe and gallery. For example, given an image of Mallard (or Rolls-Royce Phantom Sedan 2012) as the query, the FGIR system should return images of the same http://www.weixiushen.com/
  • 34. FGIR based on hand-crafted features Fine-grained image retrieval (con’t) 细粒度级别图像的检索任务通常只在于某个物体类别内,例如仅会对不同鸟类 类别进⾏图像检索,通常并不会对鸟类、狗类等等多种类别物体的混合数据集 进⾏细粒度级别图像检索。相⽐之下,本⽂第 3章提出的检索⽅法是⾸个基于 深度学习的细粒度级别图像检索⽅法,同时,本⽂研究的细粒度级别图像检索 图 2 11: Xie 等⼈[83] 提出的细粒度级别图像“搜索”任务⽰意图。(该插图由 IEEE 授权使 ⽤。) [Xie et al., IEEE TMM 2015] http://www.weixiushen.com/
  • 35. 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 CCV #545 ECC #54 2 ECCV-16 submission ID 545 (a) Input image (d) Descriptor aggregation(c) Descriptor selection (b) Convolutional activation tensor AnSCDAfeature Figure 1. Pipeline of the proposed SCDA method. (Best viewed in color.) annotations are expensive and unrealistic in many real applications. In answer to this di culty, there are attempts to categorize fine-grained images with only image-level labels, e.g., [6–9]. In this paper, we handle a more challenging but more realistic task, i.e., Fine- [Wei et al., IEEE TIP 2017] Fine-grained image retrieval (con’t) Selective Convolutional Descriptor Aggregation (SCDA) http://www.weixiushen.com/
  • 36. 45 46 47 48 49 50 51 52 53 54 55 56 CV 45 2 ECCV-16 submission ID 545 (a) Input image (c) Descriptor selection (b) Convolutional activation tensor Figure 1. Pipeline of the proposed SCDA method. (Best viewe 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 ECCV #545 CCV-16 submission ID 545 ed applications. For example, a bird protection project may not want g images given a bird query. To our best knowledge, this is the first o fine-grained image retrieval using deep learning. ctive Convolutional Descriptor Aggregation per, we follow the notations in [24]. The term “feature map” indicates lution results of one channel; the term “activations” indicates feature ll channels in a convolution layer; and the term “descriptor” indicates ensional component vector of activations. “pool5” refers to the activa- e max-pooled last convolution layer, and “fc8” refers to the activations fully connected layer. an input image I of size H ⇥ W, the activations of a convolution formulated as an order-3 tensor T with h ⇥ w ⇥ d elements, which set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h ⇥ w is ature map of the corresponding channel. From another point of view, also considered as having h ⇥ w cells and each cell contains one d- al deep descriptor. We denote the deep descriptors as X = x(i,j) , j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 Rd ). For 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 3 Selective Convolutional Descripto In this paper, we follow the notations in [24]. The the convolution results of one channel; the term “ maps of all channels in a convolution layer; and th the d-dimensional component vector of activations tions of the max-pooled last convolution layer, and of the last fully connected layer. Given an input image I of size H ⇥ W, the layer are formulated as an order-3 tensor T wit include a set of 2-D feature maps S = {Sn} (n = the nth feature map of the corresponding channel T can be also considered as having h ⇥ w cells dimensional deep descriptor. We denote the deep where (i, j) is a particular cell (i 2 {1, . . . , h} , j 2 instance, by employing the popular pre-trained V deep descriptors, we can get a 7 ⇥ 7 ⇥ 512 activati image is 224⇥224. Thus, on one hand, for this im (i.e., Sn) of size 7 ⇥ 7; on the other hand, 49 deep obtained. 3.1 Selecting Convolutional Descriptors What distinguishes SCDA from existing deep le 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 3 Selective Convolutional Descriptor Aggregation In this paper, we follow the notations in [24]. The term “feature map” in the convolution results of one channel; the term “activations” indicates maps of all channels in a convolution layer; and the term “descriptor” in the d-dimensional component vector of activations. “pool5” refers to the tions of the max-pooled last convolution layer, and “fc8” refers to the activ of the last fully connected layer. Given an input image I of size H ⇥ W, the activations of a conv layer are formulated as an order-3 tensor T with h ⇥ w ⇥ d elements, include a set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h the nth feature map of the corresponding channel. From another point o T can be also considered as having h ⇥ w cells and each cell contains dimensional deep descriptor. We denote the deep descriptors as X = x where (i, j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 R instance, by employing the popular pre-trained VGG-16 model [25] to e deep descriptors, we can get a 7 ⇥ 7 ⇥ 512 activation tensor in pool5 if the image is 224⇥224. Thus, on one hand, for this image, we have 512 featur (i.e., Sn) of size 7 ⇥ 7; on the other hand, 49 deep descriptors of 512-d a obtained. 3.1 Selecting Convolutional Descriptors What distinguishes SCDA from existing deep learning-based image re methods is: using only the pre-trained model, SCDA is able to find usef 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 Selective Convolutional Descriptor Aggregation n this paper, we follow the notations in [24]. The term “feature map” indicates he convolution results of one channel; the term “activations” indicates feature aps of all channels in a convolution layer; and the term “descriptor” indicates he d-dimensional component vector of activations. “pool5” refers to the activa- ons of the max-pooled last convolution layer, and “fc8” refers to the activations the last fully connected layer. Given an input image I of size H ⇥ W, the activations of a convolution yer are formulated as an order-3 tensor T with h ⇥ w ⇥ d elements, which clude a set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h ⇥ w is he nth feature map of the corresponding channel. From another point of view, can be also considered as having h ⇥ w cells and each cell contains one d- mensional deep descriptor. We denote the deep descriptors as X = x(i,j) , here (i, j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 Rd ). For stance, by employing the popular pre-trained VGG-16 model [25] to extract eep descriptors, we can get a 7 ⇥ 7 ⇥ 512 activation tensor in pool5 if the input mage is 224⇥224. Thus, on one hand, for this image, we have 512 feature maps .e., Sn) of size 7 ⇥ 7; on the other hand, 49 deep descriptors of 512-d are also btained. .1 Selecting Convolutional Descriptors Feature maps: Descriptors: Fine-grained image retrieval (con’t) Notations [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 37. 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 CCV #545 ECCV #54 ECCV-16 submission ID 545 5 The 468−th channel The 375−th channel The 284−th channelThe 108−th channel The 481−th channel The 245−th channel The 6−th channel The 163−th channel Figure 2. Sampled feature maps of four fine-grained images. Although we resize the images for better visualization, our method can deal with images of any resolution. Fine-grained image retrieval (con’t) [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 38. 270 271 272 273 274 275 276 277 278 279 280 281 ECCV #545 ECCV-16 sub ... ... (a) Input image (d) Mask m(c) Activation map(b) Feature maps Figure 4. Selecting useful deep convolutional descriptors. (B Therefore, we use fM to select useful and meaningful d f 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 (b) Visualization of the mask map Figure 3. Visualization of the mask map M and the correspondin component fM. The selected regions are highlighted in red. (Best v region, we could expect this region to be an object rather tha Therefore, in the proposed method, we add up the obtained tensor through the depth direction. Thus, the h ⇥ w ⇥ d 3- an h ⇥ w 2-D tensor, which we call the “activation map”, i. (where Sn is the nth feature map in pool5). For the activation h ⇥ w summed activation responses, corresponding to h ⇥ w on the aforementioned observation, it is straightforward to sa activation response a particular position (i, j) is, the more po responding region being part of the object. Then, we calcula ¯a of all the positions in A as the threshold to decide which objects: the position (i, j) whose activation response is highe the main object, e.g., birds or dogs, might appear in that posi M of the same size as A can be obtained as: Mi,j = ( 1 if Ai,j > ¯a 0 otherwise , where (i, j) is a particular position in these h ⇥ w positions. The figures in the first row of Fig. 3 show some examples for birds and dogs. In these figures, we first resize the mask Fine-grained image retrieval (con’t) Obtaining the activation map by summarizing feature maps [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 39. 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 CCV #545 ECC #54 6 ECCV-16 submission ID 545 (a) Visualization of the mask map M (b) Visualization of the mask map Figure 3. Visualization of the mask map M and the corresponding largest connected component fM. The selected regions are highlighted in red. (Best viewed in color.) region, we could expect this region to be an object rather than the background. 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 CCV #545 ECC #54 6 ECCV-16 submission ID 545 (a) Visualization of the mask map M (b) Visualization of the mask map Figure 3. Visualization of the mask map M and the corresponding largest connected component fM. The selected regions are highlighted in red. (Best viewed in color.) Fine-grained image retrieval (con’t) Visualization of the mask map M [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 40. 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 ECCV #545 ECCV #54 ECCV-16 submission ID 545 7 ... ... (a) Input image (e) The largest connected component of the mask map(d) Mask map(c) Activation map(b) Feature maps Figure 4. Selecting useful deep convolutional descriptors. (Best viewed in color.) Therefore, we use fM to select useful and meaningful deep convolutional de- scriptors. The descriptor x(i,j) should be kept when fMi,j = 1, while fMi,j = 0 means the position (i, j) might have background or noisy parts: F = n x(i,j)|fMi,j = 1 o , (2) Fine-grained image retrieval (con’t) Selecting useful deep convolutional descriptors [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 41. Qualitative evaluation 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 ECCV #545 ECCV #545 ECCV-16 submission ID 545 7 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 Figure 6. Samples of predicted object localization bounding box on CUB200-2011 and Stanford Dogs. The ground-truth bounding box is marked as the red dashed rectangle, Fine-grained image retrieval (con’t) [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 42. 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 ECCV #545 ECCV #545 ECCV-16 submission ID 545 9 Table 2. Comparison of di↵erent encoding or pooling approaches for FGIR. Approach Dimension CUB200-2011 Stanford Dogs top1 top5 top1 top5 VLAD 1,024 55.92% 62.51% 69.28% 74.43% Fisher Vector 2,048 52.04% 59.19% 68.37% 73.74% avgPool 512 56.42% 63.14% 73.76% 78.47% maxPool 512 58.35% 64.18% 70.37% 75.59% avg&maxPool 1,024 59.72% 65.79% 74.86% 79.24% – Pooling approaches. We also try two traditional pooling approaches, i.e., max-pooling and average-pooling, to aggregate the deep descriptors. After encoding or pooling into a single vector, for VLAD and FV, the square root normalization and `2-normalization are followed; for max- and average- pooling methods, we just do `2-normalization (the square root normalization did not work well). Finally, the cosine similarity is used for nearest neighbor search. We use two datasets to demonstrate which type of aggregation method is optimal 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 localization performance is just slightly lower or even comparable to that of these algorithms using strong supervisions, e.g., ground-truth bounding box and parts annotations (even in the test phase). 3.2 Aggregating Convolutional Descriptors After selecting F = n x(i,j)|fMi,j = 1 o , we compare several encoding or pooling approaches to aggregate these convolutional features, and then give our proposal. – VLAD [14] uses k-means to find a codebook of K centroids {c1, . . . , cK} and maps x(i,j) into a single vector v(i,j) = ⇥ 0 . . . 0 x(i,j) ck . . . 0 ⇤ 2 RK⇥d , where ck is the closest centroid to x(i,j). The final representation isP i,j v(i,j). – Fisher Vector [15]: FV is similar to VLAD, but uses a soft assignment (i.e., Gaussian Mixture Model) instead of using k-means. Moreover, FV also includes second-order statistics.2 2 For parameter choice of VLAD/FV, we follow the suggestions reported in [30]. The number of clusters in VLAD and the number of Gaussian components in FV are both set to 2. Larger values lead to lower accuracy. 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 whole-object in CUB200-2011. By comparing the results of PCP for torso and our whole-object, we find that, even though our method is unsupervised, the localization performance is just slightly lower or even comparable to that of these algorithms using strong supervisions, e.g., ground-truth bounding box and parts annotations (even in the test phase). 3.2 Aggregating Convolutional Descriptors After selecting F = n x(i,j)|fMi,j = 1 o , we compare several encoding or pooling approaches to aggregate these convolutional features, and then give our proposal. – VLAD [14] uses k-means to find a codebook of K centroids {c1, . . . , cK} and maps x(i,j) into a single vector v(i,j) = ⇥ 0 . . . 0 x(i,j) ck . . . 0 ⇤ 2 RK⇥d , where ck is the closest centroid to x(i,j). The final representation isP i,j v(i,j). – Fisher Vector [15]: FV is similar to VLAD, but uses a soft assignment (i.e., Gaussian Mixture Model) instead of using k-means. Moreover, FV also includes second-order statistics.2 2 For parameter choice of VLAD/FV, we follow the suggestions reported in [30]. The number of clusters in VLAD and the number of Gaussian components in FV are both set to 2. Larger values lead to lower accuracy. Fine-grained image retrieval (con’t) Aggregating convolutional descriptors [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 43. ECCV-16 submission ID 545 9 Table 2. Comparison of di↵erent encoding or pooling approaches for FGIR. Approach Dimension CUB200-2011 Stanford Dogs top1 top5 top1 top5 VLAD 1,024 55.92% 62.51% 69.28% 74.43% Fisher Vector 2,048 52.04% 59.19% 68.37% 73.74% avgPool 512 56.42% 63.14% 73.76% 78.47% maxPool 512 58.35% 64.18% 70.37% 75.59% avg&maxPool 1,024 59.72% 65.79% 74.86% 79.24% – Pooling approaches. We also try two traditional pooling approaches, i.e., max-pooling and average-pooling, to aggregate the deep descriptors. After encoding or pooling into a single vector, for VLAD and FV, the square root normalization and `2-normalization are followed; for max- and average- SCDA Fine-grained image retrieval (con’t) Comparing difference encoding or pooling methods [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 44. 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 ECCV #545 ECCV #545 10 ECCV-16 submission ID 545 (a) M of Pool5 (d) of Relu5_2(c) M of Relu5_2(b) of Pool5 Figure 6. The mask map and its corresponding largest connected component of dif- ferent CNN layers. (The figure is best viewed in color.) where ↵ is the coe cient for SCDArelu5 2 . It is set to 0.5 for FGIR. The `2 normalization is followed. In addition, another SCDA+ of the horizontal flip of the original image is incorporated, which is denoted as “SCDA flip+ ” (4,096-d). 4 Experiments and Results In this section, we report the fine-grained image retrieval results. In addition, 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 our aggregation scheme. Its performance is significantly and consistently higher than the others. We use the “avg&maxPool” aggregation as “SCDA feature” to represent the whole fine-grained image. 3.3 Multiple Layer Ensemble As studied in [31,32], the ensemble of multiple layers boost the final performance. Thus, we also incorporate another SCDA feature produced from the relu5 2 layer which is three layers in front of pool5 in the VGG-16 model [25]. Following pool5, we get the mask map Mrelu5 2 from relu5 2. Its activations are less related to the semantic meaning than those of pool5. As shown in Fig. 6 (c), there are many noisy parts. However, the bird is more accurately detected than pool5. Therefore, we combine fMpool5 and Mrelu5 2 together to get the final mask map of relu5 2. fMpool5 is firstly upsampled to the size of Mrelu5 2 . We keep the descriptors when their position in both fMpool5 and Mrelu5 2 are 1, which are the final selected relu5 2 descriptors. The aggregation process remains the same. Finally, we concatenate the SCDA features of relu5 2 and pool5 into a single representation, denoted by “SCDA+ ”: SCDA+ ⇥ SCDApool5 , ↵ ⇥ SCDArelu5 2 ⇤ , (3) 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 ECCV #545 6 submission ID 545 (d) of Relu5_2(c) M of Relu5_2(b) of Pool5 mask map and its corresponding largest connected component of dif- s. (The figure is best viewed in color.) coe cient for SCDArelu5 2 . It is set to 0.5 for FGIR. The `2 s followed. In addition, another SCDA+ of the horizontal flip of ge is incorporated, which is denoted as “SCDA flip+ ” (4,096-d). ments and Results Fine-grained image retrieval (con’t) Multiple layer ensemble [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 45. 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 CV 545 ECC #5 12 ECCV-16 submission ID 545 Figure 7. Some retrieval results of four fine-grained datasets. On the left, there are Fine-grained image retrieval (con’t) [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 46. 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 V 5 E # ECCV-16 submission ID 545 13 Figure 8. Quality demonstrations of the SCDA feature. From the top to bottom of each column, there are six returned original images in the order of one sorted dimension of “256-d SVD+whitening”. (Best viewed in color and zoomed in.) Fine-grained image retrieval (con’t) Quality demonstration of the SCDA feature [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 47. Deep Descriptor Transforming CNN pre-trained models Deep Descriptor Transforming Four b Object Col-lo Image co-localization (a.k.a. unsupervised object discovery) is a fundamental computer vision problem, which simultaneously localizes objects of the same category across a set of distinct images. 2 – The proposed method: DDT 4 – E stat and ü DD rob Shen2 , Zhi-Hua Zhou1 Technology, Nanjing University, Nanjing, China aide, Adelaide, Australia edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au CNN pre-trained models Deep Descriptor Transforming Input images into a set of linearly uncorrelated variables (i.e., the principal components). This transformation is defined in such a way that the first principal component has the largest possible vari- ance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to all the preceding components. PCA is widely used in machine learning and computer vision for dimension reduction [Chen et al., 2013; Gu et al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc- tion [Zhang et al., 2013; Nie et al., 2011] and so on. Specifi- cally, in this paper, we utilize PCA as projection directions for transforming these deep descriptors {x(i,j)} to evaluate their correlations. Then, on each projection direction, the corre- sponding principal component’s values are treated as the cues for image co-localization, especially the first principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we first collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ¯x = 1 K X n X i,j xn (i,j) , (1) characteristic th localization sce indicates indeed Therefore, th old for dividing positive values part has negativ appear. In additi indicates that n can be used for resized by the n same as that of largest connecte what is done in [ relation values a bounding box w of positive regi prediction. 3.4 Discussi In this section, comparing with As shown in F and DDT are hi nsforming for Image Co-Localization n-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Chunhua Shen2 , Zhi-Hua Zhou1 Software Technology, Nanjing University, Nanjing, C y of Adelaide, Adelaide, Australia amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide. h the s. In ained erent CNN pre-trained models XN-1X1 X2 XN… Deep descriptors Calculating descriptors’ mean vector nsforming for Image Co-Localization⇤ n-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Chunhua Shen2 , Zhi-Hua Zhou1 Software Technology, Nanjing University, Nanjing, C y of Adelaide, Adelaide, Australia amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide. h the all the preceding components. PCA is widely used in machine learning and computer vision for dimension reduction [Chen et al., 2013; Gu et al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc- tion [Zhang et al., 2013; Nie et al., 2011] and so on. Specifi- cally, in this paper, we utilize PCA as projection directions for transforming these deep descriptors {x(i,j)} to evaluate their correlations. Then, on each projection direction, the corre- sponding principal component’s values are treated as the cues for image co-localization, especially the first principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we first collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ¯x = 1 K X n X i,j xn (i,j) , (1) where K = h ⇥ w ⇥ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ⇥ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. positive values part has negativ appear. In addi indicates that n can be used fo resized by the same as that of largest connect what is done in relation values bounding box w of positive reg prediction. 3.4 Discuss In this section comparing with As shown in and DDT are h ers the informa “person” and ev jects. Furtherm the aggregatio calculate the m on for dimension reduction [Chen et al., 2013; Gu et 2011; Zhang et al., 2009; Davidson, 2009], noise reduc- [Zhang et al., 2013; Nie et al., 2011] and so on. Specifi- y, in this paper, we utilize PCA as projection directions for sforming these deep descriptors {x(i,j)} to evaluate their elations. Then, on each projection direction, the corre- nding principal component’s values are treated as the cues mage co-localization, especially the first principal com- ent. Thanks to the property of this kind of transforming, T is also able to handle data noise. n DDT, for a set of N images containing objects from the e category, we first collect the corresponding convolutional criptors (X1 , . . . , XN ) by feeding them into a pre-trained N model. Then, the mean vector of all the descriptors is ulated by: ¯x = 1 K X n X i,j xn (i,j) , (1) re K = h ⇥ w ⇥ N. Note that, here we assume each ge has the same number of deep descriptors (i.e., h ⇥ w) presentation clarity. Our proposed method, however, can dle input images with arbitrary resolutions. hen, after obtaining the covariance matrix: appear. In addition, if P indicates that no comm can be used for detecti resized by the nearest same as that of the inp largest connected comp what is done in [Wei et a relation values and the z bounding box which con of positive regions is r prediction. 3.4 Discussions an In this section, we inve comparing with SCDA. As shown in Fig. 2, th and DDT are highlighte ers the information from “person” and even “gui jects. Furthermore, we the aggregation map o calculate the mean valu tion [Zhang et al., 2013; Nie et al., 2011] and so on. Specifi- cally, in this paper, we utilize PCA as projection directions for transforming these deep descriptors {x(i,j)} to evaluate their correlations. Then, on each projection direction, the corre- sponding principal component’s values are treated as the cues for image co-localization, especially the first principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we first collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ¯x = 1 K X n X i,j xn (i,j) , (1) where K = h ⇥ w ⇥ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ⇥ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. Then, after obtaining the covariance matrix: Cov(x) = 1 K X n X i,j (xn (i,j) ¯x)(xn (i,j) ¯x)> , (2) can be used for det resized by the neare same as that of the i largest connected co what is done in [Wei relation values and th bounding box which of positive regions i prediction. 3.4 Discussions In this section, we i comparing with SCD As shown in Fig. 2 and DDT are highlig ers the information f “person” and even “g jects. Furthermore, the aggregation map calculate the mean v ization threshold in S values in aggregatio red vertical line corre beyond the threshold sforming for Image Co-Localization⇤ n-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Chunhua Shen2 , Zhi-Hua Zhou1 Software Technology, Nanjing University, Nanjing, C y of Adelaide, Adelaide, Australia amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide. sforming for Image Co-Localization⇤ n-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Chunhua Shen2 , Zhi-Hua Zhou1 Software Technology, Nanjing University, Nanjing, Ch y of Adelaide, Adelaide, Australia ng for Image Co-Localization⇤ ang1 , Yao Li2 , Chen-Wei Xie1 , a Shen2 , Zhi-Hua Zhou1 Technology, Nanjing University, Nanjing, China laide, Adelaide, Australia (a) Bike (b) Diningtable 0 0.5 1 0 50 100 150 -1 0 1 0 100 200 300 400 (c) Boat 0 0.5 1 0 20 40 60 -1 0 100 200 300 400 (d) Bottle 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 140 -1 -0.5 0 0.5 10 20 40 60 80 100 120 140 SCDA DDT SCDA DDT DDT DSCDA SCDA (g) Chair (h) Dog (i) Horse (j) Motorbi 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 8 1 -1 -0.5 0 0.5 1 0 50 100 150 200 250 300 0 0.2 0.4 0.6 0.8 1 0 50 100 150 -1 -0.5 0 0.5 1 0 50 100 150 200 250 300 0 0.5 1 0 20 40 60 80 -1 0 200 400 600 0 0.5 1 0 50 100 150 200 -1 0 1 0 200 400 600 800 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 -1 -0.5 0 0.5 1 0 50 100 150 200 250 Figure 2: Examples from twelve randomly sampled classes of VOC 2007. Th SCDA, the second column are by our DDT. The red vertical lines in the histogra localizing objects. The selected regions in images are highlighted in red. (Best “diningtable” image region. In Fig. 2, more examples are pre- sented. In that figure, some failure cases can be also found, e.g., the chair class in Fig. 2 (g). In addition, the normalized P1 can be also used as localiza- tion probability scores. Combining it with conditional random filed techniques might produce more accurate object bound- aries. Thus, DDT can be modified slightly in that way, and then perform the co-segmentation problem. More importantly, different from other co-segmentation methods, DDT can detect noisy images while other methods can not. 4 Experiments In this section, we first introduce the evaluation metric and datasets used in image co-localization. Then, we compare the empirical results of our DDT with other state-of-the-arts on these datasets. The computational cost of DDT is reported too. Moreover, the results in Sec. 4.4 and Sec. 4.5 illustrate the generalization ability and robustness of the proposed method. Finally, our further study in Sec. 4.6 reveals DDT might deal Table 1: C M [Joulin [Joulin [Rubinste [Tang S [Cho e O predicted bou box. All Cor Our experi commonly u covery datas 2007 / VOC ImageNet Su For exper al., 2015; Li Methods [Joulin et al., 2 SCDA [Cho et al., 2 [Li et al., 20 Our DDT Methods SCDA [Cho et al., 20 [Li et al., 201 Our DDT Table 4: Comparisons of the CorLoc metric with weakly supervised object l “X” in the “Neg.” column indicates that these WSOL methods require access t Methods Neg. aero bike bird boat bottle bus car cat chair cow table dog [Shi et al., 2013] X 67.3 54.4 34.3 17.8 1.3 46.6 60.7 68.9 2.5 32.4 16.2 58. [Cinbis et al., 2015] X 56.6 58.3 28.4 20.7 6.8 54.9 69.1 20.8 9.2 50.5 10.2 29. [Wang et al., 2015] X 37.7 58.8 39.0 4.7 4.0 48.4 70.0 63.7 9.0 54.2 33.3 37. [Bilen et al., 2015] X 66.4 59.3 42.7 20.4 21.3 63.4 74.3 59.6 21.1 58.2 14.0 38. [Ren et al., 2016] X 79.2 56.9 46.0 12.2 15.7 58.4 71.4 48.6 7.2 69.9 16.7 47. [Wang et al., 2014] X 80.1 63.9 51.5 14.9 21.0 55.7 74.2 43.5 26.2 53.4 16.3 56. Our DDT 67.3 63.3 61.3 22.7 8.5 64.8 57.0 80.5 9.4 49.0 22.5 72. Table 4: “X” in th Meth [Shi et al [Cinbis et [Wang et a [Bilen et a [Ren et a [Wang et a Our D Figure 3 successf rectangle the same Table 5: Metho [Cho et al. SCDA erent from their original purposes (e.g., for describing texture r finding object proposals [Ghodrati et al., 2015]). However, or such adaptations of pre-trained models, they still require urther annotations in the new domain (e.g., image labels). While, DDT deals with the image co-localization problem in n unsupervised setting. Coincidentally, several recent works also shed lights on NN pre-trained model reuse in the unsupervised setting, e.g., CDA [Wei et al., 2017]. SCDA is proposed for handling he fine-grained image retrieval task, where it uses pre-trained models (from ImageNet, which is not fine-grained) to locate main objects in fine-grained images. It is the most related work o ours, even though SCDA is not for image co-localization. Different from our DDT, SCDA assumes only an object of nterest in each image, and meanwhile objects from other ategories does not exist. Thus, SCDA locates the object using ues from this single image assumption. Apparently, it can not ork well for images containing diverse objects (cf. Table 2 nd Table 3), and also can not handle data noise (cf. Sec. 4.5). .2 Image Co-Localization mage co-localization is a fundamental problem in computer ision, where it needs to discover the common object emerging n only positive sets of example images (without any nega- ve examples or further supervisions). Image co-localization hares some similarities with image co-segmentation [Zhao nd Fu, 2015; Kim et al., 2011; Joulin et al., 2012]. Instead f generating a precise segmentation of the related objects in 3 The Proposed Method 3.1 Preliminary The following notations are used in the term “feature map” indicates the con channel; the term “activations” indica channels in a convolution layer; and indicates the d-dimensional componen Given an input image I of size H ⇥ convolution layer are formulated as an h⇥w⇥d elements. T can be considere and each cell contains one d-dimension the n-th image, we denote its correspo as Xn = n xn (i,j) 2 Rd o , where (i, (i 2 {1, . . . , h} , j 2 {1, . . . , w}) and 3.2 SCDA Recap Since SCDA [Wei et al., 2017] is the m we hereby present a recap of this meth for dealing with the fine-grained imag employs pre-trained models to select t scriptors by localizing the main object unsupervisedly. In SCDA, it assumes th only one main object of interest and w objects. Thus, the object localization s Fine-grained image retrieval (con’t) [Wei et al., IJCAI 2017] http://www.weixiushen.com/
  • 48. CNN pre-trained models Deep Descriptor Transforming Four benchmark c Object Discovery, PA Col-localization re Detecting noisy im 2 – The proposed method: DDT 4 – Experimen 5 – Further st the s. In ined rent trac- onal d act e co- ffec- ming tors ons, in a ffec- nch- nsis- hods mon- cate- ta. model by for other CNN pre-trained models Deep Descriptor Transforming Figure 1: Pipeline of the proposed DDT method for image co-localization. In this instance, the goal is to localize the airplane within each image. Note that, there might be few noisy images in the image set. (Best viewed in color.) adopted to various usages, e.g., as universal feature extrac- tors [Wang et al., 2015; Li et al., 2016], object proposal gen- Input images PCA is widely used in machine learning and computer vision for dimension reduction [Chen et al., 2013; Gu et al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc- tion [Zhang et al., 2013; Nie et al., 2011] and so on. Specifi- cally, in this paper, we utilize PCA as projection directions for transforming these deep descriptors {x(i,j)} to evaluate their correlations. Then, on each projection direction, the corre- sponding principal component’s values are treated as the cues for image co-localization, especially the first principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we first collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ¯x = 1 K X n X i,j xn (i,j) , (1) where K = h ⇥ w ⇥ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ⇥ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. Then, after obtaining the covariance matrix: Cov(x) = 1 K X n X i,j (xn (i,j) ¯x)(xn (i,j) ¯x)> , (2) we can get the eigenvectors ⇠1, . . . , ⇠d of Cov(x) which cor- respond to the sorted eigenvalues 1 · · · d 0. part has negative values pre appear. In addition, if P1 of indicates that no common o can be used for detecting resized by the nearest inter same as that of the input im largest connected componen what is done in [Wei et al., 2 relation values and the zero bounding box which contain of positive regions is retur prediction. 3.4 Discussions and A In this section, we investig comparing with SCDA. As shown in Fig. 2, the ob and DDT are highlighted in ers the information from a s “person” and even “guide-b jects. Furthermore, we nor the aggregation map of SC calculate the mean value (w ization threshold in SCDA) values in aggregation map red vertical line corresponds beyond the threshold, there explanation about why SCD Whilst, for DDT, it lever form these deep descriptor n Wu , Chunhua Shen , Zhi-Hua Zhou Novel Software Technology, Nanjing University, Nanjing, China niversity of Adelaide, Adelaide, Australia zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au able with the plications. In of pre-trained ally, different eature extrac- convolutional ons could act the image co- mple but effec- Transforming of descriptors stent regions, CNN pre-trained models Deep Descriptor Transforming XN-1X1 X2 XN… Deep descriptors Calculating descriptors’ mean vector i1 , Chen-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , n Wu1 , Chunhua Shen2 , Zhi-Hua Zhou1 Novel Software Technology, Nanjing University, Nanjing, China niversity of Adelaide, Adelaide, Australia zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au able with the plications. In of pre-trained ally, different eature extrac- convolutional ons could act the image co- mple but effec- Transforming CNN pre-trained models transforming these deep descriptors {x(i,j)} to evaluate their correlations. Then, on each projection direction, the corre- sponding principal component’s values are treated as the cues for image co-localization, especially the first principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we first collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ¯x = 1 K X n X i,j xn (i,j) , (1) where K = h ⇥ w ⇥ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ⇥ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. Then, after obtaining the covariance matrix: Cov(x) = 1 K X n X i,j (xn (i,j) ¯x)(xn (i,j) ¯x)> , (2) we can get the eigenvectors ⇠1, . . . , ⇠d of Cov(x) which cor- respond to the sorted eigenvalues 1 · · · d 0. As aforementioned, since the first principal component has the largest variance, we take the eigenvector ⇠1 corresponding to the largest eigenvalue as the main projection direction. For the deep descriptor at a particular position (i, j) of an image, its first principal component p1 is calculated as follows: same as that of the input i largest connected compone what is done in [Wei et al., 2 relation values and the zero bounding box which contai of positive regions is retu prediction. 3.4 Discussions and A In this section, we investi comparing with SCDA. As shown in Fig. 2, the o and DDT are highlighted in ers the information from a “person” and even “guide-b jects. Furthermore, we nor the aggregation map of S calculate the mean value ( ization threshold in SCDA) values in aggregation map red vertical line correspond beyond the threshold, ther explanation about why SC Whilst, for DDT, it leve form these deep descripto class, DDT can accurately histogram is also drawn. B tive values. We normalize Apparently, few values ar (i.e., 0). More importantly, Obtaining the covariance matrix and then getting the eigenvectors sponding principal component’s values are treated as the cues for image co-localization, especially the first principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we first collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ¯x = 1 K X n X i,j xn (i,j) , (1) where K = h ⇥ w ⇥ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ⇥ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. Then, after obtaining the covariance matrix: Cov(x) = 1 K X n X i,j (xn (i,j) ¯x)(xn (i,j) ¯x)> , (2) we can get the eigenvectors ⇠1, . . . , ⇠d of Cov(x) which cor- respond to the sorted eigenvalues 1 · · · d 0. As aforementioned, since the first principal component has the largest variance, we take the eigenvector ⇠1 corresponding to the largest eigenvalue as the main projection direction. For the deep descriptor at a particular position (i, j) of an image, its first principal component p1 is calculated as follows: what is done in [Wei et al., 2017]). B relation values and the zero thresho bounding box which contains the la of positive regions is returned as prediction. 3.4 Discussions and Analyse In this section, we investigate the comparing with SCDA. As shown in Fig. 2, the object loc and DDT are highlighted in red. B ers the information from a single i “person” and even “guide-board” a jects. Furthermore, we normalize the aggregation map of SCDA in calculate the mean value (which i ization threshold in SCDA). The hi values in aggregation map is also red vertical line corresponds to the beyond the threshold, there are sti explanation about why SCDA high Whilst, for DDT, it leverages th form these deep descriptors into class, DDT can accurately locate histogram is also drawn. But, P1 h tive values. We normalize P1 into Apparently, few values are larger (i.e., 0). More importantly, many va ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we first collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ¯x = 1 K X n X i,j xn (i,j) , (1) where K = h ⇥ w ⇥ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ⇥ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. Then, after obtaining the covariance matrix: Cov(x) = 1 K X n X i,j (xn (i,j) ¯x)(xn (i,j) ¯x)> , (2) we can get the eigenvectors ⇠1, . . . , ⇠d of Cov(x) which cor- respond to the sorted eigenvalues 1 · · · d 0. As aforementioned, since the first principal component has the largest variance, we take the eigenvector ⇠1 corresponding to the largest eigenvalue as the main projection direction. For the deep descriptor at a particular position (i, j) of an image, its first principal component p1 is calculated as follows: p1 (i,j) = ⇠> 1 x(i,j) ¯x . (3) According to their spatial locations, all p1 (i,j) from an image are combined into a 2-D matrix whose dimensions are h ⇥ w. bounding box which contains the of positive regions is returned a prediction. 3.4 Discussions and Analys In this section, we investigate th comparing with SCDA. As shown in Fig. 2, the object l and DDT are highlighted in red. ers the information from a single “person” and even “guide-board” jects. Furthermore, we normaliz the aggregation map of SCDA calculate the mean value (which ization threshold in SCDA). The values in aggregation map is als red vertical line corresponds to th beyond the threshold, there are s explanation about why SCDA hig Whilst, for DDT, it leverages t form these deep descriptors into class, DDT can accurately locat histogram is also drawn. But, P1 tive values. We normalize P1 int Apparently, few values are larg (i.e., 0). More importantly, many indicates the strong negative co validates the effectiveness of DD As another example shown in Fig locates “person” in the image b class. While, DDT can correctl r Transforming for Image Co-Localization⇤ 1 , Chen-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Wu1 , Chunhua Shen2 , Zhi-Hua Zhou1 Novel Software Technology, Nanjing University, Nanjing, China niversity of Adelaide, Adelaide, Australia zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au able with the plications. In of pre-trained ally, different eature extrac- convolutional CNN pre-trained modelsTransforming Transforming for Image Co-Localization⇤ 1 , Chen-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Wu1 , Chunhua Shen2 , Zhi-Hua Zhou1 Novel Software Technology, Nanjing University, Nanjing, China niversity of Adelaide, Adelaide, Australia zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au able with the plications. In of pre-trained lly, different CNN pre-trained models ming (DDT) is that we can leverage mage set, instead of a We call that matrix as indicator matrix: 2 6 p1 (1,1) p1 (1,2) . . . p1 (1,w) p1 p1 . . . p1 3 7 sforming for Image Co-Localization⇤ n-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Chunhua Shen2 , Zhi-Hua Zhou1 Software Technology, Nanjing University, Nanjing, China y of Adelaide, Adelaide, Australia amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au h the s. In ined erent trac- onal CNN pre-trained models 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 -1 -0.5 0 0.5 10 20 40 60 80 100 120 (g) Chair (h) Dog (i) Horse (j) Motorbike (k) Pla 0 0.2 0.4 0.6 0.8 1 0 20 40 60 8 1 -1 -0.5 0 0.5 1 0 50 100 150 200 250 0 0.2 0.4 0.6 0.8 1 0 50 100 -1 -0.5 0 0.5 1 0 50 100 150 200 250 0 0.5 1 0 20 40 60 80 -1 0 200 400 600 0 0.5 1 0 20 40 -1 0 100 200 Figure 2: Examples from twelve randomly sampled classes of VOC 2007. The first column of ea SCDA, the second column are by our DDT. The red vertical lines in the histogram plots indicate th localizing objects. The selected regions in images are highlighted in red. (Best viewed in color a “diningtable” image region. In Fig. 2, more examples are pre- sented. In that figure, some failure cases can be also found, e.g., the chair class in Fig. 2 (g). In addition, the normalized P1 can be also used as localiza- tion probability scores. Combining it with conditional random filed techniques might produce more accurate object bound- aries. Thus, DDT can be modified slightly in that way, and then perform the co-segmentation problem. More importantly, different from other co-segmentation methods, DDT can detect noisy images while other methods can not. 4 Experiments In this section, we first introduce the evaluation metric and datasets used in image co-localization. Then, we compare the empirical results of our DDT with other state-of-the-arts on these datasets. The computational cost of DDT is reported too. Moreover, the results in Sec. 4.4 and Sec. 4.5 illustrate the generalization ability and robustness of the proposed method. Finally, our further study in Sec. 4.6 reveals DDT might deal with part-based image co-localization, which is a novel and challenging problem. In our experiments, the images keep the original image reso- lutions. For the pre-trained deep model, the publicly available VGG-19 model [Simonyan and Zisserman, 2015] is employed to extract deep convolution descriptors from the last convo- lution layer (before pool5). We use the open-source library MatConvNet [Vedaldi and Lenc, 2015] for conducting experi- ments. All the experiments are run on a computer with Intel Xeon E5-2660 v3, 500G main memory, and a K80 GPU. 4.1 Evaluation Metric and Datasets Following previous image co-localization works [Li et al., 2016; Cho et al., 2015; Tang et al., 2014], we take the cor- rect localization (CorLoc) metric for evaluating the proposed method. CorLoc is defined as the percentage of images cor- rectly localized according to the PASCAL-criterion [Ever- ingham et al., 2015]: area(BpBgt) area(Bp[Bgt) > 0.5, where Bp is the Table 1: Comparisons of Co Methods Airpla [Joulin et al., 2010] 32.93 [Joulin et al., 2012] 57.32 [Rubinstein et al., 2013] 74.39 [Tang et al., 2014] 71.95 SCDA 87.80 [Cho et al., 2015] 82.93 Our DDT 91.46 predicted bounding box and Bg box. All CorLoc results are rep Our experiments are conducte commonly used in image co-lo covery dataset [Rubinstein et 2007 / VOC 2012 dataset [Eve ImageNet Subsets [Li et al., 201 For experiments on the VOC al., 2015; Li et al., 2016; Joulin in the trainval set (excluding im instances annotated as difficult covery, we use the 100-image s al., 2013; Cho et al., 2015] in comparison with other methods In addition, Object Discover images in the Airplane, Car and These noisy images contain no egory, as the third image show Sec. 4.5, we quantitatively meas DDT to identify these noisy im To further investigate the g ImageNet Subsets [Li et al., 2 six subsets/categories. These s from the 1000-label ILSVRC c al., 2015]. That is to say, these trained CNN models. Experim that DDT is insensitive to the o Table 2: Comparison Methods aero bike bird b [Joulin et al., 2014] 32.8 17.3 20.9 1 SCDA 54.4 27.2 43.4 1 [Cho et al., 2015] 50.3 42.8 30.0 1 [Li et al., 2016] 73.1 45.0 43.4 2 Our DDT 67.3 63.3 61.3 2 Table 3: Comparison Methods aero bike bird bo SCDA 60.8 41.7 38.6 21 [Cho et al., 2015] 57.0 41.2 36.0 26 [Li et al., 2016] 65.7 57.8 47.9 28 Our DDT 76.7 67.1 57.9 30 4.2 Comparisons with Sta Comparisons to Image Co-Loc We first compare the results of cluding SCDA) on Object Disc we also use VGG-19 to extract th perform experiments. As show forms other methods by about 4% Especially for the airplane class that of [Cho et al., 2015]. In ad of each category in this dataset SCDA can perform well. For VOC 2007 and 2012, th objects per image, which is m Discovery. The comparisons of two datasets are reported in Tab It is clear that on average our D state-of-the-arts (based on deep l both two datasets. Moreover, D small common objects, e.g., “bo because most images of these d which do not obey SCDA’s assum Table 4: Comparisons of the CorLoc metric with weakly supervised object localization met “X” in the “Neg.” column indicates that these WSOL methods require access to a negative ima Methods Neg. aero bike bird boat bottle bus car cat chair cow table dog horse mbike per [Shi et al., 2013] X 67.3 54.4 34.3 17.8 1.3 46.6 60.7 68.9 2.5 32.4 16.2 58.9 51.5 64.6 18 [Cinbis et al., 2015] X 56.6 58.3 28.4 20.7 6.8 54.9 69.1 20.8 9.2 50.5 10.2 29.0 58.0 64.9 36 [Wang et al., 2015] X 37.7 58.8 39.0 4.7 4.0 48.4 70.0 63.7 9.0 54.2 33.3 37.4 61.6 57.6 30 [Bilen et al., 2015] X 66.4 59.3 42.7 20.4 21.3 63.4 74.3 59.6 21.1 58.2 14.0 38.5 49.5 60.0 19 [Ren et al., 2016] X 79.2 56.9 46.0 12.2 15.7 58.4 71.4 48.6 7.2 69.9 16.7 47.4 44.2 75.5 41 [Wang et al., 2014] X 80.1 63.9 51.5 14.9 21.0 55.7 74.2 43.5 26.2 53.4 16.3 56.7 58.3 69.5 14 Our DDT 67.3 63.3 61.3 22.7 8.5 64.8 57.0 80.5 9.4 49.0 22.5 72.6 73.8 69.0 7 (a) Chipmunk (b) Rhino (d) Racoon (e) Rake Figure 3: Random samples of predicted object co-localization bounding box on ImageNet Sub successful predictions and one failure case. In these images, the red rectangle is the predictio rectangle is the ground truth bounding box. In the successful predictions, the yellow rectangle the same as the red predictions. (Best viewed in color and zoomed in.) Table 5: Comparisons of on image sets disjoint with ImageNet. Methods Chipmunk Rhino Stoat Racoon Rake Wheelchair Mean Image Methods Neg. aer [Shi et al., 2013] X 67. [Cinbis et al., 2015] X 56. [Wang et al., 2015] X 37. [Bilen et al., 2015] X 66. [Ren et al., 2016] X 79. [Wang et al., 2014] X 80. Our DDT 67. (a) Chipmu (d) Racoo Figure 3: Random samp successful predictions a rectangle is the ground t the same as the red pred Table 5: Comparisons of Methods Chipmunk [Cho et al., 2015] 26.6 SCDA 32.3 [Li et al., 2016] 44.9 Our DDT 70.3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 false positive rate truepositiverate 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 truepositiverate Airplane False positive rate Truepositiverate Truepositiverate Figure 4: ROC curves ill at identifying noisy ima The curves in red line ar in blue dashed line prese only the method in [Tan CNN pre-trained model reuse in the unsupervised setting, e.g., SCDA [Wei et al., 2017]. SCDA is proposed for handling the fine-grained image retrieval task, where it uses pre-trained models (from ImageNet, which is not fine-grained) to locate main objects in fine-grained images. It is the most related work to ours, even though SCDA is not for image co-localization. Different from our DDT, SCDA assumes only an object of interest in each image, and meanwhile objects from other categories does not exist. Thus, SCDA locates the object using cues from this single image assumption. Apparently, it can not work well for images containing diverse objects (cf. Table 2 and Table 3), and also can not handle data noise (cf. Sec. 4.5). 2.2 Image Co-Localization Image co-localization is a fundamental problem in computer vision, where it needs to discover the common object emerging in only positive sets of example images (without any nega- tive examples or further supervisions). Image co-localization shares some similarities with image co-segmentation [Zhao and Fu, 2015; Kim et al., 2011; Joulin et al., 2012]. Instead of generating a precise segmentation of the related objects in each image, co-localization methods aim to return a bound- ing box around the object. Moreover, co-segmentation has a strong assumption that every image contains the object of interest, and hence is unable to handle noisy images. Additionally, co-localization is also related to weakly su- pervised object localization (WSOL) [Zhang et al., 2016; Bilen et al., 2015; Wang et al., 2014; Siva and Xiang, 2011]. But the key difference between them is WSOL requires manually-labeled negative images whereas co-localization channel; the term “activations” indicates feature m channels in a convolution layer; and the term “d indicates the d-dimensional component vector of ac Given an input image I of size H ⇥ W, the activa convolution layer are formulated as an order-3 tenso h⇥w⇥d elements. T can be considered as having h and each cell contains one d-dimensional deep descr the n-th image, we denote its corresponding deep de as Xn = n xn (i,j) 2 Rd o , where (i, j) is a partic (i 2 {1, . . . , h} , j 2 {1, . . . , w}) and n 2 {1, . . . , N 3.2 SCDA Recap Since SCDA [Wei et al., 2017] is the most related wo we hereby present a recap of this method. SCDA is for dealing with the fine-grained image retrieval pr employs pre-trained models to select the meaningfu scriptors by localizing the main object in fine-graine unsupervisedly. In SCDA, it assumes that each image only one main object of interest and without other c objects. Thus, the object localization strategy is bas activation tensor of a single image. Concretely, for an image, the activation tensor is through the depth direction. Thus, the h ⇥ w ⇥ d 3 becomes a h ⇥ w 2-D matrix, which is called the “ag map” in SCDA. Then, the mean value ¯a of the ag map is regarded as the threshold for localizing the the activation response in the position (i, j) of the ag map is larger than ¯a, it indicates the object might that position. What we need is a mapping function … Fine-grained image retrieval (con’t) [Wei et al., IJCAI 2017] http://www.weixiushen.com/