http://imatge-upc.github.io/retrieval-2017-cam/
Image retrieval in realistic scenarios targets large dynamic datasets of unlabeled images. In these cases, training or fine-tuning a model every time new images are added to the database is neither efficient nor scalable.
Convolutional neural networks trained for image classification over large datasets have been proven effective feature extractors when transferred to the task of image retrieval. The most successful approaches are based on encoding the activations of convolutional layers as they convey the image spatial information. Our proposal goes beyond and aims at a local-aware encoding of these features depending on the predicted image semantics, with the advantage of using only of the knowledge contained inside the network.
In particular, we employ Class Activation Maps (CAMs) to obtain the most discriminative regions from a semantic perspective. Additionally, CAMs are also used to generate object proposals during an unsupervised re-ranking stage after a first fast search.
Our experiments on two public available datasets for instance retrieval, Oxford5k and Paris6k, demonstrate that our system is competitive and even outperforms the current state-of-the-art when using off-the-shelf models trained on the object classes of ImageNet.
4. Image Retrieval
Given an image query, generate a ranked list of similar images
Task Definition
4
What are the properties that we want for these systems?
5. Image Retrieval
Desirable System Properties
5
Invariant to scale, illumination, translation
Be able to provide a fast searchMemory Efficient
6. Image Retrieval
Desirable System Properties
6
Invariant to scale, illumination, translation
Be able to provide a fast searchMemory Efficient
How to achieve them?
8. 8
Image Retrieval
Image Representations
● Hand-Crafted Features (Before 2012)
○ SIFT
○ SIFT + BoW
○ VLAD
○ Fisher Vectors
Josef Sivic, Andrew Zisserman, et al. Video google: A text retrieval approach to object matching in videos. In ICCV, 2003.
9. 9
Image Retrieval
Image Representations
● Learned Features (After 2012)
○ Convolutional Neural Networks (CNNs)
.
.
.
.
.
.
Convolutional Layer Max-Pooling Layer Fully-Connected Layer
Flattening
Class Predictions
Input Image
Softmax Activation
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556. (2014)
VGG-16
10. 10
Image Retrieval
Image Representations
● Learned Features (After 2012)
○ Convolutional Neural Networks (CNNs)
.
.
.
.
.
.
Flattening
Large Datasets
Labels
Computational Power (GPUs)
Training is expensive!
11. 11
Retrieval
System
➔ Collecting, annotating and cleaning a large dataset to train models
requires great effort.
➔ Training or fine-tuning a model every time new data is added is neither
efficient nor scalable.
Dynamic datasets of unlabeled images
Image Representations
Challenges for Learning Features in Retrieval
12. 12
➔ Transferring the knowledge of a model trained in a large-scale dataset.
(e.g. ImageNet)
One solution that does not involve training is Transfer Learning
Image Representations
Challenges for Learning Features
Classification → Retrieval
How?
15. Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. (2014). Neural codes for image retrieval. In ECCV 2014
Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: an astounding baseline for recognition. In
DeepVision CVPRW 2014
Fully-connected features as global representation
15
Image Representations
Fully-Connected Features
.
.
.
.
.
.
Convolutional Layer Max-Pooling Layer Fully-Connected Layer
Flattening
Class Predictions
Input Image
Softmax Activation
Neural Codes
16. Babenko, A., & Lempitsky, V. (2015). Aggregating local deep features for image retrieval. ICCV 2015
Tolias, G., Sicre, R., & Jégou, H. (2016). Particular object retrieval with integral max-pooling of CNN activations. ICLR 2016
Kalantidis, Y., Mellina, C., & Osindero, S. (2015). Cross-dimensional Weighting for Aggregated Deep Convolutional Features. arXiv
preprint arXiv:1512.04065.
16
Image Representations
Convolutional Features
.
.
.
.
.
.
Convolutional Layer Max-Pooling Layer Fully-Connected Layer
Flattening
Class Predictions
Input Image
Softmax Activation
Convolutional
Features
Convolutional features as global representation
➔ They convey the spatial information of the image
17. 17
Image Representations
From Feature Maps to Compact Representations
(Fh
x Fw
x 3) x K Filters
Stride: S, Padding: P
Ih
x Iw
x 3 (H x W) x K Feature Maps
H = (Ih
− Fh
+ 2P) / S +1
W = (Iw
− Fw
+ 2P) / S +1
Image Convolutional layer Feature Maps
18. 18
Image Representations
From Feature Maps to Compact Representations
1 0 1
0 1 0
9 0 4
1 2 1
0 0 0
0 0 0
0 0 1
0 0 0
5 1 1
Max Pooling
Max Pooling
Max Pooling
9
2
5
3 Feature Maps (3x3) Compact vector of (1x3)Pooling Operation
19. 19
Image Representations
3 Feature Maps (3x3)
1 0 1
1 1 0
9 0 4
1 2 1
0 0 0
0 0 0
0 0 1
0 0 0
5 1 1
Sum Pooling
Sum Pooling
Sum Pooling
Compact vector of (1x3)
17
4
8
From Feature Maps to Compact Representations
Pooling Operation
20. Image Representations
Related Works
SPoC R-MAC BoW CroW
Combine and aggregate convolutional features:
● SPoC → Gaussian prior + Sum Pooling
● R-MAC → Rigid grid of regions + Max Pooling
● BoW → Bag of Visual Words
● CroW → Spatial weighting based on normalized sum of feature maps (boost
salient content) + channel weighting based on sparsity + Sum Pooling
20
25. GAP .
.
.
.
.
.
w1
w2
w3
wN
Class 1
(Tennis Ball)
VGG-16
CAMi
= w1,i
∗ conv1
+ w2,i
∗ conv2
+ … + wN,i
∗ convN
Class N
CAM
(Tennis Ball)
B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba. 2016. Learning Deep Features for Discriminative Localization. CVPR
(2016).
25
CAM layer
Class Activation Maps
Obtaining Discriminative Regions
28. Encode images dynamically combining their semantics with convolutional features using
only the knowledge contained inside a trained network
Compact
Descriptor
Class-Weighted Convolutional Features
28
GAP .
.
.
.
.
.
w1
w2
w3
wN
Class 1
(Tennis Ball)
CAM layer
Class N
Proposal
30. GAP .
.
.
.
.
.
w1
w2
w3
wN
Class 1
(Tennis Ball)
CAM layer
Class N
Feature Extraction
CAMs Computation
Depth = K
In a single forward pass we extract convolutional features and image CAMs
Normalized
CAMs
[0,1]
30
Image Encoding Pipeline
1. Features & CAMs Extraction
32. 32
F1
= [ f1,1
, f1,2
, … , f1,K
]
F2
= [ f2,1
, f2,2
, … , f2,K
]
F N
= [ fN,1
, fN,2
, … , fN,K
]
PCA + l2
PCA+ l2
PCA + l2
l2
DI
= [d1
, … , dK
]
➔ Which classes do we aggregate?
➔ What is the optimal number of classes (NC
) ?
➔ What is the optimal number of classes (NPCA
) to compute the PCA ?
.
.
.
Image Encoding Pipeline
3. Descriptor Aggregation
33. Image
Encoding
Pipeline
Store all the
class vectors (Fc
)
Image Dataset
Query
Image
Encoding
Pipeline
CNN Most Probable Classes Prediction
from query (Retriever, Seashore...)
Query Descriptor
Dataset Descriptor
Aggregation
Scores
33
Descriptor Aggregation Strategies
Online Aggregation (OnA)
36. ● Framework: Keras with Theano as backend / PyTorch
● Images resized to 1024x720 (keeping aspect ratio)
● We explored using different CNNs as feature extractors
36
Experimental Setup
39. ● VGG-16 CAM model as feature extractor (pre-trained on ImageNet)
● Features from Conv5_1 Layer
GAP .
.
.
.
.
.
w1
w2
w3
wN
CAM layer
39
Conv5_1
Experimental Setup
40. ● Datasets
○ Oxford 5k
○ Paris 6k
○ 100k Distractors (Flickr) → Oxford105k, Paris106k
● Using the features that fall inside the cropped query
● PCA computed with Oxf5k when testing in Par6k and vice versa
Philbin, J. , Chum, O. , Isard, M. , Sivic, J. and Zisserman, A. Object retrieval with large vocabularies and fast spatial matching, CVPR 2007
Philbin, J. , Chum, O. , Isard, M. , Sivic, J. and Zisserman, A. Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image
Databases. CVPR 2008
40
Experimental Setup
41. ● Scores computed with cosine similarity
41
● Evaluation metric: mean Average Precision (mAP)
Experimental Setup
42. 42
Finding the optimal parameters
Ablation Studies
● We carried out ablation studies to validate our method
● Computational Burden
● Number of top-classes to add: NC
● Number of top-classes to compute the PCA matrix: NPCA
45. 45
Nc
corresponds to the number of classes aggregated.
Npca
corresponds to the classes used to compute the PCA (computed with data of
Oxford when testing in Paris and vice versa).
Ablation Studies
Sensitivity with respect to NC
and Npca
46. 46
The first 16 classes correspond to: vault, bell cote, bannister, analog clock, movie
theater, coil, pier, dome, pedestal, flagpole, church, chime, suspension bridge, birdhouse,
sundial, triumphal arch. All of them are related to buildings.
Ablation Studies
Using predefined classes
48. 48
Search Post-Processing
● We establish a set of thresholds based on the max value of the
normalized CAM
○ 1%, 10%, 20%, 30%, 40%
● Generate a bounding box covering the largest connected element
● Compare the descriptor of each new region of the target image with the
query descriptor and generate new scores.
● Results a bit better if we average over 2 most probable CAMs
● For the top-R target images
Region Proposals for Re-Ranking
Query Expansion
● Generate a new query by l2
normalizing the sum of the top-QE target descriptors
49. 49
Green - Ground Truth
Orange → Red - Different Thresholds
Region Proposals using CAMs
50. 50
Green - Ground Truth
Orange → Red - Different Thresholds
Region Proposals using CAMs
53. ▷ In this work we proposed a technique to build compact image
representations focusing on their semantic content.
▷ We employed an image encoding pipeline that makes use of a
pre-trained CNN and Class Activation Maps to extract discriminative
regions from the image and weight its convolutional features
accordingly
▷ Our experiments demonstrated that selecting the relevant content of
an image to build the image descriptor is beneficial, and contributes to
increase the retrieval performance. The proposed approach establishes
a new state-of-the-art compared to methods that build image
representations combining off-the-shelf features using random or fixed
grid regions.
53
Conclusions