5. R-CNN: Pipeline Overview
Step1. Input an image
Step2. Use selective search to obtain ~2k proposals
Step3. Warp each proposal and apply CNN to extract its features
Step4. Adopt class-specified SVM to score each proposal
Step5. Rank the proposals and use NMS to get the bboxes.
Step6. Use class-specified regressors to refine the bboxes’
positions.Ross Girshick et al. Rich feature hierarchies for accurate object detection and semantic
segmentation, CVPR14 第 5页 | 共 25 页
7. R-CNN: Limitation
• TOO SLOWWWW !!! (13s/image on a GPU or
53s/image on a CPU, and VGG-Net 7x slower)
• Proposals need to be warped to a fixed size.
第 7页 | 共 25 页
9. SPP-Net: Motivation
• Cropping may loss some information about the object
• Warpping may change the object’s appearance
He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual
Recognition, TPAMI15 第 9页 | 共 25 页
10. SPP-Net: Spatial Pyramid Pooling (SPP)
Layer
• FC layer need a fixed-length input while conv layer can be
adapted to arbitrary input size.
• Thus we need a bridge between the conv and FC layer.
• Here comes the SPP layer.
第 10页 | 共 25 页
11. SPP-Net: Training for Detection(1)
Conv5
feature
map
Conv5
feature
map
Conv5
feature
map
Image Pyramid FeatMap Pyramids
conv
Step1. Generate a image pyramid and exact the conv
FeatMap of the whole image
第 11页 | 共 25 页
12. SPP-Net: Training for Detection(2)
• Step 2, For each proposal, walking the image
pyramid and find a project version that has a
number of pixels closest to 224x224. (For scaling
invariance in training.)
• Step 3, find the corresponding FeatMap in Conv5
and use SPP layer to pool it to a fix size.
• Step 4, While getting
all the proposals’
feature, fine-tune the
FC layer only.
• Step 5, Train the
class-specified SVM
第 12页 | 共 25 页
13. SPP-Net: Testing for Detection
• Allmost the same as R-CNN, except Step3.
第 13页 | 共 25 页
14. SPP-Net: Performance
• Speed: 64x faster than R-CNN using one scale,
and 24x faster using five-scale paramid.
• mAP: +1.2 mAP vs R-CNN
第 14页 | 共 25 页
15. SPP-Net: Limitation
2. Training is expensive in space and time.
1. Training is a multi-stage pipeline.
FC layersConv layers SVM regressor
store
第 15页 | 共 25 页
18. Fast R-CNN: Joint Training Framework
Joint the feature extractor, classifier, regressor
together in a unified framework
第 18页 | 共 25 页
19. Fast R-CNN: RoI pooling layer
≈ one scale SPP layer
第 19页 | 共 25 页
20. Fast R-CNN: Regression Loss
A smooth L1 loss which is less sensitive
to outliers than L2 loss
第 20页 | 共 25 页
21. Fast R-CNN: Scale Invariance
image pyramids ( multi scale )brute force ( single scale )
Conv5
feature
map
conv
• In practice, single scale is good enough. (The main
reason why it can faster x10 than SPP-Net)
第 21页 | 共 25 页
22. Fast R-CNN: Other tricks
• SVD on FC layers: 30% speed up at testing
time with a little performance drop.
• Which layers to fine-tune? Fix the shallow
conv layers can reduce the training time with
a little performance drop.
• Data augment: use VOC12 as the additional
trainset can boost mAP by ~3%
第 22页 | 共 25 页
23. Fast R-CNN: Performance
• Without data augment, the mAP just +0.9 on VOC077
• But training and testing time has been greatly speed
up. (training 9x, testing 213x vs R-CNN)
• Without data augment, the mAP +2.3 on VOC127
第 23页 | 共 25 页
27. Unified Approach: Framework
第 27页 | 共 25 页
• Move Away from classification network
• Use a deep network like GoogleNet
• Divide the image into 7 by 7 grids
• Each grid responsible for predicting the object
center falling in the grid
• Predict the class probabilities and coordinates
for the object
• Testing time reduces significantly as no
regions are required
• Loss function is combination of class
probabilities error and bounding box
regression error as in Fast RCNN
28. Unified Approach: Training
第 28页 | 共 25 页
• Most grid parameters will tend towards zero
as one object will only contribute towards one
grid
• Introduce extra probability for background vs
foreground
• Probability error loss for a class activated only
when foreground
• Optimize for Pr(Class/ob) rather than
Pr(Class)
• Final probabilities calculated by
Pr(ob)*Pr(Class/ob)
29. Unified Approach: Training
第 29页 | 共 25 页
• Run initial iterations by minimizing pr(ob) and
pr(class/ob) separately
• Can run joint minimization in later stages
• The network predicts the bounding box taking
convolutions from the whole image
• This instigates error in localization
• Penalize predictions which outputs lower iou
by rescaling probabilities to the iou instead of
1
33. Unified Approach: Network
第 33页 | 共 25 页
Variant of GoogleNet with pooling layers replaced by
convolutional layers which helps in localizing objects
Leaky Relu layer with 1.1(X>0) + 0.1(x<0) increases
map
Logistic layer at the end to enforce predictions within
0 to 1
34. Unified Approach: Saliency
第 34页 | 共 25 页
• Predicts images at 45 fps
• Competitive Performance with Fast Rcnn
using Caffe net (MAP Score = 58.8) in VOC
2007
• Almost 95 times faster than Fast RCNN
More details to be published in
upcoming paper