Detection

Detection:
第 1页 | 共 25 页

Object Detection: Intuition
Detection ≈ Localization +
Classification
第 2页 | 共 25 页

Outline
•R-CNN
•SPP-Net
•Fast R-CNN
•Unified Approach
第 3页 | 共 25 页

Outline
•R-CNN
•SPP-Net
•Fast R-CNN
•Unified Approach
第 4页 | 共 25 页

R-CNN: Pipeline Overview
Step1. Input an image
Step2. Use selective search to obtain ~2k proposals
Step3. Warp each proposal and apply CNN to extract its features
Step4. Adopt class-specified SVM to score each proposal
Step5. Rank the proposals and use NMS to get the bboxes.
Step6. Use class-specified regressors to refine the bboxes’
positions.Ross Girshick et al. Rich feature hierarchies for accurate object detection and semantic
segmentation, CVPR14 第 5页 | 共 25 页

R-CNN: Performance in PASCAL VOC07
• AlexNet(T-Net): 58.5 mAP
• VGG-Net(O-Net): 66.0 mAP
第 6页 | 共 25 页

R-CNN: Limitation
• TOO SLOWWWW !!! (13s/image on a GPU or
53s/image on a CPU, and VGG-Net 7x slower)
• Proposals need to be warped to a fixed size.
第 7页 | 共 25 页

Outline
•R-CNN
•SPP-Net
•Fast R-CNN
•Unified Approach
第 8页 | 共 25 页

SPP-Net: Motivation
• Cropping may loss some information about the object
• Warpping may change the object’s appearance
He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual
Recognition, TPAMI15 第 9页 | 共 25 页

SPP-Net: Spatial Pyramid Pooling (SPP)
Layer
• FC layer need a fixed-length input while conv layer can be
adapted to arbitrary input size.
• Thus we need a bridge between the conv and FC layer.
• Here comes the SPP layer.
第 10页 | 共 25 页

SPP-Net: Training for Detection(1)
Conv5
feature
map
Conv5
feature
map
Conv5
feature
map
Image Pyramid FeatMap Pyramids
conv
Step1. Generate a image pyramid and exact the conv
FeatMap of the whole image
第 11页 | 共 25 页

SPP-Net: Training for Detection(2)
• Step 2, For each proposal, walking the image
pyramid and find a project version that has a
number of pixels closest to 224x224. (For scaling
invariance in training.)
• Step 3, find the corresponding FeatMap in Conv5
and use SPP layer to pool it to a fix size.
• Step 4, While getting
all the proposals’
feature, fine-tune the
FC layer only.
• Step 5, Train the
class-specified SVM
第 12页 | 共 25 页

SPP-Net: Testing for Detection
• Allmost the same as R-CNN, except Step3.
第 13页 | 共 25 页

SPP-Net: Performance
• Speed: 64x faster than R-CNN using one scale,
and 24x faster using five-scale paramid.
• mAP: +1.2 mAP vs R-CNN
第 14页 | 共 25 页

SPP-Net: Limitation
2. Training is expensive in space and time.
1. Training is a multi-stage pipeline.
FC layersConv layers SVM regressor
store
第 15页 | 共 25 页

Outline
•R-CNN
•SPP-Net
•Fast R-CNN
•Unified Approach
第 16页 | 共 25 页

Fast R-CNN: Motivation
Ross Girshick, Fast R-CNN, Arxiv tech report
JOINT TRAINING!!
第 17页 | 共 25 页

Fast R-CNN: Joint Training Framework
Joint the feature extractor, classifier, regressor
together in a unified framework
第 18页 | 共 25 页

Fast R-CNN: RoI pooling layer
≈ one scale SPP layer
第 19页 | 共 25 页

Fast R-CNN: Regression Loss
A smooth L1 loss which is less sensitive
to outliers than L2 loss
第 20页 | 共 25 页

Fast R-CNN: Scale Invariance
image pyramids （ multi scale ）brute force （ single scale ）
Conv5
feature
map
conv
• In practice, single scale is good enough. (The main
reason why it can faster x10 than SPP-Net)
第 21页 | 共 25 页

Fast R-CNN: Other tricks
• SVD on FC layers: 30% speed up at testing
time with a little performance drop.
• Which layers to fine-tune? Fix the shallow
conv layers can reduce the training time with
a little performance drop.
• Data augment: use VOC12 as the additional
trainset can boost mAP by ~3%
第 22页 | 共 25 页

Fast R-CNN: Performance
• Without data augment, the mAP just +0.9 on VOC077
• But training and testing time has been greatly speed
up. (training 9x, testing 213x vs R-CNN)
• Without data augment, the mAP +2.3 on VOC127
第 23页 | 共 25 页

Fast-RCNN: Discussion about #proposal
Are more proposals always better ？ NO!
第 24页 | 共 25 页

Outline
•R-CNN
•SPP-Net
•Fast R-CNN
•Unified Approach
第 25页 | 共 25 页

Unified Approach: Motivation
No Need For
Regions
第 26页 | 共 25 页

Unified Approach: Framework
第 27页 | 共 25 页
• Move Away from classification network
• Use a deep network like GoogleNet
• Divide the image into 7 by 7 grids
• Each grid responsible for predicting the object
center falling in the grid
• Predict the class probabilities and coordinates
for the object
• Testing time reduces significantly as no
regions are required
• Loss function is combination of class
probabilities error and bounding box
regression error as in Fast RCNN

Unified Approach: Training
第 28页 | 共 25 页
• Most grid parameters will tend towards zero
as one object will only contribute towards one
grid
• Introduce extra probability for background vs
foreground
• Probability error loss for a class activated only
when foreground
• Optimize for Pr(Class/ob) rather than
Pr(Class)
• Final probabilities calculated by
Pr(ob)*Pr(Class/ob)

Unified Approach: Training
第 29页 | 共 25 页
• Run initial iterations by minimizing pr(ob) and
pr(class/ob) separately
• Can run joint minimization in later stages
• The network predicts the bounding box taking
convolutions from the whole image
• This instigates error in localization
• Penalize predictions which outputs lower iou
by rescaling probabilities to the iou instead of
1

Unified Approach: Detection layer
第 30页 | 共 25 页

第 31页 | 共 25 页

第 32页 | 共 25 页

Unified Approach: Network
第 33页 | 共 25 页
Variant of GoogleNet with pooling layers replaced by
convolutional layers which helps in localizing objects
Leaky Relu layer with 1.1(X>0) + 0.1(x<0) increases
map
Logistic layer at the end to enforce predictions within
0 to 1

Unified Approach: Saliency
第 34页 | 共 25 页
• Predicts images at 45 fps
• Competitive Performance with Fast Rcnn
using Caffe net (MAP Score = 58.8) in VOC
2007
• Almost 95 times faster than Fast RCNN
More details to be published in
upcoming paper

Detection

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Detection

Similaire à Detection (20)

Dernier

Dernier (20)

Detection