The document discusses different approaches to object detection in images using deep learning. It begins with describing detection as classification, where an image is classified into categories for what objects are present. It then discusses approaches that involve separating detection into a classification head and localization head. The document also covers improvements like R-CNN which uses region proposals to first generate candidate object regions before running classification and bounding box regression on those regions using CNN features. This helps address issues with previous approaches like being too slow when running the CNN over the entire image at multiple locations and scales.
8. From Classification To Detection
Classification Head:
● C+1 Scores for C
classes + 1
background
class
Localization Head:
● Class agnostic:
(x,y,w,h)
● Class specific:
(x,y,w,h) X C
9. From Classification To Detection
● Training
○ Crop random regions from images.
○ Scale to uniform size.
○ A region is labeled according to overlap with ground truth labeling.
○ Optimize using Stochastic Gradient Descent.
○ Handle class imbalance by resampling.
● Detection
○ Use sliding window to go over image.
○ Crop regions.
○ Scale to uniform size.
○ Apply network to all cropped images.
○ Repeat process for different image scales.
10. How To Handle So Many Detections?
● Problem:
○ Running this algorithm at many locations at many scales result with many detections.
● Solution:
○ Need somehow to suppress weaker detections.
11. Non-Maximum Suppression (NMS)
● Start with most confident detection D.
● Measure IoU with all other detections.
● Remove detections with IoU>50% with D.
● Repeat with next most confident detection.
12. From Classification To Detection
● Problem:
○ Previous method was too slow.
○ Network is applied over and over.
● Solution:
○ Sliding window is inherently efficient in the case of CNNs.
● OverFeat: Integrated Recognition, Localization and Detection using
Convolutional Networks (2013)
○ Rob Fergus, Yann LeCun
17. CNNs Are Still Too Slow
● Problem:
○ Need to test many positions and scales, and use a computationally demanding classifier (CNN)
● Solution:
○ Only look at a tiny subset of possible positions.
● Rich feature hierarchies for accurate object detection and semantic
segmentation (2014)
○ AKA R-CNN
○ Ross Girshick
18. Region Proposals
● Find “blobby” image regions that are likely to contain objects
● “Class-agnostic” object detector
● Look for “blob-like” regions
28. R-CNN: Training
1. Train a classification model on a large dataset (ImageNet)
2. Fine-tune model for detection on a smaller dataset (Pascal)
○ Instead of 1000 ImageNet classes, now use 20 classes + background class.
○ Extract region proposals for all images.
○ Use positive / negative regions from detection images.
■ If proposal has >50% IoU with any ground truth → Positive example.
■ Otherwise → Negative example.
■ Batch = 32 positives + 96 negatives.
3. Train final classifiers
○ Extract region proposals for all images.
○ For each region: crop and warp to CNN size, run forward pass, save features to disk.
(Requires ~200GB for Pascal dataset)
○ Train one binary SVM per class to classify region features.
○ Train one linear regression model per class to predict regression offsets.