Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Recent Progress on Object Detection_20170331

1 957 vues

Publié le

This slide provides a brief summary of recent progress on object detection using deep learning.
The concept of selected previous works(R-CNN series/YOLO/SSD) and 6 recent papers (uploaded to the Arxiv between Dec/2016 and Mar/2017) are introduced in this slide.
Most papers are focusing on improving the performance of small object detection.

Publié dans : Technologie
  • Soyez le premier à commenter

Recent Progress on Object Detection_20170331

  1. 1. Recent Progress on Object Detection HOW TO DEAL WITH SMALL OBJECTS http://cv.snu.ac.kr ComputerVisionLabSeoul National University Jihong Kang 2017.03.31
  2. 2. Contents  Quick Review on Previous Works  Selected Recent Papers  Proposed Solutions  Methods Comparison  What’s Next?  Summary
  3. 3. Previous Works R-CNN SERIES/YOLO/SSD
  4. 4. Region-CNN Series R-CNN Usage of deep CNN features for detection FAST R-CNN RoI Pooling Layer + Multi-task Loss Region proposals: Edge Boxes, Selective search CNN features for region classification(SVM) / bounding box regression(linear model) Weakness: CNN is not updated by classification loss. R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," CVPR, 2014. R. Girshick, "Fast r-cnn," ICCV, 2015. RoI pooling layer: differentiable Multi-task loss: classification loss + bbox regression loss Weakness: region proposal is separately performed.
  5. 5. Region-CNN Series FASTER R-CNN Region Proposal Network(RPN): sharing features for RPN and Fast R-CNN S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," NIPS, 2015 The base CNN RPN: obj./non-obj. classification & bounding box regression using anchors. Fast R-CNN classification + bbox regression RPN using sliding window Weakness: RPN and classification is still separate process.
  6. 6. YOLO / SSD YOU ONLY LOOK ONCE(YOLO) Super fast detector (21~155 fps) Finding objects at each grid in parallel Slightly worse performance than Faster R-CNN SINGLE SHOT MULTIBOX DETECTOR(SSD) Faster R-CNN(default box)+ YOLO(parallel processing) Multi-scale feature map detection: detect small objects on lower level, large objects on higher level. End-to-End training/testing J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," CVPR, 2016. W. Liu et al., "SSD: Single shot multibox detector," ECCV, 2016.
  7. 7. Comparison R-CNN SERIES 2-step process of region proposal and object recognition. ◦ 1st step: generate ~2k object bounding box candidates . ◦ 2nd step: classify the RoI and refine the bounding box, one by one. YOLO/SSD 1-step process Do region proposal and classification at same time, at each sub-regions of image in parallel. • Both approaches require Non-Maximum Suppression(NMS) as postprocessing for handling overlapped detection results.
  8. 8. Problem Poor performance on detecting small objects. Results of SSD321
  9. 9. Why? 1. Small object provides less information. ◦ But, if human can recognize, then machine should be able to do. ◦ Human can infer from the context in the scene around the object. 2. Trade-off between localization accuracy and high-level semantic features. ◦ high-resolution feature maps at low level of network have semantically weak features. ◦ low-resolution feature maps at high level of network lose accurate position of objects.(due to quantization effect of pooling)
  10. 10. Selected Recent Papers
  11. 11. A List of Selected Recent Papers [1] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature Pyramid Networks for Object Detection," arXiv preprint arXiv:1612.03144, 2016. [2] P. Hu and D. Ramanan, "Finding Tiny Faces," arXiv preprint arXiv:1612.04402, 2016. [3] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, "Beyond Skip Connections: Top-Down Modulation for Object Detection," arXiv preprint arXiv:1612.06851, 2016. [4] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, "DSSD: Deconvolutional Single Shot Detector," arXiv preprint arXiv:1701.06659, 2017. [5] H. Li, Y. Liu, W. Ouyang, and X. Wang, "Zoom Out-and-In Network with Recursive Training for Object Proposal," arXiv preprint arXiv:1702.05711, 2017. [6] K. He, G. Gkioxari, P. Dollár, R. Girshick, “Mask R-CNN,” arXiv preprint arXiv:1703.06870, 2017. 2016.12 2017.1 2017.2 2017.3 [4][1-3] [5] [6] arXiv Upload Date
  12. 12. Proposed Solutions 6 TIPS FOR ENHANCEMENT
  13. 13. 1st Solution: Hourglass Structure DSSD network
  14. 14. "Beyond Skip Connections: Top-Down Modulation for Object Detection” "DSSD: Deconvolutional Single Shot Detector" "Feature Pyramid Networks for Object Detection" "Zoom Out-and-In Network with Recursive Training for Object Proposal" The Mainstream
  15. 15. Understanding: In Bottom-up Structure • Lower-resolution feature map =Coarser localization • Semantically stronger features SSD Network Deeper layer • Higher-resolution feature map =Finer localization • Semantically weak features
  16. 16. Understanding: Adding Top-down modules • Restore finer localization information using combination of upsampling and lateral connection • Maintain semantically strong features DSSD Network Top-downBottom-up
  17. 17. [1] "Beyond Skip Connections: Top-Down Modulation for Object Detection” [2] "Feature Pyramid Networks for Object Detection" [3] "DSSD: Deconvolutional Single Shot Detector" Nearest Neighbor Upsampling [4] "Zoom Out-and-In Network with Recursive Training for Object Proposal" Upsampling type Lateral path Merge operation [1] Interpolation 3x3 conv Channel concat. [2] Interpolation 1x1 conv Eltw sum [3] Deconv Skip connection Eltw product [4] Deconv Conv from both paths Eltw sum Top-down Modules Comparison on top-down modules
  18. 18. Performance Gain "Feature Pyramid Networks for Object Detection" "DSSD: Deconvolutional Single Shot Detector"
  19. 19. 2nd Solution: Image Pyramid/ Multi-scale feature P. Hu and D. Ramanan, "Finding Tiny Faces"
  20. 20. Detection on Resized Image For small faces, detecting on the upsampled image is better. For large faces, detecting on the subsampled image is better. Why?
  21. 21. Explanation PRE-TRAINED MODEL’S IMPACT IN MY OPINION.. For small objects, CNN of the upsampled image can generate more features from higher resolution feature maps. For large objects, reducing the input image makes the CNN can include more context around the object. (Since the receptive field is determined by the layers of CNN) Image Pyramid is good for generating different amount of context inside RoI with same CNN model
  22. 22. Context Around the Object Including additional context around the objects is always helpful, especially, for small ones. dotted rectangle: receptive field The number: detection rate This is due to overfitting. The improvement from adding context to a tight-fitting template
  23. 23. Impact of Multi- scale Feature Multi-scale feature is better than s ingle scale feature, especially, for small ones. "Finding Tiny Faces" "Feature Pyramid Networks for Object Detection"
  24. 24. 3rd Solution: Decomposition into Easier Problems
  25. 25. Multi-class Classification into Multi-binary Classification Mask R-CNN decouples mask and class prediction. Mask is predicted for each class. (COCO dataset has 80 classes)
  26. 26. Scale-specific Face Detector "Finding Tiny Faces“ uses 25 fixed- size face detector using shared CNN features. Each detector performs binary classification and generates a heatmap. (The bounding box regressor is applied together with small amount of performance gain.) Different templates are used for each scale paths
  27. 27. 4th Solution: ROI Alignment For Fast/Faster R-CNN Fast R-CNN
  28. 28. Misalignment of RoI Pooling Misalignment of RoI occurs from R oI pooling process due to quantiza tion . 1st misalignment: converting conti nuous RoI into integer RoI 2nd misalignment: dividing input fe atures into bins of pooled feature. (ex. pooling from 11x11 feature to 7x7 bins)
  29. 29. Solution: RoIAlign Layer RoIAlign Layer performs bilinear interpolation to compute exact values of non-integer coordinates. Ablation experiment for RoIAlign on object instance segmentation
  30. 30. Impact on Small Object Detection 1.6 point gain on Small Objects
  31. 31. 5th Solution: Better Base Network S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," arXiv preprint arXiv:1611.05431, 2016. ResNext Module
  32. 32. Comparison on Base Architecture Mask R-CNN Beyond Skip Connections: Top-Down Modulation … Finding Tiny Faces DSSD IRNv2: InceptionResNet v2 IRNv2 ≟ ResNext-101 > ResNet101 > ResNet50>VGG16 *This comparison doesn’t consider the number of parameters or memory usage of each network.
  33. 33. 6th Solution: Various Training Techniques Imbalanced ratio of positive samples and negative samples in region proposals
  34. 34. Data Sampling HARD NEGATIVE MINING In Region proposal, the ratio of negative samples are usually much larger than positive samples in images. Instead of using all negatives, use hard only hard negative samples sorted by confidence loss. In SSD, 1:3 of positive vs. negative ratio is used. BALANCED SAMPLING Large positive object has more overlapped template than small object. So, do the balanced sampling among different sized samples.
  35. 35. Training PROGRESSIVE TRAINING For training additional layers on the top of pre- trained network, instead of training whole layers at once, attach one or a part of layers at a time and repeatedly train them. This progressive network building performs better in general. (From TDM paper) MULTI-TASK TRAINING In Mask R-CNN, it shows better object detection performance when the training both segmentation mask and object detector, than training detector only.
  36. 36. Methods Comparison
  37. 37. Comparison Table [1] “Feature Pyramid Networks for Object Detection" [2] "Finding Tiny Faces" [3] "Beyond Skip Connections: Top-Down Modulation for Object Detection" [4] "DSSD: Deconvolutional Single Shot Detector“ [5] "Zoom Out-and-In Network with Recursive Training for Object Proposal" [6] “Mask R-CNN” Methods COCO Test- dev AP Hourglass Multi-scale feature based detection Image Pyramid RoI Align Base Network FPN[1] 36.2 O O X X ResNet101 TDM[3] 36.8 O X X X InceptionResNetV2 DSSD[4] 33.2 O O X X ResNet101 Tiny Faces[2] - X O O X ResNet101 Zoom network[5] - O O X X Inception-BN Mask R-CNN[6] 39.2 O O X O ResNext101 + FPN
  38. 38. Summary
  39. 39. 6 Tips For Better Detection 1. Hourglass structure 2. Image pyramid/multi-scale feature 3. Decomposition into easier problems 4. RoI Alignment 5. Better base network 6. Various training techniques