SlideShare une entreprise Scribd logo
1  sur  34
DenseBox: Unifying Landmark
Localization with End to End
Object Detection
Submitted on 16 Sep 2015 (v1), last revised 19 Sep 2015 (v3)
Lichao Huang, Yi Yang, Yafeng Deng, Yinan Yu
arXiv preprint arXiv:1509.04874, 2015
CHEN KUAN-YU
stu9458@gmail.com
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 1
Agenda
• Introduction
• Algorithm
• Ground-Truth Generation
• Model Design
• Multi-Task Training
• Refine with Landmark Localization
• Experiments
• Conclusion
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 2
Introduction
• How can a single fully convolutional neural network (FCN) perform on object
detection?
• In this work, we focus on one question: To what extent can an one-stage FCN
perform on object detection?
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 3
Introduction
• We introduce DenseBox a unified end-to-end FCN framework that directly
predicts bounding boxes and object class confidences through all locations and
scales of an image.
• Although similar to many existing sliding window fashion FCN detection
frameworks, DenseBox is more carefully designed to detect objects under small
scales and heavy occlusion
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 4
Introduction
• Checking for nearby cars during driving, finding a person, and localizing a
familiar face are all examples of object detection
• Indicate our DenseBox is the state-of-the-art system for detecting challenging
objects such as faces and cars
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 5
Introduction
1. First, we demonstrate that a single fully convolutional neural network, if
designed and optimized carefully, can detect objects under different scales with
heavy occlusion extremely accurately and efficiently
2. Second, we show that when incorporating with landmark localization through
multi-task learning[1]
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 6
Introduction
• The DenseBox detection Pipeline
1. Image pyramid is fed to the network
2. After several layers of convolution and pooling, upsampling feature map back and apply
convolution layers to get final output
3. Convert output feature map to bounding boxes, and apply non-maximum suppression to all
bounding boxes over the threshold
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 7
Algorithm
• Ground-Truth Generation
• Model Design
• Multi-Task Training
• Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 8
Algorithm
• Bounding Box
• Left top 𝑝𝑡 = (𝑥𝑡, 𝑦𝑡)
• Right bottom 𝑝 𝑏 = (𝑥 𝑏, 𝑦 𝑏)
• Output feature map with 5-dimensional vector
• 𝑡𝑖 = ( 𝑠, 𝑑𝑥 𝑡 = 𝑥𝑖 − 𝑥𝑡, 𝑑𝑦 𝑡 = 𝑦𝑖 − 𝑦𝑡, 𝑑𝑥 𝑏 = 𝑥𝑖 − 𝑥 𝑏, 𝑑𝑦 𝑏 = 𝑦𝑖 − 𝑦 𝑏)
• 𝑠, is the confidence score of being an object
• 𝑑𝑥 𝑡
, 𝑑𝑦 𝑡
, 𝑑𝑥 𝑏
, 𝑑𝑦 𝑏
denote the distance between output pixel location with the
boundary of target bounding box.
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 9
Algorithm
• In this paper, we train our network on single scale, and apply it to multiple scales
for evaluation
• In training, the patches are cropped and resized to 240x240 with a face in the
center roughly has the height of 50 pixels. The output ground truth in training is a
5-channel map sized 60x60 , with the downsampling factor of four
Ground-Truth Generation
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 10
• The positive labeled region in the first channel of ground truth map is a filled
circle with radius 𝑟𝑐 (its scaling factor is set to be 0.3 to the box size)
• The remaining 4 channels are filled with the distance between the pixel location of
output map between the left top and right bottom corners of the nearest bounding
box
Algorithm Ground-Truth Generation
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 11
• Note that if multiple faces occur in one patch, we keep those faces as positive if
they fall in a scale range(e.g. 0.8 to 1.25 in our setting) relative to the face in patch
center
• The pixels of first channel, which denote the confidence score of class, in the
ground truth map are initialized with 0, and further set to 1 if within the positive
label region
• Each pixel can be treated as one sample , since every 5-channel pixel describe a
bounding box.
Algorithm Ground-Truth Generation
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 12
Algorithm
• Ground-Truth Generation
• Model Design
• Multi-Task Training
• Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 13
Algorithm
• Network architecture of DenseBox. The rectangles with red names contain
learnable parameters
• Derived from the VGG(Visual Geometric Group) 19 model used for image
classification[35]
Model-Design
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 14
Algorithm
• Multi-Level Feature Fusion
• Recent works indicate that using features from different convolution layers can enhance
performance in task such as edge detection and segmentation
• Part-level feature focus on local details of object to find discriminative appearance parts, while
object-level or high-level feature usually has a larger receptive field in order to recognize
object
• we concatenate feature map from conv3_4 and conv4_4. The receptive field (or
sliding window size) of conv3_4 is 48x48, almost the same size of the face size in
training, and the conv4_4 have a much larger receptive field, around 118x118 in size
Model-Design
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 15
Algorithm
• Ground-Truth Generation
• Model Design
• Multi-Task Training
• Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 16
Algorithm
• Like Fast R-CNN(R-ConvNet), our network has two sibling output branches
1. The confidence score 𝑦(per pixel in the output map) of being a target object. Given the
ground truth label 𝑦∗ ∈ (0,1) , the classification loss can be defined as follows
2. The second branch of outputs the bounding-box regression loss, denoted as 𝐿𝑙𝑜𝑐. It targets
on minimizing the L2 loss between the predicted location offsets 𝑑 = ( 𝑑 𝑡𝑥1, 𝑑 𝑡𝑦1, 𝑑 𝑡𝑥2,
𝑑 𝑡𝑦2)and the targets 𝑑∗ = (𝑑 𝑡𝑥1
∗
, 𝑑 𝑡𝑦1
∗
, 𝑑 𝑡𝑥2
∗
, 𝑑 𝑡𝑦2
∗
)
Multi-Task Training
(1) (2)
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 17
Algorithm
• The process of selecting negative samples is one of the crucial parts in learning
• In addition, the detector will degrade if we penalize loss on those samples lying in
the margin of positive and negative region.
• Here we use a binary mask for each output pixel to indicate whether it is selected
in training
Multi-Task Training - Balance Sampling
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 18
Algorithm
• Ignoring Gray Zone
• Hard Negative Mining
• Loss with Mask
Multi-Task Training - Balance Sampling
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 19
Algorithm
• Ignoring Gray Zone
• The gray zone is defined on the margin of positive and negative region. It should not be
considered to be positive or negative, and its loss weight should be set to 0
• 𝐷𝑖𝑠 𝑝𝑖𝑥𝑒𝑙 < 𝑟𝑛𝑒𝑎𝑟 = 2 𝑝𝑖𝑥𝑒𝑙
• 𝑓𝑖𝑔𝑛 decided to select whether or not
Multi-Task Training - Balance Sampling
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 20
Algorithm
• Hard Negative Mining
• We make learning more efficient by searching the badly predicted samples rather than random
samples. After negative mining, the badly predicted samples are very likely to be selected, so
that gradient descent learning on those samples leads more robust prediction with less noise
• Sort the loss of output pixels in descending order, and assign the top 1% to be hard-negative,
in all experiments, we keep all positive labeled pixels(samples) and the ratio of positive and
negative to be 1:1
• 𝑓𝑠𝑒𝑙 to those pixels (samples) selected in a mini-batch.
Multi-Task Training - Balance Sampling
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 21
Algorithm
• Loss with Mask
• Now we can define the mask 𝑀( 𝑡𝑖) for each sample 𝑡𝑖 = ( 𝑦𝑖 𝑑𝑖) as a function of flags
mentioned above
• Then if we combine the classification (1) and bounding box regression (2) loss with masks,
our full multi-task loss can be represented as
Multi-Task Training - Balance Sampling
(3)
(4)
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 22
Algorithm
• Ground-Truth Generation
• Model Design
• Multi-Task Training
• Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 23
Algorithm
• Landmark localization can be achieved in DenseBox just by stacking a few layers
owe to the fully convolution architecture.
Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 24
Algorithm
Yann LeCun, “Learning Hierarchies of Invariant
feature” , Center for data science & courant institute
NYU, http://www.slideshare.net/yandex/yann-le-cun
Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 25
Algorithm
• Location Loss value 𝐿𝑙𝑚
• 𝜆 𝑑𝑒𝑡 , 𝜆𝑙𝑚 is controll the balance of the three tasks
• Refine detection loss as 𝐿 𝑟𝑓
Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 26
Experiments
1. Landmarks for face(MALF Face Detection Task)
2. 8 landmarks for car(KITTI Car Detection Task)
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 27
Experiments
• MALF Face Detection Task
• Each image the longest image side does not exceed 800 pixels.
• Test our model on each image at several scales. The test scale starts from 2−3 to 21.2 with the
step of 20.3. This setting ena2ble our models to detect faces from 20 pixels to 400 pixels in
height.
• Results of three versions of DenseBox on MALF dataset:
• DenseBoxNoLandmark denotes DenseBox without landmark in training.
• DenseBoxLandmark is the model incorporating landmark localization,
• DenseBoxEnsemble is the result of ensembling 10 DenseBox with landmarks from different batch iterations
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 28
Experiments
• KITTI Car Detection Task
• The key difficulty of KITTI car detection task is that a great amount of cars are in small size
and occluded. We selectively annotate 8 landmarks for large cars
• The evaluation metric of KITTI car detection task is different from general object detection.
KITTI requires an overlap of 70% for true positive bounding box, while other tasks such as
face detection only requires 50% overlap
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 29
Experiments
Different versions of DenseBox and Recall-Curve2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 30
Experiments
Method Moderate (%) Easy(%) Hard(%)
Regionlets[23] 76.45 84.75 59.70
AOG[19] 74.26 84.24 60.51
3DVP[40] 75.77 87.46 65.38
spCov_LBP 77.40 87.19 60.60
DeepInsight 84.40 84.59 76.09
NIPS ID 331 87.14 88.33 76.11
DJML 88.79 91.31 77.73
DenseBox(Without landmark) 85.07 82.33 76.27
DenseBox(with landmark) 85.74 83.63 76.71
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 31
Experiments
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 32
Experiments
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 33
Conclusion
• The performance can be boosted easily by incorporating landmark information
• The DenseBox achieves impressive performance on both face detection and car
detection task, demonstrating its high suitable for situation
• The original DenseBox presented in this paper needs several seconds to process
one image. But this has been addressed in our latter version
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 34

Contenu connexe

Tendances

PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
Jinwon Lee
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
Jinwon Lee
 

Tendances (20)

PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detection
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
PCA-SIFT: A More Distinctive Representation for Local Image Descriptors
PCA-SIFT: A More Distinctive Representation for Local Image DescriptorsPCA-SIFT: A More Distinctive Representation for Local Image Descriptors
PCA-SIFT: A More Distinctive Representation for Local Image Descriptors
 
Perceptrons (D1L2 2017 UPC Deep Learning for Computer Vision)
Perceptrons (D1L2 2017 UPC Deep Learning for Computer Vision)Perceptrons (D1L2 2017 UPC Deep Learning for Computer Vision)
Perceptrons (D1L2 2017 UPC Deep Learning for Computer Vision)
 
Structure from motion
Structure from motionStructure from motion
Structure from motion
 
Performance evaluation of ds cdma
Performance evaluation of ds cdmaPerformance evaluation of ds cdma
Performance evaluation of ds cdma
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
Build Your Own 3D Scanner: 3D Scanning with Swept-PlanesBuild Your Own 3D Scanner: 3D Scanning with Swept-Planes
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
 
Pca ankita dubey
Pca ankita dubeyPca ankita dubey
Pca ankita dubey
 
Visualization using tSNE
Visualization using tSNEVisualization using tSNE
Visualization using tSNE
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

En vedette

Stm32f4硬體週邊介紹
Stm32f4硬體週邊介紹Stm32f4硬體週邊介紹
Stm32f4硬體週邊介紹
Jack Wang
 

En vedette (14)

第三章Ti msp430平台介紹 v3
第三章Ti msp430平台介紹 v3第三章Ti msp430平台介紹 v3
第三章Ti msp430平台介紹 v3
 
2014 1029 adaptive dissmination of safety data among vehicles
2014 1029 adaptive dissmination of safety data among vehicles2014 1029 adaptive dissmination of safety data among vehicles
2014 1029 adaptive dissmination of safety data among vehicles
 
車用通信報告
車用通信報告車用通信報告
車用通信報告
 
Stm32 develop tool introduction
Stm32 develop tool introductionStm32 develop tool introduction
Stm32 develop tool introduction
 
The design of electronic license plate recognition terminal system based on n...
The design of electronic license plate recognition terminal system based on n...The design of electronic license plate recognition terminal system based on n...
The design of electronic license plate recognition terminal system based on n...
 
Based on raspberry pi with the application of Stepper
Based on raspberry pi with the application of StepperBased on raspberry pi with the application of Stepper
Based on raspberry pi with the application of Stepper
 
第二周課程 Arduino介紹
第二周課程 Arduino介紹第二周課程 Arduino介紹
第二周課程 Arduino介紹
 
2014暑期訓練之Linux kernel power
2014暑期訓練之Linux kernel power2014暑期訓練之Linux kernel power
2014暑期訓練之Linux kernel power
 
Object detection technique using bounding box algorithm for
Object detection technique using bounding box algorithm forObject detection technique using bounding box algorithm for
Object detection technique using bounding box algorithm for
 
Stm32f4硬體週邊介紹
Stm32f4硬體週邊介紹Stm32f4硬體週邊介紹
Stm32f4硬體週邊介紹
 
Tumour detection
Tumour detectionTumour detection
Tumour detection
 
艾鍗學院-單晶片韌體-CC2500通訊實驗
艾鍗學院-單晶片韌體-CC2500通訊實驗艾鍗學院-單晶片韌體-CC2500通訊實驗
艾鍗學院-單晶片韌體-CC2500通訊實驗
 
Verilog 語法教學
Verilog 語法教學 Verilog 語法教學
Verilog 語法教學
 
簡介 GitHub 平台
簡介 GitHub 平台簡介 GitHub 平台
簡介 GitHub 平台
 

Similaire à Densebox

Similaire à Densebox (20)

PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learning
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)Towards Accurate Multi-person Pose Estimation in the Wild (My summery)
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, Captioning
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료
 
object detection paper review
object detection paper reviewobject detection paper review
object detection paper review
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
Week5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptxWeek5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptx
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
 
Object detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs Retinanet
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 AlgorithmSelf Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
 

Dernier

Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cherry
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Cherry
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 

Dernier (20)

Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Concept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfConcept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdf
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Genome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptxGenome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Early Development of Mammals (Mouse and Human).pdf
Early Development of Mammals (Mouse and Human).pdfEarly Development of Mammals (Mouse and Human).pdf
Early Development of Mammals (Mouse and Human).pdf
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 

Densebox

  • 1. DenseBox: Unifying Landmark Localization with End to End Object Detection Submitted on 16 Sep 2015 (v1), last revised 19 Sep 2015 (v3) Lichao Huang, Yi Yang, Yafeng Deng, Yinan Yu arXiv preprint arXiv:1509.04874, 2015 CHEN KUAN-YU stu9458@gmail.com 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 1
  • 2. Agenda • Introduction • Algorithm • Ground-Truth Generation • Model Design • Multi-Task Training • Refine with Landmark Localization • Experiments • Conclusion 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 2
  • 3. Introduction • How can a single fully convolutional neural network (FCN) perform on object detection? • In this work, we focus on one question: To what extent can an one-stage FCN perform on object detection? 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 3
  • 4. Introduction • We introduce DenseBox a unified end-to-end FCN framework that directly predicts bounding boxes and object class confidences through all locations and scales of an image. • Although similar to many existing sliding window fashion FCN detection frameworks, DenseBox is more carefully designed to detect objects under small scales and heavy occlusion 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 4
  • 5. Introduction • Checking for nearby cars during driving, finding a person, and localizing a familiar face are all examples of object detection • Indicate our DenseBox is the state-of-the-art system for detecting challenging objects such as faces and cars 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 5
  • 6. Introduction 1. First, we demonstrate that a single fully convolutional neural network, if designed and optimized carefully, can detect objects under different scales with heavy occlusion extremely accurately and efficiently 2. Second, we show that when incorporating with landmark localization through multi-task learning[1] 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 6
  • 7. Introduction • The DenseBox detection Pipeline 1. Image pyramid is fed to the network 2. After several layers of convolution and pooling, upsampling feature map back and apply convolution layers to get final output 3. Convert output feature map to bounding boxes, and apply non-maximum suppression to all bounding boxes over the threshold 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 7
  • 8. Algorithm • Ground-Truth Generation • Model Design • Multi-Task Training • Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 8
  • 9. Algorithm • Bounding Box • Left top 𝑝𝑡 = (𝑥𝑡, 𝑦𝑡) • Right bottom 𝑝 𝑏 = (𝑥 𝑏, 𝑦 𝑏) • Output feature map with 5-dimensional vector • 𝑡𝑖 = ( 𝑠, 𝑑𝑥 𝑡 = 𝑥𝑖 − 𝑥𝑡, 𝑑𝑦 𝑡 = 𝑦𝑖 − 𝑦𝑡, 𝑑𝑥 𝑏 = 𝑥𝑖 − 𝑥 𝑏, 𝑑𝑦 𝑏 = 𝑦𝑖 − 𝑦 𝑏) • 𝑠, is the confidence score of being an object • 𝑑𝑥 𝑡 , 𝑑𝑦 𝑡 , 𝑑𝑥 𝑏 , 𝑑𝑦 𝑏 denote the distance between output pixel location with the boundary of target bounding box. 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 9
  • 10. Algorithm • In this paper, we train our network on single scale, and apply it to multiple scales for evaluation • In training, the patches are cropped and resized to 240x240 with a face in the center roughly has the height of 50 pixels. The output ground truth in training is a 5-channel map sized 60x60 , with the downsampling factor of four Ground-Truth Generation 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 10
  • 11. • The positive labeled region in the first channel of ground truth map is a filled circle with radius 𝑟𝑐 (its scaling factor is set to be 0.3 to the box size) • The remaining 4 channels are filled with the distance between the pixel location of output map between the left top and right bottom corners of the nearest bounding box Algorithm Ground-Truth Generation 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 11
  • 12. • Note that if multiple faces occur in one patch, we keep those faces as positive if they fall in a scale range(e.g. 0.8 to 1.25 in our setting) relative to the face in patch center • The pixels of first channel, which denote the confidence score of class, in the ground truth map are initialized with 0, and further set to 1 if within the positive label region • Each pixel can be treated as one sample , since every 5-channel pixel describe a bounding box. Algorithm Ground-Truth Generation 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 12
  • 13. Algorithm • Ground-Truth Generation • Model Design • Multi-Task Training • Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 13
  • 14. Algorithm • Network architecture of DenseBox. The rectangles with red names contain learnable parameters • Derived from the VGG(Visual Geometric Group) 19 model used for image classification[35] Model-Design 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 14
  • 15. Algorithm • Multi-Level Feature Fusion • Recent works indicate that using features from different convolution layers can enhance performance in task such as edge detection and segmentation • Part-level feature focus on local details of object to find discriminative appearance parts, while object-level or high-level feature usually has a larger receptive field in order to recognize object • we concatenate feature map from conv3_4 and conv4_4. The receptive field (or sliding window size) of conv3_4 is 48x48, almost the same size of the face size in training, and the conv4_4 have a much larger receptive field, around 118x118 in size Model-Design 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 15
  • 16. Algorithm • Ground-Truth Generation • Model Design • Multi-Task Training • Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 16
  • 17. Algorithm • Like Fast R-CNN(R-ConvNet), our network has two sibling output branches 1. The confidence score 𝑦(per pixel in the output map) of being a target object. Given the ground truth label 𝑦∗ ∈ (0,1) , the classification loss can be defined as follows 2. The second branch of outputs the bounding-box regression loss, denoted as 𝐿𝑙𝑜𝑐. It targets on minimizing the L2 loss between the predicted location offsets 𝑑 = ( 𝑑 𝑡𝑥1, 𝑑 𝑡𝑦1, 𝑑 𝑡𝑥2, 𝑑 𝑡𝑦2)and the targets 𝑑∗ = (𝑑 𝑡𝑥1 ∗ , 𝑑 𝑡𝑦1 ∗ , 𝑑 𝑡𝑥2 ∗ , 𝑑 𝑡𝑦2 ∗ ) Multi-Task Training (1) (2) 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 17
  • 18. Algorithm • The process of selecting negative samples is one of the crucial parts in learning • In addition, the detector will degrade if we penalize loss on those samples lying in the margin of positive and negative region. • Here we use a binary mask for each output pixel to indicate whether it is selected in training Multi-Task Training - Balance Sampling 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 18
  • 19. Algorithm • Ignoring Gray Zone • Hard Negative Mining • Loss with Mask Multi-Task Training - Balance Sampling 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 19
  • 20. Algorithm • Ignoring Gray Zone • The gray zone is defined on the margin of positive and negative region. It should not be considered to be positive or negative, and its loss weight should be set to 0 • 𝐷𝑖𝑠 𝑝𝑖𝑥𝑒𝑙 < 𝑟𝑛𝑒𝑎𝑟 = 2 𝑝𝑖𝑥𝑒𝑙 • 𝑓𝑖𝑔𝑛 decided to select whether or not Multi-Task Training - Balance Sampling 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 20
  • 21. Algorithm • Hard Negative Mining • We make learning more efficient by searching the badly predicted samples rather than random samples. After negative mining, the badly predicted samples are very likely to be selected, so that gradient descent learning on those samples leads more robust prediction with less noise • Sort the loss of output pixels in descending order, and assign the top 1% to be hard-negative, in all experiments, we keep all positive labeled pixels(samples) and the ratio of positive and negative to be 1:1 • 𝑓𝑠𝑒𝑙 to those pixels (samples) selected in a mini-batch. Multi-Task Training - Balance Sampling 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 21
  • 22. Algorithm • Loss with Mask • Now we can define the mask 𝑀( 𝑡𝑖) for each sample 𝑡𝑖 = ( 𝑦𝑖 𝑑𝑖) as a function of flags mentioned above • Then if we combine the classification (1) and bounding box regression (2) loss with masks, our full multi-task loss can be represented as Multi-Task Training - Balance Sampling (3) (4) 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 22
  • 23. Algorithm • Ground-Truth Generation • Model Design • Multi-Task Training • Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 23
  • 24. Algorithm • Landmark localization can be achieved in DenseBox just by stacking a few layers owe to the fully convolution architecture. Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 24
  • 25. Algorithm Yann LeCun, “Learning Hierarchies of Invariant feature” , Center for data science & courant institute NYU, http://www.slideshare.net/yandex/yann-le-cun Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 25
  • 26. Algorithm • Location Loss value 𝐿𝑙𝑚 • 𝜆 𝑑𝑒𝑡 , 𝜆𝑙𝑚 is controll the balance of the three tasks • Refine detection loss as 𝐿 𝑟𝑓 Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 26
  • 27. Experiments 1. Landmarks for face(MALF Face Detection Task) 2. 8 landmarks for car(KITTI Car Detection Task) 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 27
  • 28. Experiments • MALF Face Detection Task • Each image the longest image side does not exceed 800 pixels. • Test our model on each image at several scales. The test scale starts from 2−3 to 21.2 with the step of 20.3. This setting ena2ble our models to detect faces from 20 pixels to 400 pixels in height. • Results of three versions of DenseBox on MALF dataset: • DenseBoxNoLandmark denotes DenseBox without landmark in training. • DenseBoxLandmark is the model incorporating landmark localization, • DenseBoxEnsemble is the result of ensembling 10 DenseBox with landmarks from different batch iterations 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 28
  • 29. Experiments • KITTI Car Detection Task • The key difficulty of KITTI car detection task is that a great amount of cars are in small size and occluded. We selectively annotate 8 landmarks for large cars • The evaluation metric of KITTI car detection task is different from general object detection. KITTI requires an overlap of 70% for true positive bounding box, while other tasks such as face detection only requires 50% overlap 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 29
  • 30. Experiments Different versions of DenseBox and Recall-Curve2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 30
  • 31. Experiments Method Moderate (%) Easy(%) Hard(%) Regionlets[23] 76.45 84.75 59.70 AOG[19] 74.26 84.24 60.51 3DVP[40] 75.77 87.46 65.38 spCov_LBP 77.40 87.19 60.60 DeepInsight 84.40 84.59 76.09 NIPS ID 331 87.14 88.33 76.11 DJML 88.79 91.31 77.73 DenseBox(Without landmark) 85.07 82.33 76.27 DenseBox(with landmark) 85.74 83.63 76.71 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 31
  • 32. Experiments 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 32
  • 33. Experiments 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 33
  • 34. Conclusion • The performance can be boosted easily by incorporating landmark information • The DenseBox achieves impressive performance on both face detection and car detection task, demonstrating its high suitable for situation • The original DenseBox presented in this paper needs several seconds to process one image. But this has been addressed in our latter version 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 34

Notes de l'éditeur

  1. 統一標記且點到點對應的定位方式,他們稱之為DenseBox. arXiv: arXiv(X依希臘文的χ發音,讀音如英語的archive)是一個收集物理學、數學、計算機科學與生物學論文預印本的網站。至2008年10月為止,arXiv.org已收集了超過50萬篇預印本; 2014年底, 達到1百萬篇的藏量 。2014年時, 約略以每月 8000 篇的速率增加。
  2. 在特徵辨識的領域上閱讀約三篇左右有代表性的論文皆只提出訓練的方法,並且藉由這些方法來減少訓練的資源。 首先要如何生成Dataset所應對的GT方便後續的使用、設定Model的擷取條件、多重任務下的訓練(不同種類)並在最後提到如何標註有興趣的部分
  3. 破題提要到如何能以Single-FCN表現在物件偵測上?並且著重問題在one-stage FCN在物件偵測上能夠表現到什麼程度
  4. DenseBox是一個統一的點到點的全神經網路架構,可直接預測數塊區域和物體種類的潛力。這方法經由圖片的全域位置和尺度。雖然相似於已經存在的Sliding Window盛行的FCN確認框架,但DenseBox依舊是更細微的設計去確認更小尺度和更多障礙物的演算法
  5. 最後,像在鄰近的車輛、找到人臉等都是重點,且在兩個Challenge上都是相當先進的做法,主要針對臉、車。
  6. 包含兩項特點。偵測障礙物和小尺寸的是一特點、第二個部分則是藉由Multi-task learning以結合標誌性的定位
  7. 1. 餵入圖片金字塔(不同解析度大小)進網路中 2. 利用數個卷積層來採樣特徵,並且把採樣的特徵取回來以後應用卷積層去得到最後結果 3. 利用非最大抑制(將那些特徵轉換成四種方向並且確認相鄰的兩個像素)來使得輸出的bouniding-boxes超過門檻值
  8. 每一個特徵輸出圖上面包含了分數以及該Pixel和bounding box的端點距離。而最後輸出將轉換成Bounding box,而非最大抑制將應用於這些boxes而它們的分數必須要超過Threshold.
  9. 第一步是拿GT來訓練,本篇論文當中以單一尺寸的圖片來訓練Network,而在訓練當中,擷取的圖片會重新縮小成240x240,而大部分GT所指向的臉部大約高50pixel,所以5-channel的size才是坐落在60x60,這縮小率大約是4倍。但後續的實驗過程有提到為了更大的擷取,並不只有50pixel來代表高
  10. 將一張圖片分為五個通道,第一個Channel放置的是GT的位置,剩下四個則是放置最靠近BB的四個座標點的距離。而該GT的位置也是分數的意思,在範圍內則是判定有一定的成績。這是製作GT的條件
  11. 即使有多張臉出現,只要都是在一定的範圍內就算是檢測得到。每一個Pixel就有5-channel,一個是所占的分數,另外四個則是在BB的距離,方便後續計算結果。不在範圍內是0、在範圍內是1,藉此計算overlap.
  12. 在[35]當中有提到多層的使用,卷積與工程輸出有關,例如輸出=input*轉移函數等(f和g生成的第三個函數,表徵f經過翻轉和平移的g的重疊部分面積,又稱兩種特徵的重疊積)。在此利用深度卷積網路的方式做大尺寸範圍的搜尋,但是因為做得比較深,
  13. 上述卷積可以得到不少有機會的特徵。而在Pooling layer當中則是得到維度很大的特徵,將特徵切成幾個區域,取其最大值或平均值,得到新的、維度較小的特徵。 之所以在此說可以當成學習的一環,而是因為藉由多次的特徵分析和卷積協助可以拿到更深層的特徵,並且把這些特徵再次採樣、訓練並且確認損失。 VGG這團隊提出的Model作為步驟中的表示方法,包含了許多運算的道理在內,利用它的模型來建造Network architecture.
  14. 而針對在多層次的特徵組合當中,近年來利用卷積層在比較小的、比較不同的地方,例如Edge-detection之類。而部分特徵可以找得更加的細微且明顯的外觀特徵,而本篇論文也將卷積可能會出現的Size大小提出來,藉由這提出的大小來告知不同大小針對特徵的定義也都不同,conv4_4的size為conv3_4的一半,這是因為使用雙線性的向上採集的方式,由大入小、再回到相同的解析度大小(入、出一致)
  15. 在[35]當中有提到多層的使用,卷積與工程輸出有關,例如輸出=input*轉移函數等(f和g生成的第三個函數,表徵f經過翻轉和平移的g的重疊部分面積,又稱兩種特徵的重疊積)。在此利用深度卷積網路的方式做大尺寸範圍的搜尋.
  16. 近似R-CNN(Region-based的神經網路數學法,用以協助分類)的方式協助分類, 第一個支線是代表每一個pixel上的output都應該要和GT比較得出差值, 如(1). 而和BB的座標差距則是落在(2)上, 計算出每一個點上的可能結果. 而這個L代表的就是Loss, 錯誤可能. L2 Loss的意思是取取平方差的意思,把錯誤放大。
  17. 在多重任務間訓練需要平均採樣, 需要正、負的採樣結果。而採集負結果其實是一個相當關鍵的部分。在採樣在正反中間的邊界時,偵測器將被迫降階。而最後使用mask去顯示出哪一個pixel該被選擇、哪一個不該被選擇。 每個mini-batch中只有一個訓練樣本。用在Training上面,利用每一次針對集合的迭代來找到整個訓練集合上的梯度方向,可以沿著該方向找到下一個迭代點。而mini-batch GD則是被提出來只針對某一個patch,而不是全部的集合樣本。
  18. 在多重任務間訓練需要平均採樣, 需要正、負的採樣結果
  19. 如果是在灰色地帶的話則捨棄,該Threshold設為2-pixel(邊線上的意思),因為negative狀況可以視為背景,所以等於是在外與內間的邊線取灰階地帶
  20. 選擇錯誤的資料是較為簡單且明顯地,這有助於幫助加強預測的準確程度,猶如刪除錯誤的資料一般。在Label的比例上保持正負皆1:1的情狀,而用fsel來決定是否選擇(錯誤的資料才選擇)
  21. 在Mask的選擇上,如果忽略或是沒有被選擇到,那麼該Mask就代表沒有效,對(4)有影響。在(4)當中,紅框處要先確認GT是否有效(因為他要比較分數),如果GT本身是無效的那則代表根本不需要訓練。Landaloc是代表平均函數,可以平均分配的分數和位置的差異結果,並在最後Normalize d*,把這些框架分成50/4(一個偵測到的object的大小是50pixel)
  22. 在[35]當中有提到多層的使用,卷積與工程輸出有關,例如輸出=input*轉移函數等(f和g生成的第三個函數,表徵f經過翻轉和平移的g的重疊部分面積,又稱兩種特徵的重疊積)。在此利用深度卷積網路的方式做大尺寸範圍的搜尋,但是因為做得比較深,
  23. 這步驟融合前面的區域性標註,不斷的細化後得到探勘錯誤並回饋錯誤(但這些錯誤後續不知道有沒有使用)。Det、landmark結合成成上述的圖案,並再擴張成檢視的size為搜尋到的結果圓圈。在本步驟利用Convolution一次又一次的簡化可能的結果並且總和起來、製作出可能的部分。紅字提到的1x1x512就代表著Size或者又稱之為Channel且分為512張,可能的特徵圖案
  24. 左上角是檢測的方式,而在中間的階段則像前面的紅框處,把分別的結果串列進來。而前面提到傳了4個過程則是代表分數、Bounding-Box座標。右下角的圖片是範例圖,大致說明著NxNxN的意思。在此的NxN代表size,但有時候代表的可能是圖片灰階或是RGB.
  25. L(lm)是標記的分數、rf是重新檢視縮小後的Loss檢測分數、det是區域檢測分數,這些搭配權重之後才是確切的總和成績。
  26. 總共應用了MALF、KITTI兩個資料庫。而這兩項挑戰則是如左右圖一般。 左方是找出臉部的可能特徵,在眉毛、鼻子、嘴巴上都有特別之處,這些和車子一樣都特別標註起特徵並且藉由CNN學習。
  27. MALF包含了五千多張圖片,但這些圖片的姿態都趨向於正面。大部分的從小尺寸到大尺寸都有,而三種不同版本的DenseBox可以用於測試階段使用。大致上的臉部是 50 pixels高,並在[0.8 , 1.25]的倍數上確認
  28. 在KITTI的Car-Detection資料庫當中,overlap必須要大於70%、face detection需要50%。在“The PASCAL Visual Object Classes (VOC) Challenge“當中,則是以50%作為Threshold,並且將各種不同的overlap-ratio作為標準來判斷AP的數值高低。
  29. 左邊的圖片顯示出不同的Dense-box版本的比較。最高的Ensemble代表從不同的batch合成了10個Landmark的DenseBox。而右圖代表著Recall-rate,DenseBox在搜索上能夠盡全力找到全部的資料內容(一張圖)
  30. 在當時, DenseBox蟬聯四個月的第一到第四名, 而如今是由DJML擊敗其他人(在KITTI的車輛的辨識上)。 [19]整合前後狀況與確認被掩蓋的遮蔽物 [23]則是藉由區域性重新定位的方式 [40]三維定位的空間架構
  31. 大部分特徵擷取都能夠成功,像右下角圖片中顯示抓錯的問題則是因為那些「特徵」太近似一個臉
  32. 該論文應用到Car和Face上,但是初始的運算速度不夠快速。後續版本有些許改變