Objects as points (CenterNet) review [CDM]

Objects as Points
Choi Dongmin
Yonsei University Severance Hospital CCIDS

Contents
• Abstract

• Introduction

• Preliminary

• Objects as Points

• Implementation details

• Experiments

• Conclusion

• Appendix : Model Architecture

Abstract
• Most successful object detectors enumerate a nearly exhaustive
list of potential object locations and classify each 
- wasteful, inefficient, requiring additional pre-processing

• An object as a single point, the center point of its bounding box 
- find center points 
- regress other properties (size, 3D location, orientation, pose)

• CenterNet 
- end-to-end differentiable 
- simpler, faster, and more accurate 
- runs in real-time

Introduction
Object Detection
Represent each object through an axis-aligned
bounding box that tightly encompasses the the object
https://hoya012.github.io/blog/Tutorials-of-Object-Detection-Using-Deep-Learning-the-application-of-object-detection/

Introduction
One-Stage Detector
- Anchor (possible bounding box) based method
Wu et al. Recent Advances in Deep Learning for Object Detection. arXiv:1908.03673

Introduction
Two-Stage Detector
Wu et al. Recent Advances in Deep Learning for Object Detection. arXiv:1908.03673

Introduction
Post-processing : NMS (Non Maximum Suppression)
https://heiwais25.github.io/machine%20learning/cnn/2018/05/10/non-maximum-suppression/
- Remove duplicated detections for the same instance by computing IoU

- Hard to diﬀerentiate and train

- Most current detectors are not end-to-end because of NMS

Introduction
Objects as Points
- Center point

- Regression of other properties (size, dimension, 3D extent, orientation, and pose)

- Standard keypoint estimation problem

- Single network forward-pass without nms post-processing

Introduction
Extension to Other Tasks
- Easy to extend method to other tasks

- 3D object detection, 
multi-person human pose estimation

- Additional outputs at each center
point (right ﬁgure)

Introduction
Oﬃcial Pytorch Code : https://github.com/xingyizhou/CenterNet

Preliminary
N Ukita et al. Semi- and Weakly-supervised Human Pose Estimation. arXiv:1906.0139
I ∈ RW×H×3 ̂Y ∈ [0,1]
W
R ×H
R ×C
- : Output Stride

- : the number of keypoints
R
C

Preliminary
Penalty-reduced pixel-wise logistic regression with focal loss
- : the number of keypoints in image

- , (hyper-parameters for focal-loss)
N
α = 2 β = 4

Preliminary
Local Oﬀset : L1 Loss
- To recover the discretization error caused by the output stride

- predict a local oﬀset for each center point̂O ∈ R
W
R × H
R ×2

Objects as Points
: the bounding box of object with category(x(k)
1
, y(k)
1
, x(k)
2
, y(k)
2
) k ck
- center point

- object size

- a single size prediction

- L1 loss :
pk =
(
x(k)
1 + x(k)
2
2
,
y(k)
1 + y(k)
2
2 )
sk = (x(k)
2
− x(k)
1
, y(k)
2
− y(k)
1 )
̂S ∈ R
W
R × H
R ×2
Lsize =
1
N
N
∑
k=1
| ̂Spk
− sk |
(x(k)
1
, y(k)
1
)
(x(k)
2
, y(k)
2
)

Objects as Points
Training objective
Ldet = Lk + λsizeLsize + λoff Loff
-

- total outputs at each location
λsize = 0.1, λoff = 1
C + 4

Objects as Points
From points to bounding boxes
: the set of detected center points of clasŝPc n ̂P = {( ̂xi, ̂yi)}n
i=1 c
Bounding box location : ( ̂xi + δ ̂xi − ̂wi/2, ̂yi + δ ̂yi − ̂hi/2,
̂xi + δ ̂xi + ̂wi/2, ̂yi + δ ̂yi + ̂hi/2)
- : the oﬀset prediction

- : the size prediction
(δ ̂xi, δ ̂yi) = ̂O ̂xi, ̂yi
( ̂wi, ̂hi) = ̂S ̂xi, ̂yi

Implementation details
A Newell et al. Stacked Hourglass Networks for Human Pose Estimation. arXiv:1603.06937
Experiment with 4 architectures

Experiment with 4 architectures1. Hourglass

Experiment with 4 architectures1. Hourglass
- Downsamples the input by 4x, followed by sequential hourglass module (Used 2-stacks)

- Each hourglass module is a symmetric 5-layer-down and up-convolutional network with skip
connections

- Quite large but yields the best performance

Experiment with 4 architectures2. ResNet
B Xiao et al. Simple baselines for human pose estimation and tracking. arXiv:1804.0620 
J Dai et al. Deformable Convolutional Networks. arXiv:1703.06211
- A standard residual network with three up-convolutional networks

- Add one 3x3 deformable convolutional layer before each up-conv

- Up-conv kernels initialization : Bilinear interpolation

Experiment with 4 architectures3. DLA
F Yu et al. Deep Layer Aggregation. arXiv:1707.06484
- An image classiﬁcation network with hierarchical skip connections

- utilize the fully convolutional upsampling for dense prediction

- Replace the original conv with 3x3 deformable conv at every upsampling layer

Experiment with 4 architecturesCOCO validation result
- N.A : without test augmentation

- F : ﬂip testing

- MS : multi-scale augmentation (0.5, 0.75, 1, 1.25, 1.5) with NMS to merge results

Conclusion
• A new representations for objects: as points

• Simple, fast, accurate, and end-to-end diﬀerentiable without NMS
post-processing

• Estimation of a range of additional object properties, such as 
pose, 3D orientation, depth and extent, in one single forward pass

• A new direction for real-time object recognition

Thank you
Yonsei University Severance Hospital CCIDS

Objects as points (CenterNet) review [CDM]

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Objects as points (CenterNet) review [CDM]

Similaire à Objects as points (CenterNet) review [CDM] (20)

Plus de Dongmin Choi

Plus de Dongmin Choi (20)

Dernier

Dernier (20)

Objects as points (CenterNet) review [CDM]