4. Abstract
• Most successful object detectors enumerate a nearly exhaustive
list of potential object locations and classify each
- wasteful, inefficient, requiring additional pre-processing
• An object as a single point, the center point of its bounding box
- find center points
- regress other properties (size, 3D location, orientation, pose)
• CenterNet
- end-to-end differentiable
- simpler, faster, and more accurate
- runs in real-time
6. Introduction
Object Detection
Represent each object through an axis-aligned
bounding box that tightly encompasses the the object
https://hoya012.github.io/blog/Tutorials-of-Object-Detection-Using-Deep-Learning-the-application-of-object-detection/
9. Introduction
Post-processing : NMS (Non Maximum Suppression)
https://heiwais25.github.io/machine%20learning/cnn/2018/05/10/non-maximum-suppression/
- Remove duplicated detections for the same instance by computing IoU
- Hard to differentiate and train
- Most current detectors are not end-to-end because of NMS
10. Introduction
Objects as Points
- Center point
- Regression of other properties (size, dimension, 3D extent, orientation, and pose)
- Standard keypoint estimation problem
- Single network forward-pass without nms post-processing
11. Introduction
Extension to Other Tasks
- Easy to extend method to other tasks
- 3D object detection,
multi-person human pose estimation
- Additional outputs at each center
point (right figure)
13. Preliminary
N Ukita et al. Semi- and Weakly-supervised Human Pose Estimation. arXiv:1906.0139
I ∈ RW×H×3 ̂Y ∈ [0,1]
W
R ×H
R ×C
- : Output Stride
- : the number of keypoints
R
C
16. Preliminary
Local Offset : L1 Loss
- To recover the discretization error caused by the output stride
- predict a local offset for each center point̂O ∈ R
W
R × H
R ×2
17. Objects as Points
: the bounding box of object with category(x(k)
1
, y(k)
1
, x(k)
2
, y(k)
2
) k ck
- center point
- object size
- a single size prediction
- L1 loss :
pk =
(
x(k)
1 + x(k)
2
2
,
y(k)
1 + y(k)
2
2 )
sk = (x(k)
2
− x(k)
1
, y(k)
2
− y(k)
1 )
̂S ∈ R
W
R × H
R ×2
Lsize =
1
N
N
∑
k=1
| ̂Spk
− sk |
(x(k)
1
, y(k)
1
)
(x(k)
2
, y(k)
2
)
18. Objects as Points
Training objective
Ldet = Lk + λsizeLsize + λoff Loff
-
- total outputs at each location
λsize = 0.1, λoff = 1
C + 4
19. Objects as Points
From points to bounding boxes
: the set of detected center points of clasŝPc n ̂P = {( ̂xi, ̂yi)}n
i=1 c
Bounding box location : ( ̂xi + δ ̂xi − ̂wi/2, ̂yi + δ ̂yi − ̂hi/2,
̂xi + δ ̂xi + ̂wi/2, ̂yi + δ ̂yi + ̂hi/2)
- : the offset prediction
- : the size prediction
(δ ̂xi, δ ̂yi) = ̂O ̂xi, ̂yi
( ̂wi, ̂hi) = ̂S ̂xi, ̂yi
20. Implementation details
A Newell et al. Stacked Hourglass Networks for Human Pose Estimation. arXiv:1603.06937
Experiment with 4 architectures
21. Implementation details
A Newell et al. Stacked Hourglass Networks for Human Pose Estimation. arXiv:1603.06937
Experiment with 4 architectures1. Hourglass
22. Implementation details
Experiment with 4 architectures1. Hourglass
A Newell et al. Stacked Hourglass Networks for Human Pose Estimation. arXiv:1603.06937
- Downsamples the input by 4x, followed by sequential hourglass module (Used 2-stacks)
- Each hourglass module is a symmetric 5-layer-down and up-convolutional network with skip
connections
- Quite large but yields the best performance
23. Implementation details
Experiment with 4 architectures2. ResNet
B Xiao et al. Simple baselines for human pose estimation and tracking. arXiv:1804.0620
J Dai et al. Deformable Convolutional Networks. arXiv:1703.06211
- A standard residual network with three up-convolutional networks
- Add one 3x3 deformable convolutional layer before each up-conv
- Up-conv kernels initialization : Bilinear interpolation
24. Implementation details
Experiment with 4 architectures3. DLA
F Yu et al. Deep Layer Aggregation. arXiv:1707.06484
- An image classification network with hierarchical skip connections
- utilize the fully convolutional upsampling for dense prediction
- Replace the original conv with 3x3 deformable conv at every upsampling layer
25. Implementation details
Experiment with 4 architecturesCOCO validation result
- N.A : without test augmentation
- F : flip testing
- MS : multi-scale augmentation (0.5, 0.75, 1, 1.25, 1.5) with NMS to merge results
27. Conclusion
• A new representations for objects: as points
• Simple, fast, accurate, and end-to-end differentiable without NMS
post-processing
• Estimation of a range of additional object properties, such as
pose, 3D orientation, depth and extent, in one single forward pass
• A new direction for real-time object recognition