1. 3D Interpretation from Single 2D Image
for Autonomous Driving V
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
2. Outline
• MonoRUn: Monocular 3D Object Detection by Reconstruction and
Uncertainty Propagation
• M3DSSD: Monocular 3D Single Stage Object Detector
• Delving into Localization Errors for Monocular 3D Object Detection
• GrooMeD-NMS: Grouped Mathematically Differentiable NMS for
Monocular 3D Object Detection
• Objects are Different: Flexible Monocular 3D Object Detection
3. MonoRUn: Monocular 3D Object Detection by Reconstruction
and Uncertainty Propagation
• MonoRUn: Self supervised, learn dense
correspondences and geometry;
• Robust KL loss: minimize the uncertainty
weighted projection error;
• Uncertainty aware region reconstruction
network for 3-d object coordinate
regression;
• uncertainty-driven PnP for object pose
and covariance matrix estimation;
• Codes: https://github.com/tjiiv-
cprg/MonoRUn.
7. M3DSSD: Monocular 3D Single Stage
Object Detector
• Monocular 3D Single Stage object Detector (M3DSSD) with feature alignment and
asymmetric non-local attention.
• Current anchor-based monocular 3D object detection methods suffer from feature
mismatching.
• To overcome this, propose a two-step feature alignment approach.
• In the first step, the shape alignment is performed to enable the receptive field of
the feature map to focus on the pre-defined anchors with high confidence scores.
• In the second step, the center alignment is used to align the features at 2D/3D
centers.
• Further, it is often difficult to learn global information and capture long-range
relationships, which are important for the depth prediction of objects.
• Asymmetric non-local attention block with multiscale sampling to extract depth-
wise features.
• The code is released at https://github.com/mumianyuxin/M3DSSD.
8. M3DSSD: Monocular 3D Single Stage
Object Detector
The architecture of M3DSSD. (a) The backbone of the framework, which is modified from DLA-102. (b)
The two-step feature alignment, classification head, 2D/3D center regression heads, and ANAB
especially designed for predicting the depth z3d. (c) Other regression heads.
9. M3DSSD: Monocular 3D Single Stage
Object Detector
The architecture of shape alignment
and the outcome of shape alignment
on objects. The yellow squares indicate
the sampling location of the AlignConv,
and the anchors are in red.
10. M3DSSD: Monocular 3D Single Stage
Object Detector
The architectures of center alignment and the
outcome of the center alignment. When
applying center alignment to objects, the
sampling locations on the foreground regions (in
white) all concentrate on the centers of objects
(in yellow) after center alignment, which are
near to the true centers of objects (in red).
11. M3DSSD: Monocular 3D Single Stage
Object Detector
Asymmetric Non-local Attention Block. The
key and query branches share the same
attention maps, which forces the key and
value to focus on the same place. Bottom:
Pyramid Average Pooling with Attention
(PA2) that generates different level
descriptors in various resolutions.
15. Delving into Localization Errors for
Monocular 3D Object Detection
• Quantify the impact introduced by each sub-task and found the
‘localization error’ is the vital factor in restricting monocular 3D detection.
• Besides, investigate the underlying reasons behind localization errors,
analyze the issues they might bring, and propose three strategies.
• First, misalignment between the center of the 2D bounding box and the
projected center of the 3D object, which is a vital factor leading to low
localization accuracy.
• Second, we observe that accurately localizing distant objects with existing
technologies is almost impossible, while those samples will mislead the
learned network. To remove such samples from the training set for
improving the overall performance of the detector.
• Lastly, 3D IoU oriented loss for the size estimation of the object, which is not
affected by ‘localization error’.
• Codes: https://github.com/xinzhuma/monodle.
17. Delving into Localization Errors for
Monocular 3D Object Detection
• Coupled with the errors accumulated by other tasks such as depth
estimation, it becomes an almost impossible task to accurately estimate the
3D bounding box of distant objects from a single monocular image, unless
the depth estimation is accurate enough (not achieved to date).
• For estimating the coarse center c, 1) use the projected 3D center cw as the
ground-truth for the branch estimating coarse center c and 2) force the
model to learn features from 2D detection simultaneously.
• Here adopting the projected 3D center cw as the ground-truth for the
coarse center c, helps the branch for estimating the coarse center aware of
3D geometry and more related to the task of estimating 3D object center,
which is the key of localization problem.
• 2D detection serves as auxiliary task to learn better 3D aware features.
18. Delving into Localization Errors for
Monocular 3D Object Detection
• Two schemes are proposed on how to generate the object level training weight for sample:
• Hard coding: discard all samples over a certain distance
• Soft coding: generate it using a reverse sigmoid-like function
• A IoU oriented optimization for 3D size estimation: Specifically, suppose all prediction items
except the 3D size s = [h,w,l]3D are completely correct, then
22. GrooMeD-NMS: Grouped Mathematically Differentiable
NMS for Monocular 3D Object Detection
• While there were attempts to include NMS in the training pipeline for tasks
such as 2D object detection, they have been less widely
• adopted due to a non-mathematical expression of the NMS.
• It integrate GrooMeD-NMS – a Grouped Mathematically Differentiable NMS
for monocular 3D object detection, such that the network is trained end-to-
end with a loss on the boxes after NMS.
• First formulate NMS as a matrix operation and then group and mask the
boxes in an unsupervised manner to obtain a simple closed-form expression
of the NMS.
• GrooMeDNMS addresses the mismatch between training and inference
pipelines and, therefore, forces the network to select the best 3D box in a
differentiable manner.
23. GrooMeD-NMS: Grouped Mathematically Differentiable
NMS for Monocular 3D Object Detection
(a) Conventional object detection has a mismatch between training and inference as it uses NMS
only in inference. (b) To address this, propose a novel GrooMeD-NMS layer, such that the network is
trained end-to-end with NMS applied. s and r denote the score of boxes B before and after the NMS
respectively. O denotes the matrix containing IoU2D overlaps of B. Lbefore denotes the losses before the
NMS, while Lafter denotes the loss after the NMS. (c) GrooMeD-NMS layer calculates r in a differentiable
manner giving gradients from Lafter when the best-localized box corresponding to an object is not
selected after NMS.
25. GrooMeD-NMS: Grouped Mathematically Differentiable
NMS for Monocular 3D Object Detection
write the rescores r in a matrix formulation as written compactly as
where P, called the Prune Matrix, is obtained when the
pruning function p operates element-wise on O.
to avoid recursion, use as the solution
cluster boxes in an image in an unsupervised manner
based on IoU2D overlaps to obtain the groups G. Grouping
thus mimics the grouping of the classical NMS, but does
not rescore the boxes. Rewrite as
26. GrooMeD-NMS: Grouped Mathematically Differentiable
NMS for Monocular 3D Object Detection
Classical NMS considers the IoU2D of the top-scored box
with other boxes. This consideration is equivalent to only
keeping the column of O corresponding to the top box
while assigning the rest of the columns to be zero.
Implement this through masking of PGk . Let MGk denote
the binary mask corresponding to group Gk.
Due to Frobenius matrix, simplify further to get
entries in the binary matrix MGk in the column correspond.
to the top scored box are 1 and the rest are 0.
28. GrooMeD-NMS: Grouped Mathematically Differentiable
NMS for Monocular 3D Object Detection
The method consists of M3DRPN and uses binning and
self-balancing confidence. The boxes’ self-balancing
confidence are used as scores s, which pass through the
GrooMeD-NMS layer to obtain the rescores r. The
rescores signal the network if the best box has not been
selected for a particular object. The target assignment is
For calculating gIoU3D, first calculate the volume V
and hull volume Vhull of the 3D boxes. Vhull is the
product of gIoU2D in Birds Eye View (BEV),
removing the rotations and hull of the Y dimension.
if the best boxes are correctly ranked in one image
and are not in the second, then the gradients only
affect the boxes of the second image.
modification as Image-wise AP-Loss
use the modified AP-Loss as the loss after NMS since
AP-Loss does not suffer from class imbalance.
31. Objects are Different: Flexible Monocular 3D
Object Detection
• Most existing methods adopt the same approach for all objects regardless of
diverse distributions, leading to limited performance for truncated objects.
• A flexible framework for monocular 3D object detection which explicitly decouples
the truncated objects and adaptively combines multiple approaches for object
depth estimation.
• Specifically, decouple the edge of the feature map for predicting long-tail truncated
objects so that the optimization of normal objects is not influenced.
• Furthermore, formulate the object depth estimation as an uncertainty-guided
ensemble of directly regressed object depth and solved depths from different
groups of keypoints.
• Code to be at https://github.com/zhangyp15/MonoFlex.
33. Objects are Different: Flexible Monocular 3D
Object Detection
• The framework is extended from CenterNet, where objects are identified by their
representative points and predicted by peaks of the heatmap.
• First, the CNN backbone extracts feature maps from the monocular image as the
input for multiple prediction heads.
• Multiple prediction branches are deployed on the shared backbone to regress
objects’ properties, including the 2D bounding box, dimension, orientation,
keypoints, and depth.
• The image-level localization involves the heatmap and offsets, where the edge
fusion modules are used to decouple the feature learning and prediction of
truncated objects.
• The final depth estimation is an uncertainty guided combination of the regressed
depth and the computed depths from estimated keypoints and dimensions.
• The adaptive depth ensemble adopts four methods for depth estimation and
simultaneously predicts their uncertainties, which are utilized to form an uncertainty
weighted prediction.
34. Objects are Different: Flexible Monocular 3D
Object Detection
The dimension and orientation can be directly inferred
from appearance-based clues, while the 3D location is
converted to the projected 3D center xc = (uc, vc) and the
object depth z as
Existing methods utilize a unified representation xr, the
center of 2D bounding box xb, for every object. In such
cases, the offset 𝛿c = xc - xb is regressed to derive the
projected 3D center xc.
(a) The 3D location is converted to the projected
center and the object depth. (b) The distribution of
the offsets c from 2D centers to projected 3D
centers. Inside and outside objects exhibit entirely
different distributions.
Divide objects into two groups depending on whether
their projected 3D centers are inside or outside the
image.
The joint learning of 𝛿c can suffer from long-tail
offsets and therefore to decouple the
representations and the offset learning of inside
and outside objects.
35. Objects are Different: Flexible Monocular 3D
Object Detection
Inside Objects’ discretization offset error
Outside Objects’ discretization offset error
(a) The intersection xI between the image edge and
the line from xb to xc is used to represent the
truncated object. (b) The edge heatmap is generated
with 1D Gaussian distribution. (c) The edge
intersection xI (cyan) is a better representation than
2D center xb (green) for heavily truncated objects.
Since 2D bounding boxes only capture the inside-
image part of objects, the visual locations of xb can
be confusing and even on other objects. By contrast,
the intersection xI disentangles the edge area of the
heatmap to focus on outside objects and offers a
strong boundary prior to simplify the localization.
36. Objects are Different: Flexible Monocular 3D
Object Detection
• Edge fusion module to further decouple the feature learning and prediction of
outside objects;
• The module first extracts four boundaries of the feature map and concatenates
them into an edge feature vector in clockwise order, which is then processed by
two 1D convolutional layers to learn
• unique features for truncated objects.
• Finally, the processed vector is remapped to the four boundaries and added to
the input feature map.
37. Objects are Different: Flexible Monocular 3D
Object Detection
Relation of global orientation, local
orientation, and the viewing angle.
Keypoints include the projections of eight vertexes,
top center and bottom center of the 3D bounding box.
The depth of a supporting line of the 3D bounding
box can be computed with the object height and the line’s
pixel height. Split ten keypoints into three groups, each
of which can produce the center depth independently