1. Joint Detection & Segmentation
in BEV Representation
Yu Huanng
Sunnyvale, California
Yu.huang07@gmail.com
2. Outline
• M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Bird’s-Eye View Representation
• BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-
Centric Autonomous Driving
• Learning Ego 3D Representation as Ray Tracing
• BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View
Representation
• Efficient and Robust 2D-to-BEV Representation Learning via Geometry-
guided Kernel Transformer
• BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework
• BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera
Images via Spatiotemporal Transformers
3. M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
• M2BEV, a unified framework that jointly performs 3D object detection and map
segmentation in the Bird’s Eye View (BEV) space with multi-camera image inputs.
• Unlike the majority of previous works which separately process detection and
segmentation, M2BEV infers both tasks with a unified model and improves efficiency.
• M2BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in
ego-car coordinates.
• Such BEV representation is important as it enables different tasks to share a single
encoder.
• This framework further contains four important designs that benefit both accuracy and
efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a
voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to
assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that
reinforces with larger weights for more distant predictions, and (4) Large-scale 2D
detection pre-training and auxiliary supervision.
• M2BEV is memory efficient, allowing significantly higher resolution images as input, with
faster inference speed.
9. BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
• BEVerse, a unified framework for 3D perception and prediction based on multi-
camera systems.
• Unlike existing studies focusing on the improvement of single-task approaches,
BEVerse features in producing spatio-temporal Birds-Eye-View (BEV)
representations from multi-camera videos and jointly reasoning about multiple
tasks for vision-centric autonomous driving.
• Specifically, BEVerse first performs shared feature extraction and lifting to generate
4D BEV representations from multi-timestamp and multi-view images.
• After the ego- motion alignment, the spatio-temporal encoder is utilized for further
feature extraction in BEV.
• Finally, multiple task decoders are attached for joint reasoning and prediction.
• Within the decoders, propose the grid sampler to generate BEV features with
different ranges and granularities for different tasks.
• Also, design the method of iterative flow for memory-efficient future prediction.
18. Learning Ego 3D Representation as Ray Tracing
• A self-driving perception model aims to extract 3D semantic representations from
multiple cameras collectively into the bird’s-eye- view (BEV) coordinate frame of
the ego car in order to ground down- stream planner.
• Existing perception methods often rely on error-prone depth estimation of the
whole scene or learning sparse virtual 3D representations without the target
geometry structure, both of which remain limited in performance and/or
capability.
• In this paper, a end-to-end architecture for ego 3D representation learning from
an arbitrary number of unconstrained camera views.
• Inspired by the ray tracing principle, design a polarized grid of “imaginary eyes”
as the learnable ego 3D representation and formulate the learning process with
the adaptive attention mechanism in conjunction with the 3D-to-2D projection.
• Critically, this formulation allows extracting rich 3D representation from 2D
images without any depth supervision, and with the built-in geometry structure
consistent w.r.t BEV.
19. Learning Ego 3D Representation as Ray Tracing
(a) The first strategy, represented by LSS, CaDDN, is based on dense pixel-level depth estimation. (b) The
second strategy represented by PON bypasses the depth estimation by learning implicit 2D-3D projection. (c)
This strategy that backtracks 2D information from “imaginary” eyes specially designed in the BEV’s geometry.
23. BEVFusion: Multi-Task Multi-Sensor Fusion
with Unified Bird’s-Eye View Representation
• Recent approaches are based on point-level fusion: augmenting the LiDAR point
cloud with camera features.
• However, the camera-to-LiDAR projection throws away the semantic density of
camera features, hindering the effectiveness of such methods, especially for
semantic-oriented tasks (such as 3D scene segmentation).
• This paper breaks this deeply-rooted convention with BEVFusion, an efficient and
generic multi-task multi-sensor fusion framework.
• It unifies multi- modal features in the shared bird’s-eye view (BEV) representation
space, which nicely preserves both geometric and semantic information.
• To achieve this, diagnose and lift key efficiency bottlenecks in the view
transformation with optimized BEV pooling, reducing latency by more than 40×.
• BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D
perception tasks with almost no architectural changes.
28. Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
• This work is a Geometry-guided Kernel Transformer (GKT), a 2D-to-BEV
representation learning mechanism.
• GKT leverages the geometric priors to guide the transformer to focus on
discriminative regions, and unfolds kernel features to generate BEV
representation.
• For fast inference, further introduce a look-up table (LUT) indexing method to get
rid of the camera’s calibrated parameters at runtime.
• GKT can run at 72.3 FPS on 3090 GPU / 45.6 FPS on 2080ti GPU and is robust to
the camera deviation and the predefined BEV height.
• And GKT achieves the state-of-the-art real-time segmentation results, i.e., 38.0
mIoU (100m×100m perception range at a 0.5m resolution) on the nuScenes val
set.
• Code and models will be available at https://github.com/hustvl/GKT.
29. Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
(a) Geometry-based pointwise transformation leverages camera’s calibrated parameters (intrinsic and extrinsic) to
determine the correspondence (one to one or one to many) between 2D positions and BEV grids. (b) Geometry-free
global transformation considers the full correlation between image and BEV. Each BEV grid interacts with all image pixels.
30. Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
31. Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
32. BEVFusion: A Simple and Robust LiDAR-
Camera Fusion Framework
• Fusing the camera and LiDAR information has become a de-facto standard for 3D
object detection tasks.
• Current methods rely on point clouds from the LiDAR sensor as queries to
leverage the feature from the image space.
• However, people discover that this underlying assumption makes the current
fusion framework infeasible to produce any prediction when there is a LiDAR
malfunction, regardless of minor or major.
• A simple fusion framework, BEVFusion, whose camera stream does not depend
on the input of LiDAR data, thus addressing the downside of previous methods.
• Under the robustness training settings that simulate various LiDAR malfunctions,
this framework surpasses the state-of-the-art methods by 15.7% to 28.9% mAP.
• The code is available at https://github.com/ADLab-AutoDrive/BEVFusion.
36. BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
• This work presents a framework termed BEVFormer, which learns unified BEV
representations with spatiotemporal transformers to support multiple
autonomous driving perception tasks.
• In a nutshell, BEVFormer exploits both spatial and temporal information by
interacting with spatial and temporal space through pre-defined grid-shaped BEV
queries.
• To aggregate spatial information, design a spatial cross-attention that each BEV
query extracts the spatial features from the RoI across camera views.
• For temporal information, propose a temporal self-attention to recurrently fuse
the history BEV information.
• This approach achieves 56.9% in terms of NDS metric on the nuScenes test set,
which is 9.0 points higher than previous best arts and on par with the
performance of LiDAR-based baselines.
• The code will be released at https://github.com/zhiqi-li/BEVFormer.