SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Joint Detection & Segmentation
in BEV Representation
Yu Huanng
Sunnyvale, California
Yu.huang07@gmail.com
Outline
• M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Bird’s-Eye View Representation
• BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-
Centric Autonomous Driving
• Learning Ego 3D Representation as Ray Tracing
• BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View
Representation
• Efficient and Robust 2D-to-BEV Representation Learning via Geometry-
guided Kernel Transformer
• BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework
• BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera
Images via Spatiotemporal Transformers
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
• M2BEV, a unified framework that jointly performs 3D object detection and map
segmentation in the Bird’s Eye View (BEV) space with multi-camera image inputs.
• Unlike the majority of previous works which separately process detection and
segmentation, M2BEV infers both tasks with a unified model and improves efficiency.
• M2BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in
ego-car coordinates.
• Such BEV representation is important as it enables different tasks to share a single
encoder.
• This framework further contains four important designs that benefit both accuracy and
efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a
voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to
assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that
reinforces with larger weights for more distant predictions, and (4) Large-scale 2D
detection pre-training and auxiliary supervision.
• M2BEV is memory efficient, allowing significantly higher resolution images as input, with
faster inference speed.
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
M2BEV: Multi-Camera Joint 3D Detection and Segmentation
with Unified Bird’s-Eye View Representation
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
• BEVerse, a unified framework for 3D perception and prediction based on multi-
camera systems.
• Unlike existing studies focusing on the improvement of single-task approaches,
BEVerse features in producing spatio-temporal Birds-Eye-View (BEV)
representations from multi-camera videos and jointly reasoning about multiple
tasks for vision-centric autonomous driving.
• Specifically, BEVerse first performs shared feature extraction and lifting to generate
4D BEV representations from multi-timestamp and multi-view images.
• After the ego- motion alignment, the spatio-temporal encoder is utilized for further
feature extraction in BEV.
• Finally, multiple task decoders are attached for joint reasoning and prediction.
• Within the decoders, propose the grid sampler to generate BEV features with
different ranges and granularities for different tasks.
• Also, design the method of iterative flow for memory-efficient future prediction.
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
BEVerse: Unified Perception and Prediction in Birds-Eye-
View for Vision-Centric Autonomous Driving
Learning Ego 3D Representation as Ray Tracing
• A self-driving perception model aims to extract 3D semantic representations from
multiple cameras collectively into the bird’s-eye- view (BEV) coordinate frame of
the ego car in order to ground down- stream planner.
• Existing perception methods often rely on error-prone depth estimation of the
whole scene or learning sparse virtual 3D representations without the target
geometry structure, both of which remain limited in performance and/or
capability.
• In this paper, a end-to-end architecture for ego 3D representation learning from
an arbitrary number of unconstrained camera views.
• Inspired by the ray tracing principle, design a polarized grid of “imaginary eyes”
as the learnable ego 3D representation and formulate the learning process with
the adaptive attention mechanism in conjunction with the 3D-to-2D projection.
• Critically, this formulation allows extracting rich 3D representation from 2D
images without any depth supervision, and with the built-in geometry structure
consistent w.r.t BEV.
Learning Ego 3D Representation as Ray Tracing
(a) The first strategy, represented by LSS, CaDDN, is based on dense pixel-level depth estimation. (b) The
second strategy represented by PON bypasses the depth estimation by learning implicit 2D-3D projection. (c)
This strategy that backtracks 2D information from “imaginary” eyes specially designed in the BEV’s geometry.
Learning Ego 3D Representation as Ray Tracing
Learning Ego 3D Representation as Ray Tracing
Learning Ego 3D Representation as Ray Tracing
BEVFusion: Multi-Task Multi-Sensor Fusion
with Unified Bird’s-Eye View Representation
• Recent approaches are based on point-level fusion: augmenting the LiDAR point
cloud with camera features.
• However, the camera-to-LiDAR projection throws away the semantic density of
camera features, hindering the effectiveness of such methods, especially for
semantic-oriented tasks (such as 3D scene segmentation).
• This paper breaks this deeply-rooted convention with BEVFusion, an efficient and
generic multi-task multi-sensor fusion framework.
• It unifies multi- modal features in the shared bird’s-eye view (BEV) representation
space, which nicely preserves both geometric and semantic information.
• To achieve this, diagnose and lift key efficiency bottlenecks in the view
transformation with optimized BEV pooling, reducing latency by more than 40×.
• BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D
perception tasks with almost no architectural changes.
BEVFusion: Multi-Task Multi-Sensor Fusion
with Unified Bird’s-Eye View Representation
BEVFusion: Multi-Task Multi-Sensor Fusion
with Unified Bird’s-Eye View Representation
BEVFusion: Multi-Task Multi-Sensor Fusion
with Unified Bird’s-Eye View Representation
BEVFusion: Multi-Task Multi-Sensor Fusion
with Unified Bird’s-Eye View Representation
Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
• This work is a Geometry-guided Kernel Transformer (GKT), a 2D-to-BEV
representation learning mechanism.
• GKT leverages the geometric priors to guide the transformer to focus on
discriminative regions, and unfolds kernel features to generate BEV
representation.
• For fast inference, further introduce a look-up table (LUT) indexing method to get
rid of the camera’s calibrated parameters at runtime.
• GKT can run at 72.3 FPS on 3090 GPU / 45.6 FPS on 2080ti GPU and is robust to
the camera deviation and the predefined BEV height.
• And GKT achieves the state-of-the-art real-time segmentation results, i.e., 38.0
mIoU (100m×100m perception range at a 0.5m resolution) on the nuScenes val
set.
• Code and models will be available at https://github.com/hustvl/GKT.
Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
(a) Geometry-based pointwise transformation leverages camera’s calibrated parameters (intrinsic and extrinsic) to
determine the correspondence (one to one or one to many) between 2D positions and BEV grids. (b) Geometry-free
global transformation considers the full correlation between image and BEV. Each BEV grid interacts with all image pixels.
Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
Efficient and Robust 2D-to-BEV Representation
Learning via Geometry-guided Kernel Transformer
BEVFusion: A Simple and Robust LiDAR-
Camera Fusion Framework
• Fusing the camera and LiDAR information has become a de-facto standard for 3D
object detection tasks.
• Current methods rely on point clouds from the LiDAR sensor as queries to
leverage the feature from the image space.
• However, people discover that this underlying assumption makes the current
fusion framework infeasible to produce any prediction when there is a LiDAR
malfunction, regardless of minor or major.
• A simple fusion framework, BEVFusion, whose camera stream does not depend
on the input of LiDAR data, thus addressing the downside of previous methods.
• Under the robustness training settings that simulate various LiDAR malfunctions,
this framework surpasses the state-of-the-art methods by 15.7% to 28.9% mAP.
• The code is available at https://github.com/ADLab-AutoDrive/BEVFusion.
BEVFusion: A Simple and Robust LiDAR-
Camera Fusion Framework
BEVFusion: A Simple and Robust LiDAR-
Camera Fusion Framework
BEVFusion: A Simple and Robust LiDAR-
Camera Fusion Framework
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
• This work presents a framework termed BEVFormer, which learns unified BEV
representations with spatiotemporal transformers to support multiple
autonomous driving perception tasks.
• In a nutshell, BEVFormer exploits both spatial and temporal information by
interacting with spatial and temporal space through pre-defined grid-shaped BEV
queries.
• To aggregate spatial information, design a spatial cross-attention that each BEV
query extracts the spatial features from the RoI across camera views.
• For temporal information, propose a temporal self-attention to recurrently fuse
the history BEV information.
• This approach achieves 56.9% in terms of NDS metric on the nuScenes test set,
which is 9.0 points higher than previous best arts and on par with the
performance of LiDAR-based baselines.
• The code will be released at https://github.com/zhiqi-li/BEVFormer.
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
BEVFormer: Learning Bird’s-Eye-View Representation from
Multi-Camera Images via Spatiotemporal Transformers
Thanks

Contenu connexe

Tendances

ガイデットフィルタとその周辺
ガイデットフィルタとその周辺ガイデットフィルタとその周辺
ガイデットフィルタとその周辺
Norishige Fukushima
 

Tendances (20)

【DL輪読会】Vision-Centric BEV Perception: A Survey
【DL輪読会】Vision-Centric BEV Perception: A Survey【DL輪読会】Vision-Centric BEV Perception: A Survey
【DL輪読会】Vision-Centric BEV Perception: A Survey
 
Visual odometry & slam utilizing indoor structured environments
Visual odometry & slam utilizing indoor structured environmentsVisual odometry & slam utilizing indoor structured environments
Visual odometry & slam utilizing indoor structured environments
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
 
SfMLearner++ Intro
 SfMLearner++ Intro SfMLearner++ Intro
SfMLearner++ Intro
 
입문 Visual SLAM 14강 - 2장 Introduction to slam
입문 Visual SLAM 14강  - 2장 Introduction to slam입문 Visual SLAM 14강  - 2장 Introduction to slam
입문 Visual SLAM 14강 - 2장 Introduction to slam
 
Depth estimation using deep learning
Depth estimation using deep learningDepth estimation using deep learning
Depth estimation using deep learning
 
CNN-SLAMざっくり
CNN-SLAMざっくりCNN-SLAMざっくり
CNN-SLAMざっくり
 
Deformable Part Modelとその発展
Deformable Part Modelとその発展Deformable Part Modelとその発展
Deformable Part Modelとその発展
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Fisheye Omnidirectional View in Autonomous Driving
Fisheye Omnidirectional View in Autonomous DrivingFisheye Omnidirectional View in Autonomous Driving
Fisheye Omnidirectional View in Autonomous Driving
 
Depth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors IIDepth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors II
 
20190131 lidar-camera fusion semantic segmentation survey
20190131 lidar-camera fusion semantic segmentation survey20190131 lidar-camera fusion semantic segmentation survey
20190131 lidar-camera fusion semantic segmentation survey
 
SSII2021 [TS1] Visual SLAM ~カメラ幾何の基礎から最近の技術動向まで~
SSII2021 [TS1] Visual SLAM ~カメラ幾何の基礎から最近の技術動向まで~SSII2021 [TS1] Visual SLAM ~カメラ幾何の基礎から最近の技術動向まで~
SSII2021 [TS1] Visual SLAM ~カメラ幾何の基礎から最近の技術動向まで~
 
Stereo Matching by Deep Learning
Stereo Matching by Deep LearningStereo Matching by Deep Learning
Stereo Matching by Deep Learning
 
Visual SLAM: Why Bundle Adjust?の解説(第4回3D勉強会@関東)
Visual SLAM: Why Bundle Adjust?の解説(第4回3D勉強会@関東)Visual SLAM: Why Bundle Adjust?の解説(第4回3D勉強会@関東)
Visual SLAM: Why Bundle Adjust?の解説(第4回3D勉強会@関東)
 
SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜
SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜
SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜
 
ガイデットフィルタとその周辺
ガイデットフィルタとその周辺ガイデットフィルタとその周辺
ガイデットフィルタとその周辺
 
SLAM入門 第2章 SLAMの基礎
SLAM入門 第2章 SLAMの基礎SLAM入門 第2章 SLAMの基礎
SLAM入門 第2章 SLAMの基礎
 
Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...
Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...
Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera w...
 
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
SSII2019TS: 実践カメラキャリブレーション ~カメラを用いた実世界計測の基礎と応用~
 

Similaire à BEV Joint Detection and Segmentation

A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
Best Jobs
 
Concept of stereo vision based virtual touch
Concept of stereo vision based virtual touchConcept of stereo vision based virtual touch
Concept of stereo vision based virtual touch
Vivek Chamorshikar
 

Similaire à BEV Joint Detection and Segmentation (20)

Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Fisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving IIIFisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving III
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving II
 
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
A Segmentation Based Sequential Pattern Matching for Efficient Video Copy Det...
 
Concept of stereo vision based virtual touch
Concept of stereo vision based virtual touchConcept of stereo vision based virtual touch
Concept of stereo vision based virtual touch
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Seamless view synthesis through te...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Seamless view synthesis through te...IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Seamless view synthesis through te...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Seamless view synthesis through te...
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Models used in iOS programming, with a focus on MVVM
Models used in iOS programming, with a focus on MVVMModels used in iOS programming, with a focus on MVVM
Models used in iOS programming, with a focus on MVVM
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Robust face recognition from multi...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Robust face recognition from multi...IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Robust face recognition from multi...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Robust face recognition from multi...
 
An Assessment of Image Matching Algorithms in Depth Estimation
An Assessment of Image Matching Algorithms in Depth EstimationAn Assessment of Image Matching Algorithms in Depth Estimation
An Assessment of Image Matching Algorithms in Depth Estimation
 
In tech vision-based_obstacle_detection_module_for_a_wheeled_mobile_robot
In tech vision-based_obstacle_detection_module_for_a_wheeled_mobile_robotIn tech vision-based_obstacle_detection_module_for_a_wheeled_mobile_robot
In tech vision-based_obstacle_detection_module_for_a_wheeled_mobile_robot
 
Driver drowsiness monitoring system using visual behaviour and machine learning
Driver drowsiness monitoring system using visual behaviour and machine learningDriver drowsiness monitoring system using visual behaviour and machine learning
Driver drowsiness monitoring system using visual behaviour and machine learning
 
Deep vo and slam ii
Deep vo and slam iiDeep vo and slam ii
Deep vo and slam ii
 
Deep vo and slam iii
Deep vo and slam iiiDeep vo and slam iii
Deep vo and slam iii
 
Google | Infinite Nature Zero Whitepaper
Google | Infinite Nature Zero WhitepaperGoogle | Infinite Nature Zero Whitepaper
Google | Infinite Nature Zero Whitepaper
 
Dynamic Error Concealment Algorithm for Multiview Coding Using Lost MBs Size...
Dynamic Error Concealment Algorithm for Multiview Coding  Using Lost MBs Size...Dynamic Error Concealment Algorithm for Multiview Coding  Using Lost MBs Size...
Dynamic Error Concealment Algorithm for Multiview Coding Using Lost MBs Size...
 
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
 
Remote Sensing Image Scene Classification
Remote Sensing Image Scene ClassificationRemote Sensing Image Scene Classification
Remote Sensing Image Scene Classification
 

Plus de Yu Huang

Plus de Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planning
 
Lidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainLidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rain
 
Autonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucksAutonomous Driving of L3/L4 Commercial trucks
Autonomous Driving of L3/L4 Commercial trucks
 
3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V3-d interpretation from single 2-d image V
3-d interpretation from single 2-d image V
 
3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV3-d interpretation from single 2-d image IV
3-d interpretation from single 2-d image IV
 
3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III
 
BEV Semantic Segmentation
BEV Semantic SegmentationBEV Semantic Segmentation
BEV Semantic Segmentation
 

Dernier

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Dernier (20)

University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 

BEV Joint Detection and Segmentation

  • 1. Joint Detection & Segmentation in BEV Representation Yu Huanng Sunnyvale, California Yu.huang07@gmail.com
  • 2. Outline • M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation • BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision- Centric Autonomous Driving • Learning Ego 3D Representation as Ray Tracing • BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation • Efficient and Robust 2D-to-BEV Representation Learning via Geometry- guided Kernel Transformer • BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework • BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
  • 3. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation • M2BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Bird’s Eye View (BEV) space with multi-camera image inputs. • Unlike the majority of previous works which separately process detection and segmentation, M2BEV infers both tasks with a unified model and improves efficiency. • M2BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. • Such BEV representation is important as it enables different tasks to share a single encoder. • This framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that reinforces with larger weights for more distant predictions, and (4) Large-scale 2D detection pre-training and auxiliary supervision. • M2BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed.
  • 4. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation
  • 5. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation
  • 6. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation
  • 7. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation
  • 8. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation
  • 9. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving • BEVerse, a unified framework for 3D perception and prediction based on multi- camera systems. • Unlike existing studies focusing on the improvement of single-task approaches, BEVerse features in producing spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos and jointly reasoning about multiple tasks for vision-centric autonomous driving. • Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timestamp and multi-view images. • After the ego- motion alignment, the spatio-temporal encoder is utilized for further feature extraction in BEV. • Finally, multiple task decoders are attached for joint reasoning and prediction. • Within the decoders, propose the grid sampler to generate BEV features with different ranges and granularities for different tasks. • Also, design the method of iterative flow for memory-efficient future prediction.
  • 10. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 11. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 12. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 13. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 14. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 15. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 16. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 17. BEVerse: Unified Perception and Prediction in Birds-Eye- View for Vision-Centric Autonomous Driving
  • 18. Learning Ego 3D Representation as Ray Tracing • A self-driving perception model aims to extract 3D semantic representations from multiple cameras collectively into the bird’s-eye- view (BEV) coordinate frame of the ego car in order to ground down- stream planner. • Existing perception methods often rely on error-prone depth estimation of the whole scene or learning sparse virtual 3D representations without the target geometry structure, both of which remain limited in performance and/or capability. • In this paper, a end-to-end architecture for ego 3D representation learning from an arbitrary number of unconstrained camera views. • Inspired by the ray tracing principle, design a polarized grid of “imaginary eyes” as the learnable ego 3D representation and formulate the learning process with the adaptive attention mechanism in conjunction with the 3D-to-2D projection. • Critically, this formulation allows extracting rich 3D representation from 2D images without any depth supervision, and with the built-in geometry structure consistent w.r.t BEV.
  • 19. Learning Ego 3D Representation as Ray Tracing (a) The first strategy, represented by LSS, CaDDN, is based on dense pixel-level depth estimation. (b) The second strategy represented by PON bypasses the depth estimation by learning implicit 2D-3D projection. (c) This strategy that backtracks 2D information from “imaginary” eyes specially designed in the BEV’s geometry.
  • 20. Learning Ego 3D Representation as Ray Tracing
  • 21. Learning Ego 3D Representation as Ray Tracing
  • 22. Learning Ego 3D Representation as Ray Tracing
  • 23. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation • Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. • However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). • This paper breaks this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. • It unifies multi- modal features in the shared bird’s-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. • To achieve this, diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40×. • BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes.
  • 24. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation
  • 25. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation
  • 26. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation
  • 27. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation
  • 28. Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer • This work is a Geometry-guided Kernel Transformer (GKT), a 2D-to-BEV representation learning mechanism. • GKT leverages the geometric priors to guide the transformer to focus on discriminative regions, and unfolds kernel features to generate BEV representation. • For fast inference, further introduce a look-up table (LUT) indexing method to get rid of the camera’s calibrated parameters at runtime. • GKT can run at 72.3 FPS on 3090 GPU / 45.6 FPS on 2080ti GPU and is robust to the camera deviation and the predefined BEV height. • And GKT achieves the state-of-the-art real-time segmentation results, i.e., 38.0 mIoU (100m×100m perception range at a 0.5m resolution) on the nuScenes val set. • Code and models will be available at https://github.com/hustvl/GKT.
  • 29. Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer (a) Geometry-based pointwise transformation leverages camera’s calibrated parameters (intrinsic and extrinsic) to determine the correspondence (one to one or one to many) between 2D positions and BEV grids. (b) Geometry-free global transformation considers the full correlation between image and BEV. Each BEV grid interacts with all image pixels.
  • 30. Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer
  • 31. Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer
  • 32. BEVFusion: A Simple and Robust LiDAR- Camera Fusion Framework • Fusing the camera and LiDAR information has become a de-facto standard for 3D object detection tasks. • Current methods rely on point clouds from the LiDAR sensor as queries to leverage the feature from the image space. • However, people discover that this underlying assumption makes the current fusion framework infeasible to produce any prediction when there is a LiDAR malfunction, regardless of minor or major. • A simple fusion framework, BEVFusion, whose camera stream does not depend on the input of LiDAR data, thus addressing the downside of previous methods. • Under the robustness training settings that simulate various LiDAR malfunctions, this framework surpasses the state-of-the-art methods by 15.7% to 28.9% mAP. • The code is available at https://github.com/ADLab-AutoDrive/BEVFusion.
  • 33. BEVFusion: A Simple and Robust LiDAR- Camera Fusion Framework
  • 34. BEVFusion: A Simple and Robust LiDAR- Camera Fusion Framework
  • 35. BEVFusion: A Simple and Robust LiDAR- Camera Fusion Framework
  • 36. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers • This work presents a framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. • In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through pre-defined grid-shaped BEV queries. • To aggregate spatial information, design a spatial cross-attention that each BEV query extracts the spatial features from the RoI across camera views. • For temporal information, propose a temporal self-attention to recurrently fuse the history BEV information. • This approach achieves 56.9% in terms of NDS metric on the nuScenes test set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. • The code will be released at https://github.com/zhiqi-li/BEVFormer.
  • 37. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
  • 38. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
  • 39. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
  • 40. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
  • 41. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers