SlideShare une entreprise Scribd logo
1  sur  71
Télécharger pour lire hors ligne
BEV SEMANTIC
SEGMENTATION
Yu Huang
Sunnyvale, California
Yu.huang07@gmail.com
OUTLINE
• Learning to Look around Objects for Top-View
Representations of Outdoor Scenes
• Monocular Semantic Occupancy Grid Mapping
with Convolutional Variational Enc-Dec Networks
• Cross-view Semantic Segmentation for Sensing
Surroundings
• MonoLayout: Amodal scene layout from a single
image
• Predicting Semantic Map Representations from
Images using Pyramid Occupancy Networks
• A Sim2Real DL Approach for the Transformation
of Images from Multiple Vehicle-Mounted Cameras
to a Semantically Segmented Image in BEV
• FISHING Net: Future Inference of Semantic
Heatmaps In Grids
• BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry and Semantic Point Cloud
• Lift, Splat, Shoot: Encoding Images from Arbitrary
Camera Rigs by Implicitly Unprojecting to 3D
• Understanding Bird’s-Eye View Semantic HD-maps
Using an Onboard Monocular Camera
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
• Estimating an occlusion-reasoned semantic scene layout in the top-view.
• This challenging problem not only requires an accurate understanding of both the 3D geometry and
the semantics of the visible scene, but also of occluded areas.
• A convolutional neural network that learns to predict occluded portions of the scene layout by
looking around foreground objects like cars or pedestrians.
• But instead of hallucinating RGB values, directly predicting the semantics and depths in the
occluded areas enables a better transformation into the top-view.
• This initial top-view representation can be significantly enhanced by learning priors and rules about
typical road layouts from simulated or, if available, map data.
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
The inpainting CNN first encodes a masked image and the mask
itself. The extracted features are concatenated and two decoders
predict semantics and depth for visible and occluded pixels.
To train the inpainting CNN, ignore FG
objects as no GT is available (red) but
articially add masks (green) over BG regions
where full annotation is already available.
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
The process of mapping the semantic segmentation with corresponding
depth first into a 3D point cloud and then into the bird's eye view. The red
and blue circles illustrate corresponding locations in all views.
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
(a) Simulated road shapes in the top-view. (b) The refinement-CNN is an encoder-decoder network
receiving three supervisory signals: self-reconstruction with the input, adversarial loss from simulated data,
and reconstruction loss with aligned OpenStreetMap (OSM) data. (c) The alignment CNN takes as input the
initial BEV map and a crop of OSM data (via noisy GPS and yaw estimate given). The CNN predicts a warp
for the OSM map and is trained to minimize the reconstruction loss with the initial BEV map.
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
(a) We use a composition of similarity transform (left, “box") and a non-parametric warp (right, “flow") to
align noisy OSM with image evidence. (b, top) Input image and the corresponding Binit. (b, bottom)
Resulting warping grid overlaid on the OSM map and the warping result for 4 different warping
functions, respectively: “box", ”flow", “box+flow", “box+flow (with regularization)". Note the importance of
composing the transformations and the induced regularization.
Learning To Look Around Objects For Top-view
Representations Of Outdoor Scenes
Examples of BEV representation. Examples of our BEV representation.
Monocular Semantic Occupancy Grid Mapping With
Convolutional Variational Encoder-decoder Networks
• This work is end-to-end learning of monocular semantic-metric occupancy grid mapping from
weak binocular ground truth.
• The network learns to predict four classes, as well as a camera to bird’s eye view mapping.
• At the core, it utilizes a variational encoder-decoder network that encodes the front-view visual
information of the driving scene and subsequently decodes it into a 2-D top-view Cartesian
coordinate system.
• The variational sampling with a relatively small embedding vector brings robustness against
vehicle dynamic perturbations, and generalizability for unseen KITTI data
Monocular Semantic Occupancy Grid Mapping With
Convolutional Variational Encoder-decoder Networks
Illustration of the proposed variational encoder-decoder approach. From a single front-view
RGB image, our system can predict a 2-D top-view semantic-metric occupancy grid map.
Monocular Semantic Occupancy Grid Mapping With
Convolutional Variational Encoder-decoder Networks
Some visualized mapping examples on the test set with different methods.
Monocular Semantic Occupancy Grid Mapping With
Convolutional Variational Encoder-decoder Networks
Cross-view Semantic Segmentation For Sensing
Surroundings
• Cross-view Semantic Segmentation: a framework named View Parsing Network (VPN) to
address it.
• In the cross-view semantic segmentation task, the agent is trained to parse the first-view
observations into a top-down-view semantic map indicating the spatial location of all the
objects at pixel-level.
• The main issue of this task is that lacking the real-world annotations of top-down view data.
• To mitigate this, train the VPN in 3D graphics environment and utilize the domain adaptation
technique to transfer it to handle real-world data.
• Code and demo videos can be found at https://view-parsing-network.github.io.
Cross-view Semantic Segmentation For Sensing
Surroundings
Framework of the View
Parsing Network for
cross-view semantic
segmentation. The
simulation part shows
the architecture and
training scheme of VPN,
while the real-world part
demonstrates the
domain adaptation
process for transferring
VPN to the real world.
Cross-view Semantic Segmentation For Sensing
Surroundings
Qualitative results of sim-to-real adaptation. The results of source prediction before and after domain
adaptation, drivable area prediciton after adaptation and the groud-truth drivable area map.
MonoLayout: Amodal Scene Layout From A Single
Image
• Given a single color image captured from a driving platform, to predict the bird’s eye view layout
of the road and other traffic participants.
• The estimated layout should reason beyond what is visible in the image, and compensate for the
loss of 3D information due to projection.
• Amodal scene layout estimation, involves hallucinating scene layout for even parts of the world
that are occluded in the image.
• Mono-Layout, a deep NN for real-time amodal scene layout estimation from a single image.
• To represent scene layout as a multi-channel semantic occupancy grid, and leverage adversarial
feature learning to “hallucinate" plausible completions for occluded image parts.
MonoLayout: Amodal Scene Layout From A Single
Image
MonoLayout: Given only a single image of a road scene, a neural network architecture reasons about
the amodal scene layout in bird’s eye view in real-time (30 fps). This approach, MonoLayout can
hallucinate regions of the static scene (road, sidewalks)—and traffic participants—that do not even
project to the visible regime of the image plane. Shown above are example images from the KITTI (left)
and Argoverse (right) datasets. MonoLayout outperforms prior art (by more than a 20% margin) on
hallucinating occluded regions.
MonoLayout: Amodal Scene Layout From A Single
Image
Architecture: MonoLayout takes in a color image of an urban driving scenario, and predicts an amodal
scene layout in bird’s eye view. The architecture comprises a context encoder, amodal layout decoders,
and two discriminators. Architecture: MonoLayout takes in a color image of an urban driving scenario, and
predicts an amodal scene layout in bird’s eye view. The architecture comprises a context encoder, amodal
layout decoders, and two discriminators.
MonoLayout: Amodal Scene Layout From A Single
Image
MonoLayout: Amodal Scene Layout From A Single
Image
Static layout estimation: Observe how MonoLayout performs amodal completion of the static scene
(road shown in pink, sidewalk shown in gray). Mono Occupancy fails to reason beyond occluding
objects (top row), and does not hallucinate large missing patches (bottom row), while MonoLayout is
accurately able to do so. Furthermore, even in cases where there is no occlusion (row 2), MonoLayout
generates road layouts of much sharper quality. Row 3 show extremely challenging scenarios where
most of the view is blocked by vehicles, and the scenes exhibit high-dynamic range (HDR) and shadows.
MonoLayout: Amodal Scene Layout From A Single
Image
Dynamic layout estimation: vehicle occupancy estimation results on the KITTI 3D Object detection
benchmark. From left to right, the column corresponds to the input image, Mono Occupancy, Mono3D,
OFT, MonoLayout, and ground-truth respectively. While the other approaches miss out on detecting cars
(top row), or split a vehicle detection into two (second row), or stray detections off road (third row),
MonoLayout produces crisp object boundaries while respecting vehicle and road geometries.
Monolayout: Amodal Scene Layout From A Single
Image
Amodal scene layout estimation on the Argoverse
dataset. The dataset comprises multiple
challenging scenarios, with low illumination, large
number of vehicles. MonoLayout is accurately able
to produce sharp estimates of vehicles and road
layouts. (Sidewalks are not predicted here, as they
aren’t annotated in Argoverse).
MonoLayout: Amodal Scene Layout From A Single
Image
Trajectory forecasting: MonoLayout-
forecast accurately estimates future
trajectories of moving vehicles. (Left): In
each figure, the magenta cuboid shows the
initial position of the vehicle. MonoLayout-
forecast is pre-conditioned for 1 seconds,
by observing the vehicle, at which point
(cyan cuboid) it starts forecasting future
trajectories (blue). The ground-truth
trajectory is shown in red, for comparision.
(Right): Trajectories visualized in image
space. Notice how MonoLayout-forecast is
able to forecast trajectories accurately
despite the presence of moving obstacles
(top row), turns (middle row), and merging
traffic (bottom row).
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
• vision-based elements: ground plane estimation, road segmentation and 3D object detection
• a simple, unified approach for estimating maps directly from monocular images using a single
end-to-end deep learning architecture
• For the maps themselves, adopt a semantic Bayesian occupancy grid framework, allowing to
trivially accumulate information over multiple cameras and timesteps
• Codes available at http://github.com/tom-roddick/mono-semantic-maps.
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Given a set of surround-view images,
predict a full 360 birds-eye-view
semantic map, which captures both
static elements like road and
sidewalk as well as dynamic actors
such as cars and pedestrians.
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Architecture diagram showing an overview. (1) A ResNet-50 backbone network extracts image features
at multiple resolutions. (2) A feature pyramid augments the high-resolution features with spatial context
from lower pyramid layers. (3) A stack of dense transformer layers map the image-based features into
the birds-eye-view. (4) The top down network processes the birds-eye-view features and predicts the
final semantic occupancy probabilities.
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
This dense transformer layer first condenses the image based features
along the vertical dimension, whilst retaining the horizontal dimension. Then,
predict a set of features along the depth axis in a polar coordinate system,
which are then resampled to Cartesian coordinates.
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
• The dense transformer layer is inspired: while the network needs a lot of vert. context to map
features to the BEV, in the horiz. direction the relationship btw BEV locations and image
locations can be established using camera geometry.
• In order to retain the maximum amount of spatial info, collapse the vert. dim. and channel
dimensions of the image feature map to a bottleneck of size B, but preserve the horiz. dim. W.
• The apply a 1D conv along the horiz. axis, reshape the result. feat. map to give a tensor of dim.
• However this feature map, which is still in image-space coord., actually corresponds to a
trapezoid in the orthographic BEV space due to perspective, and so the final step is to resample
into a Cartesian frame using the known camera focal length f and horizontal offset u0.
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
Predicting Semantic Map Representations From
Images Using Pyramid Occupancy Networks
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
• To obtain a corrected 360 BEV image given images from multiple vehicle-mounted cameras.
• The corrected BEV image is segmented into semantic classes and includes a prediction of
occluded areas.
• The neural network approach does not rely on manually labeled data, but is trained on a
synthetic dataset in such a way that it generalizes well to real-world data.
• By using semantically segmented images as input, reduce the reality gap between simulated and
real-world data and are able to show that the method can be successfully applied in the real
world.
• Source code and datasets are available at https://github:com/ika-rwth-aachen/Cam2BEV.
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
A homography can be applied to the four semantically segmented images from
vehicle-mounted cameras to transform them to BEV. This approach involves
learning to compute an accurate BEV image without visual distortions.
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
• For each vehicle camera, virtual rays are cast from its mount position to the edges of the
semantically segmented ground truth BEV image.
• The rays are only cast to edge pixels that lie within the specific camera’s field of view.
• All pixels along these rays are processed to determine their occlusion state according to the
following rules:
1. some semantic classes always block sight (e.g. building, truck);
2. some semantic classes never block sight (e.g. road);
3. cars block sight, except on taller objects behind them (e.g. truck, bus);
4. partially occluded objects remain completely visible;
5. objects are only labeled as occluded if they are occluded in all camera perspectives.
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
The uNetXST architecture has
separate encoder paths for each
input image (green paths). As part of
the skip-connection on each scale
level (violet paths), feature maps are
projectively transformed (v-block),
concatenated with the other input
streams (||-block), convoluted, and
finally concatenated with upsampled
output of the decoder path. This
illustration shows a network with only
two pooling and two upsampling
layers, the actual trained network
contains four, respectively.
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
The v-block resembles a Spatial Transformer unit.
Input feature maps from preceding convolutional
layers (orange grid layers) are projectively
transformed by the homographies obtained through
IPM (Inverse Projection Mapping). The transformation
differs between the input streams for the different
cameras. Spatial consistency is established, since the
transformed feature maps all capture the same field
of view as the ground truth BEV. The transformed
feature maps are then concatenated into a single
feature map (cf. ||-block).
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
A Sim2Real Deep Learning Approach For The Transformation Of Images From
Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
• End-to-end pipeline that performs semantic segmentation and short term prediction using a top
down representation.
• This approach consists of an ensemble of neural networks which take in sensor data from different
sensor modalities and transform them into a single common top-down semantic grid representation.
• This representation favorable as it is agnostic to sensor-specific reference frames and captures both
the semantic and geometric information for the surrounding scene.
• Because the modalities share a single output representation, they can be easily aggregated to produce
a fused output.
• This work predicts short-term semantic grids but the framework can be extended to other tasks.
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
FISHING Net Architecture:
multiple neural networks, one for
each sensor modality (lidar, radar
and camera) take in a sequence
of input sensor data and output a
sequence of shared top-down
semantic grids representing 3
object classes (Vulnerable Road
Users (VRU), vehicles and
background). The sequences are
then fused using an aggregation
function to output a fused
sequence of semantic grids.
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
• The overall architecture consists of a neural network for each sensor modality.
• Across all modalities, the network architecture consists of an encoder decoder network with
convolutional layers.
• It uses average pooling with a pooling size of (2,2) in the encoder and up-sampling in the
decoder.
• After the decoder, a single linear convolutional layer to produce logits, and a softmax to
produce the final output probabilities for each of the three classes along each of the output
timesteps.
• It uses a slightly different encoder and decoder scheme for the vision network compared to the
lidar and radar networks to account for the pixel space features.
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
Vision architecture
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
Lidar and Radar Architecture
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
• The LiDAR features consist of: 1) Binary lidar occupancy (1 if any lidar point is present a given
grid cell, 0 otherwise). 2) Lidar density (Log normalized density of all lidar points present in a
grid cell). 3) Max z (Largest height value for lidar points in a given grid cell). 4) Max z sliced
(Largest z value for each grid cell over 5 linear slices eg. 0-0.5m,..., 2.0-2.5m).
• The Radar features consist of: 1) Binary radar occupancy (1 if any radar point is present a given
grid cell, 0 otherwise). 2) X, Y values for each radar return’s doppler velocity compensated with
ego vehicle’s motion. 3) Radar cross section (RCS). 4) Signal to noise ratio (SNR). 5)
Ambiguous Doppler interval.
• The dimensions of the images match the output resolution of 192 by 320.
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
FISHING Net: Future Inference Of Semantic
Heatmaps In Grids
Label
input for lidar radar and vision predictions for lidar radar and vision
BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry And Semantic Point Cloud
• Bird’s eye semantic segmentation, a task that predicts pixel-wise semantic segmentation in BEV
from side RGB images.
• Two main challenges: the view transformation from side view to bird’s eye view, as well as
transfer learning to unseen domains.
• The 2-staged perception pipeline explicitly predicts pixel depths and combines them with pixel
semantics in an efficient manner, allowing the model to leverage depth information to infer
objects’ spatial locations in the BEV.
• Transfer learning by abstracting high level geometric features and predicting an intermediate
representation that is common across different domains.
BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry And Semantic Point Cloud
BEV-Seg
pipeline
BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry And Semantic Point Cloud
• In the first stage, N RGB road scene images are captured by cameras at different angles and
individually pass through semantic segmentation network S and depth estimation network D.
• The resulting side semantic segmentations and depth maps are combined and projected into a
semantic point cloud.
• This point cloud is then projected downward into an incomplete bird’s-eye view, which is fed
into a parser network to predict the final bird’s-eye segmentation.
• The rest of this section provides details on the various components of the pipeline.
BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry And Semantic Point Cloud
• For side-semantic segmentations, use HRNet, a state-of-the-art convolutional network for semantic
segmentation.
• For monocular depth estimation, implement SORD using the same HRNet as the backbone.
• For both tasks, train the same model on all four views.
• The resulting semantic point cloud is projected height-wise onto a 512x512 image.
• Train a separate HRNet model as the parser network for the final bird’s-eye segmentation.
• Transfer learning via modularity and abstraction: 1). Fine-tune the stage 1 models on the target
domain stage 1 data; 2). Apply the trained stage 2 model as-is to the projected point cloud in the
target domain.
BEV-Seg: Bird’s Eye View Semantic Segmentation
Using Geometry And Semantic Point Cloud
Table 1: Segmentation Result on
BEVSEG-Carla. Oracle models have
ground truth given for specified inputs.
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
• End-to-end architecture that directly extracts a bird's-eye-view representation of a scene given
image data from an arbitrary number of cameras
• To “lift" each image individually into a frustum of features for each camera, then “splat" all
frustums into a rasterized bird's-eye view grid
• To learn how to represent images and how to fuse predictions from all cameras into a single
cohesive representation of the scene while being robust to calibration error
• Codes: https://nv-tlabs.github.io/lift-splat-shoot
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
Given multi-view camera data (left), it infers semantics directly in the bird's-eye-view (BEV) coordinate
frame (right). It shows vehicle segmentation (blue), drivable area (orange), and lane segmentation
(green). These BEV predictions are then projected back onto input images (dots on the left).
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
Traditionally, computer vision tasks such as semantic segmentation involve making predictions in
the same coordinate frame as the input image. In contrast, planning for self-driving generally
operates in the bird's-eye-view frame. The model directly makes predictions in a given bird's-eye-
view frame for end-to-end planning from multi-view images.
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
It visualizes the “lift" step. For each pixel, it predicts a categorical distribution over depth (left) and a
context vector (top left). Features at each point along the ray are determined by their outer product (right).
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
In the “lift" step, a frustum-shaped point cloud is generated for each individual image (center-left). The
extrinsics/intrinsics are then used to splat each frustum onto the BEV plane (center right). Finally, a BEV
CNN processes the BEV representation for BEV semantic segmentation or planning (right).
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
It visualizes the 1K trajectory templates that is “shoot"
onto the cost map during training and testing. During
training, the cost of each template trajectory is
computed and interpreted as a 1K-dimensional
Boltzman distribution over the templates. During
testing, choose the argmax of this distribution and act
according to the chosen template.
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
Instead of the hard-margin loss proposed in NMP
(Neural Motion Planner), planning is framed as
classification over a set of K template trajectories.
To leverage the cost-volume nature of the planning
problem, enforce the distribution over K template
trajectories to take the following form
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
For a single time stamp, remove each of the cameras and visualize how the loss the cameras selects the
prediction of the network. Region covered by the missing camera becomes fuzzier in every case. When the
front camera is removed (top middle), the network extrapolates the lane and drivable area in front of the ego
and extrapolates the body of a car for which only a corner can be seen in the top right camera.
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
Qualitatively show how the model performs, given an entirely new camera rig at test time. Road
segmentation is in orange, lane segmentation is in green, and vehicle segmentation is in blue.
Lift, Splat, Shoot: Encoding Images From Arbitrary
Camera Rigs By Implicitly Unprojecting To 3D
The top 10 ranked trajectories out of the 1k templates. The model predicts bimodal distributions and
curves from observations from a single timestamp. The model does not have access to the speed of the
car so it is compelling that the model predicts low-speed trajectories near crosswalks and brake lights.
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
• online estimation of semantic BEV HD-maps using video input from a single onboard camera
• image-level understanding, BEV level understanding, and aggregation of temporal info
Front-facing monocular camera
for Bird’s-eye View (BEV) HD-
map understanding
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
It relies on three pillars and can also be split into modules that process backbone features. First, the
image-level branch which is composed of two decoders, one processing the static HDmap and one the
dynamic obstacle, second the BEV temporal aggregation module that fuses our three pillars and
aggregates all the temporal and image plane information in the BEV and finally the BEV decoder.
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
Temporal aggregation module combines information
from all frames and all branches into one BEV feature
map. Backbone features and image-level static
estimates are projected with warping function AB to
BEV and max (M) is applied in batch dimension. The
results are concatenated in channel dimension. The
reference frame backbone features (highlighted with
red) are used in Max function as well as skip
connection to concatenation.
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
• The dataset also provides 3D bounding boxes of 23 object classes.
• In experiments, select six HD-map classes: drivable area, pedestrian crossings, walkways,
carpark area, road segment, and lane.
• For dynamic objects, select the classes: car, truck, bus, trailer, construction vehicle, pedestrian,
motorcycle, traffic cone and barrier.
• Even though a six-camera rig was used to capture data, only use the front camera for training
and evaluation.
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
Understanding Bird’s-eye View Semantic HD-
Maps Using An Onboard Monocular Camera
BEV Semantic Segmentation

Contenu connexe

Tendances

Image segmentation
Image segmentationImage segmentation
Image segmentationKuppusamy P
 
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...Deep Learning JP
 
[DL輪読会]Domain Adaptive Faster R-CNN for Object Detection in the Wild
[DL輪読会]Domain Adaptive Faster R-CNN for Object Detection in the Wild[DL輪読会]Domain Adaptive Faster R-CNN for Object Detection in the Wild
[DL輪読会]Domain Adaptive Faster R-CNN for Object Detection in the WildDeep Learning JP
 
Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Ohnishi Katsunori
 
Deformable Part Modelとその発展
Deformable Part Modelとその発展Deformable Part Modelとその発展
Deformable Part Modelとその発展Takao Yamanaka
 
MixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised LearningMixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised Learningharmonylab
 
点群深層学習 Meta-study
点群深層学習 Meta-study点群深層学習 Meta-study
点群深層学習 Meta-studyNaoya Chiba
 
[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーション[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーションDeep Learning JP
 
Batch normalization effectiveness_20190206
Batch normalization effectiveness_20190206Batch normalization effectiveness_20190206
Batch normalization effectiveness_20190206Masakazu Shinoda
 
[DL輪読会]MetaFormer is Actually What You Need for Vision
[DL輪読会]MetaFormer is Actually What You Need for Vision[DL輪読会]MetaFormer is Actually What You Need for Vision
[DL輪読会]MetaFormer is Actually What You Need for VisionDeep Learning JP
 
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けてThe Whole Brain Architecture Initiative
 
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】YutaSuzuki27
 
[DL輪読会]Real-Time Semantic Stereo Matching
[DL輪読会]Real-Time Semantic Stereo Matching[DL輪読会]Real-Time Semantic Stereo Matching
[DL輪読会]Real-Time Semantic Stereo MatchingDeep Learning JP
 
Object tracking survey
Object tracking surveyObject tracking survey
Object tracking surveyRich Nguyen
 
Semantic Segmentation Review
Semantic Segmentation ReviewSemantic Segmentation Review
Semantic Segmentation ReviewTakeshi Otsuka
 
Weakly supervised semantic segmentation
Weakly supervised semantic segmentationWeakly supervised semantic segmentation
Weakly supervised semantic segmentation哲东 郑
 
[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)
[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)
[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)Deep Learning JP
 
Triplet Loss 徹底解説
Triplet Loss 徹底解説Triplet Loss 徹底解説
Triplet Loss 徹底解説tancoro
 

Tendances (20)

Image segmentation
Image segmentationImage segmentation
Image segmentation
 
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
 
[DL輪読会]Domain Adaptive Faster R-CNN for Object Detection in the Wild
[DL輪読会]Domain Adaptive Faster R-CNN for Object Detection in the Wild[DL輪読会]Domain Adaptive Faster R-CNN for Object Detection in the Wild
[DL輪読会]Domain Adaptive Faster R-CNN for Object Detection in the Wild
 
Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向
 
Deformable Part Modelとその発展
Deformable Part Modelとその発展Deformable Part Modelとその発展
Deformable Part Modelとその発展
 
MixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised LearningMixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised Learning
 
点群深層学習 Meta-study
点群深層学習 Meta-study点群深層学習 Meta-study
点群深層学習 Meta-study
 
[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーション[DL輪読会]医用画像解析におけるセグメンテーション
[DL輪読会]医用画像解析におけるセグメンテーション
 
Batch normalization effectiveness_20190206
Batch normalization effectiveness_20190206Batch normalization effectiveness_20190206
Batch normalization effectiveness_20190206
 
[DL輪読会]MetaFormer is Actually What You Need for Vision
[DL輪読会]MetaFormer is Actually What You Need for Vision[DL輪読会]MetaFormer is Actually What You Need for Vision
[DL輪読会]MetaFormer is Actually What You Need for Vision
 
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて
第7回WBAシンポジウム:全脳確率的生成モデル(WB-PGM)〜世界モデルと推論に基づく汎用人工知能に向けて
 
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
 
[DL輪読会]Real-Time Semantic Stereo Matching
[DL輪読会]Real-Time Semantic Stereo Matching[DL輪読会]Real-Time Semantic Stereo Matching
[DL輪読会]Real-Time Semantic Stereo Matching
 
Object tracking survey
Object tracking surveyObject tracking survey
Object tracking survey
 
Final ppt
Final pptFinal ppt
Final ppt
 
Deep sets
Deep setsDeep sets
Deep sets
 
Semantic Segmentation Review
Semantic Segmentation ReviewSemantic Segmentation Review
Semantic Segmentation Review
 
Weakly supervised semantic segmentation
Weakly supervised semantic segmentationWeakly supervised semantic segmentation
Weakly supervised semantic segmentation
 
[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)
[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)
[DL輪読会]SoftTriple Loss: Deep Metric Learning Without Triplet Sampling (ICCV2019)
 
Triplet Loss 徹底解説
Triplet Loss 徹底解説Triplet Loss 徹底解説
Triplet Loss 徹底解説
 

Similaire à BEV Semantic Segmentation

Game Engine Overview
Game Engine OverviewGame Engine Overview
Game Engine OverviewSharad Mitra
 
3-d interpretation from single 2-d image for autonomous driving
3-d interpretation from single 2-d image for autonomous driving3-d interpretation from single 2-d image for autonomous driving
3-d interpretation from single 2-d image for autonomous drivingYu Huang
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgYu Huang
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIYu Huang
 
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...ijma
 
3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving IIYu Huang
 
Arindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentationArindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentationArindam Batabyal
 
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...c.choi
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Yu Huang
 
Fisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving IIIFisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving IIIYu Huang
 
fusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving Ifusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving IYu Huang
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VYu Huang
 
Report bep thomas_blanken
Report bep thomas_blankenReport bep thomas_blanken
Report bep thomas_blankenxepost
 
LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)Yu Huang
 
Driving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XIIDriving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XIIYu Huang
 
CVGIP 2010 Part 3
CVGIP 2010 Part 3CVGIP 2010 Part 3
CVGIP 2010 Part 3Cody Liu
 
spkumar-503report-approved
spkumar-503report-approvedspkumar-503report-approved
spkumar-503report-approvedPrasanna Kumar
 

Similaire à BEV Semantic Segmentation (20)

Game Engine Overview
Game Engine OverviewGame Engine Overview
Game Engine Overview
 
3-d interpretation from single 2-d image for autonomous driving
3-d interpretation from single 2-d image for autonomous driving3-d interpretation from single 2-d image for autonomous driving
3-d interpretation from single 2-d image for autonomous driving
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving II
 
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
Leader Follower Formation Control of Ground Vehicles Using Dynamic Pixel Coun...
 
3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II
 
Arindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentationArindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentation
 
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
Real-time 3D Object Pose Estimation and Tracking for Natural Landmark Based V...
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
 
AR/SLAM for end-users
AR/SLAM for end-usersAR/SLAM for end-users
AR/SLAM for end-users
 
Fisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving IIIFisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving III
 
fusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving Ifusion of Camera and lidar for autonomous driving I
fusion of Camera and lidar for autonomous driving I
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Report bep thomas_blanken
Report bep thomas_blankenReport bep thomas_blanken
Report bep thomas_blanken
 
CAD
CAD CAD
CAD
 
LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)LiDAR-based Autonomous Driving III (by Deep Learning)
LiDAR-based Autonomous Driving III (by Deep Learning)
 
Driving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XIIDriving behaviors for adas and autonomous driving XII
Driving behaviors for adas and autonomous driving XII
 
G04743943
G04743943G04743943
G04743943
 
CVGIP 2010 Part 3
CVGIP 2010 Part 3CVGIP 2010 Part 3
CVGIP 2010 Part 3
 
spkumar-503report-approved
spkumar-503report-approvedspkumar-503report-approved
spkumar-503report-approved
 

Plus de Yu Huang

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingYu Huang
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...Yu Huang
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingYu Huang
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingYu Huang
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationYu Huang
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and PredictionYu Huang
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIYu Huang
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVYu Huang
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduYu Huang
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the HoodYu Huang
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)Yu Huang
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingYu Huang
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?Yu Huang
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingYu Huang
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learningYu Huang
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymoYu Huang
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningYu Huang
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingYu Huang
 
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningYu Huang
 
Lidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainLidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainYu Huang
 

Plus de Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planning
 
Lidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainLidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rain
 

Dernier

Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectErbil Polytechnic University
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxNiranjanYadav41
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxachiever3003
 

Dernier (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction Project
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptx
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptx
 

BEV Semantic Segmentation

  • 1. BEV SEMANTIC SEGMENTATION Yu Huang Sunnyvale, California Yu.huang07@gmail.com
  • 2. OUTLINE • Learning to Look around Objects for Top-View Representations of Outdoor Scenes • Monocular Semantic Occupancy Grid Mapping with Convolutional Variational Enc-Dec Networks • Cross-view Semantic Segmentation for Sensing Surroundings • MonoLayout: Amodal scene layout from a single image • Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks • A Sim2Real DL Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in BEV • FISHING Net: Future Inference of Semantic Heatmaps In Grids • BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud • Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D • Understanding Bird’s-Eye View Semantic HD-maps Using an Onboard Monocular Camera
  • 3. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes • Estimating an occlusion-reasoned semantic scene layout in the top-view. • This challenging problem not only requires an accurate understanding of both the 3D geometry and the semantics of the visible scene, but also of occluded areas. • A convolutional neural network that learns to predict occluded portions of the scene layout by looking around foreground objects like cars or pedestrians. • But instead of hallucinating RGB values, directly predicting the semantics and depths in the occluded areas enables a better transformation into the top-view. • This initial top-view representation can be significantly enhanced by learning priors and rules about typical road layouts from simulated or, if available, map data.
  • 4. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes
  • 5. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes The inpainting CNN first encodes a masked image and the mask itself. The extracted features are concatenated and two decoders predict semantics and depth for visible and occluded pixels. To train the inpainting CNN, ignore FG objects as no GT is available (red) but articially add masks (green) over BG regions where full annotation is already available.
  • 6. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes The process of mapping the semantic segmentation with corresponding depth first into a 3D point cloud and then into the bird's eye view. The red and blue circles illustrate corresponding locations in all views.
  • 7. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes (a) Simulated road shapes in the top-view. (b) The refinement-CNN is an encoder-decoder network receiving three supervisory signals: self-reconstruction with the input, adversarial loss from simulated data, and reconstruction loss with aligned OpenStreetMap (OSM) data. (c) The alignment CNN takes as input the initial BEV map and a crop of OSM data (via noisy GPS and yaw estimate given). The CNN predicts a warp for the OSM map and is trained to minimize the reconstruction loss with the initial BEV map.
  • 8. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes (a) We use a composition of similarity transform (left, “box") and a non-parametric warp (right, “flow") to align noisy OSM with image evidence. (b, top) Input image and the corresponding Binit. (b, bottom) Resulting warping grid overlaid on the OSM map and the warping result for 4 different warping functions, respectively: “box", ”flow", “box+flow", “box+flow (with regularization)". Note the importance of composing the transformations and the induced regularization.
  • 9. Learning To Look Around Objects For Top-view Representations Of Outdoor Scenes Examples of BEV representation. Examples of our BEV representation.
  • 10. Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder-decoder Networks • This work is end-to-end learning of monocular semantic-metric occupancy grid mapping from weak binocular ground truth. • The network learns to predict four classes, as well as a camera to bird’s eye view mapping. • At the core, it utilizes a variational encoder-decoder network that encodes the front-view visual information of the driving scene and subsequently decodes it into a 2-D top-view Cartesian coordinate system. • The variational sampling with a relatively small embedding vector brings robustness against vehicle dynamic perturbations, and generalizability for unseen KITTI data
  • 11. Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder-decoder Networks Illustration of the proposed variational encoder-decoder approach. From a single front-view RGB image, our system can predict a 2-D top-view semantic-metric occupancy grid map.
  • 12. Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder-decoder Networks Some visualized mapping examples on the test set with different methods.
  • 13. Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder-decoder Networks
  • 14. Cross-view Semantic Segmentation For Sensing Surroundings • Cross-view Semantic Segmentation: a framework named View Parsing Network (VPN) to address it. • In the cross-view semantic segmentation task, the agent is trained to parse the first-view observations into a top-down-view semantic map indicating the spatial location of all the objects at pixel-level. • The main issue of this task is that lacking the real-world annotations of top-down view data. • To mitigate this, train the VPN in 3D graphics environment and utilize the domain adaptation technique to transfer it to handle real-world data. • Code and demo videos can be found at https://view-parsing-network.github.io.
  • 15. Cross-view Semantic Segmentation For Sensing Surroundings Framework of the View Parsing Network for cross-view semantic segmentation. The simulation part shows the architecture and training scheme of VPN, while the real-world part demonstrates the domain adaptation process for transferring VPN to the real world.
  • 16. Cross-view Semantic Segmentation For Sensing Surroundings Qualitative results of sim-to-real adaptation. The results of source prediction before and after domain adaptation, drivable area prediciton after adaptation and the groud-truth drivable area map.
  • 17. MonoLayout: Amodal Scene Layout From A Single Image • Given a single color image captured from a driving platform, to predict the bird’s eye view layout of the road and other traffic participants. • The estimated layout should reason beyond what is visible in the image, and compensate for the loss of 3D information due to projection. • Amodal scene layout estimation, involves hallucinating scene layout for even parts of the world that are occluded in the image. • Mono-Layout, a deep NN for real-time amodal scene layout estimation from a single image. • To represent scene layout as a multi-channel semantic occupancy grid, and leverage adversarial feature learning to “hallucinate" plausible completions for occluded image parts.
  • 18. MonoLayout: Amodal Scene Layout From A Single Image MonoLayout: Given only a single image of a road scene, a neural network architecture reasons about the amodal scene layout in bird’s eye view in real-time (30 fps). This approach, MonoLayout can hallucinate regions of the static scene (road, sidewalks)—and traffic participants—that do not even project to the visible regime of the image plane. Shown above are example images from the KITTI (left) and Argoverse (right) datasets. MonoLayout outperforms prior art (by more than a 20% margin) on hallucinating occluded regions.
  • 19. MonoLayout: Amodal Scene Layout From A Single Image Architecture: MonoLayout takes in a color image of an urban driving scenario, and predicts an amodal scene layout in bird’s eye view. The architecture comprises a context encoder, amodal layout decoders, and two discriminators. Architecture: MonoLayout takes in a color image of an urban driving scenario, and predicts an amodal scene layout in bird’s eye view. The architecture comprises a context encoder, amodal layout decoders, and two discriminators.
  • 20. MonoLayout: Amodal Scene Layout From A Single Image
  • 21. MonoLayout: Amodal Scene Layout From A Single Image Static layout estimation: Observe how MonoLayout performs amodal completion of the static scene (road shown in pink, sidewalk shown in gray). Mono Occupancy fails to reason beyond occluding objects (top row), and does not hallucinate large missing patches (bottom row), while MonoLayout is accurately able to do so. Furthermore, even in cases where there is no occlusion (row 2), MonoLayout generates road layouts of much sharper quality. Row 3 show extremely challenging scenarios where most of the view is blocked by vehicles, and the scenes exhibit high-dynamic range (HDR) and shadows.
  • 22. MonoLayout: Amodal Scene Layout From A Single Image Dynamic layout estimation: vehicle occupancy estimation results on the KITTI 3D Object detection benchmark. From left to right, the column corresponds to the input image, Mono Occupancy, Mono3D, OFT, MonoLayout, and ground-truth respectively. While the other approaches miss out on detecting cars (top row), or split a vehicle detection into two (second row), or stray detections off road (third row), MonoLayout produces crisp object boundaries while respecting vehicle and road geometries.
  • 23. Monolayout: Amodal Scene Layout From A Single Image Amodal scene layout estimation on the Argoverse dataset. The dataset comprises multiple challenging scenarios, with low illumination, large number of vehicles. MonoLayout is accurately able to produce sharp estimates of vehicles and road layouts. (Sidewalks are not predicted here, as they aren’t annotated in Argoverse).
  • 24. MonoLayout: Amodal Scene Layout From A Single Image Trajectory forecasting: MonoLayout- forecast accurately estimates future trajectories of moving vehicles. (Left): In each figure, the magenta cuboid shows the initial position of the vehicle. MonoLayout- forecast is pre-conditioned for 1 seconds, by observing the vehicle, at which point (cyan cuboid) it starts forecasting future trajectories (blue). The ground-truth trajectory is shown in red, for comparision. (Right): Trajectories visualized in image space. Notice how MonoLayout-forecast is able to forecast trajectories accurately despite the presence of moving obstacles (top row), turns (middle row), and merging traffic (bottom row).
  • 25. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks • vision-based elements: ground plane estimation, road segmentation and 3D object detection • a simple, unified approach for estimating maps directly from monocular images using a single end-to-end deep learning architecture • For the maps themselves, adopt a semantic Bayesian occupancy grid framework, allowing to trivially accumulate information over multiple cameras and timesteps • Codes available at http://github.com/tom-roddick/mono-semantic-maps.
  • 26. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks
  • 27. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks Given a set of surround-view images, predict a full 360 birds-eye-view semantic map, which captures both static elements like road and sidewalk as well as dynamic actors such as cars and pedestrians.
  • 28. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks Architecture diagram showing an overview. (1) A ResNet-50 backbone network extracts image features at multiple resolutions. (2) A feature pyramid augments the high-resolution features with spatial context from lower pyramid layers. (3) A stack of dense transformer layers map the image-based features into the birds-eye-view. (4) The top down network processes the birds-eye-view features and predicts the final semantic occupancy probabilities.
  • 29. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks This dense transformer layer first condenses the image based features along the vertical dimension, whilst retaining the horizontal dimension. Then, predict a set of features along the depth axis in a polar coordinate system, which are then resampled to Cartesian coordinates.
  • 30. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks • The dense transformer layer is inspired: while the network needs a lot of vert. context to map features to the BEV, in the horiz. direction the relationship btw BEV locations and image locations can be established using camera geometry. • In order to retain the maximum amount of spatial info, collapse the vert. dim. and channel dimensions of the image feature map to a bottleneck of size B, but preserve the horiz. dim. W. • The apply a 1D conv along the horiz. axis, reshape the result. feat. map to give a tensor of dim. • However this feature map, which is still in image-space coord., actually corresponds to a trapezoid in the orthographic BEV space due to perspective, and so the final step is to resample into a Cartesian frame using the known camera focal length f and horizontal offset u0.
  • 31. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks
  • 32. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks
  • 33. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks
  • 34. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks
  • 35. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV • To obtain a corrected 360 BEV image given images from multiple vehicle-mounted cameras. • The corrected BEV image is segmented into semantic classes and includes a prediction of occluded areas. • The neural network approach does not rely on manually labeled data, but is trained on a synthetic dataset in such a way that it generalizes well to real-world data. • By using semantically segmented images as input, reduce the reality gap between simulated and real-world data and are able to show that the method can be successfully applied in the real world. • Source code and datasets are available at https://github:com/ika-rwth-aachen/Cam2BEV.
  • 36. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV A homography can be applied to the four semantically segmented images from vehicle-mounted cameras to transform them to BEV. This approach involves learning to compute an accurate BEV image without visual distortions.
  • 37. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV • For each vehicle camera, virtual rays are cast from its mount position to the edges of the semantically segmented ground truth BEV image. • The rays are only cast to edge pixels that lie within the specific camera’s field of view. • All pixels along these rays are processed to determine their occlusion state according to the following rules: 1. some semantic classes always block sight (e.g. building, truck); 2. some semantic classes never block sight (e.g. road); 3. cars block sight, except on taller objects behind them (e.g. truck, bus); 4. partially occluded objects remain completely visible; 5. objects are only labeled as occluded if they are occluded in all camera perspectives.
  • 38. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV The uNetXST architecture has separate encoder paths for each input image (green paths). As part of the skip-connection on each scale level (violet paths), feature maps are projectively transformed (v-block), concatenated with the other input streams (||-block), convoluted, and finally concatenated with upsampled output of the decoder path. This illustration shows a network with only two pooling and two upsampling layers, the actual trained network contains four, respectively.
  • 39. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV The v-block resembles a Spatial Transformer unit. Input feature maps from preceding convolutional layers (orange grid layers) are projectively transformed by the homographies obtained through IPM (Inverse Projection Mapping). The transformation differs between the input streams for the different cameras. Spatial consistency is established, since the transformed feature maps all capture the same field of view as the ground truth BEV. The transformed feature maps are then concatenated into a single feature map (cf. ||-block).
  • 40. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
  • 41. A Sim2Real Deep Learning Approach For The Transformation Of Images From Multiple Vehicle-mounted Cameras To A Semantically Segmented Image In BEV
  • 42. FISHING Net: Future Inference Of Semantic Heatmaps In Grids • End-to-end pipeline that performs semantic segmentation and short term prediction using a top down representation. • This approach consists of an ensemble of neural networks which take in sensor data from different sensor modalities and transform them into a single common top-down semantic grid representation. • This representation favorable as it is agnostic to sensor-specific reference frames and captures both the semantic and geometric information for the surrounding scene. • Because the modalities share a single output representation, they can be easily aggregated to produce a fused output. • This work predicts short-term semantic grids but the framework can be extended to other tasks.
  • 43. FISHING Net: Future Inference Of Semantic Heatmaps In Grids FISHING Net Architecture: multiple neural networks, one for each sensor modality (lidar, radar and camera) take in a sequence of input sensor data and output a sequence of shared top-down semantic grids representing 3 object classes (Vulnerable Road Users (VRU), vehicles and background). The sequences are then fused using an aggregation function to output a fused sequence of semantic grids.
  • 44. FISHING Net: Future Inference Of Semantic Heatmaps In Grids • The overall architecture consists of a neural network for each sensor modality. • Across all modalities, the network architecture consists of an encoder decoder network with convolutional layers. • It uses average pooling with a pooling size of (2,2) in the encoder and up-sampling in the decoder. • After the decoder, a single linear convolutional layer to produce logits, and a softmax to produce the final output probabilities for each of the three classes along each of the output timesteps. • It uses a slightly different encoder and decoder scheme for the vision network compared to the lidar and radar networks to account for the pixel space features.
  • 45. FISHING Net: Future Inference Of Semantic Heatmaps In Grids Vision architecture
  • 46. FISHING Net: Future Inference Of Semantic Heatmaps In Grids Lidar and Radar Architecture
  • 47. FISHING Net: Future Inference Of Semantic Heatmaps In Grids • The LiDAR features consist of: 1) Binary lidar occupancy (1 if any lidar point is present a given grid cell, 0 otherwise). 2) Lidar density (Log normalized density of all lidar points present in a grid cell). 3) Max z (Largest height value for lidar points in a given grid cell). 4) Max z sliced (Largest z value for each grid cell over 5 linear slices eg. 0-0.5m,..., 2.0-2.5m). • The Radar features consist of: 1) Binary radar occupancy (1 if any radar point is present a given grid cell, 0 otherwise). 2) X, Y values for each radar return’s doppler velocity compensated with ego vehicle’s motion. 3) Radar cross section (RCS). 4) Signal to noise ratio (SNR). 5) Ambiguous Doppler interval. • The dimensions of the images match the output resolution of 192 by 320.
  • 48. FISHING Net: Future Inference Of Semantic Heatmaps In Grids
  • 49. FISHING Net: Future Inference Of Semantic Heatmaps In Grids Label input for lidar radar and vision predictions for lidar radar and vision
  • 50. BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry And Semantic Point Cloud • Bird’s eye semantic segmentation, a task that predicts pixel-wise semantic segmentation in BEV from side RGB images. • Two main challenges: the view transformation from side view to bird’s eye view, as well as transfer learning to unseen domains. • The 2-staged perception pipeline explicitly predicts pixel depths and combines them with pixel semantics in an efficient manner, allowing the model to leverage depth information to infer objects’ spatial locations in the BEV. • Transfer learning by abstracting high level geometric features and predicting an intermediate representation that is common across different domains.
  • 51. BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry And Semantic Point Cloud BEV-Seg pipeline
  • 52. BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry And Semantic Point Cloud • In the first stage, N RGB road scene images are captured by cameras at different angles and individually pass through semantic segmentation network S and depth estimation network D. • The resulting side semantic segmentations and depth maps are combined and projected into a semantic point cloud. • This point cloud is then projected downward into an incomplete bird’s-eye view, which is fed into a parser network to predict the final bird’s-eye segmentation. • The rest of this section provides details on the various components of the pipeline.
  • 53. BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry And Semantic Point Cloud • For side-semantic segmentations, use HRNet, a state-of-the-art convolutional network for semantic segmentation. • For monocular depth estimation, implement SORD using the same HRNet as the backbone. • For both tasks, train the same model on all four views. • The resulting semantic point cloud is projected height-wise onto a 512x512 image. • Train a separate HRNet model as the parser network for the final bird’s-eye segmentation. • Transfer learning via modularity and abstraction: 1). Fine-tune the stage 1 models on the target domain stage 1 data; 2). Apply the trained stage 2 model as-is to the projected point cloud in the target domain.
  • 54. BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry And Semantic Point Cloud Table 1: Segmentation Result on BEVSEG-Carla. Oracle models have ground truth given for specified inputs.
  • 55. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D • End-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras • To “lift" each image individually into a frustum of features for each camera, then “splat" all frustums into a rasterized bird's-eye view grid • To learn how to represent images and how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error • Codes: https://nv-tlabs.github.io/lift-splat-shoot
  • 56. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D Given multi-view camera data (left), it infers semantics directly in the bird's-eye-view (BEV) coordinate frame (right). It shows vehicle segmentation (blue), drivable area (orange), and lane segmentation (green). These BEV predictions are then projected back onto input images (dots on the left).
  • 57. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D Traditionally, computer vision tasks such as semantic segmentation involve making predictions in the same coordinate frame as the input image. In contrast, planning for self-driving generally operates in the bird's-eye-view frame. The model directly makes predictions in a given bird's-eye- view frame for end-to-end planning from multi-view images.
  • 58. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D It visualizes the “lift" step. For each pixel, it predicts a categorical distribution over depth (left) and a context vector (top left). Features at each point along the ray are determined by their outer product (right).
  • 59. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D In the “lift" step, a frustum-shaped point cloud is generated for each individual image (center-left). The extrinsics/intrinsics are then used to splat each frustum onto the BEV plane (center right). Finally, a BEV CNN processes the BEV representation for BEV semantic segmentation or planning (right).
  • 60. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D It visualizes the 1K trajectory templates that is “shoot" onto the cost map during training and testing. During training, the cost of each template trajectory is computed and interpreted as a 1K-dimensional Boltzman distribution over the templates. During testing, choose the argmax of this distribution and act according to the chosen template.
  • 61. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D Instead of the hard-margin loss proposed in NMP (Neural Motion Planner), planning is framed as classification over a set of K template trajectories. To leverage the cost-volume nature of the planning problem, enforce the distribution over K template trajectories to take the following form
  • 62. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D For a single time stamp, remove each of the cameras and visualize how the loss the cameras selects the prediction of the network. Region covered by the missing camera becomes fuzzier in every case. When the front camera is removed (top middle), the network extrapolates the lane and drivable area in front of the ego and extrapolates the body of a car for which only a corner can be seen in the top right camera.
  • 63. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D Qualitatively show how the model performs, given an entirely new camera rig at test time. Road segmentation is in orange, lane segmentation is in green, and vehicle segmentation is in blue.
  • 64. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D The top 10 ranked trajectories out of the 1k templates. The model predicts bimodal distributions and curves from observations from a single timestamp. The model does not have access to the speed of the car so it is compelling that the model predicts low-speed trajectories near crosswalks and brake lights.
  • 65. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera • online estimation of semantic BEV HD-maps using video input from a single onboard camera • image-level understanding, BEV level understanding, and aggregation of temporal info Front-facing monocular camera for Bird’s-eye View (BEV) HD- map understanding
  • 66. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera It relies on three pillars and can also be split into modules that process backbone features. First, the image-level branch which is composed of two decoders, one processing the static HDmap and one the dynamic obstacle, second the BEV temporal aggregation module that fuses our three pillars and aggregates all the temporal and image plane information in the BEV and finally the BEV decoder.
  • 67. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera Temporal aggregation module combines information from all frames and all branches into one BEV feature map. Backbone features and image-level static estimates are projected with warping function AB to BEV and max (M) is applied in batch dimension. The results are concatenated in channel dimension. The reference frame backbone features (highlighted with red) are used in Max function as well as skip connection to concatenation.
  • 68. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera • The dataset also provides 3D bounding boxes of 23 object classes. • In experiments, select six HD-map classes: drivable area, pedestrian crossings, walkways, carpark area, road segment, and lane. • For dynamic objects, select the classes: car, truck, bus, trailer, construction vehicle, pedestrian, motorcycle, traffic cone and barrier. • Even though a six-camera rig was used to capture data, only use the front camera for training and evaluation.
  • 69. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera
  • 70. Understanding Bird’s-eye View Semantic HD- Maps Using An Onboard Monocular Camera