3D Perception for Autonomous Driving - Datasets and Algorithms -

Mobility Technologies Co., Ltd.
3D Perception for Autonomous Driving
- Datasets and Algorithms -
Kazuyuki MIyazawa
AI R D Group 2, AI System Dept.

Who am I?
2
@kzykmyzw
Kazuyuki Miyazawa
Group Leader
AI R D Group 2
AI System Dept.
Past Work Experience
April 2019 - March 2020
AI Research Engineer@DeNA Co., Ltd.
April 2010 - March 2019
Research Scientist@Mitsubishi Electric Corp.
Education
PhD in Information Science@Tohoku University

Mobility Technologies Co., Ltd.3
1 Autonomous Driving Datasets
Agenda
2 3D Object Detection Algorithms

3D Object Detection: Motivation
■ 2D bounding boxes are not sufficient
■ Lack of 3D pose, Occlusion information, and 3D location
Preliminary (Today’s Main Topic)
4
2D Object Detection 3D Object Detection
http://www.cs.toronto.edu/~byang/

Autonomous Driving
Datasets
5
01

KITTI [2012]
6
Sensor Setup
● GPS/IMU x 1
● LiDAR (64ch) x 1
● Grayscale Camera (1.4M) x 2
● Color Camera (1.4M) x 2
http://www.cvlibs.net/datasets/kitti/

KITTI [2012]
7

3D Object Detection
8
● 7,481 training images / point clouds
● 7,518 test images / point clouds
● 80,256 labeled objects
type Car, Van, Truck, Pedestrian, Person_sitting, Cyclist,
Tram, Misc or DontCare
truncated 0 to 1, where truncated refers to the object leaving
image boundaries
occuluded 0 = fully visible, 1 = partly occluded, 2 = largely occluded,
3 = unknown
alpha Observation angle of object, ranging [-pi..pi]
bbox 2D bounding box of object in the image
dimensions 3D object dimensions: height, width, length
location 3D object location x,y,z in camera coordinate
rotation_y Rotation ry around Y-axis in camera coordinates [-pi..pi]
Annotations

License
9

Variants of KITTI
10
SemanticKITTI Dataset provides
annotations that associate each LiDAR
point with one of 28 semantic classes in all
22 sequences of the KITTI Dataset
http://semantic-kitti.org/
Virtual KITTI contains 50 high-resolution
monocular videos (21,260 frames)
generated from five different virtual worlds
in urban settings under different imaging
and weather conditions
https://europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds/

ApolloScape [2017]
11
Sensor Setup
● GPS/IMU x 1
● LiDAR x 2
http://apolloscape.auto/

ApolloScape [2017]
12
Scene Parsing
3D Car Instance
Lane Segmentation

ApolloScape [2017]
13
Self Localization Stereo

3D Object Detection
14
● 53 min training sequences
● 50 min testing sequences
● 70K 3D fitted cars
type Small vehicle, Big vehicle, Pedestrian, Motorcyclist and
Bicyclist, Traffic cones, Others
dimensions 3D object dimensions: height, width, length
location 3D object location x,y,z in relative coordinate
heading Steering radian with respect to the direction of the object
Annotations

■ To the extent that we authorize the Developer to use Datasets and subject to the terms of this
Agreement, the Developer is entitled to use the Datasets only (i) for Developer’s internal
purposes of non-commercial research or teaching and (ii) in accordance with the terms of this
Agreement.
License
15
http://apolloscape.auto/license.html

nuScenes [2019]
16
Sensor Setup
● GPS/IMU x 1
● RADAR x 5
https://www.nuscenes.org/

Semantic Map
17
● Provide highly accurate human-annotated
semantic maps of the relevant areas
● 11 semantic classes
● Encourage the use of localization and
semantic maps as strong priors for all tasks

3D Object Detection
18
● category
● attribute
● visibility
● instance
● sensor
● calibrated_sensor
● ego_pose
● log
● scene
● sample
● sample_data
● sample_annotation
● map
Number of annotations per category
Attributes distribution for selected categories
1.4M boxes in total

License
19

Argoverse [2019]
20
Sensor Setup
● GPS x 1
● Color Camera (2M) x 7
https://www.argoverse.org/

Argoverse Maps
21
Vector Map:
Lane-Level Geometry
Rasterized Map:
Ground Height
Rasterized Map:
Drivable Area

3D Object Detection (3D Tracking)
22
● Collection of 113 log segments with
3D object tracking annotations
● These log segments vary in length
from 15 to 30 seconds and contain
a total of 11,052 tracks
● Each sequence includes
annotations for all objects within 5
meters of “drivable area” — the
area in which it is possible for a
vehicle to drive

License
23

Lyft Level 5 [2019]
24
Sensor Setup (BETA_V0)
● WFOV Camera (1.2M) x 6
● Long-focal-length Camera (1.7M) x 1
Sensor Setup (BETA_++)
● WFOV Camera (2M) x 6
● Long-focal-length Camera (2M) x 1
https://level5.lyft.com/dataset/

Semantic Map
25

3D Object Detection (Same format as nuScenes)
26
● category
● attribute
● visibility
● instance
● sensor
● calibrated_sensor
● ego_pose
● log
● scene
● sample
● sample_data
● sample_annotation
● map
animal
bicycle
bus
car
emergency_vehicle
motorcycle
other_vehicle
pedestrian
truck
638K boxes in total

License
27

Audi Autonomous Driving Dataset (A2D2) [2020]
28
Sensor Setup
● GPS/IMU x 1
https://www.a2d2.audi/a2d2/en.html

Audi Autonomous Driving Dataset (A2D2) [2020]
29

3D Object Detection
30
● All images have corresponding
LiDAR point clouds, of which
12,497 are annotated with 3D
bounding boxes within the field
of view of the front-center
camera

License
31

Comparison
32
? ? ?
? ? ?
These figures are based on Table 1 in https://arxiv.org/abs/1912.04838

Comparison
33
Waymo Waymo Waymo
Waymo Waymo Waymo
These figures are based on Table 1 in https://arxiv.org/abs/1912.04838

Waymo Open Dataset [2019]
34
Sensor Setup
● Mid-Range (~75m) LiDAR x 1
● Short-Range (~20m) LiDAR x 4
● Color Camera (2M) x 3
https://waymo.com/open/

Data Volume
35
Train
798 segments
w/ labels
(757 GB)
Test
150 seg.
w/o labels
(192 GB)
Validation
202 seg.
w/ labels
(144 GB)
● Contain 1150 segments that each span 20 seconds
● Additionally, segments from a new location and only a subset have labels
are provided for domain adaptation

Data Format
36
Segment Frame context Shared information among all frames in the scene (e.g., calibration parameters, stats)
timestamp_micros Frame timestamp
pose Vehicle pose
images Camera images and metadata (e.g., pose, velocity, timestamp)
lasers Range images
laser_labels 3D box annotations
projected_lidar_labels Lidar labels (laser_labels) projected to camera images
camera_labels 2D box annotations
no_label_zones Polygon that represents areas without labels (e.g., opposite side of a highway)
Frame ...
● Each segment (20 sec) consists of ~200 frames (10 Hz)
● All the data related to a segment is stored to a single tfrecord and represented
as protocol buffers

Range Image
37
The point cloud of each LiDAR is encoded as a range image
1streturn2ndreturn
range
intensity
elongation
range
intensity
elongation

API & Tutorial in colab
38
https://github.com/waymo-research/waymo-open-dataset
https://colab.research.google.com/github/waymo-research/waymo-open-dataset/blob/master/tutorial/tutorial.ipynb

Data Visualization (LiDAR Point Cloud)
39
Mid-range LiDAR

40
Mid-range LiDAR

41
Mid-range LiDAR
Short-range LiDAR (front)

42
Mid-range LiDAR
Short-range LiDAR (right)

43
Mid-range LiDAR
Short-range LiDAR (rear)

44
Mid-range LiDAR
Short-range LiDAR (left)

45
Mid-range LiDAR
Short-range LiDARs (all)

Data Visualization (Camera Images)
46
Front Left
1920x1080
Front
1920x1080
Front Right
1920x1080
Side Left
1920x886
Side Right
1920x886

3D Object Detection
47
■ 3D LiDAR Lables
■ 3D 7-DOF bounding boxes in the
vehicle frame with globally unique
tracking IDs
■ vehicles, pedestrian, cyclists, signs
■ 2D Camera Lables
■ Not projections of 3D labels
■ vehicles, pedestrian, cyclists
■ Tight-fitting, axis-aligned 2D
bounding boxes with globally
unique tracking IDs
Vehicle Pedestrian Cyclists Signs
3D
Object 6.1M 2.8M 67K 3.2M
3D
TrackID 60K 23K 620 23K
2D
Object 7.7M 2.1M 63K -
2D
TrackID 164K 45K 1.3K -
Labeled object and tracking ID counts

2D Label Samples
48

3D Label Samples
49

LiDAR to Camera Projection
50
■ Cameras and LiDARs data are well-synchronized
■ LiDAR points can be projected to camera image with rolling shutter effect compensation

Challenges
51

Evaluation Metrics for 3D Object Detection
52
https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
P/R Curve Average Precision with Heading
Each true positive is weighted by heading
accuracy defined as
Ground truth
Prediction

To ensure the Dataset is only used for Non-Commercial Purposes, You agree
■ Not to distribute or publish any models trained on or refined using the Dataset,
or the weights or biases from such trained models
■ Not to use or deploy the Dataset, any models trained on or refined using the
Dataset, or the weights or biases from such trained models (i) in operation of a
vehicle or to assist in the operation of a vehicle, (ii) in any Production Systems,
or (iii) for any other primarily commercial purposes
License
53
https://waymo.com/open/terms/

3D Object Detection
Algorithms
54
02

■ Design a novel type of neural network that directly consumes point clouds, which well respects
the permutation invariance of points in the input
■ Provide a unified architecture for applications ranging from object classification, part
segmentation, to scene semantic parsing
PointNet [C. Qi+, CVPR2017]
55
https://arxiv.org/abs/1612.00593

PointNet Architecture
56

57
Predict an affine transformation
matrix by a mini-network and align
all input set to achieve invariance
against geometric transformations

58
The same alignment approach
is also applied in feature space

59
Using max pooling as
symmetric function, aggregate
unordered point features

■ Divide a point cloud into 3D voxels and transform them into a unified feature representation
■ Descriptive volumetric representation is then connected to a RPN to generate detections
VoxelNet [Y, Zhou+, CVPR2018]
60
A voxel represents a value
on a regular grid in three-
dimensional space
https://en.wikipedia.org/wiki/Voxel
LiDAR ONLY

Voxel Feature Encoding (VFE) Layer
61
● VFE enables inter-point interaction within
a voxel, by combining point-wise features
with a locally aggregated feature.
● Stacking multiple VFE layers allows
learning complex features for
characterizing local 3D shape information

Convolutional Middle Layers
62
● Each convolutional middle layer applies 3D
convolution, BN layer, and ReLU layer
sequentially
● Convolutional middle layers aggregate
voxel-wise features within a progressively
expanding receptive field, adding more
context to the shape description

Region Proposal Network
63
● The first layer of each block downsamples the input feature map
● Then the output of every block is upsampled to a fixed size and
concatenated to construct the high resolution feature map
● Finally, this feature map is mapped to the desired learning targets

Evaluation on KITTI
64
Performance comparison on KITTI validation set
Performance comparison on KITTI test set

■ Apply sparse convolution to greatly increase the speeds of training and inference
■ Introduce a novel angle loss regression approach to solve the problem of the large loss
generated when the angle prediction error is equal to π
SECOND (Sparsely Embedded CONvolutional Detection) [Y, Yan+, Sensors2018]
65
LiDAR ONLY
https://pdfs.semanticscholar.org/5125/a16039cabc6320c908a4764f32596e018ad3.pdf

Sparse Convolution Algorithm
66
■ Gather the necessary input to construct the matrix, perform GEMM, then scatter the data back
■ GPU-based rule generation algorithm is proposed to construct input–output index rule matrix

■ Directly predicting the radian offset suffers from an adversarial example problem between the
cases of 0 and π radians because they correspond to the same box but generate a large loss
when one is misidentified as the other
■ Solve this problem by introducing a new angle loss regression:
■ To address the issue that this loss treats boxes with opposite directions as being the same, a
simple direction classifier is added to the output of the RPN
Sine-Error Loss for Angle Regression
67

Evaluation on KITTI
68

PointPillars [A. Lang+, CVPR2019]
69
■ Propose an encoder to learn a representation of point clouds organized in vertical columns
(pillars) and generate pseudo 2D image
■ Encoded features can be used with any standard 2D convolutional detection architecture
without computationally-expensive 3D ConvNets
LiDAR ONLY

Pointcloud to Pseudo-Image
70
Point cloud is discretized into an evenly
spaced grid in the x-y plane,creating a
set of pillars

71
Create a dense tensor of size (D, P, N)
D: Dimension of augmented lidar point (=9)
P: Number of non-empty pillars per sample
N: Number of points per pillar

72
Apply PointNet to generate a (C, P,
N) sized feature tensor, followed by a
max operation over the channels to
create an output tensor of size (C, P)

73
Features are scattered back to the original
pillar locations to create a pseudo-image of
size (C, H, W) where H and W indicate the
height and width of the canvas

Backbone
74
Top-down network produces
features at increasingly
small spatial resolution
Second network performs
upsampling and concatenation
of the top-down features

Detection Head
75
Single Shot Detector (SSD) is
used with additional regression
targets (height and elevation)

Evaluation on KITTI
76

■ Implementaion
■ Official PointPillar’s implementation is forked from SECOND’s
implementation and is no longer maintained
■ Instead, SECOND’s implementation now supports PointPillars
■ Format Conversion
■ SECOND’s implementation only supports KITTI and nuScenes, so format
conversion is the fastest way to use Waymo Open Dataset
■ Several converters can be found on GitHub
■ Waymo_Kitti_Adapter
■ waymo_kitti_converter
Let’s Try PointPillars on Waymo Open Dataset
77

These results are just for reference, because only a part of training set is used and hyper parameters are not
tuned to Waymo Open Dataset at all
Vehicle Detection Results
78

These results are just for reference, because only a part of training set is used and hyper parameters are not
tuned to Waymo Open Dataset at all
Vehicle Detection Results
79

Results from Leaderboard on Waymo Open Dataset
80
https://waymo.com/open/challenges/3d-detection/#

■ First generate 2D object region proposals in the RGB image using CNN, then each 2D region
is extruded to a 3D viewing frustum to get a point cloud
■ PointNet predicts a 3D bounding box for the object from the points in frustum
Frustum PointNets [C. Qi+, CVPR2018]
81
LiDAR + Camera

Frustum Proposal
82
● Use object detector in RGB image to predict a 2D bounding box
and lift it to a frustum with a known camera matrix
● Collect all points within the frustum to form a frustum point cloud

3D Instance Segmentation
83
Object instance is segmented by
binary classification of each point
using PointNet

Amodal 3D Box Estimation
84
Estimate the object’s amodal
oriented 3D bounding box by
using a box regression PointNet
Estimate the true center of the
complete object and then
transform the coordinate such
that the predicted center
becomes the origin

Evaluation on KITTI
85

PV-RCNN [S. Shi+, CVPR2020]
86
■ Voxel-based operation efficiently encodes multi-scale feature representations and can
generate high-quality 3D proposals, while the PointNet-based set abstraction operation
preserves accurate location information with flexible receptive fields
■ Integrate the two operations via the voxel-to-keypoint 3D scene encoding and the keypoint-to-
grid RoI feature abstraction
LiDAR ONLY

3D Voxel CNN for Feature Encoding and Proposal Generation
87
Input points are first divided into
voxels and gradually converted into
feature volumes by 3D sparse CNN
By converting 3D feature volumes
into 2D bird-view feature maps,
high-quality 3D proposals are
generated following the anchor-
based approaches

Voxel-to-keypoint Scene Encoding via Voxel Set Abstraction
88
Small number of
keypoints are sampled
from the point clouds
PointNet-based set abstraction module encodes
the multi-scale semantic features from the 3D
CNN feature volumes to the keypoints.
Check if each key point is inside or
outside of a ground-truth 3D box,
and re-weight the keypoint features

Keypoint-to-grid RoI Feature Abstraction for Proposal Refinement
89
RoI-grid pooling module
aggregates the keypoint
features to the RoI-grid
points with multiple
receptive fields using
PointNet

Evaluation on KITTI / Waymo Open Dataset
90
Performance comparison on Waymo OD validation set

We Don’t Need Camera?
91
3D vehicle detection performance on KITTI test set (moderate)
LiDAR only
LiDAR + Camera

■ Autonomous Driving Dataset
■ KITTI is most famous and frequently used dataset for vehicle related researches, however, it has
limited amount and the performance on the dataset is coming to a head (> 80% AP)
■ More recent datasets provide much larger multi-modal sensor data and annotations, and some of
them also provide semantic maps
■ Waymo Open Dataset is one of the largest and most diverse datasets ever released, and provides
high-quality (meata)data and annotations (but unfortunately, it’s NOT commercial-friendly at all)
■ 3D Object Detection Algorithms
■ Recent 3D object detection algorithms re-purpose camera-based detection architectures, which has
been greatly advanced by CNN and many mature techniques such as region proposal
■ Main two streams are the grid-based methods and the point-based methods, and a key component in
the former is 2D/3D CNN, and PointNet in the latter
■ Current SoTAs are dominated by LiDAR-only methods and LiDAR-camera fusion methods lag behind
Summary
92

·

3D Perception for Autonomous Driving - Datasets and Algorithms -

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 3D Perception for Autonomous Driving - Datasets and Algorithms -

Similar to 3D Perception for Autonomous Driving - Datasets and Algorithms - (20)

More from Kazuyuki Miyazawa

More from Kazuyuki Miyazawa (14)

Recently uploaded

Recently uploaded (20)

3D Perception for Autonomous Driving - Datasets and Algorithms -