3D Perception for Autonomous Driving - Datasets and Algorithms -
1. Mobility Technologies Co., Ltd.
3D Perception for Autonomous Driving
- Datasets and Algorithms -
Kazuyuki MIyazawa
AI R D Group 2, AI System Dept.
Mobility Technologies Co., Ltd.
2. Mobility Technologies Co., Ltd.
Who am I?
2
@kzykmyzw
Kazuyuki Miyazawa
Group Leader
AI R D Group 2
AI System Dept.
Mobility Technologies Co., Ltd.
Past Work Experience
April 2019 - March 2020
AI Research Engineer@DeNA Co., Ltd.
April 2010 - March 2019
Research Scientist@Mitsubishi Electric Corp.
Education
PhD in Information Science@Tohoku University
4. Mobility Technologies Co., Ltd.
3D Object Detection: Motivation
■ 2D bounding boxes are not sufficient
■ Lack of 3D pose, Occlusion information, and 3D location
Preliminary (Today’s Main Topic)
4
2D Object Detection 3D Object Detection
http://www.cs.toronto.edu/~byang/
6. Mobility Technologies Co., Ltd.
KITTI [2012]
6
Sensor Setup
● GPS/IMU x 1
● LiDAR (64ch) x 1
● Grayscale Camera (1.4M) x 2
● Color Camera (1.4M) x 2
http://www.cvlibs.net/datasets/kitti/
8. Mobility Technologies Co., Ltd.
3D Object Detection
8
● 7,481 training images / point clouds
● 7,518 test images / point clouds
● 80,256 labeled objects
type Car, Van, Truck, Pedestrian, Person_sitting, Cyclist,
Tram, Misc or DontCare
truncated 0 to 1, where truncated refers to the object leaving
image boundaries
occuluded 0 = fully visible, 1 = partly occluded, 2 = largely occluded,
3 = unknown
alpha Observation angle of object, ranging [-pi..pi]
bbox 2D bounding box of object in the image
dimensions 3D object dimensions: height, width, length
location 3D object location x,y,z in camera coordinate
rotation_y Rotation ry around Y-axis in camera coordinates [-pi..pi]
Annotations
10. Mobility Technologies Co., Ltd.
Variants of KITTI
10
SemanticKITTI Dataset provides
annotations that associate each LiDAR
point with one of 28 semantic classes in all
22 sequences of the KITTI Dataset
http://semantic-kitti.org/
Virtual KITTI contains 50 high-resolution
monocular videos (21,260 frames)
generated from five different virtual worlds
in urban settings under different imaging
and weather conditions
https://europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds/
11. Mobility Technologies Co., Ltd.
ApolloScape [2017]
11
Sensor Setup
● GPS/IMU x 1
● LiDAR x 2
● Color Camera (9.2M) x 2
http://apolloscape.auto/
14. Mobility Technologies Co., Ltd.
3D Object Detection
14
● 53 min training sequences
● 50 min testing sequences
● 70K 3D fitted cars
type Small vehicle, Big vehicle, Pedestrian, Motorcyclist and
Bicyclist, Traffic cones, Others
dimensions 3D object dimensions: height, width, length
location 3D object location x,y,z in relative coordinate
heading Steering radian with respect to the direction of the object
Annotations
15. Mobility Technologies Co., Ltd.
■ To the extent that we authorize the Developer to use Datasets and subject to the terms of this
Agreement, the Developer is entitled to use the Datasets only (i) for Developer’s internal
purposes of non-commercial research or teaching and (ii) in accordance with the terms of this
Agreement.
License
15
http://apolloscape.auto/license.html
16. Mobility Technologies Co., Ltd.
nuScenes [2019]
16
Sensor Setup
● GPS/IMU x 1
● LiDAR (32ch) x 1
● RADAR x 5
● Color Camera (1.4M) x 3
https://www.nuscenes.org/
17. Mobility Technologies Co., Ltd.
Semantic Map
17
● Provide highly accurate human-annotated
semantic maps of the relevant areas
● 11 semantic classes
● Encourage the use of localization and
semantic maps as strong priors for all tasks
18. Mobility Technologies Co., Ltd.
3D Object Detection
18
● category
● attribute
● visibility
● instance
● sensor
● calibrated_sensor
● ego_pose
● log
● scene
● sample
● sample_data
● sample_annotation
● map
Number of annotations per category
Attributes distribution for selected categories
1.4M boxes in total
20. Mobility Technologies Co., Ltd.
Argoverse [2019]
20
Sensor Setup
● GPS x 1
● LiDAR (32ch) x 2
● Color Camera (4.8M) x 2
● Color Camera (2M) x 7
https://www.argoverse.org/
22. Mobility Technologies Co., Ltd.
3D Object Detection (3D Tracking)
22
● Collection of 113 log segments with
3D object tracking annotations
● These log segments vary in length
from 15 to 30 seconds and contain
a total of 11,052 tracks
● Each sequence includes
annotations for all objects within 5
meters of “drivable area” — the
area in which it is possible for a
vehicle to drive
28. Mobility Technologies Co., Ltd.
Audi Autonomous Driving Dataset (A2D2) [2020]
28
Sensor Setup
● GPS/IMU x 1
● LiDAR (16ch) x 5
● Color Camera (2.3M) x 6
https://www.a2d2.audi/a2d2/en.html
30. Mobility Technologies Co., Ltd.
3D Object Detection
30
● All images have corresponding
LiDAR point clouds, of which
12,497 are annotated with 3D
bounding boxes within the field
of view of the front-center
camera
32. Mobility Technologies Co., Ltd.
Comparison
32
? ? ?
? ? ?
These figures are based on Table 1 in https://arxiv.org/abs/1912.04838
33. Mobility Technologies Co., Ltd.
Comparison
33
Waymo Waymo Waymo
Waymo Waymo Waymo
These figures are based on Table 1 in https://arxiv.org/abs/1912.04838
34. Mobility Technologies Co., Ltd.
Waymo Open Dataset [2019]
34
Sensor Setup
● Mid-Range (~75m) LiDAR x 1
● Short-Range (~20m) LiDAR x 4
● Color Camera (2M) x 3
● Color Camera (1.6M) x 2
https://waymo.com/open/
35. Mobility Technologies Co., Ltd.
Data Volume
35
Train
798 segments
w/ labels
(757 GB)
Test
150 seg.
w/o labels
(192 GB)
Validation
202 seg.
w/ labels
(144 GB)
● Contain 1150 segments that each span 20 seconds
● Additionally, segments from a new location and only a subset have labels
are provided for domain adaptation
36. Mobility Technologies Co., Ltd.
Data Format
36
Segment Frame context Shared information among all frames in the scene (e.g., calibration parameters, stats)
timestamp_micros Frame timestamp
pose Vehicle pose
images Camera images and metadata (e.g., pose, velocity, timestamp)
lasers Range images
laser_labels 3D box annotations
projected_lidar_labels Lidar labels (laser_labels) projected to camera images
camera_labels 2D box annotations
no_label_zones Polygon that represents areas without labels (e.g., opposite side of a highway)
Frame ...
● Each segment (20 sec) consists of ~200 frames (10 Hz)
● All the data related to a segment is stored to a single tfrecord and represented
as protocol buffers
37. Mobility Technologies Co., Ltd.
Range Image
37
The point cloud of each LiDAR is encoded as a range image
1streturn2ndreturn
range
intensity
elongation
range
intensity
elongation
38. Mobility Technologies Co., Ltd.
API & Tutorial in colab
38
https://github.com/waymo-research/waymo-open-dataset
https://colab.research.google.com/github/waymo-research/waymo-open-dataset/blob/master/tutorial/tutorial.ipynb
41. Mobility Technologies Co., Ltd.
Data Visualization (LiDAR Point Cloud)
41
Mid-range LiDAR
Short-range LiDAR (front)
42. Mobility Technologies Co., Ltd.
Data Visualization (LiDAR Point Cloud)
42
Mid-range LiDAR
Short-range LiDAR (right)
43. Mobility Technologies Co., Ltd.
Data Visualization (LiDAR Point Cloud)
43
Mid-range LiDAR
Short-range LiDAR (rear)
44. Mobility Technologies Co., Ltd.
Data Visualization (LiDAR Point Cloud)
44
Mid-range LiDAR
Short-range LiDAR (left)
45. Mobility Technologies Co., Ltd.
Data Visualization (LiDAR Point Cloud)
45
Mid-range LiDAR
Short-range LiDARs (all)
46. Mobility Technologies Co., Ltd.
Data Visualization (Camera Images)
46
Front Left
1920x1080
Front
1920x1080
Front Right
1920x1080
Side Left
1920x886
Side Right
1920x886
47. Mobility Technologies Co., Ltd.
3D Object Detection
47
■ 3D LiDAR Lables
■ 3D 7-DOF bounding boxes in the
vehicle frame with globally unique
tracking IDs
■ vehicles, pedestrian, cyclists, signs
■ 2D Camera Lables
■ Not projections of 3D labels
■ vehicles, pedestrian, cyclists
■ Tight-fitting, axis-aligned 2D
bounding boxes with globally
unique tracking IDs
Vehicle Pedestrian Cyclists Signs
3D
Object 6.1M 2.8M 67K 3.2M
3D
TrackID 60K 23K 620 23K
2D
Object 7.7M 2.1M 63K -
2D
TrackID 164K 45K 1.3K -
Labeled object and tracking ID counts
50. Mobility Technologies Co., Ltd.
LiDAR to Camera Projection
50
■ Cameras and LiDARs data are well-synchronized
■ LiDAR points can be projected to camera image with rolling shutter effect compensation
52. Mobility Technologies Co., Ltd.
Evaluation Metrics for 3D Object Detection
52
https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
P/R Curve Average Precision with Heading
Each true positive is weighted by heading
accuracy defined as
Ground truth
Prediction
53. Mobility Technologies Co., Ltd.
To ensure the Dataset is only used for Non-Commercial Purposes, You agree
■ Not to distribute or publish any models trained on or refined using the Dataset,
or the weights or biases from such trained models
■ Not to use or deploy the Dataset, any models trained on or refined using the
Dataset, or the weights or biases from such trained models (i) in operation of a
vehicle or to assist in the operation of a vehicle, (ii) in any Production Systems,
or (iii) for any other primarily commercial purposes
License
53
https://waymo.com/open/terms/
55. Mobility Technologies Co., Ltd.
■ Design a novel type of neural network that directly consumes point clouds, which well respects
the permutation invariance of points in the input
■ Provide a unified architecture for applications ranging from object classification, part
segmentation, to scene semantic parsing
PointNet [C. Qi+, CVPR2017]
55
https://arxiv.org/abs/1612.00593
57. Mobility Technologies Co., Ltd.
PointNet Architecture
57
Predict an affine transformation
matrix by a mini-network and align
all input set to achieve invariance
against geometric transformations
58. Mobility Technologies Co., Ltd.
PointNet Architecture
58
The same alignment approach
is also applied in feature space
59. Mobility Technologies Co., Ltd.
PointNet Architecture
59
Using max pooling as
symmetric function, aggregate
unordered point features
60. Mobility Technologies Co., Ltd.
■ Divide a point cloud into 3D voxels and transform them into a unified feature representation
■ Descriptive volumetric representation is then connected to a RPN to generate detections
VoxelNet [Y, Zhou+, CVPR2018]
60
A voxel represents a value
on a regular grid in three-
dimensional space
https://en.wikipedia.org/wiki/Voxel
LiDAR ONLY
https://arxiv.org/abs/1711.06396
61. Mobility Technologies Co., Ltd.
Voxel Feature Encoding (VFE) Layer
61
● VFE enables inter-point interaction within
a voxel, by combining point-wise features
with a locally aggregated feature.
● Stacking multiple VFE layers allows
learning complex features for
characterizing local 3D shape information
62. Mobility Technologies Co., Ltd.
Convolutional Middle Layers
62
● Each convolutional middle layer applies 3D
convolution, BN layer, and ReLU layer
sequentially
● Convolutional middle layers aggregate
voxel-wise features within a progressively
expanding receptive field, adding more
context to the shape description
63. Mobility Technologies Co., Ltd.
Region Proposal Network
63
● The first layer of each block downsamples the input feature map
● Then the output of every block is upsampled to a fixed size and
concatenated to construct the high resolution feature map
● Finally, this feature map is mapped to the desired learning targets
64. Mobility Technologies Co., Ltd.
Evaluation on KITTI
64
Performance comparison on KITTI validation set
Performance comparison on KITTI test set
65. Mobility Technologies Co., Ltd.
■ Apply sparse convolution to greatly increase the speeds of training and inference
■ Introduce a novel angle loss regression approach to solve the problem of the large loss
generated when the angle prediction error is equal to π
SECOND (Sparsely Embedded CONvolutional Detection) [Y, Yan+, Sensors2018]
65
LiDAR ONLY
https://pdfs.semanticscholar.org/5125/a16039cabc6320c908a4764f32596e018ad3.pdf
66. Mobility Technologies Co., Ltd.
Sparse Convolution Algorithm
66
■ Gather the necessary input to construct the matrix, perform GEMM, then scatter the data back
■ GPU-based rule generation algorithm is proposed to construct input–output index rule matrix
67. Mobility Technologies Co., Ltd.
■ Directly predicting the radian offset suffers from an adversarial example problem between the
cases of 0 and π radians because they correspond to the same box but generate a large loss
when one is misidentified as the other
■ Solve this problem by introducing a new angle loss regression:
■ To address the issue that this loss treats boxes with opposite directions as being the same, a
simple direction classifier is added to the output of the RPN
Sine-Error Loss for Angle Regression
67
68. Mobility Technologies Co., Ltd.
Evaluation on KITTI
68
Performance comparison on KITTI validation set
Performance comparison on KITTI test set
69. Mobility Technologies Co., Ltd.
PointPillars [A. Lang+, CVPR2019]
69
■ Propose an encoder to learn a representation of point clouds organized in vertical columns
(pillars) and generate pseudo 2D image
■ Encoded features can be used with any standard 2D convolutional detection architecture
without computationally-expensive 3D ConvNets
LiDAR ONLY
https://arxiv.org/abs/1812.05784
70. Mobility Technologies Co., Ltd.
Pointcloud to Pseudo-Image
70
Point cloud is discretized into an evenly
spaced grid in the x-y plane,creating a
set of pillars
71. Mobility Technologies Co., Ltd.
Pointcloud to Pseudo-Image
71
Create a dense tensor of size (D, P, N)
D: Dimension of augmented lidar point (=9)
P: Number of non-empty pillars per sample
N: Number of points per pillar
72. Mobility Technologies Co., Ltd.
Pointcloud to Pseudo-Image
72
Apply PointNet to generate a (C, P,
N) sized feature tensor, followed by a
max operation over the channels to
create an output tensor of size (C, P)
73. Mobility Technologies Co., Ltd.
Pointcloud to Pseudo-Image
73
Features are scattered back to the original
pillar locations to create a pseudo-image of
size (C, H, W) where H and W indicate the
height and width of the canvas
74. Mobility Technologies Co., Ltd.
Backbone
74
Top-down network produces
features at increasingly
small spatial resolution
Second network performs
upsampling and concatenation
of the top-down features
75. Mobility Technologies Co., Ltd.
Detection Head
75
Single Shot Detector (SSD) is
used with additional regression
targets (height and elevation)
77. Mobility Technologies Co., Ltd.
■ Implementaion
■ Official PointPillar’s implementation is forked from SECOND’s
implementation and is no longer maintained
■ Instead, SECOND’s implementation now supports PointPillars
■ Format Conversion
■ SECOND’s implementation only supports KITTI and nuScenes, so format
conversion is the fastest way to use Waymo Open Dataset
■ Several converters can be found on GitHub
■ Waymo_Kitti_Adapter
■ waymo_kitti_converter
Let’s Try PointPillars on Waymo Open Dataset
77
78. Mobility Technologies Co., Ltd.
These results are just for reference, because only a part of training set is used and hyper parameters are not
tuned to Waymo Open Dataset at all
Vehicle Detection Results
78
79. Mobility Technologies Co., Ltd.
These results are just for reference, because only a part of training set is used and hyper parameters are not
tuned to Waymo Open Dataset at all
Vehicle Detection Results
79
80. Mobility Technologies Co., Ltd.
Results from Leaderboard on Waymo Open Dataset
80
https://waymo.com/open/challenges/3d-detection/#
81. Mobility Technologies Co., Ltd.
■ First generate 2D object region proposals in the RGB image using CNN, then each 2D region
is extruded to a 3D viewing frustum to get a point cloud
■ PointNet predicts a 3D bounding box for the object from the points in frustum
Frustum PointNets [C. Qi+, CVPR2018]
81
LiDAR + Camera
https://arxiv.org/abs/1812.05784
82. Mobility Technologies Co., Ltd.
Frustum Proposal
82
● Use object detector in RGB image to predict a 2D bounding box
and lift it to a frustum with a known camera matrix
● Collect all points within the frustum to form a frustum point cloud
83. Mobility Technologies Co., Ltd.
3D Instance Segmentation
83
Object instance is segmented by
binary classification of each point
using PointNet
84. Mobility Technologies Co., Ltd.
Amodal 3D Box Estimation
84
Estimate the object’s amodal
oriented 3D bounding box by
using a box regression PointNet
Estimate the true center of the
complete object and then
transform the coordinate such
that the predicted center
becomes the origin
85. Mobility Technologies Co., Ltd.
Evaluation on KITTI
85
Performance comparison on KITTI validation set
Performance comparison on KITTI test set
86. Mobility Technologies Co., Ltd.
PV-RCNN [S. Shi+, CVPR2020]
86
https://arxiv.org/abs/1912.13192
■ Voxel-based operation efficiently encodes multi-scale feature representations and can
generate high-quality 3D proposals, while the PointNet-based set abstraction operation
preserves accurate location information with flexible receptive fields
■ Integrate the two operations via the voxel-to-keypoint 3D scene encoding and the keypoint-to-
grid RoI feature abstraction
LiDAR ONLY
87. Mobility Technologies Co., Ltd.
3D Voxel CNN for Feature Encoding and Proposal Generation
87
Input points are first divided into
voxels and gradually converted into
feature volumes by 3D sparse CNN
By converting 3D feature volumes
into 2D bird-view feature maps,
high-quality 3D proposals are
generated following the anchor-
based approaches
88. Mobility Technologies Co., Ltd.
Voxel-to-keypoint Scene Encoding via Voxel Set Abstraction
88
Small number of
keypoints are sampled
from the point clouds
PointNet-based set abstraction module encodes
the multi-scale semantic features from the 3D
CNN feature volumes to the keypoints.
Check if each key point is inside or
outside of a ground-truth 3D box,
and re-weight the keypoint features
89. Mobility Technologies Co., Ltd.
Keypoint-to-grid RoI Feature Abstraction for Proposal Refinement
89
RoI-grid pooling module
aggregates the keypoint
features to the RoI-grid
points with multiple
receptive fields using
PointNet
90. Mobility Technologies Co., Ltd.
Evaluation on KITTI / Waymo Open Dataset
90
Performance comparison on KITTI test set
Performance comparison on Waymo OD validation set
91. Mobility Technologies Co., Ltd.
We Don’t Need Camera?
91
3D vehicle detection performance on KITTI test set (moderate)
LiDAR only
LiDAR + Camera
92. Mobility Technologies Co., Ltd.
■ Autonomous Driving Dataset
■ KITTI is most famous and frequently used dataset for vehicle related researches, however, it has
limited amount and the performance on the dataset is coming to a head (> 80% AP)
■ More recent datasets provide much larger multi-modal sensor data and annotations, and some of
them also provide semantic maps
■ Waymo Open Dataset is one of the largest and most diverse datasets ever released, and provides
high-quality (meata)data and annotations (but unfortunately, it’s NOT commercial-friendly at all)
■ 3D Object Detection Algorithms
■ Recent 3D object detection algorithms re-purpose camera-based detection architectures, which has
been greatly advanced by CNN and many mature techniques such as region proposal
■ Main two streams are the grid-based methods and the point-based methods, and a key component in
the former is 2D/3D CNN, and PointNet in the latter
■ Current SoTAs are dominated by LiDAR-only methods and LiDAR-camera fusion methods lag behind
Summary
92