[DL輪読会]PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

1
DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
PV-RCNN: Point-Voxel Feature Set Abstraction
for 3D Object Detection
Kohei Nishimura, DeepX

2
概要
• 点群から3Dの物体検知を行う新しいモジュールを提案した
– Voxel CNNの特徴ベクトルとPoint Netの特徴ベクトルから、キーポイントの特徴ベクトルを抽出
するモジュール
– RoIの特徴ベクトルを抽出するpoolingモジュール
• 提案したモジュールを用いて、3D物体検知のタスクにおいてSOTAであることを確認した

4
書誌情報
• title: PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection
• authors: Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang,
Hongsheng Li
• institutes: CUHK-SenseTime Joint Laboratory/The Chinese University of Hong Kong, SenseTime
Research Abstract, NLPR/CASIA, CSE/CUHK
• publication: CVPR 2020
• paper url: https://arxiv.org/pdf/1912.13192
• code: https://github.com/jhultman/PV-RCNN
※スライド内の図は注釈がない限り、本論文からの引用です。

5
関連研究: Voxel CNNベースの研究
• 点群をvoxelにして、CNNを用いて特徴抽出する
• 効率的に点群を処理できるが、畳み込みを行うためにvoxelにするため、情報が欠損してしまう
VoxelNet: End-to-End Learning for Point Cloud Based 3D
Object Detection
https://arxiv.org/abs/1711.06396
SECOND: Sparsely Embedded Convolutional Detection
https://www.mdpi.com/1424-8220/18/10/3337/pdf

6
関連研究 PointNetベース
• 点群をそのままCNNの入力とする
• より多くの情報を抽出できるが、処理が重くなってしまう。
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a
Metric Space
https://arxiv.org/pdf/1706.02413.pdf
Deep Hough Voting for 3D Object Detection in Point Clouds
http://openaccess.thecvf.com/content_ICCV_2019/papers/Qi_Deep
_Hough_Voting_for_3D_Object_Detection_in_Point_Clouds_ICCV_20
19_paper.pdf

8
Overview
Voxel-to-keypoint Scene Encoding
Keypoint-to-grid
RoI Feature Abstraction

9
Overview
Keypoint-to-grid

10
• 目的
– 点群の良い表現を獲得する
• 概要
– Voxel CNNとkeypointを組み合わせて点群の表現を抽出する
• モジュール(赤字が本論文で提案した内容)
– 3D Voxel CNN
– Keypoints Sampling Module
• Furthest Point-Sampling
– Extened Voxel Set Abstraction Module
– Predicted Keypoint Weighting Module

11
notation
• 𝑖: キーポイントのインデックス
• 𝑛: キーポイントの数
• 𝐹: ボクセル化したCNNの特徴ベクトル𝑓の集合
• 𝑉: ボクセルの3次元座標𝑣の集合
• 𝑘: ボクセルCNNのlayerの数
• 𝑝𝑣 𝑘: 𝑘層目のNNの特徴ベクトル
• 𝑁𝑘: 点群が含まれているボクセルの数

12
3D Voxel CNN
• 3D voxel CNN
– 点群を𝐿 ✕ 𝑊✕ 𝐻のボクセルに分割し、3 x 3 x 3の3D sparce convolution層の入力とする
– 4層でdownsamplingして、各層の特徴ベクトルをボクセルの特徴ベクトルとみなす
• downsamplingのサイズは、1x, 2x, 4x ,8x
• 3D proposal generation
– 3D voxel CNNをdownsampleした最終層の特徴ベクトルをz方向にstackして、2Dの特徴ベクトル
(bird-view feature以下、bev)とする

13
Voxel Set Abstraction Module
• 3D CNNの各層の特徴をキーポイントに埋め込むためのモジュール
– PointNet ++で提案
– PointNet++では、VoxelCNNではなく、PointNet++の特徴ベクトルを利用した
• 𝑘層目の3DCNNのキーポイント𝑖に対する特徴ベクトル𝑓𝑖
𝑝𝑣 𝑘
は
– 𝑀(・)は近傍𝑇𝑘個の𝑆𝑖をランダムサンプリングする
– 𝐺(・)はボクセル化した特徴ベクトルと相対座標を埋め込むMLP
– 𝑆𝑖は、キーポイント𝑖から距離𝑟𝑘にあるボクセルの3DCNNの特徴ベクトルとキーポイントからの相
対座標
• キーポイント𝑖に対する特徴ベクトルは、各層の特徴ベクトルをconcatして求める

14
Extened Voxel Set Abstraction Module
• Voxel Set Abstraction(VSA)を拡張し、より情報を獲得する
– keypoint 𝑖の特徴ベクトル𝑓𝑖
𝑝
は、以下3つの特徴ベクトルをconcat
• VSAの特徴ベクトル:𝑓𝑖
𝑝𝑣
• raw point cloudの特徴ベクトル: 𝑓𝑖
(𝑟𝑎𝑤)
• 2Dの鳥瞰特徴ベクトル: 𝑓𝑖
(𝑏𝑒𝑣)
– raw point cloudの特徴ベクトル𝑓𝑖
(𝑟𝑎𝑤)
: ボクセル化するときの情報量の欠損を補完できる
• 𝑓𝑖
𝑝𝑣
と同じと記載されているが、キーポイントの特徴ベクトルの計算方法は??
– 2Dの鳥瞰特徴ベクトル𝑓𝑖
(𝑏𝑒𝑣)
: ボクセルよりも広い受容野を持つため、全域的な特徴を抽出できる
• 3D Voxel CNNで説明した特徴ベクトルから、座標を用いて計算する

15
Predicted Keypoint Weighting Module(PKW)
• キーポイントの中で重要度を割り振り、物体領域の提案精度を高める
– 最前面にあるキーポイントは、背景にあるキーポイントよりも重要度が高い
• PKWを用いた重要度付きのキーポイント𝑖の特徴ベクトルは
– 𝐴(・)は3層のMLPで特徴ベクトルからキーポイントが最前面にあるかどうかを[0, 1]で推論する
– 𝐴(・)の学習の詳細
• キーポイントが最前面にあるかのフラグは、
segmentationのラベルから求める
– キーポイントが3Dのground-truthのボックスの
内側にあるかどうか
• 学習の誤差関数はfocal loss(default parameter)

16
Overview
Keypoint-to-grid

17
Keypoint-to-grid RoI Feature Abstraction
• 目的
– 物体検知精度を高めるために、RoIのよい特徴表現を獲得する
• 概要
– キーポイントの特徴ベクトルからRoIの特徴ベクトルを計算し、物体検知のrefinementを行う
• モジュール(赤字が本論文が提案した内容)
– RoI-grid Pooling
– 3D Proposal Refinement and Confidence Prediction.

18
RoI-grid Pooling
• 物体検知精度を高めるために、キーポイントの特徴ベクトルを用いて、
RoIの良い特徴ベクトルを獲得する
– キーポイントの特徴ベクトルがRoIの外部の情報を取り込めるようにす
る
• RoIの特徴ベクトルは、同一RoI内のグリッドポイントの特徴ベクトルの
concatからなる
– グリッドポイントはRoIから一様にサンプリングされた点
• RoI内のグリッドポイント𝑔𝑖の特徴ベクトル𝑓𝑖
𝑔
は、
– 𝑀(・)は近傍𝑇𝑘個のΨ𝑖をランダムサンプリングする
– 𝐺(・)はボクセル化した特徴ベクトルと相対座標を埋め込むPointNet-
block
– Ψ𝑖は、以下の式で求められる距離𝑟内にあるキーポイント𝑝𝑗の特徴ベク
トルと𝑔𝑖からの相対位置をconcatして計算する
– 𝑓𝑖
𝑔
は、複数の𝑟に対して計算しconcatする

19
3D Proposal Refinement and Confidence Prediction
• refinement networkが、RoIのrefinementと確信度予測を行う
– 入力は、Grid-pooling後のRoIの特徴ベクトル
– 出力は、検知した領域の大きさ&位置と確信度
– 構造は2ブランチの2層のMLP
• 確信度予測について
– 予測対象の値𝑦 𝑘は以下の式
• 𝐼𝑜𝑈 𝑘は、𝑘番目のRoIのGround-Truthに対する𝐼𝑜𝑈
– 誤差関数はIoUを正規化したもの値のクロスエントロピー

20
Trainining Losses
• 誤差関数は、以下3つの誤差関数の総和
– 𝐿 𝑟𝑝𝑛: 提案した物体検知の誤差
– 𝐿 𝑠𝑒𝑔: キーポイントのセグメンテーションの誤差(PKWで説明)
– 𝐿 𝑟𝑐𝑛: refinementの誤差
• 𝐿 𝑟𝑝𝑛: 提案した物体検知の誤差
– 𝐿 𝑐𝑙𝑠: 物体検知のfocal loss
– 𝐿 𝑠𝑚𝑜𝑜𝑡ℎ−𝐿1:3D Voxel CNNが推論した物体とGTとのL1誤差(anchor boxの回帰を学習するため)
• 𝐿 𝑟𝑐𝑛: refinementの誤差
– 𝐿𝑖𝑜𝑢: 確信度予測誤差
– 𝐿 𝑠𝑚𝑜𝑜𝑡ℎ−𝐿1: refinement NNが推論したrefinementとそのGTとのL1誤差

21
Experiment
• Dataset
– KITTI
– Waymo Open
• Evaluation Metrics
– KITTI
• mean Average Precision with 40 recall positions
– Waymo Open
• mean Average Precision
• mean Average Precision weighted by heading

22
Results in KITTI
• PV-RCNN が多くの実験で既存手法を上回った

23
Results in Waymo
• PV-RCNN がすべての実験で既存手法を上回った。

24
Ablation Studies:Voxel-to-keypoint scene encoding & RoI-grid pooling
• Voxel-to-keypoint, RoI-grid poolingの両方が有効であることを確認した

25
Ablation Studies: VSAモジュール
• VSAモジュールはすべての特徴ベクトルを使うことで検知性能が向上することを確認した。

26
Ablation Studies: PKW & RoI-grid pooling module
• PKWとRoI-grid poolingモジュールが有効であることを確認した。
– RoI-aware Poolingは、RoI内の点の特徴ベクトルに対してMaxPoolingとAvgPoolingを組み合わせ
て特徴抽出を行う手法

27
まとめ
• 点群から3Dの物体検知を行う新しいモジュールを提案した
– Voxel CNNの特徴ベクトルとPoint Netの特徴ベクトルから、キーポイントの特徴ベクトルを抽出
するモジュール
– RoIの特徴ベクトルを抽出するpoolingモジュール
• 提案したモジュールを用いて、3D物体検知のタスクにおいてSOTAであることを確認した

29
Implementation Details
• Keypoints Sampling
– n = 2048 in KITTI
– n = 4096 in Waymo
• VSA module
– two neighboring radii of each level
• (0.4m, 0.8m), (0.8m, 1.2m), (1.2m, 2.4m), (2.4m, 4.8m)
• RoI grid pooling operations
– the neighborhood raddi of set abstraction for raw points are (0.4m, 0.8)

30
Dataset Details
• KITTI
– detection range
• [0, 70.4]m for the X axis,
• [−40, 40]m for the Y axis
• [−3, 1]m for the Z axis
– the voxel size (0.05m,0.05m,0.1m) in each axis.
• Waymo Open dataset
– detection range
• [−75.2, 75.2]m for the X and Y axes
• [−2,4]m for the Z axis
– the voxel size to (0.1m, 0.1m, 0.15m).

31
Training Details
• optimizer: Adam
– cosine anealing learing rate
• KITTI
– batch size: 24
– learning rate: 0.01
– epoch: 80
– GPU: 8 GTX 1080Ti
– training time: 5 hours
• Waymot Open
– batch size: 64
– learning rate 0.01
– epochs 50
– GPU: 32 GTX 1080Ti
– training time: 25 hours

32
Training Details2
• For the proposal refinement stage, we randomly sample 128 proposals with 1:1 ratio for positive
and negative proposals, where a proposal is considered as a positive proposal for box refinement
branch if it has at least 0.55 3D IoU with the ground-truth boxes, otherwise it is treated as a
negative proposal.
• Data Augmentation
– random flipping along the X axis,
– global scaling with a random scaling factor sampled from [0.95, 1.05]
– global rotation around the Z axis with a random angle sampled from [-pi/4, pi/4]
– randomly “paste” some new ground-truth objects from other scenes to the current training
scenes, for simulating objects in various environments

33
References
• PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection
– https://arxiv.org/abs/1912.13192v1
• VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
– https://arxiv.org/abs/1711.06396
• SECOND: Sparsely Embedded Convolutional Detection
– https://www.mdpi.com/1424-8220/18/10/3337/pdf
• PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
– https://arxiv.org/pdf/1706.02413.pdf
• Deep Hough Voting for 3D Object Detection in Point Clouds
– http://openaccess.thecvf.com/content_ICCV_2019/papers/Qi_Deep_Hough_Voting_for_3D_Ob
ject_Detection_in_Point_Clouds_ICCV_2019_paper.pdf

[DL輪読会]PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à [DL輪読会]PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

Similaire à [DL輪読会]PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection (20)

Plus de Deep Learning JP

Plus de Deep Learning JP (20)

Dernier

Dernier (20)

[DL輪読会]PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

Notes de l'éditeur