V2 v posenet

V2V-PoseNet:
Voxel-to-Voxel Prediction Network for
Accurate 3D Hand and Human Pose
Estimation from a Single Depth Map
Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee
Computer Vision Lab.
Dept. of ECE, ASRI,
Seoul National University
http://cv.snu.ac.kr Aug 29, 2018
Invited Talk @ NAVER
 Winner of the 2017 Hands in the Million Challenge on 3D Hand Pose Estimation

Intelligent and Invisible Computing 2
The 2017 Hands in the Million Challenge on
3D Hand Pose Estimation

HANDS 2017 3D Hand Pose Estimation Challenge
 We won the challenge! (ranked 1st among 15 entries)
 Frame-based 3D Hand Pose Estimation
V2V-PoseNet

 Goal: Localize hand keypoints (joints) from a single depth
map
Fig. 3D hand model: 21 keypoints (joints)

 Still hot topic:
 More than 16,000 publications over last 5 years

Applications
Oculus Rift
Microsoft HoloLens
 Crucial Technique for HCI and AR

What are the Challenges?
 Diverse geometric (shape) variations
 Weak appearance features
 Heavy self occlusions
 Self similarity
 Noise

Previous works for 3D Hand Pose Estimation
 Generative approaches
• Assume pre-defined hand model and fit it to the input depth image
• PSO, ICP to minimize hand-crafted cost function
[1] C. Qian, et al. “Realtime and robust hand tracking from depth.” CVPR 2014,
[2] Tang, Danhang, et al. "Opening the black box: Hierarchical sampling optimization for estimating human hand pose." ICCV 2015.
Fig. Finger detection and hand pose initialization [1]
Fig. Hierarchical sampling optimization using silver and gold energy [2]

 Discriminative approaches
• Directly localize keypoints from the input depth image without hand model
• Most of the random forest- and recent deep learning-based methods (including V2V-PoseNet)
Fig. Pose-REN [1]
[1] Chen, Xinghao, et al. "Pose Guided Structured Region Ensemble Network for Cascaded Hand Pose Estimation.“Neurocomputing 2018.
[2] Ge, Liuhao, et al. "3d convolutional neural networks for efficient and robust hand pose estimation from single depth images." CVPR 2017.
[3] Ge, Liuhao, et al. "Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view cnns.“ CVPR 2016.
Fig. 3D CNN for hand pose estimation [2]
Fig. Multi-view CNN for hand pose estimation [3]

 Hybrid approaches
• Try to combine generative and discriminative approaches
• Learn latent space of pose (generative) and localize keypoints from the space (discriminative)
• Recent methods learned latent space successfully using adversarial loss
Fig. Learned latent space of CrossingNets [2]
[1] Zhou, Xingyi, et al. "Model-based deep hand pose estimation." IJCV 2016.
[2] Wan, Chengde, et al. "Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation." CVPR 2017.
Fig. Model-based hand pose estimation (DeepModel) [1]

Our Contributions
 Firstly cast the 3D hand and human pose estimation from a single depth map into
voxel-to-voxel prediction
 Empirically validate the usefulness of the volumetric input and output
representations
 Significantly outperformed existing methods on almost all of 3D hand and human
pose estimation datasets
 Won the first place in the HANDS 2017 frame-based 3D hand pose estimation
challenge

 Most of the previous works take 2D depth image and directly regress 3D coordinates
 P2C: [Chen et al. arXiv 2017], [CrossingNets. CVPR 2017], [DeepPrior++. ICCVW2017], [Oberweger et al. ICCV 2015]
 P2V: [Pavlakos et al. CVPR 2017]
 V2C: [Ge et al. CVPR 2017], [Deng et al. arXiv 2017]
 V2V: Ours
 We argue that voxel-to-voxel prediction achieves more accurate result
Analysis of the Previous Works

Why Voxel-to-Voxel (V2V) is better ?
 Perspective distortion matters: what is perspective distortion?

 Perspective distortion matters: what is perspective distortion?

𝑥 𝑝𝑖𝑥𝑒𝑙
𝑦 𝑝𝑖𝑥𝑒𝑙
=
𝑥 𝑤𝑜𝑟𝑙𝑑
𝑦 𝑤𝑜𝑟𝑙𝑑
∗
𝐹𝐿
𝑧 𝑤𝑜𝑟𝑙𝑑
+ 𝑅0
 𝑹 𝟎: constant, 𝑭𝑳: focal length (camera
param), 𝒛 𝒘𝒐𝒓𝒍𝒅: distance from camera
 Different distances from camera make
distortion
𝑢, 𝑣 = (𝑥 𝑝𝑖𝑥𝑒𝑙, 𝑦 𝑝𝑖𝑥𝑒𝑙)
𝑋, 𝑌, 𝑍 = (𝑥 𝑤𝑜𝑟𝑙𝑑, 𝑦 𝑤𝑜𝑟𝑙𝑑, 𝑧 𝑤𝑜𝑟𝑙𝑑)

Camera
 Perspective distortion matters:
1-to-1
relation
N-to-1
relation
3D to 2D
projection
3D point cloud
(𝒄𝒐𝒐𝒓𝒅 𝒘𝒐𝒓𝒍𝒅)
ΔX-ΔX
-ΔY
ΔY
2D depth maps
(𝒄𝒐𝒐𝒓𝒅 𝒑𝒊𝒙𝒆𝒍)

 We discretize 3D point cloud to the Voxels
 Voxelized 3D point cloud is free from
perspective distortion
 Voxelized input can be more easily adopted
to the advanced CNN architecture (ResNet,
U-Net) than point cloud input
Voxelize
3D point cloud
Voxels
1-to-1
relation
1-to-1
relation

 Tompson et. al [1] argued mapping
between image and coordinates of
keypoint is highly non-linear
 Supervising per-pixel likelihood (2D
heatmap) to the network gave more
accurate result
 Most of the 2D human pose estimation
methods learn to estimate 2D heatmap
(called detection-based)
 Our model estimate per-voxel likelihood
(3D heatmap) instead of 3D coordinates
Fig. Overall architecture of the Tompson et. al [1]
[1] Tompson, Jonathan J., et al. "Joint training of a convolutional network and a graphical model for human pose estimation." Advances in neural information processing systems. 2014.

Generating Input of the V2V-PoseNet
Some problems are here…
 Simple depth thresholding can exclude some parts of hand or human body
 In contrast to regression-based methods (coordinate estimation), detection-
based methods (heatmap estimation) cannot recover excluded parts
 Conventional strategy for input generation
Depth map from a dataset Depth thresholding and
calculate center-of-mass (CoM)
Draw a fixed-size cubic
box around the CoM
Project the cubic box on
the 2D image and crop the
hand region
CoM
Thumb is contained in the
bounding box

 We refine the estimated CoM using a simple network [1]
 The network takes cropped depth image from conventional cropping method and
outputs offsets to the correct CoM
 A depth image is converted to the 3D point cloud and crop hand in the voxelized
3D space around the refined CoM by placing fixed-size cubic
Generating Input of the V2V-PoseNet
[1] Oberweger, Markus, and Vincent Lepetit. "Deepprior++: Improving fast and accurate 3d hand pose estimation." ICCV workshop. Vol. 840. 2017
Forward the cropped hand image
(CoM refinement network)
Crop hand in the 3D space and voxelize it
Refined CoM = (x-0.8,y+0.1,z+0.3)
Input of the V2V-PoseNet
Crop hand following conventional protocol
CoM = (x,y,z)
Fig. Effect of the CoM refinement

Network Design
 Fully convolutional 3D CNN
 Takes voxelized depth map and estimates per-voxel likelihood (3D heatmap) of each keypoint
 Encoder and decoder enable the model to exploit multi-scale information

Network Design
 Volumetric BasicBlock: 3D Conv + 3D BN + ReLU
 Volumetric ResBlock: extended 2D Resblock [1] to 3D
 Volumetric DownSamplingBlock: 3D Max-pooling
 Volumetric UpSamplingBlock: 3D Deconv + 3D BN + ReLU
[1] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Network Design
 Encoder decreases resolution while increases the number of channels
 Decoder increases resolution while decreases the number of channels
 Effectively extracts multi-scale information (downsampling-upsampling structure)
 Efficiently enlarges receptive field (Volumetric DownSamplingBlock)

Network Design
 3D CNN consumes a lot of GPU memory -> careful architecture designing is required
 Increasing the number of all feature maps consumes too much memory
 We increased the number of feature map of downsampled feature map only -> trade-off between memory
limitation and performance
 1.53 mm error decreases on the NYU dataset

Network Design
 Hourglass network [2] uses simple NN for upsampling
 We use VoluemetricUpSamplingBlock (3D Deconv + 3D BN + ReLU) instead of NN -> error decreases
 Skip connection helps to upsample the feature map more stable -> error decreases
[2] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016. 1

Implementation Details
 Ground-truth 3D heatmap is generated, wherein the mean of Gaussian peak is positioned at the
ground-truth joint location
• 𝐻 𝑛
∗
(𝑖, 𝑗, 𝑘) = exp −
𝑖−𝑖 𝑛
2 + 𝑗−𝑗 𝑛
2+ 𝑘−𝑘 𝑛
2
2𝜎2
• 𝐻 𝑛
∗ is the ground-truth 3D heatmap of 𝑛th keypoint, (𝑖 𝑛, 𝑗 𝑛, 𝑘 𝑛) is the ground-truth voxel coordinate of 𝑛th
keypoint.
 Mean square error is adopted as a loss function
• 𝐿 = σ 𝑛=1
𝑁 σ𝑖,𝑗,𝑘 𝐻 𝑛
∗
𝑖, 𝑗, 𝑘 − 𝐻 𝑛 𝑖, 𝑗, 𝑘 2
• 𝐻 𝑛
∗ and 𝐻 𝑛 are the ground-truth and estimated heatmaps for 𝑛th keypoint, respectively, and 𝑁 denotes the
number of keypoints
 88×88×88 voxel grid is fed to the network with data augmentation
• Rotation: [-40, 40] degrees in XY space
• Scaling: [0.8, 1.2] in XYZ space
• Translation: [-8, 8] voxels in 3D space
 Implemented under Torch7 framework (will be reimplemented under pyTorch)

Datasets
 ICVL Hand Posture Dataset
• 330K training and 1.6K testing depth images
• 10 different subjects
 NYU Hand Pose Dataset
• 72K training and 8.2K testing depth images
 MSRA Hand Pose Dataset
• 76K depth images from 9 subjects with 17 gestures
• Leave-one-subject-out cross-validation
 HANDS 2017 Frame-based 3D Hand Pose Estimation Challenge Dataset
• 975K training and 295K testing depth images
• Five subjects in the training set and ten subjects in testing set
 ITOP Human Pose Dataset
• 40K training and 10K testing depth images of 20 subjects
• Front-view, top-view
NYU Hand Pose dataset
ICVL Hand Posture
dataset MSRA Hand Pose
dataset
ITOP Human Pose dataset

Evaluation Metrics
 3D distance error
• Euclidean distance between estimated keypoint and ground-truth coordinates in 3D space
 Percentage of success frame
• Success frame: All the 3D distance error of each keypoint are less than a threshold
• Ratio of success frames in the whole test frames
 mAP based on 10 cm rule
• Consider estimated keypoint is correct if 3D distance to ground-truth is less than 10 cm
• Used for 3D human pose estimation

Computational Complexity
 Training time
• 2 days for ICVL dataset (330K training images)
• 12 hours for NYU and MSRA dataset (70K training images)
• 6 days for HANDS 2017 challenge dataset (957K training images)
• 3 hours for ITOP dataset (40K training images)
 Testing time
• 35 fps on the single-GPU machine (NVIDIA TITAN X, without ensemble)
• Can be used in real-world applications in real-time
• Input generation (ref.pt refinement + voxelizing): 23 ms (most of the time is for voxelizing)
• Network forwarding: 5 ms
• Extracting 3D coordinates from the 3D heatmaps: 0.5 ms

Ablation Study
Table. Performance and # of param comparison according to the input
and output type
 Converting 2D depth map to 3D
voxelized grid improves performance
 Estimating the per-voxel likelihood
(3D heatmap) gives more accurate
estimation compared with directly
regressing 3D coordinates
 The table shows the benefit of the
volumetric input and output
representation
 Effect of input-output representation

Ablation Study
Fig. Ref.pt refinement network (localization refinement)
 The epoch ensemble averages estimation
from several epochs
 In contrast to other ensemble techniques,
it ensembles models from a single training
 We used models from all epochs (10
epochs) for the ensemble
 In multi-GPU environment, it does not
increase running time
 More accurate and robust estimation
Fig. Effect of the localization refinement

Quantitative Results: Hand Pose
ICVL NYU MSRA

HANDS 2017 Challenge Results

Qualitative Results
 ICVL dataset

Qualitative Results
 NYU dataset

Qualitative Results
 MSRA dataset

Qualitative Results
 HANDS 2017 Challenge dataset

Quantitative Results: Human Pose
 ITOP dataset

Qualitative Results
 ITOP dataset: Front View

Qualitative Results
 ITOP dataset: Top View

Qualitative Results
 ICVL dataset: Frame-based
Video

Qualitative Results
 NYU dataset: Frame-based
Video

Qualitative Results
 MSRA dataset (grouped by gesture): Frame-based
Video

Conclusion
 We proposed a novel and powerful network, V2V-PoseNet, for 3D hand and human
pose estimation from a single depth map
 Converted 2D depth map into the 3D voxel representation and estimated the per-
voxel likelihood (3D heatmap) for each keypoint instead of directly regressing 3D
coordinates
 Significantly outperformed almost all the existing methods in almost all the 3D
hand and human pose estimation dataset
 Achieved the 1st place in HANDS 2017 frame-based 3D hand pose estimation
challenge
 Learning physical constraints via generative approach and improving encoder-
decoder for multi-scale information are future works
 Code is available: https://github.com/mks0601/V2V-PoseNet_RELEASE

Thank you
http://cv.snu.ac.kr

V2 v posenet

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à V2 v posenet

Similaire à V2 v posenet (20)

Plus de NAVER Engineering

Plus de NAVER Engineering (20)

Dernier

Dernier (20)

V2 v posenet