본 논문은 single depth map으로부터의 정확한 3D hand pose estimation을 목표로 한다. 3D hand pose estimation은 HCI, AR등의 기술을 구현함에 있어서 매우 중요한 기술이다. 이를 위해 많은 연구자들이 정확도를 높이기 위해 여러 방법을 제시하였지만, 여전히 손가락들의 비슷한 생김새, 가려짐, 다양한 손가락의 움직임으로 인한 복잡성 때문에 정확도를 올리는데 한계가 있었다. 본 논문은 기존 방법들의 한계를 극복하기 위해 기존 방법들이 사용하는 입력 형태와 출력 형태를 바꾸었다. 2d depth image를 입력으로 받아 hand joint의 3D coordinate를 직접 regress하는 대부분의 기존 방법들과는 달리, 제안하는 모델은 3D voxelized depth map을 입력으로 받아 3D heatmap을 출력한다. 이를 위해 encoder-decoder 형식의 3D CNN을 사용하였고, 달라진 입력과 출력 형태로 인해 제안하는 모델은 널리 사용되는 3개의 3d hand pose estimation dataset, 1개의 3d human pose estimation dataset에서 가장 높은 성능을 내었다. 또한 ICCV 2017에서 주최된 HANDS 2017 challenge에서 우승 하였다.
1. V2V-PoseNet:
Voxel-to-Voxel Prediction Network for
Accurate 3D Hand and Human Pose
Estimation from a Single Depth Map
Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee
Computer Vision Lab.
Dept. of ECE, ASRI,
Seoul National University
http://cv.snu.ac.kr Aug 29, 2018
Invited Talk @ NAVER
Winner of the 2017 Hands in the Million Challenge on 3D Hand Pose Estimation
2. Intelligent and Invisible Computing 2
The 2017 Hands in the Million Challenge on
3D Hand Pose Estimation
3. Intelligent and Invisible Computing 3
HANDS 2017 3D Hand Pose Estimation Challenge
We won the challenge! (ranked 1st among 15 entries)
Frame-based 3D Hand Pose Estimation
V2V-PoseNet
4. Intelligent and Invisible Computing 4
3D Hand Pose Estimation
Goal: Localize hand keypoints (joints) from a single depth
map
Fig. 3D hand model: 21 keypoints (joints)
5. Intelligent and Invisible Computing 5
3D Hand Pose Estimation
Still hot topic:
More than 16,000 publications over last 5 years
6. Intelligent and Invisible Computing 6
Applications
Oculus Rift
Microsoft HoloLens
Crucial Technique for HCI and AR
7. Intelligent and Invisible Computing 7
What are the Challenges?
Diverse geometric (shape) variations
Weak appearance features
Heavy self occlusions
Self similarity
Noise
8. Previous works for 3D Hand Pose Estimation
Generative approaches
• Assume pre-defined hand model and fit it to the input depth image
• PSO, ICP to minimize hand-crafted cost function
[1] C. Qian, et al. “Realtime and robust hand tracking from depth.” CVPR 2014,
[2] Tang, Danhang, et al. "Opening the black box: Hierarchical sampling optimization for estimating human hand pose." ICCV 2015.
Fig. Finger detection and hand pose initialization [1]
Fig. Hierarchical sampling optimization using silver and gold energy [2]
9. Previous works for 3D Hand Pose Estimation
Discriminative approaches
• Directly localize keypoints from the input depth image without hand model
• Most of the random forest- and recent deep learning-based methods (including V2V-PoseNet)
Fig. Pose-REN [1]
[1] Chen, Xinghao, et al. "Pose Guided Structured Region Ensemble Network for Cascaded Hand Pose Estimation.“Neurocomputing 2018.
[2] Ge, Liuhao, et al. "3d convolutional neural networks for efficient and robust hand pose estimation from single depth images." CVPR 2017.
[3] Ge, Liuhao, et al. "Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view cnns.“ CVPR 2016.
Fig. 3D CNN for hand pose estimation [2]
Fig. Multi-view CNN for hand pose estimation [3]
10. Previous works for 3D Hand Pose Estimation
Hybrid approaches
• Try to combine generative and discriminative approaches
• Learn latent space of pose (generative) and localize keypoints from the space (discriminative)
• Recent methods learned latent space successfully using adversarial loss
Fig. Learned latent space of CrossingNets [2]
[1] Zhou, Xingyi, et al. "Model-based deep hand pose estimation." IJCV 2016.
[2] Wan, Chengde, et al. "Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation." CVPR 2017.
Fig. Model-based hand pose estimation (DeepModel) [1]
11. Intelligent and Invisible Computing 11
Our Contributions
Firstly cast the 3D hand and human pose estimation from a single depth map into
voxel-to-voxel prediction
Empirically validate the usefulness of the volumetric input and output
representations
Significantly outperformed existing methods on almost all of 3D hand and human
pose estimation datasets
Won the first place in the HANDS 2017 frame-based 3D hand pose estimation
challenge
12. Intelligent and Invisible Computing 12
Most of the previous works take 2D depth image and directly regress 3D coordinates
P2C: [Chen et al. arXiv 2017], [CrossingNets. CVPR 2017], [DeepPrior++. ICCVW2017], [Oberweger et al. ICCV 2015]
P2V: [Pavlakos et al. CVPR 2017]
V2C: [Ge et al. CVPR 2017], [Deng et al. arXiv 2017]
V2V: Ours
We argue that voxel-to-voxel prediction achieves more accurate result
Analysis of the Previous Works
13. Intelligent and Invisible Computing 13
Why Voxel-to-Voxel (V2V) is better ?
Perspective distortion matters: what is perspective distortion?
14. Intelligent and Invisible Computing 14
Why Voxel-to-Voxel (V2V) is better ?
Perspective distortion matters: what is perspective distortion?
𝑥 𝑝𝑖𝑥𝑒𝑙
𝑦 𝑝𝑖𝑥𝑒𝑙
=
𝑥 𝑤𝑜𝑟𝑙𝑑
𝑦 𝑤𝑜𝑟𝑙𝑑
∗
𝐹𝐿
𝑧 𝑤𝑜𝑟𝑙𝑑
+ 𝑅0
𝑹 𝟎: constant, 𝑭𝑳: focal length (camera
param), 𝒛 𝒘𝒐𝒓𝒍𝒅: distance from camera
Different distances from camera make
distortion
𝑢, 𝑣 = (𝑥 𝑝𝑖𝑥𝑒𝑙, 𝑦 𝑝𝑖𝑥𝑒𝑙)
𝑋, 𝑌, 𝑍 = (𝑥 𝑤𝑜𝑟𝑙𝑑, 𝑦 𝑤𝑜𝑟𝑙𝑑, 𝑧 𝑤𝑜𝑟𝑙𝑑)
15. Why Voxel-to-Voxel (V2V) is better ?
Camera
Perspective distortion matters:
1-to-1
relation
N-to-1
relation
3D to 2D
projection
3D point cloud
(𝒄𝒐𝒐𝒓𝒅 𝒘𝒐𝒓𝒍𝒅)
ΔX-ΔX
-ΔY
ΔY
2D depth maps
(𝒄𝒐𝒐𝒓𝒅 𝒑𝒊𝒙𝒆𝒍)
16. Intelligent and Invisible Computing 16
Why Voxel-to-Voxel (V2V) is better ?
We discretize 3D point cloud to the Voxels
Voxelized 3D point cloud is free from
perspective distortion
Voxelized input can be more easily adopted
to the advanced CNN architecture (ResNet,
U-Net) than point cloud input
Voxelize
3D point cloud
(𝒄𝒐𝒐𝒓𝒅 𝒘𝒐𝒓𝒍𝒅)
Voxels
(𝒄𝒐𝒐𝒓𝒅 𝒘𝒐𝒓𝒍𝒅)
1-to-1
relation
1-to-1
relation
17. Why Voxel-to-Voxel (V2V) is better ?
Tompson et. al [1] argued mapping
between image and coordinates of
keypoint is highly non-linear
Supervising per-pixel likelihood (2D
heatmap) to the network gave more
accurate result
Most of the 2D human pose estimation
methods learn to estimate 2D heatmap
(called detection-based)
Our model estimate per-voxel likelihood
(3D heatmap) instead of 3D coordinates
Fig. Overall architecture of the Tompson et. al [1]
[1] Tompson, Jonathan J., et al. "Joint training of a convolutional network and a graphical model for human pose estimation." Advances in neural information processing systems. 2014.
18. Generating Input of the V2V-PoseNet
Some problems are here…
Simple depth thresholding can exclude some parts of hand or human body
In contrast to regression-based methods (coordinate estimation), detection-
based methods (heatmap estimation) cannot recover excluded parts
Conventional strategy for input generation
Depth map from a dataset Depth thresholding and
calculate center-of-mass (CoM)
Draw a fixed-size cubic
box around the CoM
Project the cubic box on
the 2D image and crop the
hand region
CoM
Thumb is contained in the
bounding box
19. Intelligent and Invisible Computing 19
We refine the estimated CoM using a simple network [1]
The network takes cropped depth image from conventional cropping method and
outputs offsets to the correct CoM
A depth image is converted to the 3D point cloud and crop hand in the voxelized
3D space around the refined CoM by placing fixed-size cubic
Generating Input of the V2V-PoseNet
[1] Oberweger, Markus, and Vincent Lepetit. "Deepprior++: Improving fast and accurate 3d hand pose estimation." ICCV workshop. Vol. 840. 2017
Forward the cropped hand image
(CoM refinement network)
Crop hand in the 3D space and voxelize it
Refined CoM = (x-0.8,y+0.1,z+0.3)
Input of the V2V-PoseNet
Crop hand following conventional protocol
CoM = (x,y,z)
Fig. Effect of the CoM refinement
20. Network Design
Fully convolutional 3D CNN
Takes voxelized depth map and estimates per-voxel likelihood (3D heatmap) of each keypoint
Encoder and decoder enable the model to exploit multi-scale information
21. Network Design
Volumetric BasicBlock: 3D Conv + 3D BN + ReLU
Volumetric ResBlock: extended 2D Resblock [1] to 3D
Volumetric DownSamplingBlock: 3D Max-pooling
Volumetric UpSamplingBlock: 3D Deconv + 3D BN + ReLU
[1] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
22. Intelligent and Invisible Computing 22
Network Design
Encoder decreases resolution while increases the number of channels
Decoder increases resolution while decreases the number of channels
Effectively extracts multi-scale information (downsampling-upsampling structure)
Efficiently enlarges receptive field (Volumetric DownSamplingBlock)
23. Intelligent and Invisible Computing 23
Network Design
3D CNN consumes a lot of GPU memory -> careful architecture designing is required
Increasing the number of all feature maps consumes too much memory
We increased the number of feature map of downsampled feature map only -> trade-off between memory
limitation and performance
1.53 mm error decreases on the NYU dataset
24. Intelligent and Invisible Computing 24
Network Design
Hourglass network [2] uses simple NN for upsampling
We use VoluemetricUpSamplingBlock (3D Deconv + 3D BN + ReLU) instead of NN -> error decreases
Skip connection helps to upsample the feature map more stable -> error decreases
[2] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016. 1
25. Intelligent and Invisible Computing 25
Implementation Details
Ground-truth 3D heatmap is generated, wherein the mean of Gaussian peak is positioned at the
ground-truth joint location
• 𝐻 𝑛
∗
(𝑖, 𝑗, 𝑘) = exp −
𝑖−𝑖 𝑛
2 + 𝑗−𝑗 𝑛
2+ 𝑘−𝑘 𝑛
2
2𝜎2
• 𝐻 𝑛
∗ is the ground-truth 3D heatmap of 𝑛th keypoint, (𝑖 𝑛, 𝑗 𝑛, 𝑘 𝑛) is the ground-truth voxel coordinate of 𝑛th
keypoint.
Mean square error is adopted as a loss function
• 𝐿 = σ 𝑛=1
𝑁 σ𝑖,𝑗,𝑘 𝐻 𝑛
∗
𝑖, 𝑗, 𝑘 − 𝐻 𝑛 𝑖, 𝑗, 𝑘 2
• 𝐻 𝑛
∗ and 𝐻 𝑛 are the ground-truth and estimated heatmaps for 𝑛th keypoint, respectively, and 𝑁 denotes the
number of keypoints
88×88×88 voxel grid is fed to the network with data augmentation
• Rotation: [-40, 40] degrees in XY space
• Scaling: [0.8, 1.2] in XYZ space
• Translation: [-8, 8] voxels in 3D space
Implemented under Torch7 framework (will be reimplemented under pyTorch)
26. Datasets
ICVL Hand Posture Dataset
• 330K training and 1.6K testing depth images
• 10 different subjects
NYU Hand Pose Dataset
• 72K training and 8.2K testing depth images
MSRA Hand Pose Dataset
• 76K depth images from 9 subjects with 17 gestures
• Leave-one-subject-out cross-validation
HANDS 2017 Frame-based 3D Hand Pose Estimation Challenge Dataset
• 975K training and 295K testing depth images
• Five subjects in the training set and ten subjects in testing set
ITOP Human Pose Dataset
• 40K training and 10K testing depth images of 20 subjects
• Front-view, top-view
NYU Hand Pose dataset
ICVL Hand Posture
dataset MSRA Hand Pose
dataset
ITOP Human Pose dataset
27. Intelligent and Invisible Computing 27
Evaluation Metrics
3D distance error
• Euclidean distance between estimated keypoint and ground-truth coordinates in 3D space
Percentage of success frame
• Success frame: All the 3D distance error of each keypoint are less than a threshold
• Ratio of success frames in the whole test frames
mAP based on 10 cm rule
• Consider estimated keypoint is correct if 3D distance to ground-truth is less than 10 cm
• Used for 3D human pose estimation
28. Intelligent and Invisible Computing 28
Computational Complexity
Training time
• 2 days for ICVL dataset (330K training images)
• 12 hours for NYU and MSRA dataset (70K training images)
• 6 days for HANDS 2017 challenge dataset (957K training images)
• 3 hours for ITOP dataset (40K training images)
Testing time
• 35 fps on the single-GPU machine (NVIDIA TITAN X, without ensemble)
• Can be used in real-world applications in real-time
• Input generation (ref.pt refinement + voxelizing): 23 ms (most of the time is for voxelizing)
• Network forwarding: 5 ms
• Extracting 3D coordinates from the 3D heatmaps: 0.5 ms
29. Ablation Study
Table. Performance and # of param comparison according to the input
and output type
Converting 2D depth map to 3D
voxelized grid improves performance
Estimating the per-voxel likelihood
(3D heatmap) gives more accurate
estimation compared with directly
regressing 3D coordinates
The table shows the benefit of the
volumetric input and output
representation
Effect of input-output representation
30. Intelligent and Invisible Computing 30
Ablation Study
Fig. Ref.pt refinement network (localization refinement)
The epoch ensemble averages estimation
from several epochs
In contrast to other ensemble techniques,
it ensembles models from a single training
We used models from all epochs (10
epochs) for the ensemble
In multi-GPU environment, it does not
increase running time
More accurate and robust estimation
Fig. Effect of the localization refinement
46. Intelligent and Invisible Computing 47
Qualitative Results
MSRA dataset (grouped by gesture): Frame-based
Video
47. Conclusion
We proposed a novel and powerful network, V2V-PoseNet, for 3D hand and human
pose estimation from a single depth map
Converted 2D depth map into the 3D voxel representation and estimated the per-
voxel likelihood (3D heatmap) for each keypoint instead of directly regressing 3D
coordinates
Significantly outperformed almost all the existing methods in almost all the 3D
hand and human pose estimation dataset
Achieved the 1st place in HANDS 2017 frame-based 3D hand pose estimation
challenge
Learning physical constraints via generative approach and improving encoder-
decoder for multi-scale information are future works
Code is available: https://github.com/mks0601/V2V-PoseNet_RELEASE