3. 3
Agenda
[Wandt+ CVPR'19] RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation
[Habibie+ CVPR'19] In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
[Chen.C+ CVPR'19] Unsupervised 3D Pose Estimation with Geometric Self-Supervision
[Chen.X+ CVPR'19] Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation
[Kocabas+ CVPR'19] Self-Supervised Learning of 3D Human Pose using Multi-view Geometry
[Moon+ ICCV'19] Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image
[Pavllo+ CVPR'19] 3D human pose estimation in video with temporal convolutions and semi-supervised training
Image
Video
Single-
Person
Multi-Person
Single-
View
Multi-
View
Input Target View
4. 4
Agenda
[Wandt+ CVPR'19] RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation
[Habibie+ CVPR'19] In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
[Chen.C+ CVPR'19] Unsupervised 3D Pose Estimation with Geometric Self-Supervision
[Chen.X+ CVPR'19] Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation
[Kocabas+ CVPR'19] Self-Supervised Learning of 3D Human Pose using Multi-view Geometry
[Moon+ ICCV'19] Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image
[Pavllo+ CVPR'19] 3D human pose estimation in video with temporal convolutions and semi-supervised training
Image
Video
Single-
Person
Multi-Person
Single-
View
Multi-
View
• 単一視点の画像から3D Poseを推定するアプローチ
• 2D→3Dの推定と3D→2Dの射影による相互変換を用いた学習テクニックの提案
• 2D Pose:比較的容易に・精度高く得られる
• 3D Pose:カメラパラメータを使うことで2Dへの変換は容易
Input Target View
5. • ①2D→3D変換の学習 と ②3D+カメラパラメータによる2Dへの射影 をうまく活用して3D Poseを学習
• 敵対的学習によって中間の3D表現の質を改善
5
RepNet: Weakly Supervised Training of an Adversarial Reprojection Network
for 3D Human Pose Estimation
①2D Poseから3D PoseとCamera Poseを生成
②3D PoseとCamera Poseから2D Poseを再構成して誤差を最小化
3D PoseはWGAN-GPで敵対的に学習。
人体構造を明示的に考慮する特徴のKCS
(Kinematic Chain Space)も加える
1
2
6. • 完全な教師ありには勝てないが、弱教師あり(WS)ではSoTA
• KCS + Discriminatorによる3D Poseの学習の効果が確認された
6
RepNet: Weakly Supervised Training of an Adversarial Reprojection Network
for 3D Human Pose Estimation
2
2
7. • [Wandt+ CVPR’19] RepNetと同様に、2D Poseから3D PoseとCamera Poseを推定→2Dに再射影して誤差
を最小化するように学習を行う
7
In the Wild Human Pose Estimation
Using Explicit 2D Features and Intermediate 3D Representations
違い①
2D Poseの情報と3D Poseに
関連する深さ情報(d)を明示的に分ける
→入力画像の見た目の変化などにより頑健
違い②
3D Poseは正解ラベルが存在する場合には
教師ありで学習(Boneの長さも考慮)
1
2
8. • 実験結果
8
In the Wild Human Pose Estimation
Using Explicit 2D Features and Intermediate 3D Representations
2
2
MPI-INF-3DHPではSoTA
Human3.6MではSoTAではないが善戦
(In-the-Wildなデータセットでこそ強みを発揮するとの主張)
12. • 教師なしアプローチのSoTAを更新
• Ablation Study(右)
– 敵対的誤差(Adv)、2D/3Dでの自己教師あり学習(SS)、ドメイン適応(DA)、Discriminatorへの時間情報の入力(TD)
– 全部入れることでベストの性能
12
Unsupervised 3D Pose Estimation with Geometric Self-Supervision
4
4
13. 13
Agenda
[Wandt+ CVPR'19] RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation
[Habibie+ CVPR'19] In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
[Chen.C+ CVPR'19] Unsupervised 3D Pose Estimation with Geometric Self-Supervision
[Chen.X+ CVPR'19] Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation
[Kocabas+ CVPR'19] Self-Supervised Learning of 3D Human Pose using Multi-view Geometry
[Moon+ ICCV'19] Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image
[Pavllo+ CVPR'19] 3D human pose estimation in video with temporal convolutions and semi-supervised training
Image
Video
Single-
Person
Multi-Person
Single-
View
Multi-
View
• 視点の相互変換や、カメラ幾何を用いた3Dラベルの作成を活用し、
複数視点(Multi-view)の画像から3D Pose推定のための表現を学習
Input Target View
14. • Multi-viewの画像から3D Pose Estimationに有用な潜在表現を学習する
– Pose空間で潜在表現を学習することで、画像空間で直接学習する[Rhodin+ ECCV’18]よりロバスト
14
Weakly-Supervised Discovery of Geometry-Aware Representation
for 3D Human Pose Estimation
①各視点の画像から
2D Poseを生成
③反対の視点の2D Poseを
生成→誤差を最小化
④表現の一貫性が担保されるよう
潜在表現の誤差を最小化
カメラの外部パラメータ(視点間の
位置関係)は既知である前提
②視点間の位置関係に基づく
回転行列を潜在表現に適用
1
2
18. 18
Agenda
[Wandt+ CVPR'19] RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation
[Habibie+ CVPR'19] In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
[Chen.C+ CVPR'19] Unsupervised 3D Pose Estimation with Geometric Self-Supervision
[Chen.X+ CVPR'19] Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation
[Kocabas+ CVPR'19] Self-Supervised Learning of 3D Human Pose using Multi-view Geometry
[Moon+ ICCV'19] Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image
[Pavllo+ CVPR'19] 3D human pose estimation in video with temporal convolutions and semi-supervised training
Image
Video
Single-
Person
Multi-Person
Single-
View
Multi-
View
• 複数人(Multi-Person)の3D Poseを推定する際に課題となる
人物位置の絶対的な位置関係(深度)を推定するためのテクニックを提案
Input Target View
19. • 複数人(Multi-Person)の3D Pose Estimationを行う手法の提案
• 一般的な3D Pose Estimationのモデルは、骨盤など人物の空間位置の基準点となる関節(root)を
決めておき、その基準点からの相対的な位置関係で各関節の座標を表現
• 複数人の姿勢を推定するには、各人が空間上のどこにいるのか=rootの絶対座標も推定する必要がある
19
Camera Distance-aware Top-down Approach for
3D Multi-person Pose Estimation from a Single RGB Image
1
6
20. • 3つのネットワークからなるパイプラインを提案
1. 画像内から人物を検出してクロップする DetectNet
2. 人物画像からrootの絶対座標を推定する RootNet
3. 人物画像から各関節のrootからの相対的な位置を推定する PoseNet
20
Camera Distance-aware Top-down Approach for
3D Multi-person Pose Estimation from a Single RGB Image
2
6
21. • 3つのネットワークからなるパイプラインを提案
1. 画像内から人物を検出してクロップする DetectNet → Mask R-CNN [He+ ICCV’18]
2. 人物画像からrootの絶対座標を推定する RootNet
3. 人物画像から各関節のrootからの相対的な位置を推定する PoseNet → [Sun+ ECCV’18]
21
Camera Distance-aware Top-down Approach for
3D Multi-person Pose Estimation from a Single RGB Image
3
6
22. • RootNet:カメラ座標系における人物のroot 𝑅 = 𝑥 𝑅, 𝑦 𝑅, 𝑍 𝑅 を推定する
• 2D座標の 𝑥 𝑅, 𝑦 𝑅 は簡単に推定できるが3Dの深さ( 𝑍 𝑅 )は容易には求まらない
• 画像上の面積(pixel2)と実空間上の面積(mm2)の比率とカメラパラメータから深さ 𝑑 を近似
• 人物領域のbboxが実空間において 2,000mm x 2,000mm(x アスペクト比)であると仮定
• この仮定に基づいて計算した距離尺度 𝑘 と実際の距離は相関する(右下)
22
Camera Distance-aware Top-down Approach for
3D Multi-person Pose Estimation from a Single RGB Image
4
6
w[pix]
h[pix]
2,000mm
=
= 2,000[mm] x w/h
𝛼:焦点距離
𝐴 𝑟𝑒𝑎𝑙
𝐴 𝑟𝑒𝑎𝑙
𝐴𝑖𝑚𝑔
𝐴𝑖𝑚𝑔
23. • 課題:実際の画像では
(a) 異なるbboxのサイズだが、同じ距離にいる
(b) 同じbboxのサイズだが、異なる距離にいる
場合などがあり、この仮定のみではうまくいかない
• 画像の特徴も使い、補正係数γを算出して 𝑘 を補正、最終的な絶対深度を出力する
23
Camera Distance-aware Top-down Approach for
3D Multi-person Pose Estimation from a Single RGB Image
5
6
25. 25
Agenda
[Wandt+ CVPR'19] RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation
[Habibie+ CVPR'19] In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
[Chen.C+ CVPR'19] Unsupervised 3D Pose Estimation with Geometric Self-Supervision
[Chen.X+ CVPR'19] Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation
[Kocabas+ CVPR'19] Self-Supervised Learning of 3D Human Pose using Multi-view Geometry
[Moon+ ICCV'19] Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image
[Pavllo+ CVPR'19] 3D human pose estimation in video with temporal convolutions and semi-supervised training
Image
Video
Single-
Person
Multi-Person
Single-
View
Multi-
View
• 動画の時系列情報を効率的に活用することで、単一画像のみでは解決しきれない
曖昧性を解消し、時間的に一貫性のある形で3D Poseを推定
Input Target View
26. • 動画の時系列情報を活用して3D Pose Estimationを行う手法
• 2Dと3Dの姿勢は一意に対応するとは限らないという根本的な曖昧性(ambiguity)の問題がある
→動画で観測できる連続的な人物の動きを活用することで曖昧性を解消
• Dilated Convolutionを用いたFully-Convolutionalなモデル(not RNN)で計算効率や学習効率を改善
• Back-Projectionによってラベルなしデータを効果的に利用する半教師あり学習も提案
26
3D human pose estimation in video
with temporal convolutions and semi-supervised training
1
3
29. 29
Agenda
[Wandt+ CVPR'19] RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation
[Habibie+ CVPR'19] In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations
[Chen.C+ CVPR'19] Unsupervised 3D Pose Estimation with Geometric Self-Supervision
[Chen.X+ CVPR'19] Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation
[Kocabas+ CVPR'19] Self-Supervised Learning of 3D Human Pose using Multi-view Geometry
[Moon+ ICCV'19] Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image
[Pavllo+ CVPR'19] 3D human pose estimation in video with temporal convolutions and semi-supervised training
Image
Video
Single-
Person
Multi-Person
Single-
View
Multi-
View
Input Target View
31. • [Wandt+ CVPR'19] Wandt, Bastian, and Bodo Rosenhahn. "RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose Estimation."
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
• [Habibie+ CVPR'19] Habibie, Ikhsanul, et al. "In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations." Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2019.
• [Chen.C+ CVPR'19] Chen, Ching-Hang, et al. "Unsupervised 3D Pose Estimation with Geometric Self-Supervision." Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. 2019.
• [Chen.X+ CVPR'19] Chen, Xipeng, et al. "Weakly-supervised discovery of geometry-aware representation for 3d human pose estimation." Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 2019.
• [Kocabas+ CVPR'19] Kocabas, Muhammed, Salih Karagoz, and Emre Akbas. "Self-supervised learning of 3d human pose using multi-view geometry." arXiv preprint
arXiv:1903.02330 (2019).
• [Pavllo+ CVPR'19] Pavllo, Dario, et al. "3D human pose estimation in video with temporal convolutions and semi-supervised training." Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2019.
• [Moon+ ICCV'19] Moon, Gyeongsik, Ju Yong Chang, and Kyoung Mu Lee. "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single
RGB Image." arXiv preprint arXiv:1907.11346 (2019).
• [Rhodin+ ECCV’18] Rhodin, Helge, Mathieu Salzmann, and Pascal Fua. "Unsupervised geometry-aware representation for 3d human pose estimation." Proceedings of the
European Conference on Computer Vision (ECCV). 2018.
• [He+ ICCV’17] He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.
• [Sun+ ECCV’18] Sun, Xiao, et al. "Integral human pose regression." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
31
参考文献