動画認識サーベイv1（メタサーベイ）

動画認識サーベイv1
Video Recognition Group, cvpaper.challenge
原健翔，⽚岡裕雄，⽯川裕地，笠井誠⽃，
若宮天雅，Hao Guoqing，中野真理⼦

関連資料
● cvpaper.challengeでは過去にも
動画認識関連の資料を公開
● 3D CNNによる⼈物⾏動認識の動向
● 3D CNNによる動画像の時空間特徴表現
● 動画認識・キャプショニングの潮流
● Towards Performant Video Recognition

動画認識とは？
● 動画を対象としたパターン認識の問題
● 動画中の⼈物⾏動を認識するAction Recognitionを
始めとして様々なタスクが存在
● Action Recognition, Action Proposal Generation,
Temporal Action Localization, Spatiotemporal Action Detection,
Action Segmentation, Video Captioning, Video Summarization,
Video Generation, Video Object Segmentation,
Video Interpolation, Optical Flow Estimation...
● 本資料では主に各タスクの概要を紹介

動画認識の論⽂数の遷移
● CVPR, ICCV, ECCV論⽂中の
関連単語を含む割合の推移
● video, action, activity, behavior,
event, movie, motion
● 2014年からDeepでの画像認識の盛り
上がりで下⽕?
● 画像認識が完成に近づいて
動画認識に移⾏する研究者が多く
そこから盛り上がっていっている?
● 最近はちょっと落ち着き気味?

Action Recognition
投球
⼊⼒：動画出⼒：⾏動ラベル
⼀つの⾏動を含むように時間的に切り出された動画
● ⼀番基本的な問題設定
● 画像でいうとImageNetなどの画像認識

Action Recognitionの動向｜Efficient
● 最近の⽅向性の⼀つとして効率的にAction Recognition
をしようというものが存在
● 3D CNNなど動画認識は計算コストが重いモデルが多い
● できるだけ⾼精度かつ効率的に計算可能にして
実⽤的なものにしていこうというのが⼀つの⽅向

Action Recognitionの動向｜Efficient 1
S. Bhardwaj+, “Efficient Video Classification Using Fewer Frames”, CVPR 2019.
全フレームを使うTeacherを少ないフレームのみ使うStudentに蒸留して効率化

B. Korbar+, “SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition”, ICCV 2019.
動画中の重要なclipのみを抜き出して認識することで効率化 & ⾼精度化

J. Lin+, “TSM: Temporal Shift Module for Efficient Video Understanding”, ICCV 2019.
3D CNNは性能は良いが重いので2D CNNで
追加の計算コストなしに複数フレームの情報を畳み込むための⼿法を提案．
⼀部のChannelを時間⽅向にshiftさせることでフレームの情報を混ぜると
2D CNNでも3D CNN以上の性能を達成可能．

C. Luo+, “Grouped Spatial-Temporal Aggregation for Efficient Action Recognition”, ICCV 2019.
3D CNNは性能は良いが重いので3D CNNの⼀部を2D Convに置き換えて効率化

D. Tran+, “Video Classification with Channel-Separated Convolutional Networks”, ICCV 2019.
Standard
Bottleneck Block
Channel-separated Bottleneck Block
dw: depth-wise conv
3D CNNによる動画認識におけるgroup convolutionの有効性を詳細に検討し
効率的なモデルでSOTA性能を達成

C. Feichtenhofer, “X3D: Expanding Architectures for Efficient Video Recognition”, CVPR 2020 (accepted, Oral).
ベースとなる2D CNNからtemporal duration, frame rate , spatial resolution,
network width, bottleneck width, depthを⼀つずつ変化させていき
効率的かつ⾼精度なネットワークを探索．
Channelは狭くして時空間の解像度を⾼めるのが有効．

この分野で強い研究組織
● Facebook AI Research (FAIR)
● 上の６論⽂中半分はここから出ている
● Deep以前の定番⼿法Dense Trajectories（INRIA所属時）のH. Wang,
⻑い間3D CNNの定番モデルだったC3Dを提案したD. Tran,
毎回トップ会議で動画認識系論⽂を通しているC. Feichtenhoferなど
激強動画認識研究者が勢揃い

Action Proposal Generation の概要
• 動画中の action が起こっていそうな時間区間 (Action Proposal) を予測
• データセット
- ActivityNet 1.3 [2]
- 動画数 : 20k動画, 計 648 時間
- THUMOS14 [3]
- 動画数: 約400動画
• 評価指標
- The area under the Average Recall vs Average Number of Proposals
per Video (AR-AN) with tIoU thresholds
18
[1] T. Lin et al., “BSN: Boundary Sensitive Network for Temporal Action Proposal Generation”, In ECCV 2018
[2] F. Caba Heilbron et al., “ActivityNet: A large-scale video benchmark for human activity understanding“, In CVPR 2015
[3] Y. G. Jiang et al., “Thumos challenge: Action recognition with a large number of classes”, In ECCVWS 2014
[1]より引⽤
担当: ⽯川

Anchor-based Approaches
• マルチスケールな anchor を⽤いて proposal を⽣成
• 主な⼿法
- SSAD[1], CBR[2], TURN TAP[3]
• ⻑所
- マルチスケールの proposal を効果的に⽣成できる
- 全ての anchor の情報を同時に捉えるため，
- confidence score が信頼できることが多い
• 短所
- anchor の設計が難しい
- 正確でないことが多い
- 様々なサイズの時系列区間を捉えるのが難しい
19
[1] T. Lin, “Single Shot Temporal Action Detection”, in ACM Multimedia 2017
[2] J. Gao, “Cascaded Boundary Regression for Temporal Action Detection”, in BMVC 2017
[3] J. Gao, “TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals” in ICCV2017
担当: ⽯川

Anchor-free Approaches
• action boundary や actioness などを評価してから proposal を⽣成
• 主な⼿法
- TAG[1], BSN[2], BMN[3]
• ⻑所
- 時系列⽅向の区間を柔軟に，かつ正確に proposal を⽣成可能
- BSP (Boundary Sensitive Proposal) features を⽤いれば，
- confidence score の信頼性が上がる
• 短所
- feature の設計と confidence score の評価が別々で⾏われるため，⾮効率的である
- 特徴量が単純になりがちで，時系列⽅向のコンテキストを捉えるには不⼗分である場合がある
- multi-stage で，end2end なフレームワークではない
20
[1] Yue Zhao et al., “Temporal Action Detection with Structured Segment Networks” in ICCV 2017
[2] T. Lin et al., “BSN: Boundary Sensitive Network for Temporal Action Proposal Generation” in ECCV 2018
[3] T. Lin et al., “BMN: Boundary-Matching Network for Temporal Action Proposal Generation”, in ICCV 2019
担当: ⽯川

Anchor-based approach: DAPs
• クリップごとの動画特徴量をLSTMに通し，⻑期的な特徴量を抽出
• この特徴量から anchorに対するoffsetを出⼒する
21
Victor Escorcia et al., “DAPs: Deep Action Proposals for Action Understanding”, In ECCV2016
Visual Encoder: 動画特徴抽出器 (C3D)
Sequence Encoder: C3Dから得られた特
徴量をLSTMに⼊⼒し，さらに⻑期的な時
系列情報を考慮した特徴へとエンコード
Localization Module: LSTMの出⼒から，
全結合層を組み合わせて，action
proposal の位置と⻑さを出⼒する
Prediction Module: Action proposal に
対する確信度を出⼒する．全結合層と
sigmoid関数からなる
担当: ⽯川

Anchor-based Approach: Segment-CNN (SCNN)
• action localization を⾏う two-stage 型のモデルを提案
• ⼀つ⽬のステージで，マルチスケールのスライディングウィンドウに対して
class-agnostic actionness を予測し，actionnessの⾼いものを proposal とする
• ⼆つ⽬のステージでは得られたproposalに対して⾏動分類を⾏う
22
Z. Shou et al., “Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs”, In CVPR2016
担当: ⽯川

SSAD: 物体検出で⽤いられるSSDを動画認識に拡張
• Tianwei Lin et al., “Single Shot Temporal Action Detection”, In ACM Multimedia 2017
23
• Anchor-based の⼿法 (実際には proposal に対する⾏動認識まで⾏う)
• 物体検出で⽤いられる SSD を action detection に拡張
• default anchor に対する時系列⽅向の offset を予測
(a) 複数のネットワークを⽤いて特徴抽出
(b) anchorごとにクラス分類とoffsetを推定
(c) 後処理としてNMSをし，最終的な出⼒
担当: ⽯川

Anchor-based approach: TURN TAP
• J. Gao et al., “TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals” in ICCV 2017 24
• 動画を16フレームからなるUnitに分割
• Anchor Unitに加え，前後のUnitの特徴量も⽤いて，clip (unitの集合)の特徴量としている(clip pyramid)
• anchor unit内に，action instance が存在するかどうかの判定，および start と end のoffset を推定する
担当: ⽯川

Cascaded Boundary Regression(CBR)
• J. Gao et al., ”Cascaded Boundary Regression for Temporal Action Detection” In BMVC 2017
25
• action localization のための two-stage型のネットワークを提案
• スライディングウィンドウに対するオフセットを推定することで得られたproposalを
何度も同じネットワークに通すことで，proposalの時系列区間をrefinement する
Cascaded Boundary Regression (CBR)を提案
• class-agnostic なproposalを⽣成するステージと，⾏動分類をするステージのいずれにおいても
CBRが⽤いられている
提案⼿法の全体像 Cascaded Boundary Regression
担当: ⽯川

Anchor-free approach: Temporal Actionness Grouping
• Y. Zhao et al., “Temporal Action Detection with Structured Segment Networks”, In ICCV 2017 26
• actionness を予測したのち，ある𝛾における basin を求める
• そのbasinに対して，適当な閾値 τ を設定して，action proposal を⽣成する
• 𝛾 と τ を(0, 1)の範囲で均⼀にサンプリングすることで，様々なスケールの proposal を⽣成する
actionness: action probability
complemented actionness: 1 - actionness
𝛾
𝜏
𝛾: complemented actionness のある値
basin: ある𝛾を設定したときに，complemented actionnessが
それ以下になる領域
τ: 複数のbasinを結合したときの全体の時間間隔に対する
basin同⼠の間隔の割合
担当: ⽯川

CTAP: Complementary Temporal Action Proposal
27
• anchor-based approach と anchor-free approachを組み合わせた⼿法
• 動画の特徴量を，予め決めたsliding windowに対して，actionnessを評価する
Proposa-level Actionness Trustworthiness Estimator(PATE)，およびsliding-windowを⽤いずに
proposalを推定するTAGの⼆つのネットワークに⼊⼒する．
• この⼆つのネットワークから得られた proposal を，boundary の調整と
proposalのランク付けを⾏うネットワークに⼊⼒することで，最終的な proposalを得る．
J. Gao et al., “CTAP: Complementary Temporal Action Proposal Generation”, In ECCV 2018
担当: ⽯川

BSN: ActivityNet Challenge2018 winner
• T. Lin et al., ”BSN: Boundary Sensitive Network for Temporal Action Proposal” In ECCV 2018
28
• anchor-free approach である Boundary Sensitive Network (BSN)を提案
• 動画特徴量からの starting point, ending point, actioness を推定
• starting point と ending point の起こりうる組み合わせを
action proposal と⾒なして，その区間での actioness で評価し，proposal を決定
担当: ⽯川

BMN: ActivityNet Challenge2019 winner
• anchor-free approach
• 動画特徴量から action boundary を予測した後，その組み合わせから proposal を作成
• 全ての proposal の信頼度を評価するための Boundary-Matching Confidence Map を作成し，
最終的なproposal を決定する 29
T. Lin et al., “BMN: Boundary-Matching Network for Temporal Action Proposal Generation”, In ICCV 2019
担当: ⽯川

Action Segmentation の概要
31
• 動画に対してフレームレベルでの⾏動認識を⾏う
• 主な⼿法
- Sliding window
- 準マルコフ過程
- フレーム特徴量 + RNN
- Temporal Convolution の応⽤
• データセット
- 50 Salads
- GTEA
- Breakfast
• 評価指標
- Frame-wise Accuracy
- Segmental Edit Distance
- Segmental F1 Score with tIoU thresholds

Segmental Spatiotemporal CNNs (ST-CNN)
• C. Lea et al. “Segmental Spatiotemporal CNNs for Fine-grained Action Segmentation” in ECCV 2016
32
auxiliary loss
• Spatial Component
- CNN Feature + Motion History Image をフレームの特徴量に
- フレームレベルでの分類に対する auxiliary loss をとる
• Temporal Component
- 注⽬フレームに対して，前後 d フレームをみる 1D acausal conv.
• Segmental Component
- 準マルコフ過程を⽤いて，アクションの遷移を捉える

• C. Lea et al., ”Temporal Convolutional Networks for Action Segmentation and Detection”, in CVPR2017
33
• Encoder-Decoder TCN と Dilated TCN の提案
• 注⽬フレームに対して未来の情報も⾒るAcausal Convolution と
過去の情報しか⾒ないCausal Convolution の検証も (acausalの⽅が良い)
Temporal Convolutional Networks(TCN)

Temporal Deformable Residual Networks
• P. Lei et al., “Temporal Deformable Residual Networks for Action Segmentation in Videos”, In CVPR 2018
34
outline
Deformable Temporal Residual Module
Temporal Deformable Convolution
• Deformable Convolution を action segmentation に適⽤
• 元の時系列解像度を維持する residual stream の使⽤

• Y. A. Farha et al., “MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation”, in CVPR2019
35
• TCN を多段に重ねた MS-TCNを提案し，over-segmentation errorを改善
• フレーム間での⾏動の確率の遷移にペナルティを与える smoothing loss を提案
分類⾏動の遷移
Multi-Stage Temporal Convolutional Network

動画認識のための⼈⼯データの⽣成

Video Recognition
• Supervised learning has made significant progress in context-aware video recognition
• However, supervised learning suffers from problems of:
• acquisition of supervised data is time-consuming and labor-intensive.
• copyright issue.
• mislabelling
To address these issues, we use synthetic data to learn context-aware video recognition.

Advantages of synthetic data
• Unlimited amount. --Huge datasets are what powers deep learning algorithms.
• Less labor-intensive.
• Perfect annotation.
• ImageNet -- a lot of mislabelling
• No copyright issue.

Disadvantages of synthetic data
• Poor performance on realism
• Inharmonious on appearance, location and scale.
• Overfitting
• Temporal Consistency (video only)

Inserting Videos into Videos --CVPR2019
1. 画像からビデオへのオブジェクト挿入
の領域を広げる重要で挑戦的な問題を
紹介します。
2. insert objectsを学習のため、リアルな
ペアデータ使わずに、合成fake
なペアデータを生成手法を提案しました。
3. 挑戦的な現実世界の入力ビデオに基づいて
現実的なビデオを合成できることを示します

Video Harmonization. --Temporally Coherent Video Harmonization Using Adversarial
Networks
Supervised dataset creation:
Given an image (a), we take it as the first ground-truth frame.
Then we cut out the foreground and apply inpainting to obtain
the pure background (c). By performing color adjustment on
the foreground of (a),we obtain the first composite frame (d).
By applying a random affine transform to the foregrounds of
(a) and (d), we obtain the second ground-truth frame (e) and
(b) the second composite frame (f).

Temporal GAN --Temporal Generative Adversarial Nets with Singular Value
Clipping
TGAN can learn a semantic representation of unlabeled videos, and is capable of generating videos.

Temporal GAN -2017
• Applications: Video Frame Interpolation, Conditional TGAN
• Conditional TGAN:
• In some cases, videos in a dataset contain some labels which correspond to a category of the video such as
“IceDancing” or “Baseball”. In order to exploit them and improve the quality of videos by the generator, we
also develop a Conditional TGAN (CTGAN), in which the generator can take both label l and latent variable
z0.

Title:Context-aware Synthesis for Video Frame Interpolation ーーhttps://arxiv.org/pdf/1803.10967.pdfp.pdf
概要・新規性:
服の形状が明示的にモデル化された、動作中の3D
人間の最初の大規模データセットを公開しました。
体のリグメッシュを形状画像としてモデル化する
ために、細長い身体部分の球形のパラメーター化
を実行する新しいアルゴリズムを提案しました。
パラメトリックモデルに依存せずに、単一の画像
から人体と衣服の形状を推定するエンドツーエン
ドのネットワークを導入しました
結果

Title: ADVERSARIAL VIDEO GENERATION ON COMPLEX DATASETS ーー https://arxiv.org/pdf/1907.06571.pdf
概要:
提案手法では、 GANを導入することにより、自然な
ビデオのモデリングという難しい問題に取り組みました。
UCF-101とKinectics-600でSOTAを実現しました。さらに、
複雑さと多様性の高い動画の生成もできます。
新規性:
1、提案モデルでは、最大256x256の解像度と最大48
フレームの長さで高品質のサンプルの自然な動画が
生成できます。
2、生成ビデオモデリングの新しいベンチマークとし
てKinetics-600でクラス条件付きビデオ合成を確立し
、DVD-GANの結果を強力なベースラインとして報告
します。
結果
手法

motionとcontentに基づく動画⽣成
• 動画⽣成⼿法は⼤体2種類がある︓
• future frame prediction
• 過去のframeから新たなframeを⽣成
• Decomposing Motion And Content For Natural Video Sequence Prediction ------ICLR2017
• Animating Landscape:
Self-Supervised Learning of Decoupled Motion and Appearance for Single-Image Video Synthesis
--SIGGRAPH Asia 2019
• generation
• Temporal Generative Adversarial Nets with Singular Value Clipping --ICCV2017
• MoCoGAN: Decomposing Motion and Content for Video Generation --CVPR2018

future frame prediction
• Decomposing Motion And Content For Natural Video Sequence Prediction

Animating Landscape:
Self-Supervised Learning of Decoupled Motion and Appearance for Single-Image Video
Synthesis
training motion predictor
training appearance predictor

Generating Videos with Scene Dynamics --NIPS2016
• 動画をforegroundとbackgroundに分ける。
• 同じnoiseから背景と前景を⽣成
する
• We capitalize on large amounts of
unlabeled video in order to learn a
model of scene dynamics for both
video recognition tasks (e.g. action
classification) and video generation
tasks (e.g. future prediction)

Generating Videos with Scene Dynamics --NIPS2016

MoCoGAN: Decomposing Motion and Content for Video Generation --CVPR2018
既存手法は動画を潜在空間にmappingするのは
意味ないと批判。
同じmotionを異なる速さで、潜在空間上で異なる
特徴にmappingされている。
生成動画が固定長になる。
それらの問題を解決するため、潜在空間上の一つ
の特徴量から画像を生成、全部の画像をつなげて
動画になる。
潜在空間がmotion subspaceとcontent
subspaceがある。
content variableが固定される
motion variableは動画内で変化

MoCoGAN: Decomposing Motion and Content for Video Generation --CVPR2018

TwoStreamVAN: Improving Motion Modeling in Video Generation --WACV2020
A major problem with pixel-level video prediction
and generation methods is that they
attempt to model both static content and dynamic
motion in a single entangled generator, regardless
of whether they disentangle the motion and content
in the latent space or not.
1. proposed a video generation model TwoStreamVAN
as well as a more effective learning scheme, which
disentangle motion and content in the generation phase.
2. designed a multi-scale motion fusion mechanism and
further improve motion modeling by conditioning on
the spatial context;

TwoStreamVAN: Improving Motion Modeling in Video Generation --WACV2020

NVIDIA https://github.com/NVlabs
計算リソースが十分
強い研究者を集める

TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting --
CVPR2020
Everybody Dance Nowみたいな研究

Title:Video Frame Interpolation via Adaptive Convolution ーーhttp://web.cecs.pdx.edu/~fliu/papers/cvpr2017-interp.pdf
Keywords: video interpolation
概要:
提案手法は従来の2段階(motion estimation and pixel
synthesis)を一つにまとめました。さらに、提案モデル
は入手困難なデータ(optical flowなど)を使わずに、
訓練ができる。
新規性:
1、video interpolationを一つのプロセスにしてるため、
競合する間で適切なトレードオフを行うことができた
、ロバーストな手法を提案します。
2、提案モデルは入手が困難なデータ(optical flowなど)
使わずに、広く利用可能な動画データを使用して
end-to-endトレーニングできます。
3、提案手法はオクルージュン、ぼやけのアーティファクト
、急激な明るさの変化などの難しい動画に対して高品質の
結果を生成できます。
結果
手法

Title:Context-aware Synthesis for Video Frame Interpolation ーーhttps://arxiv.org/pdf/1803.10967.pdfp.pdf
Keywords: video interpolation
概要:
提案手法では、入力フレームだけではなくその
ピクセル単位のコンテキスト情報もワープし、
高品質の中間フレームを補間するためにそれら
を使用する。
新規性:
1、bidirectional flowを柔軟なフレーム合成モデル
と組み合わせて使用すると、オクルージョンなどの
困難なケースを処理し、モーション推定の不正確さに
対応できます。
2、提案法では、フレーム補間モデルが有益な補間を
実行できます。さらに、オプティカルフローを使用
して補間の初期化を適切に行うと役立ちます。
結果
手法

最新動画データセット
● ここ数年で多数の⼤規模な動画データセットが続々登場
● 2019, 2020年のデータセット提案論⽂を紹介

最新動画データセット 1
Y. Tang+, “COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis”, CVPR 2019.
インストラクション動画の詳細⾏動認識⽤データセット

A. Miech+, “HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips”, ICCV 2019.
テキストのアノテーションが付与された超⼤規模動画データセット

H. Zhao+, “HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization”, ICCV 2019.
Action Recognition & Temporal Localization⽤の⼤規模データセット

X. Wang+, “VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research”, ICCV 2019.
複数⾔語のVideo Captioningや動画&テキスト⼊⼒での翻訳⽤データセット

Q. Jiang+, “SVD: A Large-Scale Short Video Dataset for Near-Duplicate Video Retrieval”, ICCV 2019.
動画の複製・転載を検出するためのデータセット

Q. Kong+, “MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding”, ICCV 2019.
多視点・Multi-modalなAction Recognition⽤データセット

M. Martin+, “Drive&Act: A Multi-modal Dataset for Fine-grained Driver Behavior Recognition in Autonomous Vehicles”, ICCV 2019.
⾞内の詳細⾏動認識⽤Multi-modalデータセット

Q. You+, “Action4D: Online Action Recognition in the Crowd and Clutter”, CVPR 2019.
⾏動認識⽤の多視点動画データセット

D. Shao+, “FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding”, CVPR 2020 (accepted, Oral).
階層的に詳細な⾏動をアノテーションした動画データセット

J. Liu+, “VIOLIN: A Large-Scale Dataset for Video-and-Language Inference”, CVPR 2020 (accepted).
動画に字幕とシーンを説明するpositive/negative⽂章が付与された動画データセット

S. Ghorbani+, “MoVi: A Large Multipurpose Motion and Video Dataset”, arXiv, 2020.
Mocapと動画と加速度センサが同期されたデータセット

最新動画データセットの傾向
● Video & Text
● キャプション，会話（字幕），ナレーションなど
単にテキストと⾔っても動画だと⾊々あるのでやることは多そう
● Multi-modal, Multi-view
● ⼤規模なYouTube動画データセットはたくさんあるので
それらとは異なる独⾃なデータを提案
● Fine-grained
● これまでのAction Recognitionはとにかく多様なクラスを識別する⽅向
だったのに対して最近はより詳細な識別を試みる⽅向

動画データセットの公開元
● 企業が絡んでいるものが多い
● Meitu (COIN), ByteDance (VaTeX, SVD), Hitachi (MMAct),
Alibaba (MMAct), Microsoft (Action4D, VIOLIN)

おわりに
● 動画認識の各タスクの概要や最新の研究を紹介
● 更に加筆・修正したv2の資料も後⽇公開予定

動画認識サーベイv1（メタサーベイ）

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 動画認識サーベイv1（メタサーベイ）

Similaire à 動画認識サーベイv1（メタサーベイ） (20)