Shizhe Chen, Dong Huang, "Elaborative Rehearsal for Zero-Shot Action Recognition", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13638-13647.
https://openaccess.thecvf.com/content/ICCV2021/html/Chen_Elaborative_Rehearsal_for_Zero-Shot_Action_Recognition_ICCV_2021_paper.html
2. Zero-shot Action Recognition(ZSAR)
◼Zero-shot Learning(ZSL)
• 学習データ無しでクラスを識別
• 代わりに周辺知識で学習
• ZSARはZSLの動作認識タスク
◼学習
• 動画とクラスラベル等を
埋め込み
• これら特徴を
関連づけるように学習
◼推論
• 動画特徴からクラスラベルを探索
• 学習時に用いていないクラスを識別可能
• 推論で用いるクラスの情報も埋め込むため
Unseen Class Seen Classes Side Information
Step 1: Visual Embedding Step 2: Semantic Label Embedding
Step 3: Training
Visual Feature Extractor Semantic Feature Extractor
f(.)
Google News Wikipedia
ImageNet WordNet
WikiHow
Apply Eye Makeup
New Action
Ice Dancing
Apply Eye Makeup Ice Dancing
Apply Eye
Makeup
Ice
Dancing
New Action
Prototype 1
Horse
Riding
Prototype 2
Playing
Guitar
Figure(1) Schematic representation of aZSL human action recognition framework.
attempting to overcome these limitations.
Thehuman ability torecognizeanaction without ever
having seen it before, that is, associating semantic infor-
mation from several sources to thevisual appearance of
actions, is the inspiration of ZSL approaches [43]. In
Figure 1, we provide an overview of ZSL approaches
considering the application in videos. This general
scheme can also be found in ZSL applied to object and
event recognition in both images and videos [22]. We
introduce the main aspects of the approaches through-
out this text.
before, in addition to extracting visual features, it is
necessary to associate them with a suitable prototype
and assign a label. This is made by learning an f (·)
mapping function between these spaces. As discussed
in Section 3, this mapping function can assume several
ways to be performed directly into the semantic space,
indirectly by creating an intermediate space or directly
into the visual space.
Thus, weconcentrate our investigation in approaches
that address the problem of recognizing human actions,
without having seen them before, in small video clips,
[Estevam+, Neurocomputing](arXiv)
4. 関連研究
◼初期の手法 (a)
• アクションを列挙して定義
[Liu+, CVPR2011]
◼オブジェクトを検出し埋め込み (b)
• Zero-shot学習 [Xu+, IJCV2017]
• AR最先端のネットワークを採用
[Brattoli+, CVPR2020]
• 過剰適合
◼アクションクラスの埋め込み (c)
• 本手法
• Elaborative Rehearsal(ER),人間の記憶技術
[Benjamin&Bjork, Journal of Experimental Psychology: Learning, Memory, and Cognition, 2000]
Inria, France
shi zhe. chen@
i nr i a. f r
CarnegieMellon University, USA
donghuang@
cm
u. edu
Abstract
The growing number of action classes has posed a new
challenge for video understanding, making Zero-Shot Ac-
tion Recognition (ZSAR) a thriving direction. The ZSAR
task aimsto recognizetarget (unseen) actionswithout train-
ing examples by leveraging semantic representations to
bridge seen and unseen actions. However, due to the com-
plexity and diversity of actions, it remains challenging to
semantically represent action classes and transfer knowl-
edge from seen data. In this work, we propose an ER-
enhanced ZSARmodel inspired by an effectivehuman mem-
ory technique Elaborative Rehearsal (ER), which involves
elaborating a new concept and relating it to known con-
cepts. Specifically, weexpand each action classasan Elab-
orative Description (ED) sentence, which is more discrim-
inative than a class name and less costly than manual-
defined attributes. Besides directly aligning class seman-
tics with videos, we incorporate objects from the video as
ElaborativeConcepts (EC) to improvevideo semantics and
Figure 1: Attributes and word embeddings are insufficient to se-
mantically represent action classes. Our ElaborativeRehearsal ap-
proach defines actions by Elaborative Descriptions (EDs) and as-
sociates videos with Elaborative Concepts (ECs, known concepts
detected from thevideo), which improvevideo semanticsand gen-
eralization video-action association for ZSAR. (I for videos, 4
for seen actions, ◦ for unseen actions, and ⇤ for ECs)