【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

•

0 j'aime•486 vues

Deep Learning JP

2023/3/3 Deep Learning JP http://deeplearning.jp/seminar-2/

Technologie

DEEP LEARNING JP
[DL Papers]
Bridge-Prompt: Toward Ordinal Action Understanding in
InstructionalVideos(CVPR 2022)
Yoshifumi Seki
http://deeplearning.jp/

書誌情報
● 投稿先
○ CVPR 2022
● 投稿者
○ 精華大学
● 選定理由
○ 動画からの動作解析系に最近取り組ん
でいます
https://github.com/ttlmh/Bridge-Prompt

背景・目的
● 動画からの動作解析をいい感じにやりたい
● 動作には連続性がある
○ ex. 水を飲む動作
■ コップを持つ -> 水を入れる -> 水を飲む
○ ex. パンを食べる動作
■ バターを塗る -> ジャムをぬる -> パンを食べる
● 連続性をモデルに組み込みたい
○ グラフモデルは最近いくつかあるが道のラベルには対応できない
● Prompt Engineeringをやって大規模言語モデルの強みを活かす

Prompt Engineeringとは
● 与えられた入力（ラベル情報など）をテンプレートに入れて、適切な文として入力さ
せることで、大規模言語モデルの恩恵を受けられるようにするアイデア
●
● GPT-3でのfew shot learningの仕組みに採用
● OpenAIのCLIPによる画像分類でtext-image
● Action CLIPで動画にも適用

ActionCLIP
● ラベルからPrompt
Engineeringにより文章を生成
し、Text Encoder, Video
Encoderによって類似性を図る
ことでラベル推定をする
https://arxiv.org/abs/2109.08472

$Prompt部の詳細 ● 1. Stastical Prompt ○ いくつactionが動画中にあるか ○ The video has {num} actions. ● 2. Ordinal Prompt ○ 何番目のactionか ○ This is the {ord_i} action in the video. ● 3. Semantic Prompt ○ “{ord_i}, the person is performing the action step of {vp_i}” ● 3+1. Integrated Prompt ○ 全部 ○ Semanticを全て文として並べる$

評価用データセット
● 50Salads: 50 top view 30-fps instructional videos regarding salad preparation
○ 19 kind of actions
● Georgia Tech Egocentric Activities(GTEA): 28 egocentric 15-fps instructional
videos daily kitchen activities
○ 74 class of actions
● Breakfast: 1,712 third person 15-fps videos of breakfast preparation activities.
○ 48 type of different actions
○

Implementation
● 動画は16 frameで分割される
● Kinetics-400でAction CLIPを用いて事前学習をする
●

未知のIDに対する対応力
● fine-tune時に特定の行動だけを学習させた場合、類似した行動を推定できるか？
○ cofee2teaはfine-tuneをmaking cofeeだけで行って、making teaが当てられるかを見る
○ AKLは全体としての精度

まとめ・感想
● Prompt EngineeringがNLP以外にも出ていることを初めて知って勉強になりました
● 順序を持たせたことがどのような意味を持っているのかがこの実験だとあまりわか
らなかったので残念
● 未知のIDに対応できているのはすごいけど、この実験方法がそれを測るのに適切
かは疑問
● 既存モデルとの違いをもう少し結果から読み取りたかった
○ 精度だけだとどこが良くなっているのかよくわからん

Recommandé

動作認識の最前線：手法，タスク，データセットToru Tamaki

【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose EstimationDeep Learning JP

【メタサーベイ】Video Transformercvpaper. challenge

【DL輪読会】ViT + Self Supervised LearningまとめDeep Learning JP

動画認識サーベイv1（メタサーベイ）cvpaper. challenge

[DL輪読会]Learning Transferable Visual Models From Natural Language SupervisionDeep Learning JP

近年のHierarchical Vision TransformerYusuke Uchida

自己教師学習（Self-Supervised Learning）cvpaper. challenge

Recommandé

動作認識の最前線：手法，タスク，データセットToru Tamaki

【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose EstimationDeep Learning JP

【メタサーベイ】Video Transformercvpaper. challenge

【DL輪読会】ViT + Self Supervised LearningまとめDeep Learning JP

動画認識サーベイv1（メタサーベイ）cvpaper. challenge

[DL輪読会]Learning Transferable Visual Models From Natural Language SupervisionDeep Learning JP

近年のHierarchical Vision TransformerYusuke Uchida

自己教師学習（Self-Supervised Learning）cvpaper. challenge

【メタサーベイ】数式ドリブン教師あり学習cvpaper. challenge

【DL輪読会】DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Dri...Deep Learning JP

SSII2019OS: 深層学習にかかる時間を短くしてみませんか？～分散学習の勧め～SSII

【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...Deep Learning JP

[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and EditingDeep Learning JP

【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action DiffusionDeep Learning JP

画像生成・生成モデルメタサーベイcvpaper. challenge

Skip Connection まとめ（Neural Network）Yamato OKAMOTO

SSII2021 [OS2-01] 転移学習の基礎：異なるタスクの知識を利用するための機械学習の方法SSII

生成モデルの Deep LearningSeiya Tokui

【DL輪読会】DiffRF: Rendering-guided 3D Radiance Field Diffusion [N. Muller+ CVPR2...Deep Learning JP

You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話Yusuke Uchida

[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...Deep Learning JP

Curriculum Learning （関東CV勉強会）Yoshitaka Ushiku

[DL輪読会]MetaFormer is Actually What You Need for VisionDeep Learning JP

SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII

Action Recognitionの歴史と最新動向Ohnishi Katsunori

SSII2019企画: 点群深層学習の研究動向SSII

GAN（と強化学習との関係）Masahiro Suzuki

[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsDeep Learning JP

実務でGo使い始めましたYuki Kikuchi

Django ORM道場：クエリの基本を押さえ，より良い形を身に付けようTakayuki Shimizukawa

Contenu connexe

Tendances

【メタサーベイ】数式ドリブン教師あり学習cvpaper. challenge

【DL輪読会】DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Dri...Deep Learning JP

SSII2019OS: 深層学習にかかる時間を短くしてみませんか？～分散学習の勧め～SSII

【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...Deep Learning JP

[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and EditingDeep Learning JP

【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action DiffusionDeep Learning JP

画像生成・生成モデルメタサーベイcvpaper. challenge

Skip Connection まとめ（Neural Network）Yamato OKAMOTO

SSII2021 [OS2-01] 転移学習の基礎：異なるタスクの知識を利用するための機械学習の方法SSII

生成モデルの Deep LearningSeiya Tokui

【DL輪読会】DiffRF: Rendering-guided 3D Radiance Field Diffusion [N. Muller+ CVPR2...Deep Learning JP

You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話Yusuke Uchida

[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...Deep Learning JP

Curriculum Learning （関東CV勉強会）Yoshitaka Ushiku

[DL輪読会]MetaFormer is Actually What You Need for VisionDeep Learning JP

SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII

Action Recognitionの歴史と最新動向Ohnishi Katsunori

SSII2019企画: 点群深層学習の研究動向SSII

GAN（と強化学習との関係）Masahiro Suzuki

[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsDeep Learning JP

Tendances (20)

【メタサーベイ】数式ドリブン教師あり学習

【DL輪読会】DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Dri...

SSII2019OS: 深層学習にかかる時間を短くしてみませんか？～分散学習の勧め～

【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...

[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing

【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

画像生成・生成モデルメタサーベイ

Skip Connection まとめ（Neural Network）

SSII2021 [OS2-01] 転移学習の基礎：異なるタスクの知識を利用するための機械学習の方法

生成モデルの Deep Learning

【DL輪読会】DiffRF: Rendering-guided 3D Radiance Field Diffusion [N. Muller+ CVPR2...

You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話

[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...

Curriculum Learning （関東CV勉強会）

[DL輪読会]MetaFormer is Actually What You Need for Vision

SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向

Action Recognitionの歴史と最新動向

SSII2019企画: 点群深層学習の研究動向

GAN（と強化学習との関係）

[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Similaire à 【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

実務でGo使い始めましたYuki Kikuchi

Django ORM道場：クエリの基本を押さえ，より良い形を身に付けようTakayuki Shimizukawa

【メタサーベイ】基盤モデル / Foundation Modelscvpaper. challenge

NodeにしましょうYuzo Hebishima

コミュニケーションスキルを重視したソフトウェア技術者教育手法の研究Yuichiro Saito

2015.08.29 JUS共催勉強会資料umidori

オンプレエンジニアがクラウドエンジニアを夢見て。じっと手を見る。Akihiro Kuwano

Kanamori cedec2011Yoshihiro Kanamori

第83回名古屋アジャイル勉強会「一言で言うと、アジャイルってなんなの？」hiroyuki Yamamoto

goog.require()を手書きしていいのは小学生までTeppei Sato

Androidの新ビルドシステムl_b__

C#でわかるこわくないMonadKouji Matsui

コンピュータビジョンの今を映す-CVPR 2017 速報より- （夏のトップカンファレンス論文読み会）cvpaper. challenge

Multibranch Pipeline with Docker 入門編kimulla

【いまこそ】エンジニアとデザイナー【立ち上がれ】 Yuki Kuroki

Webサイトのようには作れない！Webアプリ設計の考え方girigiribauer

初めてのDockerＹｏｕ＆Ｉ

Transformer 動向調査 in 画像認識(修正版)Kazuki Maeno

Eclipse modeling 勉強会はじめにAkira Tanaka

みくみくまうすについて&Unity で使えるコーディングノウハウtorisoup

Similaire à 【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos (20)

実務でGo使い始めました

Django ORM道場：クエリの基本を押さえ，より良い形を身に付けよう

【メタサーベイ】基盤モデル / Foundation Models

Nodeにしましょう

コミュニケーションスキルを重視したソフトウェア技術者教育手法の研究

2015.08.29 JUS共催勉強会資料

オンプレエンジニアがクラウドエンジニアを夢見て。じっと手を見る。

Kanamori cedec2011

第83回名古屋アジャイル勉強会「一言で言うと、アジャイルってなんなの？」

goog.require()を手書きしていいのは小学生まで

Androidの新ビルドシステム

C#でわかるこわくないMonad

コンピュータビジョンの今を映す-CVPR 2017 速報より- （夏のトップカンファレンス論文読み会）

Multibranch Pipeline with Docker 入門編

【いまこそ】エンジニアとデザイナー【立ち上がれ】

Webサイトのようには作れない！Webアプリ設計の考え方

初めてのDocker

Transformer 動向調査 in 画像認識(修正版)

Eclipse modeling 勉強会はじめに

みくみくまうすについて&Unity で使えるコーディングノウハウ

Plus de Deep Learning JP

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving PlannersDeep Learning JP

【DL輪読会】事前学習用データセットについてDeep Learning JP

【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...Deep Learning JP

【DL輪読会】Zero-Shot Dual-Lens Super-ResolutionDeep Learning JP

【DL輪読会】BloombergGPT: A Large Language Model for Finance arxivDeep Learning JP

【DL輪読会】マルチモーダル LLMDeep Learning JP

【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...Deep Learning JP

【DL輪読会】AnyLoc: Towards Universal Visual Place RecognitionDeep Learning JP

【DL輪読会】Can Neural Network Memorization Be Localized?Deep Learning JP

【DL輪読会】Hopfield network　関連研究についてDeep Learning JP

【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )Deep Learning JP

【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...Deep Learning JP

【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"Deep Learning JP

【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "Deep Learning JP

【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat ModelsDeep Learning JP

【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"Deep Learning JP

【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...Deep Learning JP

【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...Deep Learning JP

【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...Deep Learning JP

【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...Deep Learning JP

Plus de Deep Learning JP (20)

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners

【DL輪読会】事前学習用データセットについて

【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...

【DL輪読会】Zero-Shot Dual-Lens Super-Resolution

【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv

【DL輪読会】マルチモーダル LLM

【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...

【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition

【DL輪読会】Can Neural Network Memorization Be Localized?

【DL輪読会】Hopfield network　関連研究について

【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )

【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...

【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"

【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "

【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models

【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"

【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...

【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...

【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...

【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...

Dernier

自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineerYuki Kikuchi

CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か？akihisamiyanaga1

TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案sugiuralab

業務で生成AIを活用したい人のための生成AI入門講座（社外公開版：キンドリルジャパン社内勉強会：2024年4月発表）Hiroshi Tomioka

クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfFumieNakayama

AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfFumieNakayama

モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察～Text-to-MusicとText-To-ImageかつImage-to-Music...博三太田

デジタル・フォレンジックの最新動向（2024年4月27日情洛会総会特別講演スライド）UEHARA, Tetsutaro

Dernier (8)

自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer

CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か？

TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案

業務で生成AIを活用したい人のための生成AI入門講座（社外公開版：キンドリルジャパン社内勉強会：2024年4月発表）

クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf

AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf

モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察～Text-to-MusicとText-To-ImageかつImage-to-Music...

デジタル・フォレンジックの最新動向（2024年4月27日情洛会総会特別講演スライド）

【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

1. DEEP LEARNING JP [DL Papers] Bridge-Prompt: Toward Ordinal Action Understanding in InstructionalVideos(CVPR 2022) Yoshifumi Seki http://deeplearning.jp/

2. 書誌情報 ● 投稿先 ○ CVPR 2022 ● 投稿者 ○ 精華大学 ● 選定理由 ○ 動画からの動作解析系に最近取り組んでいます https://github.com/ttlmh/Bridge-Prompt

3. 背景・目的 ● 動画からの動作解析をいい感じにやりたい ● 動作には連続性がある ○ ex. 水を飲む動作 ■ コップを持つ -> 水を入れる -> 水を飲む ○ ex. パンを食べる動作 ■ バターを塗る -> ジャムをぬる -> パンを食べる ● 連続性をモデルに組み込みたい ○ グラフモデルは最近いくつかあるが道のラベルには対応できない ● Prompt Engineeringをやって大規模言語モデルの強みを活かす

4. Prompt Engineeringとは ● 与えられた入力（ラベル情報など）をテンプレートに入れて、適切な文として入力させることで、大規模言語モデルの恩恵を受けられるようにするアイデア ● ● GPT-3でのfew shot learningの仕組みに採用 ● OpenAIのCLIPによる画像分類でtext-image ● Action CLIPで動画にも適用

5. CLIP(ICML2021) 2021/1/15の発表より

6. CLIP(ICML2021) 2021/1/15の発表より

7. ActionCLIP ● ラベルからPrompt Engineeringにより文章を生成し、Text Encoder, Video Encoderによって類似性を図ることでラベル推定をする https://arxiv.org/abs/2109.08472

8. 提案手法

9. 提案手法の全体図

10. Prompt部の詳細 ● 1. Stastical Prompt ○ いくつactionが動画中にあるか ○ The video has {num} actions. ● 2. Ordinal Prompt ○ 何番目のactionか ○ This is the {ord_i} action in the video. ● 3. Semantic Prompt ○ “{ord_i}, the person is performing the action step of {vp_i}” ● 3+1. Integrated Prompt ○ 全部 ○ Semanticを全て文として並べる

11. 評価用データセット ● 50Salads: 50 top view 30-fps instructional videos regarding salad preparation ○ 19 kind of actions ● Georgia Tech Egocentric Activities(GTEA): 28 egocentric 15-fps instructional videos daily kitchen activities ○ 74 class of actions ● Breakfast: 1,712 third person 15-fps videos of breakfast preparation activities. ○ 48 type of different actions ○

12. Implementation ● 動画は16 frameで分割される ● Kinetics-400でAction CLIPを用いて事前学習をする ●

13.

14.

15. Long-termな映像に対する比較

16.

17. Fusion Moduleの比較・検討

18. 未知のIDに対する対応力 ● fine-tune時に特定の行動だけを学習させた場合、類似した行動を推定できるか？ ○ cofee2teaはfine-tuneをmaking cofeeだけで行って、making teaが当てられるかを見る ○ AKLは全体としての精度

19. まとめ・感想 ● Prompt EngineeringがNLP以外にも出ていることを初めて知って勉強になりました ● 順序を持たせたことがどのような意味を持っているのかがこの実験だとあまりわからなかったので残念 ● 未知のIDに対応できているのはすごいけど、この実験方法がそれを測るのに適切かは疑問 ● 既存モデルとの違いをもう少し結果から読み取りたかった ○ 精度だけだとどこが良くなっているのかよくわからん