Long term feature banks for detailed video understanding (Action Recognition)

Susang Kim(healess1@gmail.com)
Video Understanding(3)
Long-Term Feature Banks for Detailed Video Understanding

Long-term feature bank (CVPR 2019)
CVPR 2019에서 FAIR에서 발표한 논문으로
3D CNN을 활용하여 long-term feature에 대해
2~5초의 짧은 clip만을 보더라도 non-local과
bank개념을 넣어 AVA, EPIC-Kitchens,
Charades에서 SOTA 달성
(무겁지만 정확도에 향상을 둔 모델
기존 3D CNN대비 2배 파라미터)
long-term feature bank(supportive information
extracted over the entire span of a video)

long-term feature banks for detailed video understanding

AVA Dataset
https://research.google.com/ava/index.html

The difficulties of AVA Dataset
Dense Atomic action labels
Identify 80 basic human actions, localize in time
and space, wherever they appear in video
Multiple people performing multiple actions
Context can’t “solve” the problem
- birthday cake ⇏ blowing out candles
※ AVA-Kinetics Challenge : https://research.google.com/ava/challenge.html

EPIC-Kitchens
https://epic-kitchens.github.io/2020-100.html
Original Sequences (+RGB and Flow Frames): Available at Data.Bris servers (1.1TB zipped)
45 kitchens - 4 cities
Head-mounted camera
100 hours of recording -
Full HD
20M frames
Multi-language narrations
90K action segments
20K unique narrations
90 verb classes, 300
noun classes
6 challenges

http://actionrecognition.net/files/dsetdetail.php?did=12;
Rank in AVA Dataset (LTF vs Slow Fast)
Test Data에 따른 접근 방식의 차이

Skeleton-Based Action Recognition
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition http://dahualin.org/publications/dhl18_stgcn.pdf
Data의 성격에 따른
전처리에 대한 정의 필요

Long Term Feature Bank
긴 시간 상에서 복잡한 상황 변화에 대한 정보를
공간적 정보와 시간 변화등을 현재 시점의 정보를 통해 추론

Memory Network란? (Recap)
Memory
k=3
Question
Answer Set
나는 지금 배가 고프다 판교에 있다 (K=1) 회사는 포스코ICT다(K=2) 피자 주문하고 싶다 (K=3)
너는 지금 어디 있니? / 무엇을 주문할려구?
Argmax G(q,s)
q를 바탕으로 s 선택
+
판교/피자
[답변Set:피자, 판교, 서울, 야탑]
Reason
Memory
k=2
Memory
k=1
Memory + Inference
Embedding(BoW)(I) -> Story(G) -> Answer(H)
Argmax H(s,a)
S바탕으로 a선택
End-To-End Memory Networks https://arxiv.org/pdf/1503.08895.pdf

Visualization of FBO Module
프레임의 변화에 따른 특정 피쳐간의 연관성을 시각화해서 표현
FBO(Feature Bank Operator)는 Long-Term
Feature(L)에서 로 선언
는 로 2w+1의 크기를 가짐
batch(entire video:전체길이), casual(online:2w+1)
Short-Term(S)은 RoI Feature를 계산 (3D CNN
ResNet50(pre-trained on ImageNet)
시간축에 따른 Average Pool과 공간상에 따른
RoIAlign(Mask-RCNN)을 사용
3D CNN Backbone => input - H x W x 3 x 32(frames) / output - 16 x H/16 x W/16 x 2048
⇒ L, S 모두 위의 아키텍쳐로 Feature 추출

RoIAlign (Mask R-CNN) (ReCap)
https://arxiv.org/pdf/1703.06870.pdf
RoI Pooling의 경우 Object Detection Task에서 오차 허용이 가능 (IoU)
하지만 Pixel단위로 구분하는 Segmentation Task에서는 오차가 커짐
따라서 bilinear interpolation을 통해 값을 계산

Modified Non-Local block
self attention의 개념이 적용된 non-local block
Avg/Max Pooling으로 FBO 적용가능

Non-local Neural Networks (CVPR 2018)
Xi와Xj의 유사도 계산
넓은 receptive field(local) 확보 시의 비효율을 개선
temporal한 Feature 추출 시에 큰 성능향상을 가져옴

Self Attention related to Non-Local
A non-local algorithm for image denoising
(Non-local Means Filter(NL-m Filter)

Person Detector : Faster R-CNN(ResNeXt-101-FPN
(pre-trained on ImageNet + fine tuned AVA bounding boxes
Temporal Sampling : one clip per second (3D CNN - input 32 frames, 63 frame별 2 stride )
Hpyer-Parameter: SGD, minibatch size = 16, clipss on 8GPU, 140k iterations,
learning rate = 0.04, 10만~12만에서 10% decay
Data augmentation : Random(뒤집기, 스케일링, 자르기:224X224)
Inference : Detection Score >= 0.85 / 256x256 crop (256 pixel) / RoIAlign
Implementation Details

Comparison to prior work
RGB만을 사용한 3D CNN만으로 다른 모델(Optical Flow, Ensemble)에 비해 나은 성능을 보임

Codes (FBO - NL / AVG / MAX)
https://github.com/facebookresearch/video-long-term-feature-banks/blob/master/lib/models/lfb_helper.py

Charades dataset Experiments
https://prior.allenai.org/projects/charades
Charades is dataset composed of 9848 videos of daily
indoors activities collected through Amazon Mechanical
Turk.
267 different users were presented with a sentence, that
includes objects and actions from a fixed vocabulary,
and they recorded a video acting out the sentence (like
in a game of Charades).
dataset contains 66,500 temporal annotations for 157
action classes, 41,104 labels for 46 object classes, and
27,847 textual descriptions of the videos. This work was
presented at ECCV2016.
Charades Dataset의 경우 LFB NL이 최고 성능

Temporal Support
Windows 사이즈에 따른 성능 비교(L=2w+1)
Dataset별 시간
- AVA 2m
- EPIC-Kitchen 15~60s
- Charades ~30s
대부분 10초 이상(Long-term)에서 성능이
잘나오는 것을 확인

Example Predictions
4~10초 간격(window크기)에 따른 정확도의 변화 (시간이 길 수록 정확도가 올라감)

AVA-Kinetics Challenge 2020 (CVPR 2020)
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization https://arxiv.org/pdf/2006.07976.pdf
Actor간의 관계, Actor과 상황관의 추론을
통한 정확도 향상
(Actor-Context Feature Bank)
by SenseTime

Thanks
Any Questions?
You can send mail to
Susang Kim(healess1@gmail.com)

Long term feature banks for detailed video understanding (Action Recognition)

Recommandé

Recommandé

Contenu connexe

Similaire à Long term feature banks for detailed video understanding (Action Recognition)

Similaire à Long term feature banks for detailed video understanding (Action Recognition) (20)

Plus de Susang Kim

Plus de Susang Kim (14)

Long term feature banks for detailed video understanding (Action Recognition)