Self-supervised learning
Vision Vision+Language Vision+Audio
SimCLR: Chen et al, 2020
MOCO: He et al, 2020
XDC: Alwassel at al,
2020
L3: Arandjelovic and
Zisserman, 2017
GDT: Patrick at al,
2020
MIL-NCE: Miech, Alayrac et al, 2020
VideoBERT: Sun et al, 2019
DaveNet: Harwath et al, 2018
Sound of Pixels: Zhao
et al, 2018
Outline of the talk
01
Multimodal Versatile Networks
Motivation
MMV Model
Versatility checklist
Video Network Deflation
Potential applications
02
BraVe: Broaden your views for
self-supervised learning
Narrow and broad views
Main idea
Motivation
Research questions
Evaluation
Motivation
Research questions:
Are three modalities better than two for downstream tasks?
Are there natural requirements for such a multimodal network?
Self-supervised learning on modalities naturally present in videos:
Vision, Audio and Language
Positive pairs
This is an “old” idea: DeVise, Frome et al. NeurIPS13 and WSABIE, Weston et al. IJCAI 2011.
“Play the guitar” “Cut the onion”
Negative pairs
Main Idea
Video 1 Video 2
Which pretraining datasets?
MOCO: He et al, 2020
GDT: Patrick at al,
2020
MIL-NCE: Miech, Alayrac et al, 2020
HowTo100M: 1M videos, 100M clips, 20K tasks, text obtained from ASR.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, Miech, Zhukov, Alayrac et al., ICCV19
MOCO: He et al, 2020
MIL-NCE: Miech, Alayrac et al, 2020
AudioSet: 2M videos (with audio tracks), we do not extract text for this dataset.
Audio Set: An ontology and human-labeled dataset for audio events, Gemmeke et al. ICASSP 2017
Versatility checklist
Ingest any modality
Takes as input any of the
three modalities.
Specificity
Respects the specificity
of modalities.
Compare
modalities
Enables the different
modalities to be easily
compared.
1 2 3
Transfer to images
Efficiently applicable to
visual data in the form of
videos or images.
4
Embedding graph design
Fine and Coarse
Intuition: audio is more fine grained (e.g., multiple sounds of guitar) whereas text is more coarse (a single
word for guitar) ⇒ The Fine and Coarse design:
✓ enables the different modalities to be easily compared
✓ has the best results in several downstream tasks
✓ respects the specificity of modalities
Fine Space
Coarse Space
Self-supervised Multi-Modal Versatile Networks, NeurIPS 2020
Versatility checklist
Ingest any modality
Takes as input any of the
three modalities.
Specificity
Respects the specificity
of modalities.
Compare
modalities
Enables the different
modalities to be easily
compared.
1 2 3
Transfer to images
Efficiently applicable to
visual data in the form of
videos or images.
4
Ingest any modality
Takes as input any of the
three modalities.
Specificity
Respects the specificity
of modalities.
Compare
modalities
Enables the different
modalities to be easily
compared.
1 2 3
Transfer to images
Efficiently applicable to
visual data in the form of
videos or images.
4
Versatility checklist
Network Deflation
Motivation:
Most works consider learning first from images to apply models to video.
Goal:
We train our model in video and apply them efficiently to image inputs.
A standard solution: Inflated input Proposed solution: Deflated network
Video Network
Text to video
Rank 1 Rank 2 Rank 3
“pour some oil
into a hot pan”
Input text
Text to audio retrieval in the coarse space
Even though the link between audio and text
was never explicit during training, we can use
the FAC architecture to perform text to audio
retrieval.
ResNet50
To do so, the audio samples are first
embedded in the joint visual-audio (fine)
space.
Text to audio retrieval in the coarse space
ResNet50
To do so, the audio samples are first
embedded in the joint visual-audio (fine)
space.
Text to audio retrieval in the coarse space
ResNet50
To do so, the audio samples are first
embedded in the joint visual-audio (fine)
space.
Text to audio retrieval in the coarse space
ResNet50
Then the va→vat projection head is used to
project the audio embeddings into the joint
visual-audio-text space (coarse).
Text to audio retrieval in the coarse space
ResNet50
Then the va→vat projection head is used to
project the audio embeddings into the joint
visual-audio-text space (coarse).
Text to audio retrieval in the coarse space
ResNet50
Given a text input query, we simply embed it
into the joint space and retrieve the closest
audio embedding.
Input
query
Text to audio retrieval in the coarse space
However...
Most of available videos do not contain
narrations.
Using negatives for self-supervision is
expensive as it require training with large
batch sizes.
Our training misses larger context as the views
of the data cover at most 3 seconds.
Outline of the talk
01
Multimodal Versatile Networks
Motivation
MMV Model
Versatility checklist
Video Network Deflation
Potential applications
02
BraVe: Broaden your views for
self-supervised learning
Narrow and broad views
Main idea
Motivation
Research questions
Evaluation
Motivation
Goal: learn good representation by regressing a broad representation of the video.
BraVe learns strong representation of video as the narrow view needs to predict
the representation of the whole video clip (broad view).
We use separate backbones to process both views, as they perform different
tasks. This enables using different augmentations/modalities in both views.
Flow or alternative representations of the video can provide a strong signal for
learning.
Research Questions
Importance of the
broad view
Modality in the
broad view
Weight sharing
across views
1 2 3
Syncing the narrow
and broad views
4
Broaden Your Views for Self-Supervised Video Learning
Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron
van den Oord, Andrew Zisserman. Arxiv 2021.
Conclusions
Videos are a rich source of self-supervision for
video, audio and image models.
Both MMV and BraVe archive SoTA results for
self-supervised learning in several downstream
tasks.
Using audio, text or larger video context are
useful self-supervisory signals.