SlideShare une entreprise Scribd logo
1  sur  50
15/05/2021
Multi-modal self-supervised
learning from videos
Adrià Recasens Continente
DeepMind
We learn from the world through multimodal experience
[...] towards the root and try to get as close to the root as possible, nice long strokes [...]
Success of supervised learning
Pose estimation
[Towards Accurate Multi-person Pose Estimation in
the Wild, Papandreou, Zhu, Kanazawa, Toshev,
Tompson, Bregler and Murphy, CVPR17]
Image Segmentation
[Mask R-CNN, He, Gkioxari, Dollár, and Girshisck,
ICCV17]
Supervised learning
Labels are expensive Agreement: definition? Granularity?
Supervised learning
Labels are expensive Even more problematic for videos
Self-supervised learning
Vision Vision+Language Vision+Audio
SimCLR: Chen et al, 2020
MOCO: He et al, 2020
XDC: Alwassel at al,
2020
L3: Arandjelovic and
Zisserman, 2017
GDT: Patrick at al,
2020
MIL-NCE: Miech, Alayrac et al, 2020
VideoBERT: Sun et al, 2019
DaveNet: Harwath et al, 2018
Sound of Pixels: Zhao
et al, 2018
Outline of the talk
01
Multimodal Versatile Networks
Motivation
MMV Model
Versatility checklist
Video Network Deflation
Potential applications
02
BraVe: Broaden your views for
self-supervised learning
Narrow and broad views
Main idea
Motivation
Research questions
Evaluation
1 Multi-modal
versatile
networks
Motivation
Research questions:
Are three modalities better than two for downstream tasks?
Are there natural requirements for such a multimodal network?
Self-supervised learning on modalities naturally present in videos:
Vision, Audio and Language
Positive pairs
This is an “old” idea: DeVise, Frome et al. NeurIPS13 and WSABIE, Weston et al. IJCAI 2011.
“Play the guitar” “Cut the onion”
Negative pairs
Main Idea
Video 1 Video 2
Which pretraining datasets?
MOCO: He et al, 2020
GDT: Patrick at al,
2020
MIL-NCE: Miech, Alayrac et al, 2020
HowTo100M: 1M videos, 100M clips, 20K tasks, text obtained from ASR.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, Miech, Zhukov, Alayrac et al., ICCV19
MOCO: He et al, 2020
MIL-NCE: Miech, Alayrac et al, 2020
AudioSet: 2M videos (with audio tracks), we do not extract text for this dataset.
Audio Set: An ontology and human-labeled dataset for audio events, Gemmeke et al. ICASSP 2017
Versatility checklist
Ingest any modality
Takes as input any of the
three modalities.
Specificity
Respects the specificity
of modalities.
Compare
modalities
Enables the different
modalities to be easily
compared.
1 2 3
Transfer to images
Efficiently applicable to
visual data in the form of
videos or images.
4
Embedding graph design
Fine and Coarse
Intuition: audio is more fine grained (e.g., multiple sounds of guitar) whereas text is more coarse (a single
word for guitar) ⇒ The Fine and Coarse design:
✓ enables the different modalities to be easily compared
✓ has the best results in several downstream tasks
✓ respects the specificity of modalities
Fine Space
Coarse Space
Self-supervised Multi-Modal Versatile Networks, NeurIPS 2020
Do more modalities help?
State-of-the-art comparison
Versatility checklist
Ingest any modality
Takes as input any of the
three modalities.
Specificity
Respects the specificity
of modalities.
Compare
modalities
Enables the different
modalities to be easily
compared.
1 2 3
Transfer to images
Efficiently applicable to
visual data in the form of
videos or images.
4
Ingest any modality
Takes as input any of the
three modalities.
Specificity
Respects the specificity
of modalities.
Compare
modalities
Enables the different
modalities to be easily
compared.
1 2 3
Transfer to images
Efficiently applicable to
visual data in the form of
videos or images.
4
Versatility checklist
Network Deflation
Motivation:
Most works consider learning first from images to apply models to video.
Goal:
We train our model in video and apply them efficiently to image inputs.
A standard solution: Inflated input Proposed solution: Deflated network
Video Network
Network Deflation
MULTIMODAL VERSATILE NETWORKS
Potential
Applications
Audio to video
Rank 1 Rank 2 Rank 3
Audio to video
Rank 1 Rank 2 Rank 3
Text to video
“add fresh chopped
tomatoes and stir”
Input text
Text to video
Rank 1 Rank 2 Rank 3
“add fresh chopped
tomatoes and stir”
Input text
Text to video
“pour some oil
into a hot pan”
Input text
Text to video
Rank 1 Rank 2 Rank 3
“pour some oil
into a hot pan”
Input text
Text to audio retrieval in the coarse space
Even though the link between audio and text
was never explicit during training, we can use
the FAC architecture to perform text to audio
retrieval.
ResNet50
To do so, the audio samples are first
embedded in the joint visual-audio (fine)
space.
Text to audio retrieval in the coarse space
ResNet50
To do so, the audio samples are first
embedded in the joint visual-audio (fine)
space.
Text to audio retrieval in the coarse space
ResNet50
To do so, the audio samples are first
embedded in the joint visual-audio (fine)
space.
Text to audio retrieval in the coarse space
ResNet50
Then the va→vat projection head is used to
project the audio embeddings into the joint
visual-audio-text space (coarse).
Text to audio retrieval in the coarse space
ResNet50
Then the va→vat projection head is used to
project the audio embeddings into the joint
visual-audio-text space (coarse).
Text to audio retrieval in the coarse space
ResNet50
Given a text input query, we simply embed it
into the joint space and retrieve the closest
audio embedding.
Input
query
Text to audio retrieval in the coarse space
“airplane”
Rank 1
Input text
Text to audio retrieval in the coarse space
Rank 1
Input text
“chirping bird”
Text to audio retrieval in the coarse space
Resources
Pretrained models available
TF-Hub: [S3D] [TSM-RN]
[TSM-RNx2]
Models in JAX with action
recognition downstream task!
However...
Most of available videos do not contain
narrations.
Using negatives for self-supervision is
expensive as it require training with large
batch sizes.
Our training misses larger context as the views
of the data cover at most 3 seconds.
Outline of the talk
01
Multimodal Versatile Networks
Motivation
MMV Model
Versatility checklist
Video Network Deflation
Potential applications
02
BraVe: Broaden your views for
self-supervised learning
Narrow and broad views
Main idea
Motivation
Research questions
Evaluation
Main Idea
Main Idea
Main Idea
Motivation
Goal: learn good representation by regressing a broad representation of the video.
BraVe learns strong representation of video as the narrow view needs to predict
the representation of the whole video clip (broad view).
We use separate backbones to process both views, as they perform different
tasks. This enables using different augmentations/modalities in both views.
Flow or alternative representations of the video can provide a strong signal for
learning.
Research Questions
Importance of the
broad view
Modality in the
broad view
Weight sharing
across views
1 2 3
Syncing the narrow
and broad views
4
Broaden Your Views for Self-Supervised Video Learning
Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron
van den Oord, Andrew Zisserman. Arxiv 2021.
Comparison to SoTA: video-only models
Comparison to SoTA: audio-visual models
Conclusions
Videos are a rich source of self-supervision for
video, audio and image models.
Both MMV and BraVe archive SoTA results for
self-supervised learning in several downstream
tasks.
Using audio, text or larger video context are
useful self-supervisory signals.
Thank
you!

Contenu connexe

Tendances

Lecture 1 graphical models
Lecture 1  graphical modelsLecture 1  graphical models
Lecture 1 graphical models
Duy Tung Pham
 
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ..."Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
Edge AI and Vision Alliance
 

Tendances (20)

Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
 
Texture in image processing
Texture in image processing Texture in image processing
Texture in image processing
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image Processing
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
 
Deep Learning - A Literature survey
Deep Learning - A Literature surveyDeep Learning - A Literature survey
Deep Learning - A Literature survey
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graph
 
Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - Introduction
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Deformable Convolutional Network (2017)
Deformable Convolutional Network (2017)Deformable Convolutional Network (2017)
Deformable Convolutional Network (2017)
 
Lecture 1 graphical models
Lecture 1  graphical modelsLecture 1  graphical models
Lecture 1 graphical models
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ..."Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
 
Graph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype PredictionGraph Neural Network for Phenotype Prediction
Graph Neural Network for Phenotype Prediction
 

Similaire à Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos

Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communications
NAVER Engineering
 
Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...
IJECEIAES
 
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
IJCSEIT Journal
 

Similaire à Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos (20)

MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
 
3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues
 
Video+Language: From Classification to Description
Video+Language: From Classification to DescriptionVideo+Language: From Classification to Description
Video+Language: From Classification to Description
 
Video + Language 2019
Video + Language 2019Video + Language 2019
Video + Language 2019
 
Video + Language
Video + LanguageVideo + Language
Video + Language
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communications
 
論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions
論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions
論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions
 
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyMulti-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
 
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
 
Pacify based video retrieval system
Pacify based video retrieval systemPacify based video retrieval system
Pacify based video retrieval system
 
Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
On the Influence Propagation of Web Videos
On the Influence Propagation of Web VideosOn the Influence Propagation of Web Videos
On the Influence Propagation of Web Videos
 
Rosinski ibm ai overview with several examples of projects in the media and l...
Rosinski ibm ai overview with several examples of projects in the media and l...Rosinski ibm ai overview with several examples of projects in the media and l...
Rosinski ibm ai overview with several examples of projects in the media and l...
 
A04840107
A04840107A04840107
A04840107
 
Inverted File Based Search Technique for Video Copy Retrieval
Inverted File Based Search Technique for Video Copy RetrievalInverted File Based Search Technique for Video Copy Retrieval
Inverted File Based Search Technique for Video Copy Retrieval
 
Towards Using Semantic Features for Near-Duplicate Video Detection
Towards Using Semantic Features for Near-Duplicate Video DetectionTowards Using Semantic Features for Near-Duplicate Video Detection
Towards Using Semantic Features for Near-Duplicate Video Detection
 
Sub1577
Sub1577Sub1577
Sub1577
 
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
Energy-efficient Adaptive Video Streaming with Latency-Aware Dynamic Resoluti...
 
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
 

Plus de Codiax

Dr. Laura Kerber (NASA’s Jet Propulsion Laboratory) – Exploring Caves on the ...
Dr. Laura Kerber (NASA’s Jet Propulsion Laboratory) – Exploring Caves on the ...Dr. Laura Kerber (NASA’s Jet Propulsion Laboratory) – Exploring Caves on the ...
Dr. Laura Kerber (NASA’s Jet Propulsion Laboratory) – Exploring Caves on the ...
Codiax
 
Costas Voliotis (CodeWeTrust) – An AI-driven approach to source code evaluation
Costas Voliotis (CodeWeTrust) – An AI-driven approach to source code evaluationCostas Voliotis (CodeWeTrust) – An AI-driven approach to source code evaluation
Costas Voliotis (CodeWeTrust) – An AI-driven approach to source code evaluation
Codiax
 
Dr. Lobna Karoui (Fortune 500) – Disruption, empathy & Trust for sustainable ...
Dr. Lobna Karoui (Fortune 500) – Disruption, empathy & Trust for sustainable ...Dr. Lobna Karoui (Fortune 500) – Disruption, empathy & Trust for sustainable ...
Dr. Lobna Karoui (Fortune 500) – Disruption, empathy & Trust for sustainable ...
Codiax
 
Luka Postružin (Superbet) – ‘From zero to hero’ in early life customer segmen...
Luka Postružin (Superbet) – ‘From zero to hero’ in early life customer segmen...Luka Postružin (Superbet) – ‘From zero to hero’ in early life customer segmen...
Luka Postružin (Superbet) – ‘From zero to hero’ in early life customer segmen...
Codiax
 
Gema Parreno Piqueras (Apium Hub) – Videogames and Interactive Narrative Cont...
Gema Parreno Piqueras (Apium Hub) – Videogames and Interactive Narrative Cont...Gema Parreno Piqueras (Apium Hub) – Videogames and Interactive Narrative Cont...
Gema Parreno Piqueras (Apium Hub) – Videogames and Interactive Narrative Cont...
Codiax
 
Janos Puskas (Accenture) – Azure IoT Reference Architecture for enterprise Io...
Janos Puskas (Accenture) – Azure IoT Reference Architecture for enterprise Io...Janos Puskas (Accenture) – Azure IoT Reference Architecture for enterprise Io...
Janos Puskas (Accenture) – Azure IoT Reference Architecture for enterprise Io...
Codiax
 
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
Codiax
 
Javier Fuentes Alonso (Uizard) – Using machine learning to turn you into a de...
Javier Fuentes Alonso (Uizard) – Using machine learning to turn you into a de...Javier Fuentes Alonso (Uizard) – Using machine learning to turn you into a de...
Javier Fuentes Alonso (Uizard) – Using machine learning to turn you into a de...
Codiax
 
Emeli Dral (Evidently AI) – Analyze it: production monitoring for machine lea...
Emeli Dral (Evidently AI) – Analyze it: production monitoring for machine lea...Emeli Dral (Evidently AI) – Analyze it: production monitoring for machine lea...
Emeli Dral (Evidently AI) – Analyze it: production monitoring for machine lea...
Codiax
 
Matthias Feys (ML6) – Bias in ML: A Technical Intro
Matthias Feys (ML6) – Bias in ML: A Technical IntroMatthias Feys (ML6) – Bias in ML: A Technical Intro
Matthias Feys (ML6) – Bias in ML: A Technical Intro
Codiax
 
Christophe Tallec, Hello Tomorrow – Solving our next decade challenges throug...
Christophe Tallec, Hello Tomorrow – Solving our next decade challenges throug...Christophe Tallec, Hello Tomorrow – Solving our next decade challenges throug...
Christophe Tallec, Hello Tomorrow – Solving our next decade challenges throug...
Codiax
 
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Codiax
 
Olga Afanasjeva (GoodAI) - Towards general artificial intelligence for common...
Olga Afanasjeva (GoodAI) - Towards general artificial intelligence for common...Olga Afanasjeva (GoodAI) - Towards general artificial intelligence for common...
Olga Afanasjeva (GoodAI) - Towards general artificial intelligence for common...
Codiax
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The Trade
Codiax
 
Joanna Bryson (University of Bath) - Intelligence by Design_ Systems engineer...
Joanna Bryson (University of Bath) - Intelligence by Design_ Systems engineer...Joanna Bryson (University of Bath) - Intelligence by Design_ Systems engineer...
Joanna Bryson (University of Bath) - Intelligence by Design_ Systems engineer...
Codiax
 
Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...
Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...
Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...
Codiax
 
Jakub Bartoszek (Samsung Electronics) - Hardware Security in Connected World
Jakub Bartoszek (Samsung Electronics) - Hardware Security in Connected WorldJakub Bartoszek (Samsung Electronics) - Hardware Security in Connected World
Jakub Bartoszek (Samsung Electronics) - Hardware Security in Connected World
Codiax
 
Jair Ribeiro - Defining a Successful Artificial Intelligence Strategy for you...
Jair Ribeiro - Defining a Successful Artificial Intelligence Strategy for you...Jair Ribeiro - Defining a Successful Artificial Intelligence Strategy for you...
Jair Ribeiro - Defining a Successful Artificial Intelligence Strategy for you...
Codiax
 
Cindy Spelt (Zoom In Zoom Out) - How to beat the face recognition challenges?
Cindy Spelt (Zoom In Zoom Out) - How to beat the face recognition challenges?Cindy Spelt (Zoom In Zoom Out) - How to beat the face recognition challenges?
Cindy Spelt (Zoom In Zoom Out) - How to beat the face recognition challenges?
Codiax
 
Alexey Borisenko (Cisco) - Creating IoT solution using LoRaWAN Network Server
Alexey Borisenko (Cisco) - Creating IoT solution using LoRaWAN Network ServerAlexey Borisenko (Cisco) - Creating IoT solution using LoRaWAN Network Server
Alexey Borisenko (Cisco) - Creating IoT solution using LoRaWAN Network Server
Codiax
 

Plus de Codiax (20)

Dr. Laura Kerber (NASA’s Jet Propulsion Laboratory) – Exploring Caves on the ...
Dr. Laura Kerber (NASA’s Jet Propulsion Laboratory) – Exploring Caves on the ...Dr. Laura Kerber (NASA’s Jet Propulsion Laboratory) – Exploring Caves on the ...
Dr. Laura Kerber (NASA’s Jet Propulsion Laboratory) – Exploring Caves on the ...
 
Costas Voliotis (CodeWeTrust) – An AI-driven approach to source code evaluation
Costas Voliotis (CodeWeTrust) – An AI-driven approach to source code evaluationCostas Voliotis (CodeWeTrust) – An AI-driven approach to source code evaluation
Costas Voliotis (CodeWeTrust) – An AI-driven approach to source code evaluation
 
Dr. Lobna Karoui (Fortune 500) – Disruption, empathy & Trust for sustainable ...
Dr. Lobna Karoui (Fortune 500) – Disruption, empathy & Trust for sustainable ...Dr. Lobna Karoui (Fortune 500) – Disruption, empathy & Trust for sustainable ...
Dr. Lobna Karoui (Fortune 500) – Disruption, empathy & Trust for sustainable ...
 
Luka Postružin (Superbet) – ‘From zero to hero’ in early life customer segmen...
Luka Postružin (Superbet) – ‘From zero to hero’ in early life customer segmen...Luka Postružin (Superbet) – ‘From zero to hero’ in early life customer segmen...
Luka Postružin (Superbet) – ‘From zero to hero’ in early life customer segmen...
 
Gema Parreno Piqueras (Apium Hub) – Videogames and Interactive Narrative Cont...
Gema Parreno Piqueras (Apium Hub) – Videogames and Interactive Narrative Cont...Gema Parreno Piqueras (Apium Hub) – Videogames and Interactive Narrative Cont...
Gema Parreno Piqueras (Apium Hub) – Videogames and Interactive Narrative Cont...
 
Janos Puskas (Accenture) – Azure IoT Reference Architecture for enterprise Io...
Janos Puskas (Accenture) – Azure IoT Reference Architecture for enterprise Io...Janos Puskas (Accenture) – Azure IoT Reference Architecture for enterprise Io...
Janos Puskas (Accenture) – Azure IoT Reference Architecture for enterprise Io...
 
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
Roelof Pieters (Overstory) – Tackling Forest Fires and Deforestation with Sat...
 
Javier Fuentes Alonso (Uizard) – Using machine learning to turn you into a de...
Javier Fuentes Alonso (Uizard) – Using machine learning to turn you into a de...Javier Fuentes Alonso (Uizard) – Using machine learning to turn you into a de...
Javier Fuentes Alonso (Uizard) – Using machine learning to turn you into a de...
 
Emeli Dral (Evidently AI) – Analyze it: production monitoring for machine lea...
Emeli Dral (Evidently AI) – Analyze it: production monitoring for machine lea...Emeli Dral (Evidently AI) – Analyze it: production monitoring for machine lea...
Emeli Dral (Evidently AI) – Analyze it: production monitoring for machine lea...
 
Matthias Feys (ML6) – Bias in ML: A Technical Intro
Matthias Feys (ML6) – Bias in ML: A Technical IntroMatthias Feys (ML6) – Bias in ML: A Technical Intro
Matthias Feys (ML6) – Bias in ML: A Technical Intro
 
Christophe Tallec, Hello Tomorrow – Solving our next decade challenges throug...
Christophe Tallec, Hello Tomorrow – Solving our next decade challenges throug...Christophe Tallec, Hello Tomorrow – Solving our next decade challenges throug...
Christophe Tallec, Hello Tomorrow – Solving our next decade challenges throug...
 
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
 
Olga Afanasjeva (GoodAI) - Towards general artificial intelligence for common...
Olga Afanasjeva (GoodAI) - Towards general artificial intelligence for common...Olga Afanasjeva (GoodAI) - Towards general artificial intelligence for common...
Olga Afanasjeva (GoodAI) - Towards general artificial intelligence for common...
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The Trade
 
Joanna Bryson (University of Bath) - Intelligence by Design_ Systems engineer...
Joanna Bryson (University of Bath) - Intelligence by Design_ Systems engineer...Joanna Bryson (University of Bath) - Intelligence by Design_ Systems engineer...
Joanna Bryson (University of Bath) - Intelligence by Design_ Systems engineer...
 
Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...
Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...
Jakub Langr (University of Oxford) - Overview of Generative Adversarial Netwo...
 
Jakub Bartoszek (Samsung Electronics) - Hardware Security in Connected World
Jakub Bartoszek (Samsung Electronics) - Hardware Security in Connected WorldJakub Bartoszek (Samsung Electronics) - Hardware Security in Connected World
Jakub Bartoszek (Samsung Electronics) - Hardware Security in Connected World
 
Jair Ribeiro - Defining a Successful Artificial Intelligence Strategy for you...
Jair Ribeiro - Defining a Successful Artificial Intelligence Strategy for you...Jair Ribeiro - Defining a Successful Artificial Intelligence Strategy for you...
Jair Ribeiro - Defining a Successful Artificial Intelligence Strategy for you...
 
Cindy Spelt (Zoom In Zoom Out) - How to beat the face recognition challenges?
Cindy Spelt (Zoom In Zoom Out) - How to beat the face recognition challenges?Cindy Spelt (Zoom In Zoom Out) - How to beat the face recognition challenges?
Cindy Spelt (Zoom In Zoom Out) - How to beat the face recognition challenges?
 
Alexey Borisenko (Cisco) - Creating IoT solution using LoRaWAN Network Server
Alexey Borisenko (Cisco) - Creating IoT solution using LoRaWAN Network ServerAlexey Borisenko (Cisco) - Creating IoT solution using LoRaWAN Network Server
Alexey Borisenko (Cisco) - Creating IoT solution using LoRaWAN Network Server
 

Dernier

Dernier (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos

  • 1. 15/05/2021 Multi-modal self-supervised learning from videos Adrià Recasens Continente DeepMind
  • 2. We learn from the world through multimodal experience [...] towards the root and try to get as close to the root as possible, nice long strokes [...]
  • 3. Success of supervised learning Pose estimation [Towards Accurate Multi-person Pose Estimation in the Wild, Papandreou, Zhu, Kanazawa, Toshev, Tompson, Bregler and Murphy, CVPR17] Image Segmentation [Mask R-CNN, He, Gkioxari, Dollár, and Girshisck, ICCV17]
  • 4. Supervised learning Labels are expensive Agreement: definition? Granularity?
  • 5. Supervised learning Labels are expensive Even more problematic for videos
  • 6. Self-supervised learning Vision Vision+Language Vision+Audio SimCLR: Chen et al, 2020 MOCO: He et al, 2020 XDC: Alwassel at al, 2020 L3: Arandjelovic and Zisserman, 2017 GDT: Patrick at al, 2020 MIL-NCE: Miech, Alayrac et al, 2020 VideoBERT: Sun et al, 2019 DaveNet: Harwath et al, 2018 Sound of Pixels: Zhao et al, 2018
  • 7. Outline of the talk 01 Multimodal Versatile Networks Motivation MMV Model Versatility checklist Video Network Deflation Potential applications 02 BraVe: Broaden your views for self-supervised learning Narrow and broad views Main idea Motivation Research questions Evaluation
  • 9. Motivation Research questions: Are three modalities better than two for downstream tasks? Are there natural requirements for such a multimodal network? Self-supervised learning on modalities naturally present in videos: Vision, Audio and Language
  • 10. Positive pairs This is an “old” idea: DeVise, Frome et al. NeurIPS13 and WSABIE, Weston et al. IJCAI 2011. “Play the guitar” “Cut the onion” Negative pairs Main Idea Video 1 Video 2
  • 11. Which pretraining datasets? MOCO: He et al, 2020 GDT: Patrick at al, 2020 MIL-NCE: Miech, Alayrac et al, 2020 HowTo100M: 1M videos, 100M clips, 20K tasks, text obtained from ASR. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, Miech, Zhukov, Alayrac et al., ICCV19 MOCO: He et al, 2020 MIL-NCE: Miech, Alayrac et al, 2020 AudioSet: 2M videos (with audio tracks), we do not extract text for this dataset. Audio Set: An ontology and human-labeled dataset for audio events, Gemmeke et al. ICASSP 2017
  • 12. Versatility checklist Ingest any modality Takes as input any of the three modalities. Specificity Respects the specificity of modalities. Compare modalities Enables the different modalities to be easily compared. 1 2 3 Transfer to images Efficiently applicable to visual data in the form of videos or images. 4
  • 13. Embedding graph design Fine and Coarse Intuition: audio is more fine grained (e.g., multiple sounds of guitar) whereas text is more coarse (a single word for guitar) ⇒ The Fine and Coarse design: ✓ enables the different modalities to be easily compared ✓ has the best results in several downstream tasks ✓ respects the specificity of modalities Fine Space Coarse Space Self-supervised Multi-Modal Versatile Networks, NeurIPS 2020
  • 16. Versatility checklist Ingest any modality Takes as input any of the three modalities. Specificity Respects the specificity of modalities. Compare modalities Enables the different modalities to be easily compared. 1 2 3 Transfer to images Efficiently applicable to visual data in the form of videos or images. 4
  • 17. Ingest any modality Takes as input any of the three modalities. Specificity Respects the specificity of modalities. Compare modalities Enables the different modalities to be easily compared. 1 2 3 Transfer to images Efficiently applicable to visual data in the form of videos or images. 4 Versatility checklist
  • 18. Network Deflation Motivation: Most works consider learning first from images to apply models to video. Goal: We train our model in video and apply them efficiently to image inputs. A standard solution: Inflated input Proposed solution: Deflated network Video Network
  • 21. Audio to video Rank 1 Rank 2 Rank 3
  • 22. Audio to video Rank 1 Rank 2 Rank 3
  • 23. Text to video “add fresh chopped tomatoes and stir” Input text
  • 24. Text to video Rank 1 Rank 2 Rank 3 “add fresh chopped tomatoes and stir” Input text
  • 25. Text to video “pour some oil into a hot pan” Input text
  • 26. Text to video Rank 1 Rank 2 Rank 3 “pour some oil into a hot pan” Input text
  • 27. Text to audio retrieval in the coarse space Even though the link between audio and text was never explicit during training, we can use the FAC architecture to perform text to audio retrieval.
  • 28. ResNet50 To do so, the audio samples are first embedded in the joint visual-audio (fine) space. Text to audio retrieval in the coarse space
  • 29. ResNet50 To do so, the audio samples are first embedded in the joint visual-audio (fine) space. Text to audio retrieval in the coarse space
  • 30. ResNet50 To do so, the audio samples are first embedded in the joint visual-audio (fine) space. Text to audio retrieval in the coarse space
  • 31. ResNet50 Then the va→vat projection head is used to project the audio embeddings into the joint visual-audio-text space (coarse). Text to audio retrieval in the coarse space
  • 32. ResNet50 Then the va→vat projection head is used to project the audio embeddings into the joint visual-audio-text space (coarse). Text to audio retrieval in the coarse space
  • 33. ResNet50 Given a text input query, we simply embed it into the joint space and retrieve the closest audio embedding. Input query Text to audio retrieval in the coarse space
  • 34. “airplane” Rank 1 Input text Text to audio retrieval in the coarse space
  • 35. Rank 1 Input text “chirping bird” Text to audio retrieval in the coarse space
  • 36. Resources Pretrained models available TF-Hub: [S3D] [TSM-RN] [TSM-RNx2] Models in JAX with action recognition downstream task!
  • 37. However... Most of available videos do not contain narrations. Using negatives for self-supervision is expensive as it require training with large batch sizes. Our training misses larger context as the views of the data cover at most 3 seconds.
  • 38. Outline of the talk 01 Multimodal Versatile Networks Motivation MMV Model Versatility checklist Video Network Deflation Potential applications 02 BraVe: Broaden your views for self-supervised learning Narrow and broad views Main idea Motivation Research questions Evaluation
  • 39.
  • 40.
  • 41.
  • 45. Motivation Goal: learn good representation by regressing a broad representation of the video. BraVe learns strong representation of video as the narrow view needs to predict the representation of the whole video clip (broad view). We use separate backbones to process both views, as they perform different tasks. This enables using different augmentations/modalities in both views. Flow or alternative representations of the video can provide a strong signal for learning.
  • 46. Research Questions Importance of the broad view Modality in the broad view Weight sharing across views 1 2 3 Syncing the narrow and broad views 4 Broaden Your Views for Self-Supervised Video Learning Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron van den Oord, Andrew Zisserman. Arxiv 2021.
  • 47. Comparison to SoTA: video-only models
  • 48. Comparison to SoTA: audio-visual models
  • 49. Conclusions Videos are a rich source of self-supervision for video, audio and image models. Both MMV and BraVe archive SoTA results for self-supervised learning in several downstream tasks. Using audio, text or larger video context are useful self-supervisory signals.