Publicité

Contenu connexe

Similaire à Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos(20)

Publicité

Plus de Codiax(20)

Publicité

Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos

  1. 15/05/2021 Multi-modal self-supervised learning from videos Adrià Recasens Continente DeepMind
  2. We learn from the world through multimodal experience [...] towards the root and try to get as close to the root as possible, nice long strokes [...]
  3. Success of supervised learning Pose estimation [Towards Accurate Multi-person Pose Estimation in the Wild, Papandreou, Zhu, Kanazawa, Toshev, Tompson, Bregler and Murphy, CVPR17] Image Segmentation [Mask R-CNN, He, Gkioxari, Dollár, and Girshisck, ICCV17]
  4. Supervised learning Labels are expensive Agreement: definition? Granularity?
  5. Supervised learning Labels are expensive Even more problematic for videos
  6. Self-supervised learning Vision Vision+Language Vision+Audio SimCLR: Chen et al, 2020 MOCO: He et al, 2020 XDC: Alwassel at al, 2020 L3: Arandjelovic and Zisserman, 2017 GDT: Patrick at al, 2020 MIL-NCE: Miech, Alayrac et al, 2020 VideoBERT: Sun et al, 2019 DaveNet: Harwath et al, 2018 Sound of Pixels: Zhao et al, 2018
  7. Outline of the talk 01 Multimodal Versatile Networks Motivation MMV Model Versatility checklist Video Network Deflation Potential applications 02 BraVe: Broaden your views for self-supervised learning Narrow and broad views Main idea Motivation Research questions Evaluation
  8. 1 Multi-modal versatile networks
  9. Motivation Research questions: Are three modalities better than two for downstream tasks? Are there natural requirements for such a multimodal network? Self-supervised learning on modalities naturally present in videos: Vision, Audio and Language
  10. Positive pairs This is an “old” idea: DeVise, Frome et al. NeurIPS13 and WSABIE, Weston et al. IJCAI 2011. “Play the guitar” “Cut the onion” Negative pairs Main Idea Video 1 Video 2
  11. Which pretraining datasets? MOCO: He et al, 2020 GDT: Patrick at al, 2020 MIL-NCE: Miech, Alayrac et al, 2020 HowTo100M: 1M videos, 100M clips, 20K tasks, text obtained from ASR. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, Miech, Zhukov, Alayrac et al., ICCV19 MOCO: He et al, 2020 MIL-NCE: Miech, Alayrac et al, 2020 AudioSet: 2M videos (with audio tracks), we do not extract text for this dataset. Audio Set: An ontology and human-labeled dataset for audio events, Gemmeke et al. ICASSP 2017
  12. Versatility checklist Ingest any modality Takes as input any of the three modalities. Specificity Respects the specificity of modalities. Compare modalities Enables the different modalities to be easily compared. 1 2 3 Transfer to images Efficiently applicable to visual data in the form of videos or images. 4
  13. Embedding graph design Fine and Coarse Intuition: audio is more fine grained (e.g., multiple sounds of guitar) whereas text is more coarse (a single word for guitar) ⇒ The Fine and Coarse design: ✓ enables the different modalities to be easily compared ✓ has the best results in several downstream tasks ✓ respects the specificity of modalities Fine Space Coarse Space Self-supervised Multi-Modal Versatile Networks, NeurIPS 2020
  14. Do more modalities help?
  15. State-of-the-art comparison
  16. Versatility checklist Ingest any modality Takes as input any of the three modalities. Specificity Respects the specificity of modalities. Compare modalities Enables the different modalities to be easily compared. 1 2 3 Transfer to images Efficiently applicable to visual data in the form of videos or images. 4
  17. Ingest any modality Takes as input any of the three modalities. Specificity Respects the specificity of modalities. Compare modalities Enables the different modalities to be easily compared. 1 2 3 Transfer to images Efficiently applicable to visual data in the form of videos or images. 4 Versatility checklist
  18. Network Deflation Motivation: Most works consider learning first from images to apply models to video. Goal: We train our model in video and apply them efficiently to image inputs. A standard solution: Inflated input Proposed solution: Deflated network Video Network
  19. Network Deflation
  20. MULTIMODAL VERSATILE NETWORKS Potential Applications
  21. Audio to video Rank 1 Rank 2 Rank 3
  22. Audio to video Rank 1 Rank 2 Rank 3
  23. Text to video “add fresh chopped tomatoes and stir” Input text
  24. Text to video Rank 1 Rank 2 Rank 3 “add fresh chopped tomatoes and stir” Input text
  25. Text to video “pour some oil into a hot pan” Input text
  26. Text to video Rank 1 Rank 2 Rank 3 “pour some oil into a hot pan” Input text
  27. Text to audio retrieval in the coarse space Even though the link between audio and text was never explicit during training, we can use the FAC architecture to perform text to audio retrieval.
  28. ResNet50 To do so, the audio samples are first embedded in the joint visual-audio (fine) space. Text to audio retrieval in the coarse space
  29. ResNet50 To do so, the audio samples are first embedded in the joint visual-audio (fine) space. Text to audio retrieval in the coarse space
  30. ResNet50 To do so, the audio samples are first embedded in the joint visual-audio (fine) space. Text to audio retrieval in the coarse space
  31. ResNet50 Then the va→vat projection head is used to project the audio embeddings into the joint visual-audio-text space (coarse). Text to audio retrieval in the coarse space
  32. ResNet50 Then the va→vat projection head is used to project the audio embeddings into the joint visual-audio-text space (coarse). Text to audio retrieval in the coarse space
  33. ResNet50 Given a text input query, we simply embed it into the joint space and retrieve the closest audio embedding. Input query Text to audio retrieval in the coarse space
  34. “airplane” Rank 1 Input text Text to audio retrieval in the coarse space
  35. Rank 1 Input text “chirping bird” Text to audio retrieval in the coarse space
  36. Resources Pretrained models available TF-Hub: [S3D] [TSM-RN] [TSM-RNx2] Models in JAX with action recognition downstream task!
  37. However... Most of available videos do not contain narrations. Using negatives for self-supervision is expensive as it require training with large batch sizes. Our training misses larger context as the views of the data cover at most 3 seconds.
  38. Outline of the talk 01 Multimodal Versatile Networks Motivation MMV Model Versatility checklist Video Network Deflation Potential applications 02 BraVe: Broaden your views for self-supervised learning Narrow and broad views Main idea Motivation Research questions Evaluation
  39. Main Idea
  40. Main Idea
  41. Main Idea
  42. Motivation Goal: learn good representation by regressing a broad representation of the video. BraVe learns strong representation of video as the narrow view needs to predict the representation of the whole video clip (broad view). We use separate backbones to process both views, as they perform different tasks. This enables using different augmentations/modalities in both views. Flow or alternative representations of the video can provide a strong signal for learning.
  43. Research Questions Importance of the broad view Modality in the broad view Weight sharing across views 1 2 3 Syncing the narrow and broad views 4 Broaden Your Views for Self-Supervised Video Learning Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron van den Oord, Andrew Zisserman. Arxiv 2021.
  44. Comparison to SoTA: video-only models
  45. Comparison to SoTA: audio-visual models
  46. Conclusions Videos are a rich source of self-supervision for video, audio and image models. Both MMV and BraVe archive SoTA results for self-supervised learning in several downstream tasks. Using audio, text or larger video context are useful self-supervisory signals.
  47. Thank you!
Publicité