SlideShare a Scribd company logo
1 of 27
Download to read offline
AGREEMENT
• If you plan to share these slides or to use the content in these slides for your own work,
please include the following reference:
• 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします:
Tejero-de-Pablos A. (2022) “VAEs for multimodal disentanglement”. All Japan Computer Vision Study Group.
VAEs for multimodal
disentanglement
2022/05/15
Antonio TEJERO DE PABLOS
antonio_tejero@cyberagent.co.jp
1.Self-introduction
2.Background
3.Paper introduction
4.Final remarks
Self-introduction
Antonio TEJERO DE PABLOS
Background
• Present: Research scientist @ CyberAgent (AI Lab)
• ~2021: Researcher @ U-Tokyo (Harada Lab) & RIKEN (AIP)
• ~2017: PhD @ NAIST (Yokoya Lab)
Research interests
• Learning of multimodal data (RGB, depth, audio, text)
• and its applications (action recognition, advertisement
classification, etc.)
父 母
分野:コンピュータビジョン
Background
What is a VAE?
• Auto-encoder • Variational auto-encoder
With the proper regularization:
There is more!
• Vector Quantized-VAE
Quantize the bottleneck using a discrete codebook
There are a number of algorithms (like transformers) that are designed to work on discrete data, so we
would like to have a discrete representation of the data for these algorithms to use.
Advantages of VQ-VAE:
- Simplified latent space (easier to train)
- Likelihood-based model: do not suffer from the
problems of mode collapse and lack of diversity
- Real world data favors a discrete representation
(number of images that make sense is kind of finite)
Why are VAEs cool?
• Usage of VAEs (state-of-the-art)
Multimodal generation (DALL-E)
Representation learning, latent space disentanglement
Paper introduction
An interesting usage of VAEs for disentangling
multimodal data that grabbed my attention
Today I’m introducing:
1) Shi, Y., Paige, B., & Torr, P. (2019). Variational mixture-of-experts autoencoders for multi-modal
deep generative models. Advances in Neural Information Processing Systems, 32.
2) Lee, M., & Pavlovic, V. (2021). Private-shared disentangled multimodal VAE for learning of
latent representations. Conference on Computer Vision and Pattern Recognition (pp. 1692-
1700).
3) Joy, T., Shi, Y., Torr, P. H., Rainforth, T., Schmon, S. M., & Siddharth, N. (2022). Learning
Multimodal VAEs through Mutual Supervision. International Conference on Learning
Representations.
Motivation and goal
• Importance of multimodal data
Learning in the real world involves multiple perspectives: visual, auditive, linguistic
Understanding them individually allows only a partial learning of concepts
• Understanding how different modalities work together is not trivial
A similar joint-embedding process happens in the brain for reasoning and understanding
• Multimodal VAE facilitate representation learning on data with multiple views/modalities
Capture common underlying factors between the modalities
Motivation and goal
• Normally, only the shared aspects of modalities are modeled
The private information of each modality is totally LOST
E.g., image captioning
• Leverage VAE’s latent space for disentanglement
Private spaces are leveraged for modeling the disjoint properties of each
modality, and cross-modal generation
• Basically, such disentanglement can be used as:
An analytical tool to understand how modalities intertwine
A way of cross-generating modalities
Motivation and goal
• [1] and [2] propose a similar methodology
According to [1], a true multimodal generative model should meet four criteria:
Today I will introduce [2] (most recent), and explain briefly the differences with [3]
Dataset
• Digit images: MNIST & SVHN
- Shared features: Digit class
- Private features: Number style, background, etc.
Image domains as different modalities?
• Flower images and text description: Oxford-102 Flowers
- Shared features: Words and image features present in both
modalities
- Private features: Words and image features exclusive from
their modality
MNIST
SVHN
Related work
• Multimodal generation and joint multimodal VAEs (e.g., JMVAE, MVAE)
The learning of a common disentangled embedding (i.e., private-shared) is often ignored
Only some works in image-to-image translation separate ”content” (~shared) and ”style” (~private) in the
latent space (e.g., via adversarial loss)
Exclusively for between-image modalities: Not suitable for different modalities such as image and text
• Domain adaptation
Learning joint embeddings of multimodal observations
Proposed method: DMVAE
• Generative variational model: Introducing separate shared and private spaces
Usage: Cross-generation (analytical tool)
• Representations induced using pairs of individual modalities (encoder, decoder)
• Consistency of representations via Product of Experts (PoE). For a number of modalities N:
𝑞 𝑧! 𝑥", 𝑥#, ⋯ , 𝑥$ ∝ 𝑝(𝑧!) *
%&"
$
𝑞(𝑧!|𝑥%)
In VAE, inference networks and priors assume conditional Gaussian forms
𝑝 𝑧 = 𝑁 𝑧 0, 𝐼 , 𝑞 𝑧 𝑥% = 𝑁 𝑧 𝜇%, 𝐶%
𝑧"~𝑞'!
𝑧 𝑥" , 𝑧#~𝑞'"
𝑧 𝑥#
𝑧" = 𝑧(!
, 𝑧!!
, 𝑧# = 𝑧("
, 𝑧!"
We want: 𝑧) = 𝑧!!
= 𝑧!"
→ PoE
Proposed method: DMVAE
• Reconstruction inference
PoE-induced shared inference allows for inference when one or more modalities are missing
Thus, we consider three reconstruction tasks:
- Reconstruct both modalities at the same time: 𝑥", 𝑥# → 4
𝑥", 4
𝑥# 𝑧(!
, 𝑧("
, 𝑧)
- Reconstruct a single modality from its own input: 𝑥" → 4
𝑥" 𝑧(!
, 𝑧) or 𝑥# → 4
𝑥# 𝑧("
, 𝑧)
- Reconstruct a single modality from the opposite modality’s input: 𝑥# → 4
𝑥" 𝑧(!
, 𝑧) or 𝑥" → 4
𝑥# 𝑧("
, 𝑧)
• Loss function
Accuracy of reconstruction for jointly learned shared latent + KL-divergence of each normal distribution
Accuracy of cross-modal and self reconstruction + KL-divergence
Experiments: Digits (image-image)
• Evaluation
Qualitative: Cross-generation between modalities
Quantitative: Accuracy of the cross-generated images using a pre-trained classifier for each modality
- Joint: A sample from zs generates two image modalities that must be assigned the same class
Input
Output for
different
samples of zp2
Input
Output for
different
samples of zp1
Experiments: Digits (image-image)
• Ablation study: DMVAE [2] vs MMVAE [1]’s shared latent space
Experiments: Flowers (image-text)
• This task is more complex
Instead of the image-text, the intermediate features are reconstructed
• Quantitative evaluation
Class recognition (image-to-text) and cosine-similarity retrieval (text-to-image) on the shared latent space
• Qualitative evaluation
Retrieval
Conclusions
• Multimodal VAE for disentangling private and shared spaces
Improve the representational performance of multimodal VAEs
Successful application to image-image and image-text modalities
• Shaping a latent space into subspaces that capture the private-shared aspects of the
modalities
“is important from the perspective of downstream tasks, where better decomposed representations are more
amenable for using on a wider variety of tasks”
[3] Multimodal VAEs via mutual supervision
• Main differences with [1] and [2]
A type of multimodal VAE, without private-shared disentanglement
Does not rely on factorizations such as MoE or PoE for modeling modality-shared information
Instead, it repurposes semi-supervised VAEs for combining inter-modality information
- Allows learning from partially-observed modalities (Reg. = KL divergence)
• Proposed method: Mutually supErvised Multimodal vaE (MEME)
[3] Multimodal VAEs via mutual supervision
• Qualitative evaluation
Cross-modal generation
• Quantitative evaluation
Coherence: Percentage of matching predictions of the cross-generated modality using a pretrained classifier
Relatedness: Wassertein Distance between the representations of two modalities (closer if same class)
Final remarks
Final remarks
• VAE not only for generation but also for reconstruction and disentanglement tasks
Recommended textbook: “An Introduction to Variational Autoencoders”, Kingma & Welling
• Private-shared latent spaces as an effective tool for analyzing multimodal data
• There is still a lot of potential for this research
It has been only applied to a limited number of multimodal problems
• このテーマに興味のある博士課程の学生 → インターン募集中
https://www.cyberagent.co.jp/news/detail/id=27453
違うテーマでも大丈夫!
共同研究も大歓迎!
ありがとうございました︕
Antonio TEJERO DE PABLOS
antonio_tejero@cyberagent.co.jp

More Related Content

What's hot

What's hot (20)

【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks? 【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
 
Anomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめたAnomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめた
 
semantic segmentation サーベイ
semantic segmentation サーベイsemantic segmentation サーベイ
semantic segmentation サーベイ
 
AdaFace(CVPR2022)
AdaFace(CVPR2022)AdaFace(CVPR2022)
AdaFace(CVPR2022)
 
[解説スライド] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[解説スライド] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis[解説スライド] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[解説スライド] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
 
【LT資料】 Neural Network 素人なんだけど何とかご機嫌取りをしたい
【LT資料】 Neural Network 素人なんだけど何とかご機嫌取りをしたい【LT資料】 Neural Network 素人なんだけど何とかご機嫌取りをしたい
【LT資料】 Neural Network 素人なんだけど何とかご機嫌取りをしたい
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
 
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
 
猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder
 
Optimizer入門&最新動向
Optimizer入門&最新動向Optimizer入門&最新動向
Optimizer入門&最新動向
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
 
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
 
【DL輪読会】"Masked Siamese Networks for Label-Efficient Learning"
【DL輪読会】"Masked Siamese Networks for Label-Efficient Learning"【DL輪読会】"Masked Siamese Networks for Label-Efficient Learning"
【DL輪読会】"Masked Siamese Networks for Label-Efficient Learning"
 
[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係について[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係について
 
[DL Hacks]Variational Approaches For Auto-Encoding Generative Adversarial Ne...
[DL Hacks]Variational Approaches For Auto-Encoding  Generative Adversarial Ne...[DL Hacks]Variational Approaches For Auto-Encoding  Generative Adversarial Ne...
[DL Hacks]Variational Approaches For Auto-Encoding Generative Adversarial Ne...
 
【論文読み会】Alias-Free Generative Adversarial Networks(StyleGAN3)
【論文読み会】Alias-Free Generative Adversarial Networks(StyleGAN3)【論文読み会】Alias-Free Generative Adversarial Networks(StyleGAN3)
【論文読み会】Alias-Free Generative Adversarial Networks(StyleGAN3)
 
【DL輪読会】FactorVAE: A Probabilistic Dynamic Factor Model Based on Variational A...
【DL輪読会】FactorVAE: A Probabilistic Dynamic Factor Model Based on Variational A...【DL輪読会】FactorVAE: A Probabilistic Dynamic Factor Model Based on Variational A...
【DL輪読会】FactorVAE: A Probabilistic Dynamic Factor Model Based on Variational A...
 
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[DL輪読会]Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
 
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
 

Similar to VAEs for multimodal disentanglement

ImageHubExplorerPosterReduced
ImageHubExplorerPosterReducedImageHubExplorerPosterReduced
ImageHubExplorerPosterReduced
Nenad Toma?ev
 
Gephi icwsm-tutorial
Gephi icwsm-tutorialGephi icwsm-tutorial
Gephi icwsm-tutorial
csedays
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with Gephi
John Breslin
 

Similar to VAEs for multimodal disentanglement (20)

CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Image captioning
Image captioningImage captioning
Image captioning
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full version
 
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,NoidaTeaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
 
OReilly AI Transfer Learning
OReilly AI Transfer LearningOReilly AI Transfer Learning
OReilly AI Transfer Learning
 
TIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdfTIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdf
 
ImageHubExplorerPosterReduced
ImageHubExplorerPosterReducedImageHubExplorerPosterReduced
ImageHubExplorerPosterReduced
 
Gephi icwsm-tutorial
Gephi icwsm-tutorialGephi icwsm-tutorial
Gephi icwsm-tutorial
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with Gephi
 
Collaborative Immersive Analytics
Collaborative Immersive AnalyticsCollaborative Immersive Analytics
Collaborative Immersive Analytics
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applications
 

More from Antonio Tejero de Pablos

More from Antonio Tejero de Pablos (6)

ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
 
Inteligencia artificial, visión por ordenador, y datos multimodales - ACE Jap...
Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...
Inteligencia artificial, visión por ordenador, y datos multimodales - ACE Jap...
 
WakateMIRU2022 (グループ8) ディープワークによる時間管理
WakateMIRU2022 (グループ8) ディープワークによる時間管理WakateMIRU2022 (グループ8) ディープワークによる時間管理
WakateMIRU2022 (グループ8) ディープワークによる時間管理
 
Machine Learning Fundamentals IEEE
Machine Learning Fundamentals IEEEMachine Learning Fundamentals IEEE
Machine Learning Fundamentals IEEE
 
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
 
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
 

Recently uploaded

Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
pwgnohujw
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
yulianti213969
 

Recently uploaded (20)

Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethDigital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 

VAEs for multimodal disentanglement

  • 1. AGREEMENT • If you plan to share these slides or to use the content in these slides for your own work, please include the following reference: • 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします: Tejero-de-Pablos A. (2022) “VAEs for multimodal disentanglement”. All Japan Computer Vision Study Group.
  • 2. VAEs for multimodal disentanglement 2022/05/15 Antonio TEJERO DE PABLOS antonio_tejero@cyberagent.co.jp
  • 5. Antonio TEJERO DE PABLOS Background • Present: Research scientist @ CyberAgent (AI Lab) • ~2021: Researcher @ U-Tokyo (Harada Lab) & RIKEN (AIP) • ~2017: PhD @ NAIST (Yokoya Lab) Research interests • Learning of multimodal data (RGB, depth, audio, text) • and its applications (action recognition, advertisement classification, etc.) 父 母 分野:コンピュータビジョン
  • 7. What is a VAE? • Auto-encoder • Variational auto-encoder With the proper regularization:
  • 8. There is more! • Vector Quantized-VAE Quantize the bottleneck using a discrete codebook There are a number of algorithms (like transformers) that are designed to work on discrete data, so we would like to have a discrete representation of the data for these algorithms to use. Advantages of VQ-VAE: - Simplified latent space (easier to train) - Likelihood-based model: do not suffer from the problems of mode collapse and lack of diversity - Real world data favors a discrete representation (number of images that make sense is kind of finite)
  • 9. Why are VAEs cool? • Usage of VAEs (state-of-the-art) Multimodal generation (DALL-E) Representation learning, latent space disentanglement
  • 10. Paper introduction An interesting usage of VAEs for disentangling multimodal data that grabbed my attention
  • 11. Today I’m introducing: 1) Shi, Y., Paige, B., & Torr, P. (2019). Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in Neural Information Processing Systems, 32. 2) Lee, M., & Pavlovic, V. (2021). Private-shared disentangled multimodal VAE for learning of latent representations. Conference on Computer Vision and Pattern Recognition (pp. 1692- 1700). 3) Joy, T., Shi, Y., Torr, P. H., Rainforth, T., Schmon, S. M., & Siddharth, N. (2022). Learning Multimodal VAEs through Mutual Supervision. International Conference on Learning Representations.
  • 12. Motivation and goal • Importance of multimodal data Learning in the real world involves multiple perspectives: visual, auditive, linguistic Understanding them individually allows only a partial learning of concepts • Understanding how different modalities work together is not trivial A similar joint-embedding process happens in the brain for reasoning and understanding • Multimodal VAE facilitate representation learning on data with multiple views/modalities Capture common underlying factors between the modalities
  • 13. Motivation and goal • Normally, only the shared aspects of modalities are modeled The private information of each modality is totally LOST E.g., image captioning • Leverage VAE’s latent space for disentanglement Private spaces are leveraged for modeling the disjoint properties of each modality, and cross-modal generation • Basically, such disentanglement can be used as: An analytical tool to understand how modalities intertwine A way of cross-generating modalities
  • 14. Motivation and goal • [1] and [2] propose a similar methodology According to [1], a true multimodal generative model should meet four criteria: Today I will introduce [2] (most recent), and explain briefly the differences with [3]
  • 15. Dataset • Digit images: MNIST & SVHN - Shared features: Digit class - Private features: Number style, background, etc. Image domains as different modalities? • Flower images and text description: Oxford-102 Flowers - Shared features: Words and image features present in both modalities - Private features: Words and image features exclusive from their modality MNIST SVHN
  • 16. Related work • Multimodal generation and joint multimodal VAEs (e.g., JMVAE, MVAE) The learning of a common disentangled embedding (i.e., private-shared) is often ignored Only some works in image-to-image translation separate ”content” (~shared) and ”style” (~private) in the latent space (e.g., via adversarial loss) Exclusively for between-image modalities: Not suitable for different modalities such as image and text • Domain adaptation Learning joint embeddings of multimodal observations
  • 17. Proposed method: DMVAE • Generative variational model: Introducing separate shared and private spaces Usage: Cross-generation (analytical tool) • Representations induced using pairs of individual modalities (encoder, decoder) • Consistency of representations via Product of Experts (PoE). For a number of modalities N: 𝑞 𝑧! 𝑥", 𝑥#, ⋯ , 𝑥$ ∝ 𝑝(𝑧!) * %&" $ 𝑞(𝑧!|𝑥%) In VAE, inference networks and priors assume conditional Gaussian forms 𝑝 𝑧 = 𝑁 𝑧 0, 𝐼 , 𝑞 𝑧 𝑥% = 𝑁 𝑧 𝜇%, 𝐶% 𝑧"~𝑞'! 𝑧 𝑥" , 𝑧#~𝑞'" 𝑧 𝑥# 𝑧" = 𝑧(! , 𝑧!! , 𝑧# = 𝑧(" , 𝑧!" We want: 𝑧) = 𝑧!! = 𝑧!" → PoE
  • 18. Proposed method: DMVAE • Reconstruction inference PoE-induced shared inference allows for inference when one or more modalities are missing Thus, we consider three reconstruction tasks: - Reconstruct both modalities at the same time: 𝑥", 𝑥# → 4 𝑥", 4 𝑥# 𝑧(! , 𝑧(" , 𝑧) - Reconstruct a single modality from its own input: 𝑥" → 4 𝑥" 𝑧(! , 𝑧) or 𝑥# → 4 𝑥# 𝑧(" , 𝑧) - Reconstruct a single modality from the opposite modality’s input: 𝑥# → 4 𝑥" 𝑧(! , 𝑧) or 𝑥" → 4 𝑥# 𝑧(" , 𝑧) • Loss function Accuracy of reconstruction for jointly learned shared latent + KL-divergence of each normal distribution Accuracy of cross-modal and self reconstruction + KL-divergence
  • 19. Experiments: Digits (image-image) • Evaluation Qualitative: Cross-generation between modalities Quantitative: Accuracy of the cross-generated images using a pre-trained classifier for each modality - Joint: A sample from zs generates two image modalities that must be assigned the same class Input Output for different samples of zp2 Input Output for different samples of zp1
  • 20. Experiments: Digits (image-image) • Ablation study: DMVAE [2] vs MMVAE [1]’s shared latent space
  • 21. Experiments: Flowers (image-text) • This task is more complex Instead of the image-text, the intermediate features are reconstructed • Quantitative evaluation Class recognition (image-to-text) and cosine-similarity retrieval (text-to-image) on the shared latent space • Qualitative evaluation Retrieval
  • 22. Conclusions • Multimodal VAE for disentangling private and shared spaces Improve the representational performance of multimodal VAEs Successful application to image-image and image-text modalities • Shaping a latent space into subspaces that capture the private-shared aspects of the modalities “is important from the perspective of downstream tasks, where better decomposed representations are more amenable for using on a wider variety of tasks”
  • 23. [3] Multimodal VAEs via mutual supervision • Main differences with [1] and [2] A type of multimodal VAE, without private-shared disentanglement Does not rely on factorizations such as MoE or PoE for modeling modality-shared information Instead, it repurposes semi-supervised VAEs for combining inter-modality information - Allows learning from partially-observed modalities (Reg. = KL divergence) • Proposed method: Mutually supErvised Multimodal vaE (MEME)
  • 24. [3] Multimodal VAEs via mutual supervision • Qualitative evaluation Cross-modal generation • Quantitative evaluation Coherence: Percentage of matching predictions of the cross-generated modality using a pretrained classifier Relatedness: Wassertein Distance between the representations of two modalities (closer if same class)
  • 26. Final remarks • VAE not only for generation but also for reconstruction and disentanglement tasks Recommended textbook: “An Introduction to Variational Autoencoders”, Kingma & Welling • Private-shared latent spaces as an effective tool for analyzing multimodal data • There is still a lot of potential for this research It has been only applied to a limited number of multimodal problems • このテーマに興味のある博士課程の学生 → インターン募集中 https://www.cyberagent.co.jp/news/detail/id=27453 違うテーマでも大丈夫! 共同研究も大歓迎!
  • 27. ありがとうございました︕ Antonio TEJERO DE PABLOS antonio_tejero@cyberagent.co.jp