SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
報告担当:鈴木良平
Jun. 11, 2021
Transformer-based approaches for
visual representation learning
autoregression 自己回帰(以前の出力を参照しつつ系列の出力を行うこと)
CNN 畳み込みニューラルネットワーク
embedding 埋め込み(データのベクトル表現への変換)
equivariance 同変性(処理Aと変換fの順番を入れ替えても結果が等しいこと)
inductive bias 帰納バイアス(モデル設計で暗黙的に与えられるデータ仮定)
invariance 不変性(処理Aの前に変換fを行っても結果が等しいこと)
MLP 多層パーセプトロン
NLP 自然言語処理
pretraining 事前学習
self-attention 自己注意機構
Glossary
2
Today’s papers
Vaswani et al. (Google Brain, UToronto),
NeurIPS 2017
Dosovitskiy et al. (Google Brain),
ICLR 2021
Caron et al. (FAIR, INRIA, Sorbonne),
arXiv preprint 2021 3
● CNNs (e.g., VGG, ResNet) have been the de facto standard for
visual tasks in deep learning
○ Convolution provides favorable properties for image processing
● Recently alternative approaches are emerging
○ Transformer-based methods e.g., Attention-CNN, ViT
○ MLP-based methods e.g., MLP-Mixer, gMLP
● In particular, NLP-inspired Transformer-based approaches have
shown promising performance and interesting properties
Context
4
Review: Convolutional Neural Network (CNN)
● Convolution = multi-channel filtering by learnable kernels
● Typical modern CNNs contain convolution, activation, pooling,
skip-connection, normalization (e.g., BN, IN, GN), etc.
kernel =
linear map of finite-size window
5
Visual inductive biases of CNNs
Inductive bias: implicitly introduced regularizations on the solution by model
design, which is useful for utilizing the data characteristics
● Locality (局所性)
○ Natural images have spatial hierarchy
○ Sequence of convolutions works as an imitative process
receptive field gradually grows
through multiple convolutions
https://towardsdatascience.com/journey-from-machine-learning-to-deep-learning-8a807e8f3c1c
6
Visual inductive biases of CNNs
● Translation invariance / equivariance (並進不変性・同変性)
○ invariance: equivariance:
○ Convolution is naturally a translation equivariant operation
○ Equivariance is in fact broken in CNNs [Zhang 2019, Kayhan 2020]
translation invariance:
we want the model to output a
same answer for shifted images
translation equivariance:
CNN-after-shift is equivalent to
shift-after-CNN
rotational equivariance is also
sometimes imposed
[Graham et al. 2020]
cat 98% cat 98%
7
Intrinsic problems of CNNs
● Difficulty on handling long-range / irregular-shape dependency
○ CNNs can recognize interaction between two distant points only
after a large number of convolutions using large receptive field
● Low-resolution, blurred representation
○ Partially solved by skip-connections (e.g., U-Net, HRNet)
How to recognize the
interaction between the
bird and the flower?
HRNet [Sun et al. 2019]
8
Ideas from NLP
Self-attention
● Convolution: gather information from the nearby positions
● Self-attention: gather information from the related (attended) positions
○ originally developed in language models
Large-scale pretraining
● Most specific problems provide limited amount of data
● ImageNet-pretraining has already been broadly used in CV
● Pretraining with massively large datasets has shown amazing results
in NLP, e.g., GPT-3. (ImageNet 1.2M images vs. GPT-3 500B tokens)
image from: http://bliulab.net/selfAT_fold/
9
Paper 1: Attention is All You Need
● Proposed Transformer (=attention-based translation model)
● One of the most important ML papers (cited more than 20,000 times!)
● Attention is All You Need is All You Need (many subsequent papers
have proposed “improved” models, but the progress was reported to be quite small [1])
[1] Narang et al. arXiv preprint 2021
10
Vaswani et al. (Google Brain, UToronto)
NeurIPS 2017
Encoder-decoder model
Many “translation” models can be formulated as enc-dec pair
● Encoder extracts meaningful features from the input
● Decoder composes output from the extracted features
“I am a student”
“Je suis un étudiant”
Dec
Enc
latent variable
Transformed input data containing the necessary
information for translation task in a more useful form
→ can be utilized for multiple downstream tasks
Training of meaningful encoder (feature predictor)
= representation learning
11
Transformer
encoder
decoder
input:
“I am a student”
past output:
“Je suis un”
next output:
“étudiant”
12
Flow of processing
I am a student
Attention Block
Attention Block
Encoder Block
<start> Je suis un
Decoder Block
Attention Block
Attention Block
Decoder Block
étudiant
fed to attention
modules
Encoder Block
13
Parallelized training
I am a student
Attention Block
Attention Block
Encoder Block
<start> Je suis un
Decoder Block
Attention Block
Attention Block
Decoder Block
étudiant
Encoder Block
Je suis un
?
? ? ?
ground truth
14
Causal mask
Self-attention
Instead of spatial convolution, we want to aggregate the vector at
position i by aggregating information from related positions.
convolution
self-attention
Questions:
● How to know the “related” positions?
● How to aggregate the information (vectors)
from the found related positions? 15
Query-Key-Value attention
At each position i, we convert the vector into three vectors
query: , key: , and value: ,
then define the relativity between position i and j by .
Output is weighted average of weighted by the relativities.
内積
16
in matrix expression decoder fetches encoded
features by attention
Multi-head attention
Attention is basically a weighted-average → limited capability for
representing multiple types of relationships
e.g., “I like this lecture”
Multi-head attention first generates h sets of (Q,K,V),
then apply the standard attention in parallel to them.
→ each branch has different attention targets
Results are aggregated by a linear layer after joining.
17
“I” and “this lecture” have different relationships to the word “like”
Input processing
Input embedding
● Converts the raw input tokens into vector format by projection
Positional encoding
● Injects the information of (absolute) position of each token into
the embedded vector by sinusoidal functions
18
encoding dimension
absolute
position
Experimental results
Marked the highest score on English-to-DE/FR translation tasks at
1/100 training cost compared to the best methods at 2017...
19
cf. Image GPT
Transformer’s autoregressive
generation can naturally be applied
to the next-pixel prediction task
Transformer variant (GPT-2)
trained on ImageNet shows
impressive image completion
results!
[Chen et al., ICML2020]
20
https://openai.com/blog/image-gpt/
Paper 2: Vision Transformer (ViT)
Question: can we completely discard convolutions to tackle real
image recognition problems like classification?
→ Pure Transformer architecture pre-trained with very large
dataset can perform better than modern CNNs.
21
Dosovitskiy et al. (Google Brain),
ICLR 2021
The ViT model
ViT uses small splits of the input image as the “tokens” (words)
Supervised training using calculated features from the encoder
22
classification
task
Performance compared to SoTA CNNs
Experiment: fine-tuned top-1 accuracy after supervised pretraining
vs. BiT-L (supervised ResNet) and Noisy Student (semi-supervised EfficientNet)
23
JFT300M: Google’s internal
dataset consisting of 300
millions images
ImageNet 21k: superset of
ImageNet (1k) consisting of
21,000 classes, 14M images
※BiT and NS trained with JFT
Scalability
24
Small dataset: CNN performs much better than ViT
Large dataset: CNN saturates / ViT steadily improves
Learned patch embedding
The first part of ViT is embedding 16x16 patches into vector tokens
What kind of information is extracted in this stage?
→ CNN filter-like patterns (cf. Gabor filters) are found
25
Learned locality
ViT uses learnable positional embedding instead of sine encoding
→ embeddings at nearby positions become similar to each other
Attention heads at shallower layers attend to various distances
26
attending to
local relations
attending to
global relations
Recent discoveries on ViT and related methods
27
Massive ViT pretrained with massive
dataset shows great performance gain
→ 90.45% ImageNet top-1
Principled combination of convolution
and attention is more important
→ 88.56% without massive dataset
Scaling property [Zhai 2021]
With more computational budget and more dataset size,
performance gain seems to be obtained without saturation
28
Paper 3: Self-supervised ViT (DINO)
29
Self-supervised learning of ViTs
Self-supervised learning
training of model with supervision that is generated from unlabeled dataset
● e.g., next-word prediction, contrastive learning, self-distillation
Why important? → richer training signals than predicting a single class label
ViT paper studied masked patch prediction
→ worse pretraining performance
  (79.9% self-supervised << 84% supervised)
30
task: predicting the mean color of masked patches
BYOL-like distillation framework both applicable to CNN/ViT
1. From input image x, make two crops x1
and x2
.
2. Calculate the representations g(x1
) and g(x2
)
by student and teacher networks, respectively.
3. Update the student to match the representations
by seeing them as prob. distributions p1
and p2
.
4. Update the teacher as the moving average of
the student network’s parameters.
Tricks: feeding small crops to student,
centering of teacher features, epoch-wise teacher update, etc.
DINO: knowledge distillation with no labels
31
Results on transfer learning after pretraining
● Improvement over supervised pretraining was reported.
● Comparable performance to ViT with massive supervised dataset can
be obtained with only ImageNet data
32
Emerging property: attention maps as segmentation
Attention maps of the output token at the final layer found to attend
to semantic objects without any segment supervision
33
Final attention
map of this token
colors mean different attention heads
supervised ViT does not have such
property
Cost to calculate all-to-all attention
● Computation complexity of self-attention is O(N2
), prohibiting
processing of large sequence size (= high-resolution ViT)
● Required computational budget for pre-training is also very high
Unknown potential for dense prediction
● Unsupervised segmentation by DINO is very interesting, but it is not at
the level of real applications yet.
● CNNs have great power of dense prediction e.g. image generation,
segmentation, depth estimation. Can ViT do these tasks?
Limitations of Transformer-based methods
34
Another interesting topic on ViT
Importance of optimization algorithm: with SAM, ViT can perform
better than CNNs without large-scale pretraining
35
● NLP-inspired Transformer (pure attention) models show impressive
results also on image recognition problems
● Their performance well scale as the model/dataset size increase
● Very interesting properties like unsupervised segmentation are found
My impressions
● Attention seems to have a potential to detect complex visual entities
such as infiltration in WSI that needs multiple-scale observation
● How is translation equivariance realized in ViTs?
● Problem is on the computational (monetary) budget more and more
(most important Transformer papers come from Google, Facebook, OpenAI, MS, …)
Summary
36

Contenu connexe

Tendances

Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentationbhavesh_physics
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model reviewSeoung-Ho Choi
 
CLASSIFICATION OF DEBUGGERS
CLASSIFICATION OF DEBUGGERSCLASSIFICATION OF DEBUGGERS
CLASSIFICATION OF DEBUGGERSJAINAM KAPADIYA
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
SPL 1 | Introduction to Structured programming language
SPL 1 | Introduction to Structured programming languageSPL 1 | Introduction to Structured programming language
SPL 1 | Introduction to Structured programming languageMohammad Imam Hossain
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersSeunghyun Hwang
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine LearningHayim Makabee
 
Genetic algorithm ppt
Genetic algorithm pptGenetic algorithm ppt
Genetic algorithm pptMayank Jain
 
Type checking
Type checkingType checking
Type checkingrawan_z
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 

Tendances (20)

Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
 
BERT
BERTBERT
BERT
 
CLASSIFICATION OF DEBUGGERS
CLASSIFICATION OF DEBUGGERSCLASSIFICATION OF DEBUGGERS
CLASSIFICATION OF DEBUGGERS
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
 
BERT
BERTBERT
BERT
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
SPL 1 | Introduction to Structured programming language
SPL 1 | Introduction to Structured programming languageSPL 1 | Introduction to Structured programming language
SPL 1 | Introduction to Structured programming language
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
Learning With Complete Data
Learning With Complete DataLearning With Complete Data
Learning With Complete Data
 
Genetic algorithm ppt
Genetic algorithm pptGenetic algorithm ppt
Genetic algorithm ppt
 
Type checking
Type checkingType checking
Type checking
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
5. bleu
5. bleu5. bleu
5. bleu
 

Similaire à Transformer based approaches for visual representation learning

Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnSARCCOM
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Universitat Politècnica de Catalunya
 
Deep Learning
Deep LearningDeep Learning
Deep LearningJun Wang
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesijsc
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중datasciencekorea
 
[Revised] Intro to CNN
[Revised] Intro to CNN[Revised] Intro to CNN
[Revised] Intro to CNNVincent Tatan
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classificationijtsrd
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionIJCSIS Research Publications
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques ijsc
 
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understandingThyrixYang1
 

Similaire à Transformer based approaches for visual representation learning (20)

Conv xg
Conv xgConv xg
Conv xg
 
Visual Transformers
Visual TransformersVisual Transformers
Visual Transformers
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
Dssg talk CNN intro
Dssg talk CNN introDssg talk CNN intro
Dssg talk CNN intro
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
 
Icon18revrec sudeshna
Icon18revrec sudeshnaIcon18revrec sudeshna
Icon18revrec sudeshna
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
 
[Revised] Intro to CNN
[Revised] Intro to CNN[Revised] Intro to CNN
[Revised] Intro to CNN
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware Detection
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
 
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
 

Plus de Ryohei Suzuki

Paper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsPaper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsRyohei Suzuki
 
Paper memo: Optimal-Transport Analysis of Single-Cell Gene Expression Identif...
Paper memo: Optimal-Transport Analysis of Single-Cell Gene Expression Identif...Paper memo: Optimal-Transport Analysis of Single-Cell Gene Expression Identif...
Paper memo: Optimal-Transport Analysis of Single-Cell Gene Expression Identif...Ryohei Suzuki
 
Basic Concepts of Entanglement Measures
Basic Concepts of Entanglement MeasuresBasic Concepts of Entanglement Measures
Basic Concepts of Entanglement MeasuresRyohei Suzuki
 
Disentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative ModelsDisentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative ModelsRyohei Suzuki
 
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"Ryohei Suzuki
 
Report: "MolGAN: An implicit generative model for small molecular graphs"
Report: "MolGAN: An implicit generative model for small molecular graphs"Report: "MolGAN: An implicit generative model for small molecular graphs"
Report: "MolGAN: An implicit generative model for small molecular graphs"Ryohei Suzuki
 
等号と不等号の物理学
等号と不等号の物理学等号と不等号の物理学
等号と不等号の物理学Ryohei Suzuki
 
Wolf et al. "Graph abstraction reconciles clustering with trajectory inferen...
Wolf et al. "Graph abstraction reconciles clustering with trajectory inferen...Wolf et al. "Graph abstraction reconciles clustering with trajectory inferen...
Wolf et al. "Graph abstraction reconciles clustering with trajectory inferen...Ryohei Suzuki
 
コンピュータは知恵熱を出すか?
コンピュータは知恵熱を出すか?コンピュータは知恵熱を出すか?
コンピュータは知恵熱を出すか?Ryohei Suzuki
 
身体の中の小宇宙:免疫研究の最前線
身体の中の小宇宙:免疫研究の最前線身体の中の小宇宙:免疫研究の最前線
身体の中の小宇宙:免疫研究の最前線Ryohei Suzuki
 
Single-cell pseudo-temporal ordering 近年の技術動向
Single-cell pseudo-temporal ordering 近年の技術動向Single-cell pseudo-temporal ordering 近年の技術動向
Single-cell pseudo-temporal ordering 近年の技術動向Ryohei Suzuki
 
Collaborative 3D Modeling by the Crowd
Collaborative 3D Modeling by the CrowdCollaborative 3D Modeling by the Crowd
Collaborative 3D Modeling by the CrowdRyohei Suzuki
 
汝は計算機なりや?
汝は計算機なりや?汝は計算機なりや?
汝は計算機なりや?Ryohei Suzuki
 
アナログとはなんだろう。―古くて新しい、もう一つの計算―
アナログとはなんだろう。―古くて新しい、もう一つの計算―アナログとはなんだろう。―古くて新しい、もう一つの計算―
アナログとはなんだろう。―古くて新しい、もう一つの計算―Ryohei Suzuki
 
色字共感覚と書記素学習
色字共感覚と書記素学習色字共感覚と書記素学習
色字共感覚と書記素学習Ryohei Suzuki
 
AnnoTone: 高周波音の映像収録時 埋め込みによる編集支援
AnnoTone: 高周波音の映像収録時埋め込みによる編集支援AnnoTone: 高周波音の映像収録時埋め込みによる編集支援
AnnoTone: 高周波音の映像収録時 埋め込みによる編集支援Ryohei Suzuki
 
立体音響とインタラクション
立体音響とインタラクション立体音響とインタラクション
立体音響とインタラクションRyohei Suzuki
 
SIGGRAPH 2014 Preview -"Shape Collection" Session
SIGGRAPH 2014 Preview -"Shape Collection" SessionSIGGRAPH 2014 Preview -"Shape Collection" Session
SIGGRAPH 2014 Preview -"Shape Collection" SessionRyohei Suzuki
 
Overview of User Interfaces
Overview of User InterfacesOverview of User Interfaces
Overview of User InterfacesRyohei Suzuki
 

Plus de Ryohei Suzuki (20)

Paper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsPaper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problems
 
Paper memo: Optimal-Transport Analysis of Single-Cell Gene Expression Identif...
Paper memo: Optimal-Transport Analysis of Single-Cell Gene Expression Identif...Paper memo: Optimal-Transport Analysis of Single-Cell Gene Expression Identif...
Paper memo: Optimal-Transport Analysis of Single-Cell Gene Expression Identif...
 
Basic Concepts of Entanglement Measures
Basic Concepts of Entanglement MeasuresBasic Concepts of Entanglement Measures
Basic Concepts of Entanglement Measures
 
Disentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative ModelsDisentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative Models
 
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
論文紹介: "MolGAN: An implicit generative model for small molecular graphs"
 
Report: "MolGAN: An implicit generative model for small molecular graphs"
Report: "MolGAN: An implicit generative model for small molecular graphs"Report: "MolGAN: An implicit generative model for small molecular graphs"
Report: "MolGAN: An implicit generative model for small molecular graphs"
 
等号と不等号の物理学
等号と不等号の物理学等号と不等号の物理学
等号と不等号の物理学
 
Wolf et al. "Graph abstraction reconciles clustering with trajectory inferen...
Wolf et al. "Graph abstraction reconciles clustering with trajectory inferen...Wolf et al. "Graph abstraction reconciles clustering with trajectory inferen...
Wolf et al. "Graph abstraction reconciles clustering with trajectory inferen...
 
コンピュータは知恵熱を出すか?
コンピュータは知恵熱を出すか?コンピュータは知恵熱を出すか?
コンピュータは知恵熱を出すか?
 
身体の中の小宇宙:免疫研究の最前線
身体の中の小宇宙:免疫研究の最前線身体の中の小宇宙:免疫研究の最前線
身体の中の小宇宙:免疫研究の最前線
 
Single-cell pseudo-temporal ordering 近年の技術動向
Single-cell pseudo-temporal ordering 近年の技術動向Single-cell pseudo-temporal ordering 近年の技術動向
Single-cell pseudo-temporal ordering 近年の技術動向
 
Collaborative 3D Modeling by the Crowd
Collaborative 3D Modeling by the CrowdCollaborative 3D Modeling by the Crowd
Collaborative 3D Modeling by the Crowd
 
汝は計算機なりや?
汝は計算機なりや?汝は計算機なりや?
汝は計算機なりや?
 
アナログとはなんだろう。―古くて新しい、もう一つの計算―
アナログとはなんだろう。―古くて新しい、もう一つの計算―アナログとはなんだろう。―古くて新しい、もう一つの計算―
アナログとはなんだろう。―古くて新しい、もう一つの計算―
 
AnnoTone (CHI 2015)
AnnoTone (CHI 2015)AnnoTone (CHI 2015)
AnnoTone (CHI 2015)
 
色字共感覚と書記素学習
色字共感覚と書記素学習色字共感覚と書記素学習
色字共感覚と書記素学習
 
AnnoTone: 高周波音の映像収録時 埋め込みによる編集支援
AnnoTone: 高周波音の映像収録時埋め込みによる編集支援AnnoTone: 高周波音の映像収録時埋め込みによる編集支援
AnnoTone: 高周波音の映像収録時 埋め込みによる編集支援
 
立体音響とインタラクション
立体音響とインタラクション立体音響とインタラクション
立体音響とインタラクション
 
SIGGRAPH 2014 Preview -"Shape Collection" Session
SIGGRAPH 2014 Preview -"Shape Collection" SessionSIGGRAPH 2014 Preview -"Shape Collection" Session
SIGGRAPH 2014 Preview -"Shape Collection" Session
 
Overview of User Interfaces
Overview of User InterfacesOverview of User Interfaces
Overview of User Interfaces
 

Dernier

Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicinesherlingomez2
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 

Dernier (20)

Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 

Transformer based approaches for visual representation learning

  • 1. 報告担当:鈴木良平 Jun. 11, 2021 Transformer-based approaches for visual representation learning
  • 2. autoregression 自己回帰(以前の出力を参照しつつ系列の出力を行うこと) CNN 畳み込みニューラルネットワーク embedding 埋め込み(データのベクトル表現への変換) equivariance 同変性(処理Aと変換fの順番を入れ替えても結果が等しいこと) inductive bias 帰納バイアス(モデル設計で暗黙的に与えられるデータ仮定) invariance 不変性(処理Aの前に変換fを行っても結果が等しいこと) MLP 多層パーセプトロン NLP 自然言語処理 pretraining 事前学習 self-attention 自己注意機構 Glossary 2
  • 3. Today’s papers Vaswani et al. (Google Brain, UToronto), NeurIPS 2017 Dosovitskiy et al. (Google Brain), ICLR 2021 Caron et al. (FAIR, INRIA, Sorbonne), arXiv preprint 2021 3
  • 4. ● CNNs (e.g., VGG, ResNet) have been the de facto standard for visual tasks in deep learning ○ Convolution provides favorable properties for image processing ● Recently alternative approaches are emerging ○ Transformer-based methods e.g., Attention-CNN, ViT ○ MLP-based methods e.g., MLP-Mixer, gMLP ● In particular, NLP-inspired Transformer-based approaches have shown promising performance and interesting properties Context 4
  • 5. Review: Convolutional Neural Network (CNN) ● Convolution = multi-channel filtering by learnable kernels ● Typical modern CNNs contain convolution, activation, pooling, skip-connection, normalization (e.g., BN, IN, GN), etc. kernel = linear map of finite-size window 5
  • 6. Visual inductive biases of CNNs Inductive bias: implicitly introduced regularizations on the solution by model design, which is useful for utilizing the data characteristics ● Locality (局所性) ○ Natural images have spatial hierarchy ○ Sequence of convolutions works as an imitative process receptive field gradually grows through multiple convolutions https://towardsdatascience.com/journey-from-machine-learning-to-deep-learning-8a807e8f3c1c 6
  • 7. Visual inductive biases of CNNs ● Translation invariance / equivariance (並進不変性・同変性) ○ invariance: equivariance: ○ Convolution is naturally a translation equivariant operation ○ Equivariance is in fact broken in CNNs [Zhang 2019, Kayhan 2020] translation invariance: we want the model to output a same answer for shifted images translation equivariance: CNN-after-shift is equivalent to shift-after-CNN rotational equivariance is also sometimes imposed [Graham et al. 2020] cat 98% cat 98% 7
  • 8. Intrinsic problems of CNNs ● Difficulty on handling long-range / irregular-shape dependency ○ CNNs can recognize interaction between two distant points only after a large number of convolutions using large receptive field ● Low-resolution, blurred representation ○ Partially solved by skip-connections (e.g., U-Net, HRNet) How to recognize the interaction between the bird and the flower? HRNet [Sun et al. 2019] 8
  • 9. Ideas from NLP Self-attention ● Convolution: gather information from the nearby positions ● Self-attention: gather information from the related (attended) positions ○ originally developed in language models Large-scale pretraining ● Most specific problems provide limited amount of data ● ImageNet-pretraining has already been broadly used in CV ● Pretraining with massively large datasets has shown amazing results in NLP, e.g., GPT-3. (ImageNet 1.2M images vs. GPT-3 500B tokens) image from: http://bliulab.net/selfAT_fold/ 9
  • 10. Paper 1: Attention is All You Need ● Proposed Transformer (=attention-based translation model) ● One of the most important ML papers (cited more than 20,000 times!) ● Attention is All You Need is All You Need (many subsequent papers have proposed “improved” models, but the progress was reported to be quite small [1]) [1] Narang et al. arXiv preprint 2021 10 Vaswani et al. (Google Brain, UToronto) NeurIPS 2017
  • 11. Encoder-decoder model Many “translation” models can be formulated as enc-dec pair ● Encoder extracts meaningful features from the input ● Decoder composes output from the extracted features “I am a student” “Je suis un étudiant” Dec Enc latent variable Transformed input data containing the necessary information for translation task in a more useful form → can be utilized for multiple downstream tasks Training of meaningful encoder (feature predictor) = representation learning 11
  • 12. Transformer encoder decoder input: “I am a student” past output: “Je suis un” next output: “étudiant” 12
  • 13. Flow of processing I am a student Attention Block Attention Block Encoder Block <start> Je suis un Decoder Block Attention Block Attention Block Decoder Block étudiant fed to attention modules Encoder Block 13
  • 14. Parallelized training I am a student Attention Block Attention Block Encoder Block <start> Je suis un Decoder Block Attention Block Attention Block Decoder Block étudiant Encoder Block Je suis un ? ? ? ? ground truth 14 Causal mask
  • 15. Self-attention Instead of spatial convolution, we want to aggregate the vector at position i by aggregating information from related positions. convolution self-attention Questions: ● How to know the “related” positions? ● How to aggregate the information (vectors) from the found related positions? 15
  • 16. Query-Key-Value attention At each position i, we convert the vector into three vectors query: , key: , and value: , then define the relativity between position i and j by . Output is weighted average of weighted by the relativities. 内積 16 in matrix expression decoder fetches encoded features by attention
  • 17. Multi-head attention Attention is basically a weighted-average → limited capability for representing multiple types of relationships e.g., “I like this lecture” Multi-head attention first generates h sets of (Q,K,V), then apply the standard attention in parallel to them. → each branch has different attention targets Results are aggregated by a linear layer after joining. 17 “I” and “this lecture” have different relationships to the word “like”
  • 18. Input processing Input embedding ● Converts the raw input tokens into vector format by projection Positional encoding ● Injects the information of (absolute) position of each token into the embedded vector by sinusoidal functions 18 encoding dimension absolute position
  • 19. Experimental results Marked the highest score on English-to-DE/FR translation tasks at 1/100 training cost compared to the best methods at 2017... 19
  • 20. cf. Image GPT Transformer’s autoregressive generation can naturally be applied to the next-pixel prediction task Transformer variant (GPT-2) trained on ImageNet shows impressive image completion results! [Chen et al., ICML2020] 20 https://openai.com/blog/image-gpt/
  • 21. Paper 2: Vision Transformer (ViT) Question: can we completely discard convolutions to tackle real image recognition problems like classification? → Pure Transformer architecture pre-trained with very large dataset can perform better than modern CNNs. 21 Dosovitskiy et al. (Google Brain), ICLR 2021
  • 22. The ViT model ViT uses small splits of the input image as the “tokens” (words) Supervised training using calculated features from the encoder 22 classification task
  • 23. Performance compared to SoTA CNNs Experiment: fine-tuned top-1 accuracy after supervised pretraining vs. BiT-L (supervised ResNet) and Noisy Student (semi-supervised EfficientNet) 23 JFT300M: Google’s internal dataset consisting of 300 millions images ImageNet 21k: superset of ImageNet (1k) consisting of 21,000 classes, 14M images ※BiT and NS trained with JFT
  • 24. Scalability 24 Small dataset: CNN performs much better than ViT Large dataset: CNN saturates / ViT steadily improves
  • 25. Learned patch embedding The first part of ViT is embedding 16x16 patches into vector tokens What kind of information is extracted in this stage? → CNN filter-like patterns (cf. Gabor filters) are found 25
  • 26. Learned locality ViT uses learnable positional embedding instead of sine encoding → embeddings at nearby positions become similar to each other Attention heads at shallower layers attend to various distances 26 attending to local relations attending to global relations
  • 27. Recent discoveries on ViT and related methods 27 Massive ViT pretrained with massive dataset shows great performance gain → 90.45% ImageNet top-1 Principled combination of convolution and attention is more important → 88.56% without massive dataset
  • 28. Scaling property [Zhai 2021] With more computational budget and more dataset size, performance gain seems to be obtained without saturation 28
  • 29. Paper 3: Self-supervised ViT (DINO) 29
  • 30. Self-supervised learning of ViTs Self-supervised learning training of model with supervision that is generated from unlabeled dataset ● e.g., next-word prediction, contrastive learning, self-distillation Why important? → richer training signals than predicting a single class label ViT paper studied masked patch prediction → worse pretraining performance   (79.9% self-supervised << 84% supervised) 30 task: predicting the mean color of masked patches
  • 31. BYOL-like distillation framework both applicable to CNN/ViT 1. From input image x, make two crops x1 and x2 . 2. Calculate the representations g(x1 ) and g(x2 ) by student and teacher networks, respectively. 3. Update the student to match the representations by seeing them as prob. distributions p1 and p2 . 4. Update the teacher as the moving average of the student network’s parameters. Tricks: feeding small crops to student, centering of teacher features, epoch-wise teacher update, etc. DINO: knowledge distillation with no labels 31
  • 32. Results on transfer learning after pretraining ● Improvement over supervised pretraining was reported. ● Comparable performance to ViT with massive supervised dataset can be obtained with only ImageNet data 32
  • 33. Emerging property: attention maps as segmentation Attention maps of the output token at the final layer found to attend to semantic objects without any segment supervision 33 Final attention map of this token colors mean different attention heads supervised ViT does not have such property
  • 34. Cost to calculate all-to-all attention ● Computation complexity of self-attention is O(N2 ), prohibiting processing of large sequence size (= high-resolution ViT) ● Required computational budget for pre-training is also very high Unknown potential for dense prediction ● Unsupervised segmentation by DINO is very interesting, but it is not at the level of real applications yet. ● CNNs have great power of dense prediction e.g. image generation, segmentation, depth estimation. Can ViT do these tasks? Limitations of Transformer-based methods 34
  • 35. Another interesting topic on ViT Importance of optimization algorithm: with SAM, ViT can perform better than CNNs without large-scale pretraining 35
  • 36. ● NLP-inspired Transformer (pure attention) models show impressive results also on image recognition problems ● Their performance well scale as the model/dataset size increase ● Very interesting properties like unsupervised segmentation are found My impressions ● Attention seems to have a potential to detect complex visual entities such as infiltration in WSI that needs multiple-scale observation ● How is translation equivariance realized in ViTs? ● Problem is on the computational (monetary) budget more and more (most important Transformer papers come from Google, Facebook, OpenAI, MS, …) Summary 36