SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Training Data-efficient ImageTransformers
& DistillationThrough Attention
HugoTouvron et al., “Training Data-Efficient ImageTransformers & DistillationThrough Attention”
10th January, 2021
PR12 Paper Review
JinWon Lee
Samsung Electronics
Reference
• Facebook AI Blog
▪ https://ai.facebook.com/blog/data-efficient-image-transformers-a-
promising-new-technique-for-image-classification
• Official Code
▪ https://github.com/facebookresearch/deit
• PR-281:ViT(VisionTransformer)
▪ https://youtu.be/D72_Cn-XV1g
• PR-243: Designing Network Design Spaces(RegNet)
▪ https://youtu.be/bnbKQRae_u4
Transformers for ComputerVision
DALL-E: Image Generation ViT: Image Classification
DETR: Object Detection
RelatedWork – Image Classification
• Image classification is so core to computer vision that it is often used
as a benchmark to measure progress in image understanding.
• Since 2012’s AlexNet, convnets have dominated this benchmark and
have become the de facto standard.
• Despite several attempts to use transformers for image classification,
until now their performance has been inferior to that of convnets.
RelatedWork – Image Classification
• RecentlyVisionTransformers (ViT) closed the gap with the state of
the art on ImageNet, without using any convolution.This
performance is remarkable since convnet methods for image
classification have benefited from years of tuning and optimization
• Nevertheless, according to this study, a pre-training phase on a large
volume of curated data is required for the learned transformer to be
effective.
RelatedWork –TheTransformerArchitecture
• Transformers are currently the reference model for all natural
language processing (NLP) tasks.
• Many improvements of convnets for image classification are inspired
by transformers.
<Squeeze-and-Excitation Network>
RelatedWork – Knowledge Distillation
• KD refers to the training paradigm in which a student model leverages
“soft” labels coming from a strong teacher network.
• The teacher’s supervision takes into account the effects of the data
augmentation, which sometimes causes a misalignment between the real
label and the image.
• KD can transfer inductive biases in a soft way in a student model using a
teacher model where they would be incorporated in a hard way.
Image from Wonpyo Park’s Slide @ Naver D2
random crop & resize
Label: cat Label: ???
Self Attention
A slide from Deep Learning for CV@UMICH
Attention & Inductive Biases
• Fewer inductive biases than convolution and dense layer
→ Requires more data than others
Convolution layer
- Locally connected
- Same weights for all inputs
Dense layer
- Fully connected
- Same weights for all inputs
Attention layer
- Fully connected
- Different weights for all inputs
x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
𝜓(q2〮k1) 𝜓(q2〮k2) 𝜓(q2〮k3) 𝜓(q2〮k4)
VisualTransformer – Same Architecture asViT
• ViT processes input images as if there were a
sequence of input tokens.
• The fixed-size input RGB image is decomposed
into a batch of N patches of a fixed size of 16 x 16
pixels (N = 14 x 14).
• The transformer block cannot consider their
relative position, so positional embeddings are
added.
• The class token is a trainable vector appended to
the patch tokens before the first layer, that goes
through the transformer layers, and is then
projected with a linear layer to predict the class.
VisionTransformer(ViT)
1) 𝑧0 = 𝑥 𝑐𝑙𝑎𝑠𝑠; 𝑥 𝑝
1 𝐸; 𝑥 𝑝
2 𝐸; … ; 𝑥 𝑝
𝑁 𝐸 + 𝐸 𝑝𝑜𝑠, 𝐸 ∈ ℝ 𝑃2∙𝐶 ×𝐷, 𝐸 𝑝𝑜𝑠 ∈ ℝ(𝑁+1)×𝐷
2) 𝑧′
𝑙 = 𝑀𝑆𝐴 𝐿𝑁 𝑧𝑙−1 + 𝑧𝑙−1, 𝑙 = 1 … 𝐿
3) 𝑧𝑙 = 𝑀𝐿𝑃 𝐿𝑁 𝑧′
𝑙 + 𝑧′
𝑙, 𝑙 = 1 … 𝐿
4) 𝑦 = 𝐿𝑁(𝑧 𝐿
0
)
VisualTransformer – Same Architecture asViT
• Fixing the positional encoding across resolutions
▪ It is desirable to use a lower training resolution and fine-tune the net work at
the larger resolution.
▪ This speeds up the full training and improves the accuracy under prevailing
data augmentation schemes.
▪ When increasing the resolution of an input image, patch size does not
change, therefore the number of input patches(N) does change.
▪ Interpolation for positional encoding is needed when changing the resolution.
DistillationThrough Attention – Soft Distillation
• Soft distillation minimizes the Kullback-Leibler divergence between
the softmax of the teacher and the softmax of the student model.
• Let 𝑍𝑡 be the logits of the teacher model, 𝑍𝑠 the logits of the student
model.We denote by 𝜏 the temperature for the distillation, 𝜆 the
coefficient balancing the Kullback–Leibler divergence loss (KL) and
the cross-entropy (ℒ 𝐶𝐸) on ground truth labels 𝑦, and 𝜓 the softmax
function.
ℒglobal = 1 − 𝜆 ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦 + 𝜆𝜏2KL(𝜓
𝑍𝑠
𝜏
, 𝜓
𝑍𝑡
𝜏
)
DistillationThrough Attention – Hard Distillation
• Hard-label distillation is a variant distillation which takes the hard
decision of the teacher as a true label.
• Let 𝑦𝑡 = argmax 𝑐 𝑍𝑡(𝑐) be the hard decision of the teacher,
ℒglobal
hardDistill
=
1
2
ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦 +
1
2
ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦𝑡
• Note also that the hard labels can also be converted into soft labels
with label smoothing, where the true label is considered to have a
probability of 1 − 𝜖, and the remaining 𝜖 is shared across the
remaining classes.The authors fix 𝜖 = 0.1
DistillationThrough Attention – DistillationToken
• New token – distillation token, similar to class token.
DistillationThrough Attention – DistillationToken
• Interestingly, the learned class and distillation tokens converge
towards different vectors: the average cosine similarity between
these tokens equal to 0.06.
• As the class and distillation embeddings are computed at each layer,
they gradually become more similar through the network, all the way
through the last layer at which their similarity is high (cos=0.93), but
still lower than 1.
• The authors verified that distillation token adds something to the
model, compared to simply adding an additional class token
associated with the same target label: instead of a teacher pseudo-
label.
• Even if initialing them randomly and independently, during training
they converge towards the same vector (cos=0.999),
DistillationThrough Attention – Joint Classifiers
• At test time, both the class or the distillation embeddings produced
by the transformer are associated with linear classifiers and able to
infer the image label.
• It is also possible to add the softmax output by the two classifiers to
estimate it in a late fusion fashion.
Experiments –Transformer Models
• DeiT-B: reference model (same asViT-B)
• DeiT-B↑384: fine-tune DeiT at a larger resolution
• DeiT⚗: DeiT with distillation(using distillation token)
• DeiT-S(Small), DeiT-Ti(Tiny): smaller models of DeiT
Experiment - Distillation
• A convnet teacher gives best performance than using a transformer.
• The fact that the convnet is a better teacher is probably due to the
inductive bias inherited by the transformers through distillation
Experiment - Distillation
• Hard distillation significantly outperforms soft distillation for
transformers, even when using only a class token.
• The classifier on the two tokens is significantly better than the
independent class and distillation classifiers, which by themselves
already outperform the distillation baseline.
Experiment - Distillation
• Does it inherit existing inductive bias that would facilitate the training?
• Below table reports the fraction of sample classified differently for all
classifier pairs, i.e., the rate of different decisions.
• The distilled model is more correlated to the convnet than a transformer
learned from scratch.
Experiment - Efficiency vs accuracy
• DeiT is slightly below EfficientNet, which
shows that almost closed the gap between
visual transformers and convnets when
training with Imagenet only.
• These results are a major improvement (+6.3%
top-1 in a comparable setting) over previous
ViT models trained on Imagenet1k only
• Furthermore, when DeiT benefits from the
distillation from a relatively weaker RegNetY
to produce DeiT⚗, it outperforms
EfficientNet.
Experiment
- Efficiency vs accuracy
• Compared to EfficientNet, one can see
that, for the same number of
parameters, the convnet variants are
much slower.This is because large
matrix multiplications offer more
opportunity for hardware optimization
than small convolutions.
Experiment –Transfer Learning
Training Details & Ablation
• This study is intended to be transformer analogous of the bag of
tricks for convnets.
• Initialization and Hyper-perameters
▪ Transformers are relatively sensitive to initialization. The authors follow the
recommendation of Hanin and Rolnick, et al., “The effect of initialization and
architecture”, NIPS 2018
Training Details & Ablation
• Data Augmentation
▪ Compared to models that integrate more priors (such as convolutions),
transformers require a larger amount of data.
▪ Thus, in order to train with datasets of the same size, we rely on extensive
data augmentation.
▪ Almost all the data-augmentation methods that authors evaluate prove to
be useful.
• Regularization & Optimizers
▪ Transformers are sensitive to the setting of optimization hyper-parameters.
▪ lrscaled =
lr
512
× batchsize, stochastic depth, Mixup, Cutmix and Repeated
Augmentation(RA) are applied.
DataAugmentation
• Random Erasing
• Mixup & Cutmix
Batch Augmentation & Repeated Augmentation
• Batch Augmentation
▪ In a SGD optimization setting, including multiple data-augmented instances
of the same image in one optimization batch, rather than having only
distinct images in the batch, significantly enhances the effect of data-
augmentations and improve the generalization of the network.
• Repeated Augmentation
▪ In RA we form an ℬ image batch by sampling ℬ /𝑚 different images from
the dataset, and transform them up to 𝑚 times by a set of data
augmentations to fill the batch.
▪ The key difference with the standard sampling scheme in SGD is that
samples are not independent, as augmented versions of the same image are
highly correlated.While this strategy reduces the performance if the batch
size is small, for larger batch sizes RA outperforms the standard i.i.d. scheme.
Ablation Study onTraining Methods on
ImageNet
Fine-tuning at Different Resolution
• By default and similar toViT, the authors train DeiT models at
resolution 224x224 and fine-tune at resolution 384x384.
• In fine-tuning case, DeiT interpolates positional embedding. A
bilinear interpolation reduces l2-norm of a vector and it causes a
significant drop in accuracy. So, DeiT uses a bicubic interpolation.
TrainingTime
• A typical training of 300 epochs takes 37 hours with 2 nodes or 53
hours on a single node for the DeiT-B.
• DeiT-S and DeiT-Ti are trained in less than 3 days on 4 GPU.Then,
optionally we finetune the model at a larger resolution.This takes 20
hours on a single node (8 GPU) to produce a FixDeiT-B model at
resolution 384x384, which corresponds to 25 epochs.
• Since DeiT use repeated augmentation with 3 repetitions, It only sees
one third of the images during a single epoch.
Conclusion
• DeiT, which are image transformers that do not require very large
amount of data to be trained, thanks to improved training and
distillation procedure.
• CNN have optimized, both in terms of architecture and optimization
during almost a decade, including through extensive architecture
search that is prone to overfitting. In contrast, for DeiT is only
optimized the existing data augmentation and regularization
strategies pre-existing for convnets.
• Transformers will rapidly become a method of choice considering
their lower memory footprint for a given accuracy.
Thank you

Contenu connexe

Tendances

Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSungchul Kim
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Muhammad Haroon
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1
 
[DL輪読会]“Meta-Learning for Online Update of Recommender Systems. (AAAI 2022)”
[DL輪読会]“Meta-Learning for Online Update of Recommender Systems.  (AAAI 2022)” [DL輪読会]“Meta-Learning for Online Update of Recommender Systems.  (AAAI 2022)”
[DL輪読会]“Meta-Learning for Online Update of Recommender Systems. (AAAI 2022)” Deep Learning JP
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in VisionSangmin Woo
 
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...KCS Keio Computer Society
 
[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)Donghyeon Kim
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida
 
Semi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningSemi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningYusuke Uchida
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
Introduction to batch normalization
Introduction to batch normalizationIntroduction to batch normalization
Introduction to batch normalizationJamie (Taka) Wang
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMsDaniel Perez
 
08. spectal clustering
08. spectal clustering08. spectal clustering
08. spectal clusteringJeonghun Yoon
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 

Tendances (20)

Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
 
[DL輪読会]“Meta-Learning for Online Update of Recommender Systems. (AAAI 2022)”
[DL輪読会]“Meta-Learning for Online Update of Recommender Systems.  (AAAI 2022)” [DL輪読会]“Meta-Learning for Online Update of Recommender Systems.  (AAAI 2022)”
[DL輪読会]“Meta-Learning for Online Update of Recommender Systems. (AAAI 2022)”
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...
Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...
 
[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)[기초개념] Graph Convolutional Network (GCN)
[기초개념] Graph Convolutional Network (GCN)
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
 
Semi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learningSemi supervised, weakly-supervised, unsupervised, and active learning
Semi supervised, weakly-supervised, unsupervised, and active learning
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
 
CNN Tutorial
CNN TutorialCNN Tutorial
CNN Tutorial
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Introduction to batch normalization
Introduction to batch normalizationIntroduction to batch normalization
Introduction to batch normalization
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMs
 
08. spectal clustering
08. spectal clustering08. spectal clustering
08. spectal clustering
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 

Similaire à PR-297: Training data-efficient image transformers & distillation through attention

Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Generalized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingGeneralized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingDatabricks
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix DatasetBen Mabey
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Yan Xu
 
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxssuser2624f71
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment reviewJune-Woo Kim
 
Decomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesisDecomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesisNaeem Shehzad
 
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...Joonhyung Lee
 
A Comparison of Loss Function on Deep Embedding
A Comparison of Loss Function on Deep EmbeddingA Comparison of Loss Function on Deep Embedding
A Comparison of Loss Function on Deep EmbeddingCenk Bircanoğlu
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Dongmin Choi
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesJinwon Lee
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdfdsfajkh
 
Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3muayyad alsadi
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentationOwin Will
 

Similaire à PR-297: Training data-efficient image transformers & distillation through attention (20)

Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Generalized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingGeneralized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN Training
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
 
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
 
OBDPC 2022
OBDPC 2022OBDPC 2022
OBDPC 2022
 
Decomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesisDecomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesis
 
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
 
A Comparison of Loss Function on Deep Embedding
A Comparison of Loss Function on Deep EmbeddingA Comparison of Loss Function on Deep Embedding
A Comparison of Loss Function on Deep Embedding
 
lec6a.ppt
lec6a.pptlec6a.ppt
lec6a.ppt
 
Ndp Slides
Ndp SlidesNdp Slides
Ndp Slides
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdf
 
Blow review
Blow reviewBlow review
Blow review
 
Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 

Plus de Jinwon Lee

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sJinwon Lee
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersJinwon Lee
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)Jinwon Lee
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...Jinwon Lee
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionJinwon Lee
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementJinwon Lee
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...Jinwon Lee
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionJinwon Lee
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignJinwon Lee
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...Jinwon Lee
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksJinwon Lee
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksJinwon Lee
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingJinwon Lee
 

Plus de Jinwon Lee (20)

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
 

Dernier

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Dernier (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

PR-297: Training data-efficient image transformers & distillation through attention

  • 1. Training Data-efficient ImageTransformers & DistillationThrough Attention HugoTouvron et al., “Training Data-Efficient ImageTransformers & DistillationThrough Attention” 10th January, 2021 PR12 Paper Review JinWon Lee Samsung Electronics
  • 2. Reference • Facebook AI Blog ▪ https://ai.facebook.com/blog/data-efficient-image-transformers-a- promising-new-technique-for-image-classification • Official Code ▪ https://github.com/facebookresearch/deit • PR-281:ViT(VisionTransformer) ▪ https://youtu.be/D72_Cn-XV1g • PR-243: Designing Network Design Spaces(RegNet) ▪ https://youtu.be/bnbKQRae_u4
  • 3. Transformers for ComputerVision DALL-E: Image Generation ViT: Image Classification DETR: Object Detection
  • 4. RelatedWork – Image Classification • Image classification is so core to computer vision that it is often used as a benchmark to measure progress in image understanding. • Since 2012’s AlexNet, convnets have dominated this benchmark and have become the de facto standard. • Despite several attempts to use transformers for image classification, until now their performance has been inferior to that of convnets.
  • 5. RelatedWork – Image Classification • RecentlyVisionTransformers (ViT) closed the gap with the state of the art on ImageNet, without using any convolution.This performance is remarkable since convnet methods for image classification have benefited from years of tuning and optimization • Nevertheless, according to this study, a pre-training phase on a large volume of curated data is required for the learned transformer to be effective.
  • 6. RelatedWork –TheTransformerArchitecture • Transformers are currently the reference model for all natural language processing (NLP) tasks. • Many improvements of convnets for image classification are inspired by transformers. <Squeeze-and-Excitation Network>
  • 7. RelatedWork – Knowledge Distillation • KD refers to the training paradigm in which a student model leverages “soft” labels coming from a strong teacher network. • The teacher’s supervision takes into account the effects of the data augmentation, which sometimes causes a misalignment between the real label and the image. • KD can transfer inductive biases in a soft way in a student model using a teacher model where they would be incorporated in a hard way. Image from Wonpyo Park’s Slide @ Naver D2 random crop & resize Label: cat Label: ???
  • 8. Self Attention A slide from Deep Learning for CV@UMICH
  • 9. Attention & Inductive Biases • Fewer inductive biases than convolution and dense layer → Requires more data than others Convolution layer - Locally connected - Same weights for all inputs Dense layer - Fully connected - Same weights for all inputs Attention layer - Fully connected - Different weights for all inputs x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 𝜓(q2〮k1) 𝜓(q2〮k2) 𝜓(q2〮k3) 𝜓(q2〮k4)
  • 10. VisualTransformer – Same Architecture asViT • ViT processes input images as if there were a sequence of input tokens. • The fixed-size input RGB image is decomposed into a batch of N patches of a fixed size of 16 x 16 pixels (N = 14 x 14). • The transformer block cannot consider their relative position, so positional embeddings are added. • The class token is a trainable vector appended to the patch tokens before the first layer, that goes through the transformer layers, and is then projected with a linear layer to predict the class.
  • 11. VisionTransformer(ViT) 1) 𝑧0 = 𝑥 𝑐𝑙𝑎𝑠𝑠; 𝑥 𝑝 1 𝐸; 𝑥 𝑝 2 𝐸; … ; 𝑥 𝑝 𝑁 𝐸 + 𝐸 𝑝𝑜𝑠, 𝐸 ∈ ℝ 𝑃2∙𝐶 ×𝐷, 𝐸 𝑝𝑜𝑠 ∈ ℝ(𝑁+1)×𝐷 2) 𝑧′ 𝑙 = 𝑀𝑆𝐴 𝐿𝑁 𝑧𝑙−1 + 𝑧𝑙−1, 𝑙 = 1 … 𝐿 3) 𝑧𝑙 = 𝑀𝐿𝑃 𝐿𝑁 𝑧′ 𝑙 + 𝑧′ 𝑙, 𝑙 = 1 … 𝐿 4) 𝑦 = 𝐿𝑁(𝑧 𝐿 0 )
  • 12. VisualTransformer – Same Architecture asViT • Fixing the positional encoding across resolutions ▪ It is desirable to use a lower training resolution and fine-tune the net work at the larger resolution. ▪ This speeds up the full training and improves the accuracy under prevailing data augmentation schemes. ▪ When increasing the resolution of an input image, patch size does not change, therefore the number of input patches(N) does change. ▪ Interpolation for positional encoding is needed when changing the resolution.
  • 13. DistillationThrough Attention – Soft Distillation • Soft distillation minimizes the Kullback-Leibler divergence between the softmax of the teacher and the softmax of the student model. • Let 𝑍𝑡 be the logits of the teacher model, 𝑍𝑠 the logits of the student model.We denote by 𝜏 the temperature for the distillation, 𝜆 the coefficient balancing the Kullback–Leibler divergence loss (KL) and the cross-entropy (ℒ 𝐶𝐸) on ground truth labels 𝑦, and 𝜓 the softmax function. ℒglobal = 1 − 𝜆 ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦 + 𝜆𝜏2KL(𝜓 𝑍𝑠 𝜏 , 𝜓 𝑍𝑡 𝜏 )
  • 14. DistillationThrough Attention – Hard Distillation • Hard-label distillation is a variant distillation which takes the hard decision of the teacher as a true label. • Let 𝑦𝑡 = argmax 𝑐 𝑍𝑡(𝑐) be the hard decision of the teacher, ℒglobal hardDistill = 1 2 ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦 + 1 2 ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦𝑡 • Note also that the hard labels can also be converted into soft labels with label smoothing, where the true label is considered to have a probability of 1 − 𝜖, and the remaining 𝜖 is shared across the remaining classes.The authors fix 𝜖 = 0.1
  • 15. DistillationThrough Attention – DistillationToken • New token – distillation token, similar to class token.
  • 16. DistillationThrough Attention – DistillationToken • Interestingly, the learned class and distillation tokens converge towards different vectors: the average cosine similarity between these tokens equal to 0.06. • As the class and distillation embeddings are computed at each layer, they gradually become more similar through the network, all the way through the last layer at which their similarity is high (cos=0.93), but still lower than 1. • The authors verified that distillation token adds something to the model, compared to simply adding an additional class token associated with the same target label: instead of a teacher pseudo- label. • Even if initialing them randomly and independently, during training they converge towards the same vector (cos=0.999),
  • 17. DistillationThrough Attention – Joint Classifiers • At test time, both the class or the distillation embeddings produced by the transformer are associated with linear classifiers and able to infer the image label. • It is also possible to add the softmax output by the two classifiers to estimate it in a late fusion fashion.
  • 18. Experiments –Transformer Models • DeiT-B: reference model (same asViT-B) • DeiT-B↑384: fine-tune DeiT at a larger resolution • DeiT⚗: DeiT with distillation(using distillation token) • DeiT-S(Small), DeiT-Ti(Tiny): smaller models of DeiT
  • 19. Experiment - Distillation • A convnet teacher gives best performance than using a transformer. • The fact that the convnet is a better teacher is probably due to the inductive bias inherited by the transformers through distillation
  • 20. Experiment - Distillation • Hard distillation significantly outperforms soft distillation for transformers, even when using only a class token. • The classifier on the two tokens is significantly better than the independent class and distillation classifiers, which by themselves already outperform the distillation baseline.
  • 21. Experiment - Distillation • Does it inherit existing inductive bias that would facilitate the training? • Below table reports the fraction of sample classified differently for all classifier pairs, i.e., the rate of different decisions. • The distilled model is more correlated to the convnet than a transformer learned from scratch.
  • 22. Experiment - Efficiency vs accuracy • DeiT is slightly below EfficientNet, which shows that almost closed the gap between visual transformers and convnets when training with Imagenet only. • These results are a major improvement (+6.3% top-1 in a comparable setting) over previous ViT models trained on Imagenet1k only • Furthermore, when DeiT benefits from the distillation from a relatively weaker RegNetY to produce DeiT⚗, it outperforms EfficientNet.
  • 23. Experiment - Efficiency vs accuracy • Compared to EfficientNet, one can see that, for the same number of parameters, the convnet variants are much slower.This is because large matrix multiplications offer more opportunity for hardware optimization than small convolutions.
  • 25. Training Details & Ablation • This study is intended to be transformer analogous of the bag of tricks for convnets. • Initialization and Hyper-perameters ▪ Transformers are relatively sensitive to initialization. The authors follow the recommendation of Hanin and Rolnick, et al., “The effect of initialization and architecture”, NIPS 2018
  • 26. Training Details & Ablation • Data Augmentation ▪ Compared to models that integrate more priors (such as convolutions), transformers require a larger amount of data. ▪ Thus, in order to train with datasets of the same size, we rely on extensive data augmentation. ▪ Almost all the data-augmentation methods that authors evaluate prove to be useful. • Regularization & Optimizers ▪ Transformers are sensitive to the setting of optimization hyper-parameters. ▪ lrscaled = lr 512 × batchsize, stochastic depth, Mixup, Cutmix and Repeated Augmentation(RA) are applied.
  • 28. Batch Augmentation & Repeated Augmentation • Batch Augmentation ▪ In a SGD optimization setting, including multiple data-augmented instances of the same image in one optimization batch, rather than having only distinct images in the batch, significantly enhances the effect of data- augmentations and improve the generalization of the network. • Repeated Augmentation ▪ In RA we form an ℬ image batch by sampling ℬ /𝑚 different images from the dataset, and transform them up to 𝑚 times by a set of data augmentations to fill the batch. ▪ The key difference with the standard sampling scheme in SGD is that samples are not independent, as augmented versions of the same image are highly correlated.While this strategy reduces the performance if the batch size is small, for larger batch sizes RA outperforms the standard i.i.d. scheme.
  • 29. Ablation Study onTraining Methods on ImageNet
  • 30. Fine-tuning at Different Resolution • By default and similar toViT, the authors train DeiT models at resolution 224x224 and fine-tune at resolution 384x384. • In fine-tuning case, DeiT interpolates positional embedding. A bilinear interpolation reduces l2-norm of a vector and it causes a significant drop in accuracy. So, DeiT uses a bicubic interpolation.
  • 31. TrainingTime • A typical training of 300 epochs takes 37 hours with 2 nodes or 53 hours on a single node for the DeiT-B. • DeiT-S and DeiT-Ti are trained in less than 3 days on 4 GPU.Then, optionally we finetune the model at a larger resolution.This takes 20 hours on a single node (8 GPU) to produce a FixDeiT-B model at resolution 384x384, which corresponds to 25 epochs. • Since DeiT use repeated augmentation with 3 repetitions, It only sees one third of the images during a single epoch.
  • 32. Conclusion • DeiT, which are image transformers that do not require very large amount of data to be trained, thanks to improved training and distillation procedure. • CNN have optimized, both in terms of architecture and optimization during almost a decade, including through extensive architecture search that is prone to overfitting. In contrast, for DeiT is only optimized the existing data augmentation and regularization strategies pre-existing for convnets. • Transformers will rapidly become a method of choice considering their lower memory footprint for a given accuracy.