PR-297: Training data-efficient image transformers & distillation through attention

Training Data-efficient ImageTransformers
& DistillationThrough Attention
HugoTouvron et al., “Training Data-Efficient ImageTransformers & DistillationThrough Attention”
10th January, 2021
PR12 Paper Review
JinWon Lee
Samsung Electronics

Reference
• Facebook AI Blog
▪ https://ai.facebook.com/blog/data-efficient-image-transformers-a-
promising-new-technique-for-image-classification
• Official Code
▪ https://github.com/facebookresearch/deit
• PR-281:ViT(VisionTransformer)
▪ https://youtu.be/D72_Cn-XV1g
• PR-243: Designing Network Design Spaces(RegNet)
▪ https://youtu.be/bnbKQRae_u4

Transformers for ComputerVision
DALL-E: Image Generation ViT: Image Classification
DETR: Object Detection

RelatedWork – Image Classification
• Image classification is so core to computer vision that it is often used
as a benchmark to measure progress in image understanding.
• Since 2012’s AlexNet, convnets have dominated this benchmark and
have become the de facto standard.
• Despite several attempts to use transformers for image classification,
until now their performance has been inferior to that of convnets.

RelatedWork – Image Classification
• RecentlyVisionTransformers (ViT) closed the gap with the state of
the art on ImageNet, without using any convolution.This
performance is remarkable since convnet methods for image
classification have benefited from years of tuning and optimization
• Nevertheless, according to this study, a pre-training phase on a large
volume of curated data is required for the learned transformer to be
effective.

RelatedWork –TheTransformerArchitecture
• Transformers are currently the reference model for all natural
language processing (NLP) tasks.
• Many improvements of convnets for image classification are inspired
by transformers.
<Squeeze-and-Excitation Network>

RelatedWork – Knowledge Distillation
• KD refers to the training paradigm in which a student model leverages
“soft” labels coming from a strong teacher network.
• The teacher’s supervision takes into account the effects of the data
augmentation, which sometimes causes a misalignment between the real
label and the image.
• KD can transfer inductive biases in a soft way in a student model using a
teacher model where they would be incorporated in a hard way.
Image from Wonpyo Park’s Slide @ Naver D2
random crop & resize
Label: cat Label: ???

Self Attention
A slide from Deep Learning for CV@UMICH

Attention & Inductive Biases
• Fewer inductive biases than convolution and dense layer
→ Requires more data than others
Convolution layer
- Locally connected
- Same weights for all inputs
Dense layer
- Fully connected
- Same weights for all inputs
Attention layer
- Fully connected
- Different weights for all inputs
x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
𝜓(q2〮k1) 𝜓(q2〮k2) 𝜓(q2〮k3) 𝜓(q2〮k4)

VisualTransformer – Same Architecture asViT
• ViT processes input images as if there were a
sequence of input tokens.
• The fixed-size input RGB image is decomposed
into a batch of N patches of a fixed size of 16 x 16
pixels (N = 14 x 14).
• The transformer block cannot consider their
relative position, so positional embeddings are
added.
• The class token is a trainable vector appended to
the patch tokens before the first layer, that goes
through the transformer layers, and is then
projected with a linear layer to predict the class.

VisionTransformer(ViT)
1) 𝑧0 = 𝑥 𝑐𝑙𝑎𝑠𝑠; 𝑥 𝑝
1 𝐸; 𝑥 𝑝
2 𝐸; … ; 𝑥 𝑝
𝑁 𝐸 + 𝐸 𝑝𝑜𝑠, 𝐸 ∈ ℝ 𝑃2∙𝐶 ×𝐷, 𝐸 𝑝𝑜𝑠 ∈ ℝ(𝑁+1)×𝐷
2) 𝑧′
𝑙 = 𝑀𝑆𝐴 𝐿𝑁 𝑧𝑙−1 + 𝑧𝑙−1, 𝑙 = 1 … 𝐿
3) 𝑧𝑙 = 𝑀𝐿𝑃 𝐿𝑁 𝑧′
𝑙 + 𝑧′
𝑙, 𝑙 = 1 … 𝐿
4) 𝑦 = 𝐿𝑁(𝑧 𝐿
0
)

VisualTransformer – Same Architecture asViT
• Fixing the positional encoding across resolutions
▪ It is desirable to use a lower training resolution and fine-tune the net work at
the larger resolution.
▪ This speeds up the full training and improves the accuracy under prevailing
data augmentation schemes.
▪ When increasing the resolution of an input image, patch size does not
change, therefore the number of input patches(N) does change.
▪ Interpolation for positional encoding is needed when changing the resolution.

DistillationThrough Attention – Soft Distillation
• Soft distillation minimizes the Kullback-Leibler divergence between
the softmax of the teacher and the softmax of the student model.
• Let 𝑍𝑡 be the logits of the teacher model, 𝑍𝑠 the logits of the student
model.We denote by 𝜏 the temperature for the distillation, 𝜆 the
coefficient balancing the Kullback–Leibler divergence loss (KL) and
the cross-entropy (ℒ 𝐶𝐸) on ground truth labels 𝑦, and 𝜓 the softmax
function.
ℒglobal = 1 − 𝜆 ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦 + 𝜆𝜏2KL(𝜓
𝑍𝑠
𝜏
, 𝜓
𝑍𝑡
𝜏
)

DistillationThrough Attention – Hard Distillation
• Hard-label distillation is a variant distillation which takes the hard
decision of the teacher as a true label.
• Let 𝑦𝑡 = argmax 𝑐 𝑍𝑡(𝑐) be the hard decision of the teacher,
ℒglobal
hardDistill
=
1
2
ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦 +
1
2
ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦𝑡
• Note also that the hard labels can also be converted into soft labels
with label smoothing, where the true label is considered to have a
probability of 1 − 𝜖, and the remaining 𝜖 is shared across the
remaining classes.The authors fix 𝜖 = 0.1

DistillationThrough Attention – DistillationToken
• New token – distillation token, similar to class token.

DistillationThrough Attention – DistillationToken
• Interestingly, the learned class and distillation tokens converge
towards different vectors: the average cosine similarity between
these tokens equal to 0.06.
• As the class and distillation embeddings are computed at each layer,
they gradually become more similar through the network, all the way
through the last layer at which their similarity is high (cos=0.93), but
still lower than 1.
• The authors verified that distillation token adds something to the
model, compared to simply adding an additional class token
associated with the same target label: instead of a teacher pseudo-
label.
• Even if initialing them randomly and independently, during training
they converge towards the same vector (cos=0.999),

DistillationThrough Attention – Joint Classifiers
• At test time, both the class or the distillation embeddings produced
by the transformer are associated with linear classifiers and able to
infer the image label.
• It is also possible to add the softmax output by the two classifiers to
estimate it in a late fusion fashion.

Experiments –Transformer Models
• DeiT-B: reference model (same asViT-B)
• DeiT-B↑384: fine-tune DeiT at a larger resolution
• DeiT⚗: DeiT with distillation(using distillation token)
• DeiT-S(Small), DeiT-Ti(Tiny): smaller models of DeiT

Experiment - Distillation
• A convnet teacher gives best performance than using a transformer.
• The fact that the convnet is a better teacher is probably due to the
inductive bias inherited by the transformers through distillation

• Hard distillation significantly outperforms soft distillation for
transformers, even when using only a class token.
• The classifier on the two tokens is significantly better than the
independent class and distillation classifiers, which by themselves
already outperform the distillation baseline.

• Does it inherit existing inductive bias that would facilitate the training?
• Below table reports the fraction of sample classified differently for all
classifier pairs, i.e., the rate of different decisions.
• The distilled model is more correlated to the convnet than a transformer
learned from scratch.

Experiment - Efficiency vs accuracy
• DeiT is slightly below EfficientNet, which
shows that almost closed the gap between
visual transformers and convnets when
training with Imagenet only.
• These results are a major improvement (+6.3%
top-1 in a comparable setting) over previous
ViT models trained on Imagenet1k only
• Furthermore, when DeiT benefits from the
distillation from a relatively weaker RegNetY
to produce DeiT⚗, it outperforms
EfficientNet.

Experiment
- Efficiency vs accuracy
• Compared to EfficientNet, one can see
that, for the same number of
parameters, the convnet variants are
much slower.This is because large
matrix multiplications offer more
opportunity for hardware optimization
than small convolutions.

Experiment –Transfer Learning

Training Details & Ablation
• This study is intended to be transformer analogous of the bag of
tricks for convnets.
• Initialization and Hyper-perameters
▪ Transformers are relatively sensitive to initialization. The authors follow the
recommendation of Hanin and Rolnick, et al., “The effect of initialization and
architecture”, NIPS 2018

Training Details & Ablation
• Data Augmentation
▪ Compared to models that integrate more priors (such as convolutions),
transformers require a larger amount of data.
▪ Thus, in order to train with datasets of the same size, we rely on extensive
data augmentation.
▪ Almost all the data-augmentation methods that authors evaluate prove to
be useful.
• Regularization & Optimizers
▪ Transformers are sensitive to the setting of optimization hyper-parameters.
▪ lrscaled =
lr
512
× batchsize, stochastic depth, Mixup, Cutmix and Repeated
Augmentation(RA) are applied.

DataAugmentation
• Random Erasing
• Mixup & Cutmix

Batch Augmentation & Repeated Augmentation
• Batch Augmentation
▪ In a SGD optimization setting, including multiple data-augmented instances
of the same image in one optimization batch, rather than having only
distinct images in the batch, significantly enhances the effect of data-
augmentations and improve the generalization of the network.
• Repeated Augmentation
▪ In RA we form an ℬ image batch by sampling ℬ /𝑚 different images from
the dataset, and transform them up to 𝑚 times by a set of data
augmentations to fill the batch.
▪ The key difference with the standard sampling scheme in SGD is that
samples are not independent, as augmented versions of the same image are
highly correlated.While this strategy reduces the performance if the batch
size is small, for larger batch sizes RA outperforms the standard i.i.d. scheme.

Ablation Study onTraining Methods on
ImageNet

Fine-tuning at Different Resolution
• By default and similar toViT, the authors train DeiT models at
resolution 224x224 and fine-tune at resolution 384x384.
• In fine-tuning case, DeiT interpolates positional embedding. A
bilinear interpolation reduces l2-norm of a vector and it causes a
significant drop in accuracy. So, DeiT uses a bicubic interpolation.

TrainingTime
• A typical training of 300 epochs takes 37 hours with 2 nodes or 53
hours on a single node for the DeiT-B.
• DeiT-S and DeiT-Ti are trained in less than 3 days on 4 GPU.Then,
optionally we finetune the model at a larger resolution.This takes 20
hours on a single node (8 GPU) to produce a FixDeiT-B model at
resolution 384x384, which corresponds to 25 epochs.
• Since DeiT use repeated augmentation with 3 repetitions, It only sees
one third of the images during a single epoch.

Conclusion
• DeiT, which are image transformers that do not require very large
amount of data to be trained, thanks to improved training and
distillation procedure.
• CNN have optimized, both in terms of architecture and optimization
during almost a decade, including through extensive architecture
search that is prone to overfitting. In contrast, for DeiT is only
optimized the existing data augmentation and regularization
strategies pre-existing for convnets.
• Transformers will rapidly become a method of choice considering
their lower memory footprint for a given accuracy.

PR-297: Training data-efficient image transformers & distillation through attention

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à PR-297: Training data-efficient image transformers & distillation through attention

Similaire à PR-297: Training data-efficient image transformers & distillation through attention (20)

Plus de Jinwon Lee

Plus de Jinwon Lee (20)

Dernier

Dernier (20)

PR-297: Training data-efficient image transformers & distillation through attention