안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 297번째 리뷰입니다
어느덧 PR-12 시즌 3의 끝까지 논문 3편밖에 남지 않았네요.
시즌 3가 끝나면 바로 시즌 4의 새 멤버 모집이 시작될 예정입니다. 많은 관심과 지원 부탁드립니다~~
(멤버 모집 공지는 Facebook TensorFlow Korea 그룹에 올라올 예정입니다)
오늘 제가 리뷰한 논문은 Facebook의 Training data-efficient image transformers & distillation through attention 입니다.
Google에서 나왔던 ViT논문 이후에 convolution을 전혀 사용하지 않고 오직 attention만을 이용한 computer vision algorithm에 어느때보다 관심이 높아지고 있는데요
이 논문에서 제안한 DeiT 모델은 ViT와 같은 architecture를 사용하면서 ViT가 ImageNet data만으로는 성능이 잘 안나왔던 것에 비해서
Training 방법 개선과 새로운 Knowledge Distillation 방법을 사용하여 mageNet data 만으로 EfficientNet보다 뛰어난 성능을 보여주는 결과를 얻었습니다.
정말 CNN은 이제 서서히 사라지게 되는 것일까요? Attention이 computer vision도 정복하게 될 것인지....
개인적으로는 당분간은 attention 기반의 CV 논문이 쏟아질 거라고 확신하고, 또 여기에서 놀라운 일들이 일어날 수 있을 거라고 생각하고 있습니다
CNN은 10년간 많은 연구를 통해서 발전해왔지만, transformer는 이제 CV에 적용된 지 얼마 안된 시점이라서 더 기대가 크구요,
attention이 inductive bias가 가장 적은 형태의 모델이기 때문에 더 놀라운 이들을 만들 수 있을거라고 생각합니다
얼마 전에 나온 open AI의 DALL-E도 그 대표적인 예라고 할 수 있을 것 같습니다. Transformer의 또하나의 transformation이 궁금하신 분들은 아래 영상을 참고해주세요
영상링크: https://youtu.be/DjEvzeiWBTo
논문링크: https://arxiv.org/abs/2012.12877
4. RelatedWork – Image Classification
• Image classification is so core to computer vision that it is often used
as a benchmark to measure progress in image understanding.
• Since 2012’s AlexNet, convnets have dominated this benchmark and
have become the de facto standard.
• Despite several attempts to use transformers for image classification,
until now their performance has been inferior to that of convnets.
5. RelatedWork – Image Classification
• RecentlyVisionTransformers (ViT) closed the gap with the state of
the art on ImageNet, without using any convolution.This
performance is remarkable since convnet methods for image
classification have benefited from years of tuning and optimization
• Nevertheless, according to this study, a pre-training phase on a large
volume of curated data is required for the learned transformer to be
effective.
6. RelatedWork –TheTransformerArchitecture
• Transformers are currently the reference model for all natural
language processing (NLP) tasks.
• Many improvements of convnets for image classification are inspired
by transformers.
<Squeeze-and-Excitation Network>
7. RelatedWork – Knowledge Distillation
• KD refers to the training paradigm in which a student model leverages
“soft” labels coming from a strong teacher network.
• The teacher’s supervision takes into account the effects of the data
augmentation, which sometimes causes a misalignment between the real
label and the image.
• KD can transfer inductive biases in a soft way in a student model using a
teacher model where they would be incorporated in a hard way.
Image from Wonpyo Park’s Slide @ Naver D2
random crop & resize
Label: cat Label: ???
9. Attention & Inductive Biases
• Fewer inductive biases than convolution and dense layer
→ Requires more data than others
Convolution layer
- Locally connected
- Same weights for all inputs
Dense layer
- Fully connected
- Same weights for all inputs
Attention layer
- Fully connected
- Different weights for all inputs
x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
𝜓(q2〮k1) 𝜓(q2〮k2) 𝜓(q2〮k3) 𝜓(q2〮k4)
10. VisualTransformer – Same Architecture asViT
• ViT processes input images as if there were a
sequence of input tokens.
• The fixed-size input RGB image is decomposed
into a batch of N patches of a fixed size of 16 x 16
pixels (N = 14 x 14).
• The transformer block cannot consider their
relative position, so positional embeddings are
added.
• The class token is a trainable vector appended to
the patch tokens before the first layer, that goes
through the transformer layers, and is then
projected with a linear layer to predict the class.
12. VisualTransformer – Same Architecture asViT
• Fixing the positional encoding across resolutions
▪ It is desirable to use a lower training resolution and fine-tune the net work at
the larger resolution.
▪ This speeds up the full training and improves the accuracy under prevailing
data augmentation schemes.
▪ When increasing the resolution of an input image, patch size does not
change, therefore the number of input patches(N) does change.
▪ Interpolation for positional encoding is needed when changing the resolution.
13. DistillationThrough Attention – Soft Distillation
• Soft distillation minimizes the Kullback-Leibler divergence between
the softmax of the teacher and the softmax of the student model.
• Let 𝑍𝑡 be the logits of the teacher model, 𝑍𝑠 the logits of the student
model.We denote by 𝜏 the temperature for the distillation, 𝜆 the
coefficient balancing the Kullback–Leibler divergence loss (KL) and
the cross-entropy (ℒ 𝐶𝐸) on ground truth labels 𝑦, and 𝜓 the softmax
function.
ℒglobal = 1 − 𝜆 ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦 + 𝜆𝜏2KL(𝜓
𝑍𝑠
𝜏
, 𝜓
𝑍𝑡
𝜏
)
14. DistillationThrough Attention – Hard Distillation
• Hard-label distillation is a variant distillation which takes the hard
decision of the teacher as a true label.
• Let 𝑦𝑡 = argmax 𝑐 𝑍𝑡(𝑐) be the hard decision of the teacher,
ℒglobal
hardDistill
=
1
2
ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦 +
1
2
ℒ 𝐶𝐸 𝜓 𝑍𝑠 , 𝑦𝑡
• Note also that the hard labels can also be converted into soft labels
with label smoothing, where the true label is considered to have a
probability of 1 − 𝜖, and the remaining 𝜖 is shared across the
remaining classes.The authors fix 𝜖 = 0.1
16. DistillationThrough Attention – DistillationToken
• Interestingly, the learned class and distillation tokens converge
towards different vectors: the average cosine similarity between
these tokens equal to 0.06.
• As the class and distillation embeddings are computed at each layer,
they gradually become more similar through the network, all the way
through the last layer at which their similarity is high (cos=0.93), but
still lower than 1.
• The authors verified that distillation token adds something to the
model, compared to simply adding an additional class token
associated with the same target label: instead of a teacher pseudo-
label.
• Even if initialing them randomly and independently, during training
they converge towards the same vector (cos=0.999),
17. DistillationThrough Attention – Joint Classifiers
• At test time, both the class or the distillation embeddings produced
by the transformer are associated with linear classifiers and able to
infer the image label.
• It is also possible to add the softmax output by the two classifiers to
estimate it in a late fusion fashion.
18. Experiments –Transformer Models
• DeiT-B: reference model (same asViT-B)
• DeiT-B↑384: fine-tune DeiT at a larger resolution
• DeiT⚗: DeiT with distillation(using distillation token)
• DeiT-S(Small), DeiT-Ti(Tiny): smaller models of DeiT
19. Experiment - Distillation
• A convnet teacher gives best performance than using a transformer.
• The fact that the convnet is a better teacher is probably due to the
inductive bias inherited by the transformers through distillation
20. Experiment - Distillation
• Hard distillation significantly outperforms soft distillation for
transformers, even when using only a class token.
• The classifier on the two tokens is significantly better than the
independent class and distillation classifiers, which by themselves
already outperform the distillation baseline.
21. Experiment - Distillation
• Does it inherit existing inductive bias that would facilitate the training?
• Below table reports the fraction of sample classified differently for all
classifier pairs, i.e., the rate of different decisions.
• The distilled model is more correlated to the convnet than a transformer
learned from scratch.
22. Experiment - Efficiency vs accuracy
• DeiT is slightly below EfficientNet, which
shows that almost closed the gap between
visual transformers and convnets when
training with Imagenet only.
• These results are a major improvement (+6.3%
top-1 in a comparable setting) over previous
ViT models trained on Imagenet1k only
• Furthermore, when DeiT benefits from the
distillation from a relatively weaker RegNetY
to produce DeiT⚗, it outperforms
EfficientNet.
23. Experiment
- Efficiency vs accuracy
• Compared to EfficientNet, one can see
that, for the same number of
parameters, the convnet variants are
much slower.This is because large
matrix multiplications offer more
opportunity for hardware optimization
than small convolutions.
25. Training Details & Ablation
• This study is intended to be transformer analogous of the bag of
tricks for convnets.
• Initialization and Hyper-perameters
▪ Transformers are relatively sensitive to initialization. The authors follow the
recommendation of Hanin and Rolnick, et al., “The effect of initialization and
architecture”, NIPS 2018
26. Training Details & Ablation
• Data Augmentation
▪ Compared to models that integrate more priors (such as convolutions),
transformers require a larger amount of data.
▪ Thus, in order to train with datasets of the same size, we rely on extensive
data augmentation.
▪ Almost all the data-augmentation methods that authors evaluate prove to
be useful.
• Regularization & Optimizers
▪ Transformers are sensitive to the setting of optimization hyper-parameters.
▪ lrscaled =
lr
512
× batchsize, stochastic depth, Mixup, Cutmix and Repeated
Augmentation(RA) are applied.
28. Batch Augmentation & Repeated Augmentation
• Batch Augmentation
▪ In a SGD optimization setting, including multiple data-augmented instances
of the same image in one optimization batch, rather than having only
distinct images in the batch, significantly enhances the effect of data-
augmentations and improve the generalization of the network.
• Repeated Augmentation
▪ In RA we form an ℬ image batch by sampling ℬ /𝑚 different images from
the dataset, and transform them up to 𝑚 times by a set of data
augmentations to fill the batch.
▪ The key difference with the standard sampling scheme in SGD is that
samples are not independent, as augmented versions of the same image are
highly correlated.While this strategy reduces the performance if the batch
size is small, for larger batch sizes RA outperforms the standard i.i.d. scheme.
30. Fine-tuning at Different Resolution
• By default and similar toViT, the authors train DeiT models at
resolution 224x224 and fine-tune at resolution 384x384.
• In fine-tuning case, DeiT interpolates positional embedding. A
bilinear interpolation reduces l2-norm of a vector and it causes a
significant drop in accuracy. So, DeiT uses a bicubic interpolation.
31. TrainingTime
• A typical training of 300 epochs takes 37 hours with 2 nodes or 53
hours on a single node for the DeiT-B.
• DeiT-S and DeiT-Ti are trained in less than 3 days on 4 GPU.Then,
optionally we finetune the model at a larger resolution.This takes 20
hours on a single node (8 GPU) to produce a FixDeiT-B model at
resolution 384x384, which corresponds to 25 epochs.
• Since DeiT use repeated augmentation with 3 repetitions, It only sees
one third of the images during a single epoch.
32. Conclusion
• DeiT, which are image transformers that do not require very large
amount of data to be trained, thanks to improved training and
distillation procedure.
• CNN have optimized, both in terms of architecture and optimization
during almost a decade, including through extensive architecture
search that is prone to overfitting. In contrast, for DeiT is only
optimized the existing data augmentation and regularization
strategies pre-existing for convnets.
• Transformers will rapidly become a method of choice considering
their lower memory footprint for a given accuracy.