SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
Anonymous (ICLR 2021 under review)
Yonsei University Severance Hospital CCIDS
Choi Dongmin
Abstract
• Transformer

- standard architecture for NLP
• Convolutional Networks

- attention is applied keeping their overall structure

• Transformer in Computer Vision

- a pure transformer can perform very well on image classification tasks
when applied directly to sequences of image patches

- achieved S.O.T.A with small computational costs when pre-trained on
large dataset
Introduction
Vaswani et al. Attention Is All You Need. NIPS 2017
Transformer
BERT
Self-attention

based architecture
The dominant approach : pre-training on a large text corpus

and then fine-tuning on a smaller task-specific dataset
Introduction
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020

Wang et al. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020
Self-Attention in CV inspired by NLP
DETR
Axial-DeepLab
However, classic ResNet-like architectures are still S.O.T.A
• Applying a Transformer Directly to Images

- with the fewest possible modifications

- provide the sequence of linear embeddings of the patches as an input

- image patches = tokens (words) in NLP
• Small Scale Training

- achieved accuracies below ResNets of comparable size

- Transformers lack some inductive biased inherent to CNNs

(such as translation equivariance and locality)
• Large Scale Training

- trumps (surpass) inductive bias

- excellent results when pre-trained at sufficient scale and transferred
Introduction
Related Works
Transformer
Vaswani et al. Attention Is All You Need. NIPS 2017

Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019

Radford et al. Improving language under- standing with unsupervised learning. Technical Report 2018
- Standard model in NLP tasks

- Only consists of attention modules

not using RNN

- Encoder-decoder

- Requires large scale dataset and

high computational cost

- Pre-training and fine-tuning
approaches : BERT & GPT
Method
Method
Image → A sequence of flattened 2D patchesx ∈ RH×W×C
xp ∈ RN×(P2
·C)
Trainable linear projection maps

→xp ∈ RN×(P2
·C)
xpE ∈ RN×D
Learnable Position Embedding

Epos ∈ R(N+1)×D
* Because Transformer uses constant

widths, model dimension , through all of its layersD
* to retain positional information
z0
L
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py#L99-L111
Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
: input sequencez ∈ RN×D
Attention weight : similarity btwAij qi
, kj
Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
Method
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py
Method
Hybrid Architecture
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Flattened intermediate feature

maps of a ResNet

as the input sequence like DETR
Method
Fine-tuning and Higher Resolution
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Remove the pre-trained prediction head and attach a zero-initialized

feedforward layer ( =the number of downstream classes)D × K K
Experiments
• Datasets

< Pre-training >

- ILSVRC-2012 ImageNet dataset : 1k classes / 1.3M images

- ImageNet-21k : 21k classes / 14M images

- JFT : 18k classes / 303M images

< Downstream (Fine-tuning) >

- ImageNet, ImageNet ReaL, CIFAR-10/100, Oxford-IIIT Pets, Oxford
Flowers-102, VTAB
• Model Variants ex : ViT-L/16 = “Large” variants, with 16 X 16 input patch size
Experiments
• Training & Fine-tuning

< Pre-training>

- Adam with 

- Batch size 4,096

- Weight decay 0.1 (high weight decay is useful for transfer models)

- Linear learning rate warmup and decay



< Fine-tuning >

- SGD with momentum, batch size 512

• Metrics

- Few-shot (for fast on-the-fly evaluation)

- Fine-tuning accuracy
β1 = 0.9, β2 = 0.999
Experiments
• Comparison to State of the Art
Kolesnikov et al. Big Transfer (BiT): General Visual Representation Learning. ECCV 2020

Xie et al. Self-training with noisy student improves imagenet classification. CVPR 2020
* BiT-L : Big Transfer, which performs supervised transfer learning with large ResNets

* Noisy Student : a large EfficientNet trained using semi-supervised learning
Experiments
• Comparison to State of the Art
Experiments
• Pre-training Data Requirements
Larger Dataset
Larger Dataset
Experiments
• Scaling Study
Experiments
• Inspecting Vision Transformer
The components resemble plausible basis functions

for a low-dimensional representation of the fine structure within each patch 

analogous to receptive field size in CNNs
Conclusion
• Application of Transformers to Image Recognition

- no image-specific inductive biases in the architecture

- interpret an image as sequence of patches and process it by a standard
Transformer encoder

- simple, yet scalable, strategy works

- matches or exceeds the S.O.T.A being cheap to pre-train

• Many Challenges Remain

- other computer vision tasks, such as detection and segmentation

- further scaling ViT
Q&A
• ViT for Segmentation
• Fine-tuning on Grayscale Dataset
Thank you

Contenu connexe

Tendances

Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachFerdin Joe John Joseph PhD
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural NetworksPyData
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentationOwin Will
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksJeremy Nixon
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network Yan Xu
 

Tendances (20)

ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural Networks
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 

Similaire à ViT (Vision Transformer) Review [CDM]

Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...dbpublications
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitBAINIDA
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksSeunghyun Hwang
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSitakanta Mishra
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...Sunghoon Joo
 
IJCAI01 MSPC.ppt
IJCAI01 MSPC.pptIJCAI01 MSPC.ppt
IJCAI01 MSPC.pptPtidej Team
 
Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxcongtran88
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyNUPUR YADAV
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNNJunho Cho
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptxhtn540
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architecturesananth
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
 
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level FeatureDongmin Choi
 

Similaire à ViT (Vision Transformer) Review [CDM] (20)

IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
Use CNN for Sequence Modeling
Use CNN for Sequence ModelingUse CNN for Sequence Modeling
Use CNN for Sequence Modeling
 
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
Recognition and Detection of Real-Time Objects Using Unified Network of Faste...
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr Sanparit
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
IJCAI01 MSPC.ppt
IJCAI01 MSPC.pptIJCAI01 MSPC.ppt
IJCAI01 MSPC.ppt
 
AI-CUK Joint Journal Club: V.T.Hoang, Review on "Global self-attention as a r...
AI-CUK Joint Journal Club: V.T.Hoang, Review on "Global self-attention as a r...AI-CUK Joint Journal Club: V.T.Hoang, Review on "Global self-attention as a r...
AI-CUK Joint Journal Club: V.T.Hoang, Review on "Global self-attention as a r...
 
Real-Time Face Tracking with GPU Acceleration
Real-Time Face Tracking with GPU AccelerationReal-Time Face Tracking with GPU Acceleration
Real-Time Face Tracking with GPU Acceleration
 
CUDA Accelerated Face Recognition
CUDA Accelerated Face RecognitionCUDA Accelerated Face Recognition
CUDA Accelerated Face Recognition
 
Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptx
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 

Plus de Dongmin Choi

[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...Dongmin Choi
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Dongmin Choi
 
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...Dongmin Choi
 
YolactEdge Review [cdm]
YolactEdge Review [cdm]YolactEdge Review [cdm]
YolactEdge Review [cdm]Dongmin Choi
 
Review : Inter-slice Context Residual Learning for 3D Medical Image Segmentation
Review : Inter-slice Context Residual Learning for 3D Medical Image SegmentationReview : Inter-slice Context Residual Learning for 3D Medical Image Segmentation
Review : Inter-slice Context Residual Learning for 3D Medical Image SegmentationDongmin Choi
 
Deformable DETR Review [CDM]
Deformable DETR Review [CDM]Deformable DETR Review [CDM]
Deformable DETR Review [CDM]Dongmin Choi
 
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic SegmentationReview : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic SegmentationDongmin Choi
 
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...Dongmin Choi
 
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]Dongmin Choi
 
Review : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-trainingReview : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-trainingDongmin Choi
 
Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...
Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...
Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...Dongmin Choi
 
Pyradiomics Customization [CDM]
Pyradiomics Customization [CDM]Pyradiomics Customization [CDM]
Pyradiomics Customization [CDM]Dongmin Choi
 
Seeing What a GAN Cannot Generate [cdm]
Seeing What a GAN Cannot Generate [cdm]Seeing What a GAN Cannot Generate [cdm]
Seeing What a GAN Cannot Generate [cdm]Dongmin Choi
 
Neural network pruning with residual connections and limited-data review [cdm]
Neural network pruning with residual connections and limited-data review [cdm]Neural network pruning with residual connections and limited-data review [cdm]
Neural network pruning with residual connections and limited-data review [cdm]Dongmin Choi
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Dongmin Choi
 
How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...Dongmin Choi
 
Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]Dongmin Choi
 
Augmix review [cdm]
Augmix review [cdm]Augmix review [cdm]
Augmix review [cdm]Dongmin Choi
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...Dongmin Choi
 
ICCV 2019 REVIEW [CDM]
ICCV 2019 REVIEW [CDM]ICCV 2019 REVIEW [CDM]
ICCV 2019 REVIEW [CDM]Dongmin Choi
 

Plus de Dongmin Choi (20)

[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]
 
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
 
YolactEdge Review [cdm]
YolactEdge Review [cdm]YolactEdge Review [cdm]
YolactEdge Review [cdm]
 
Review : Inter-slice Context Residual Learning for 3D Medical Image Segmentation
Review : Inter-slice Context Residual Learning for 3D Medical Image SegmentationReview : Inter-slice Context Residual Learning for 3D Medical Image Segmentation
Review : Inter-slice Context Residual Learning for 3D Medical Image Segmentation
 
Deformable DETR Review [CDM]
Deformable DETR Review [CDM]Deformable DETR Review [CDM]
Deformable DETR Review [CDM]
 
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic SegmentationReview : Prototype Mixture Models for Few-shot Semantic Segmentation
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
 
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
 
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
 
Review : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-trainingReview : Rethinking Pre-training and Self-training
Review : Rethinking Pre-training and Self-training
 
Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...
Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...
Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...
 
Pyradiomics Customization [CDM]
Pyradiomics Customization [CDM]Pyradiomics Customization [CDM]
Pyradiomics Customization [CDM]
 
Seeing What a GAN Cannot Generate [cdm]
Seeing What a GAN Cannot Generate [cdm]Seeing What a GAN Cannot Generate [cdm]
Seeing What a GAN Cannot Generate [cdm]
 
Neural network pruning with residual connections and limited-data review [cdm]
Neural network pruning with residual connections and limited-data review [cdm]Neural network pruning with residual connections and limited-data review [cdm]
Neural network pruning with residual connections and limited-data review [cdm]
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]
 
How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...
 
Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]Objects as points (CenterNet) review [CDM]
Objects as points (CenterNet) review [CDM]
 
Augmix review [cdm]
Augmix review [cdm]Augmix review [cdm]
Augmix review [cdm]
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
 
ICCV 2019 REVIEW [CDM]
ICCV 2019 REVIEW [CDM]ICCV 2019 REVIEW [CDM]
ICCV 2019 REVIEW [CDM]
 

Dernier

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 

Dernier (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

ViT (Vision Transformer) Review [CDM]

  • 1. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Anonymous (ICLR 2021 under review) Yonsei University Severance Hospital CCIDS Choi Dongmin
  • 2. Abstract • Transformer
 - standard architecture for NLP • Convolutional Networks
 - attention is applied keeping their overall structure • Transformer in Computer Vision
 - a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches
 - achieved S.O.T.A with small computational costs when pre-trained on large dataset
  • 3. Introduction Vaswani et al. Attention Is All You Need. NIPS 2017 Transformer BERT Self-attention
 based architecture The dominant approach : pre-training on a large text corpus
 and then fine-tuning on a smaller task-specific dataset
  • 4. Introduction Carion et al. End-to-End Object Detection with Transformers. ECCV 2020 Wang et al. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020 Self-Attention in CV inspired by NLP DETR Axial-DeepLab However, classic ResNet-like architectures are still S.O.T.A
  • 5. • Applying a Transformer Directly to Images
 - with the fewest possible modifications
 - provide the sequence of linear embeddings of the patches as an input
 - image patches = tokens (words) in NLP • Small Scale Training
 - achieved accuracies below ResNets of comparable size
 - Transformers lack some inductive biased inherent to CNNs
 (such as translation equivariance and locality) • Large Scale Training
 - trumps (surpass) inductive bias
 - excellent results when pre-trained at sufficient scale and transferred Introduction
  • 6. Related Works Transformer Vaswani et al. Attention Is All You Need. NIPS 2017 Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019 Radford et al. Improving language under- standing with unsupervised learning. Technical Report 2018 - Standard model in NLP tasks - Only consists of attention modules
 not using RNN - Encoder-decoder - Requires large scale dataset and
 high computational cost - Pre-training and fine-tuning approaches : BERT & GPT
  • 8. Method Image → A sequence of flattened 2D patchesx ∈ RH×W×C xp ∈ RN×(P2 ·C) Trainable linear projection maps
 →xp ∈ RN×(P2 ·C) xpE ∈ RN×D Learnable Position Embedding
 Epos ∈ R(N+1)×D * Because Transformer uses constant
 widths, model dimension , through all of its layersD * to retain positional information z0 L https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py#L99-L111
  • 14. Method Hybrid Architecture Carion et al. End-to-End Object Detection with Transformers. ECCV 2020 Flattened intermediate feature
 maps of a ResNet
 as the input sequence like DETR
  • 15. Method Fine-tuning and Higher Resolution Carion et al. End-to-End Object Detection with Transformers. ECCV 2020 Remove the pre-trained prediction head and attach a zero-initialized
 feedforward layer ( =the number of downstream classes)D × K K
  • 16. Experiments • Datasets
 < Pre-training >
 - ILSVRC-2012 ImageNet dataset : 1k classes / 1.3M images
 - ImageNet-21k : 21k classes / 14M images
 - JFT : 18k classes / 303M images
 < Downstream (Fine-tuning) >
 - ImageNet, ImageNet ReaL, CIFAR-10/100, Oxford-IIIT Pets, Oxford Flowers-102, VTAB • Model Variants ex : ViT-L/16 = “Large” variants, with 16 X 16 input patch size
  • 17. Experiments • Training & Fine-tuning
 < Pre-training>
 - Adam with 
 - Batch size 4,096
 - Weight decay 0.1 (high weight decay is useful for transfer models)
 - Linear learning rate warmup and decay
 
 < Fine-tuning >
 - SGD with momentum, batch size 512 • Metrics
 - Few-shot (for fast on-the-fly evaluation)
 - Fine-tuning accuracy β1 = 0.9, β2 = 0.999
  • 18. Experiments • Comparison to State of the Art Kolesnikov et al. Big Transfer (BiT): General Visual Representation Learning. ECCV 2020 Xie et al. Self-training with noisy student improves imagenet classification. CVPR 2020 * BiT-L : Big Transfer, which performs supervised transfer learning with large ResNets * Noisy Student : a large EfficientNet trained using semi-supervised learning
  • 19. Experiments • Comparison to State of the Art
  • 20. Experiments • Pre-training Data Requirements Larger Dataset Larger Dataset
  • 22. Experiments • Inspecting Vision Transformer The components resemble plausible basis functions
 for a low-dimensional representation of the fine structure within each patch analogous to receptive field size in CNNs
  • 23. Conclusion • Application of Transformers to Image Recognition
 - no image-specific inductive biases in the architecture
 - interpret an image as sequence of patches and process it by a standard Transformer encoder
 - simple, yet scalable, strategy works
 - matches or exceeds the S.O.T.A being cheap to pre-train • Many Challenges Remain
 - other computer vision tasks, such as detection and segmentation
 - further scaling ViT
  • 24. Q&A • ViT for Segmentation • Fine-tuning on Grayscale Dataset