SlideShare une entreprise Scribd logo
1  sur  36
Transformer in Vision
Sangmin Woo
2020.10.29
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2 / 36
Contents
[2018 ICML] Image Transformer
Niki Parmar1 Ashish Vaswani1 Jakob Uszkoreit1 Łukasz Kaiser1 Noam Shazeer1 Alexander Ku2,3 Dustin Tran4
1Google Brain, Mountain View, USA
2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley
3Work done during an internship at Google Brain
4Google AI, Mountain View, USA.
[2019 CVPR] Video Action Transformer Network
Rohit Girdhar1 Jo˜ao Carreira2 Carl Doersch2 Andrew Zisserman2,3
1Carnegie Mellon University 2DeepMind 3University of Oxford
[2020 ECCV] End-to-End Object Detection with Transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko
Facebook AI
[2021 ICLR under review] An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner, Mostafa
Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby,
Google Research, Brain Team
3 / 36
Background
 Attention is all you need [2017 NIPS]
• The main idea of the original architecture is to compute self-attention by
comparing a feature to all other features in the sequence.
• Features are first mapped to a query (Q) and memory (key and value, K &
V ) embedding using linear projections.
• The output for the query is computed as an attention weighted sum of
values (V), with the attention weights obtained from the product of the
query (Q) with keys (K).
• In practice, query (Q) is the word being translated, and keys (K) and
values (V) are linear projections of the input sequence and the output
sequence generated so far.
• A positional encoding is also added to these representations in order to
incorporate positional information which is lost in this setup.
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
4 / 36
Image Transformer [2018 ICML]
 Generative models (Image Generation, Super-Resolution, Image
Completion)
5 / 36
Image Transformer [2018 ICML]
 Pixel-RNN / Pixel-CNN (van den Oord et al., 2016)
• Straightforward
• Tractable likelihood
• Simple and stable
[2] van den Oord, A¨aron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. ICML, 2016.
[3] van den Oord, A¨aron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image
generation with pixelcnn decoders. NIPS, 2016.
Pixel-RNN Pixel-CNN
6 / 36
Image Transformer [2018 ICML]
 Motivation
• Pixel-RNN and Pixel-CNN turned the problem into sequence modeling
problem by applying RNN or CNN to predict each next pixel given all
previously generated pixels.
• RNN is computationally heavy 
• CNN is parallelizable 
• CNN has limited receptive field → long range dependency problem → if
stack more layers? → expensive 
• RNN has virtually unlimited receptive field 
• Self-attention can achieve a better balance in the trade-off between the
virtually unlimited receptive field of the necessarily sequential PixelRNN
and the limited receptive field of the much more parallelizable PixelCNN
and its various extensions 
7 / 36
Image Transformer [2018 ICML]
 Image Completion & Super-resolution
8 / 36
Image Transformer [2018 ICML]
 Image Transformer
• 𝑞: single channel of one pixel
(query)
• 𝑚1, 𝑚2, 𝑚3: memory of
previously generated pixels
(key)
• 𝑝 𝑞, 𝑝1, 𝑝2, 𝑝3: position encodings
• 𝑐𝑚𝑝: first embed query and key
then apply dot product.
9 / 36
Image Transformer [2018 ICML]
 Local Self-Attention
• Scalability issue is in the self-attention mechanism 
• Restrict the positions in the memory matrix M to a local neighborhood
around the query position 
• 1d vs. 2d Attention
10/ 36
Image Transformer [2018 ICML]
 Image Generation
11 / 36
Image Transformer [2018 ICML]
 Super-resolution
12/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
[4] Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. ECCV, 2016.
13/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition & Localization
14/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Transformer
• Action Transformer unit takes as input the video feature representation
and the box proposal from RPN and maps it into query (𝑄) and memory
(𝐾&𝑉) features.
• Query (𝑄): The person being classified.
• Memory (𝐾&𝑉): Clip around the person.
• The unit processes the query (𝑄) and memory (𝐾&𝑉) to output an updated
query vector (𝑄∗
).
• The intuition is that the self-attention will add context from other people
and objects in the clip to the query (𝑄) vector, to aid with the subsequent
classification.
• This unit can be stacked in multiple heads and layers, by concatenating
the output from the multiple heads at a given layer, and using the
concatenated feature as the next query.
• This updated query (𝑄∗
) is then used to again attend to context features in
the following layer.
15/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
16/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
17/ 36
Video Action Transformer Network
[2018 CVPR]
 Action Recognition
18/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Object Detection
[5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
19/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Faster R-CNN
[5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
20/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Non Maximum Suppression (NMS)
NMS
21/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Motivation
• Multi-staged pipeline 
• Too many hand-crafted components or heuristics (e.g., non-maximum
suppression, anchor box) that explicitly encode out prior knowledge about
the task 
• Let’s simplify these pipelines with end-to-end philosophy! 
• Let’s remove the need of heuristics with direct set prediction! 
• Forces unique predictions via bi-partite matching loss between
predicted and ground-truth objects.
• Encoder-decoder architecture based on Transformer
• Transformer explicitly model all pairwise interactions between
elements in a sequence, which is particularly suitable for specific
constraints of set prediction such as removing duplicate
predictions.
22/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 DETR (DEtection TRansformer) in high-level
23/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 DETR in detail
24/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Encoder self-attention
25/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 NMS & OOD
26/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Decoder attetnion
27/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 Decoder output slot
28/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
 DETR PyTorch inference code: Very simple 
29/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Motivation
• Large Transformer-based models are often pre-trained on large corpora
and then fine-tuned for the task at hand: BERT uses a denoising self-
supervised pre-training task, while the GPT line of work uses language
modeling as its pre-training task
• Vision Transformer (ViT) yield modest results when trained on mid-sized
datasets such as ImageNet, achieving accuracies of a few percentage
points below ResNets of comparable size. This seemingly discouraging
outcome may be expected: Transformers lack some inductive biases
inherent to CNNs, such as translation equivariance and locality, and
therefore do not generalize well when trained on insufficient amounts of
data.
• However, the picture changes if ViT is trained on large datasets (14M-
300M images). i.e., large scale training trumps inductive bias.
Transformers attain excellent results when pre-trained at sufficient
scale and transferred to tasks with fewer datapoints.
30/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Vision Transformer (ViT)
31/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 ViT vs. BiT (Alexander Kolesnikov et al. Big transfer (BiT): General visual representation learning. In ECCV, 2020.)
32/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Performance vs. pre-training samples
33/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Performance vs. cost
34/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
 Image Classification
35/ 36
Concluding Remarks
 Transformer are competent with modeling inter-relationship
between pixels (video clips, image patches, …) 
 If Transformer is pre-trained sufficient number of data, it can
replace the CNN and it also performs well 
 Transformer is a generic architecture even more than MLP (I
think…)
 Not only in NLP, Transformer also shows astonishing results in
Vision!
 But, Transformer is known to have quadratic complexity 
 Here’s further reading which reduces the quadratic complexity into the
linear complexity.
 “Rethinking Attention with Performers” (ICLR 2021 under review)
Thank You
shmwoo9395@{gist.ac.kr, gmail.com}

Contenu connexe

Tendances

210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5taeseon ryu
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models Chia-Wen Cheng
 
Understanding neural radiance fields
Understanding neural radiance fieldsUnderstanding neural radiance fields
Understanding neural radiance fieldsVarun Bhaseen
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnnSumeraHangi
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachFerdin Joe John Joseph PhD
 
Deep learning based object detection basics
Deep learning based object detection basicsDeep learning based object detection basics
Deep learning based object detection basicsBrodmann17
 
Image Classification using deep learning
Image Classification using deep learning Image Classification using deep learning
Image Classification using deep learning Asma-AH
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketakiKetaki Patwari
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networksananth
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network Yan Xu
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsKasun Chinthaka Piyarathna
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionJinwon Lee
 

Tendances (20)

210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models
 
Understanding neural radiance fields
Understanding neural radiance fieldsUnderstanding neural radiance fields
Understanding neural radiance fields
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnn
 
ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
 
Deep learning based object detection basics
Deep learning based object detection basicsDeep learning based object detection basics
Deep learning based object detection basics
 
Image Classification using deep learning
Image Classification using deep learning Image Classification using deep learning
Image Classification using deep learning
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Image captioning
Image captioningImage captioning
Image captioning
 
Cnn
CnnCnn
Cnn
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
AlexNet
AlexNetAlexNet
AlexNet
 

Similaire à Transformer in Vision

Visual geometry with deep learning
Visual geometry with deep learningVisual geometry with deep learning
Visual geometry with deep learningNAVER Engineering
 
[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient Simulation[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient SimulationDongwonSon1
 
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...paperpublications3
 
物件偵測與辨識技術
物件偵測與辨識技術物件偵測與辨識技術
物件偵測與辨識技術CHENHuiMei
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesFellowship at Vodafone FutureLab
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
 
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro..."High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...Edge AI and Vision Alliance
 
Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.IRJET Journal
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...Edge AI and Vision Alliance
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyNUPUR YADAV
 
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...CSCJournals
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Review On Different Feature Extraction Algorithms
Review On Different Feature Extraction AlgorithmsReview On Different Feature Extraction Algorithms
Review On Different Feature Extraction AlgorithmsIRJET Journal
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Content-based image retrieval based on corel dataset using deep learning
Content-based image retrieval based on corel dataset using deep learningContent-based image retrieval based on corel dataset using deep learning
Content-based image retrieval based on corel dataset using deep learningIAESIJAI
 
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET TransformRotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET TransformIRJET Journal
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImageryRAHUL BHOJWANI
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverviewMotaz El-Saban
 

Similaire à Transformer in Vision (20)

Visual geometry with deep learning
Visual geometry with deep learningVisual geometry with deep learning
Visual geometry with deep learning
 
[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient Simulation[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient Simulation
 
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...
 
物件偵測與辨識技術
物件偵測與辨識技術物件偵測與辨識技術
物件偵測與辨識技術
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro..."High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
 
Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.
 
Visual Transformers
Visual TransformersVisual Transformers
Visual Transformers
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...
Performance Evaluation of CNN Based Pedestrian and Cyclist Detectors On Degra...
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 
Review On Different Feature Extraction Algorithms
Review On Different Feature Extraction AlgorithmsReview On Different Feature Extraction Algorithms
Review On Different Feature Extraction Algorithms
 
AR/SLAM for end-users
AR/SLAM for end-usersAR/SLAM for end-users
AR/SLAM for end-users
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
 
Content-based image retrieval based on corel dataset using deep learning
Content-based image retrieval based on corel dataset using deep learningContent-based image retrieval based on corel dataset using deep learning
Content-based image retrieval based on corel dataset using deep learning
 
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET TransformRotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
Rotation Invariant Face Recognition using RLBP, LPQ and CONTOURLET Transform
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
 

Plus de Sangmin Woo

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxSangmin Woo
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptxSangmin Woo
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxSangmin Woo
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxSangmin Woo
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxSangmin Woo
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptxSangmin Woo
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptxSangmin Woo
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSangmin Woo
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Sangmin Woo
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient TransformersSangmin Woo
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsSangmin Woo
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextSangmin Woo
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsSangmin Woo
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationSangmin Woo
 

Plus de Sangmin Woo (14)

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

Dernier

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 

Dernier (20)

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 

Transformer in Vision

  • 1. Transformer in Vision Sangmin Woo 2020.10.29 [2018 ICML] Image Transformer [2019 CVPR] Video Action Transformer Network [2020 ECCV] End-to-End Object Detection with Transformers [2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  • 2. 2 / 36 Contents [2018 ICML] Image Transformer Niki Parmar1 Ashish Vaswani1 Jakob Uszkoreit1 Łukasz Kaiser1 Noam Shazeer1 Alexander Ku2,3 Dustin Tran4 1Google Brain, Mountain View, USA 2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley 3Work done during an internship at Google Brain 4Google AI, Mountain View, USA. [2019 CVPR] Video Action Transformer Network Rohit Girdhar1 Jo˜ao Carreira2 Carl Doersch2 Andrew Zisserman2,3 1Carnegie Mellon University 2DeepMind 3University of Oxford [2020 ECCV] End-to-End Object Detection with Transformers Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko Facebook AI [2021 ICLR under review] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy, Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, Google Research, Brain Team
  • 3. 3 / 36 Background  Attention is all you need [2017 NIPS] • The main idea of the original architecture is to compute self-attention by comparing a feature to all other features in the sequence. • Features are first mapped to a query (Q) and memory (key and value, K & V ) embedding using linear projections. • The output for the query is computed as an attention weighted sum of values (V), with the attention weights obtained from the product of the query (Q) with keys (K). • In practice, query (Q) is the word being translated, and keys (K) and values (V) are linear projections of the input sequence and the output sequence generated so far. • A positional encoding is also added to these representations in order to incorporate positional information which is lost in this setup. [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • 4. 4 / 36 Image Transformer [2018 ICML]  Generative models (Image Generation, Super-Resolution, Image Completion)
  • 5. 5 / 36 Image Transformer [2018 ICML]  Pixel-RNN / Pixel-CNN (van den Oord et al., 2016) • Straightforward • Tractable likelihood • Simple and stable [2] van den Oord, A¨aron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. ICML, 2016. [3] van den Oord, A¨aron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with pixelcnn decoders. NIPS, 2016. Pixel-RNN Pixel-CNN
  • 6. 6 / 36 Image Transformer [2018 ICML]  Motivation • Pixel-RNN and Pixel-CNN turned the problem into sequence modeling problem by applying RNN or CNN to predict each next pixel given all previously generated pixels. • RNN is computationally heavy  • CNN is parallelizable  • CNN has limited receptive field → long range dependency problem → if stack more layers? → expensive  • RNN has virtually unlimited receptive field  • Self-attention can achieve a better balance in the trade-off between the virtually unlimited receptive field of the necessarily sequential PixelRNN and the limited receptive field of the much more parallelizable PixelCNN and its various extensions 
  • 7. 7 / 36 Image Transformer [2018 ICML]  Image Completion & Super-resolution
  • 8. 8 / 36 Image Transformer [2018 ICML]  Image Transformer • 𝑞: single channel of one pixel (query) • 𝑚1, 𝑚2, 𝑚3: memory of previously generated pixels (key) • 𝑝 𝑞, 𝑝1, 𝑝2, 𝑝3: position encodings • 𝑐𝑚𝑝: first embed query and key then apply dot product.
  • 9. 9 / 36 Image Transformer [2018 ICML]  Local Self-Attention • Scalability issue is in the self-attention mechanism  • Restrict the positions in the memory matrix M to a local neighborhood around the query position  • 1d vs. 2d Attention
  • 10. 10/ 36 Image Transformer [2018 ICML]  Image Generation
  • 11. 11 / 36 Image Transformer [2018 ICML]  Super-resolution
  • 12. 12/ 36 Video Action Transformer Network [2018 CVPR]  Action Recognition [4] Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. ECCV, 2016.
  • 13. 13/ 36 Video Action Transformer Network [2018 CVPR]  Action Recognition & Localization
  • 14. 14/ 36 Video Action Transformer Network [2018 CVPR]  Action Transformer • Action Transformer unit takes as input the video feature representation and the box proposal from RPN and maps it into query (𝑄) and memory (𝐾&𝑉) features. • Query (𝑄): The person being classified. • Memory (𝐾&𝑉): Clip around the person. • The unit processes the query (𝑄) and memory (𝐾&𝑉) to output an updated query vector (𝑄∗ ). • The intuition is that the self-attention will add context from other people and objects in the clip to the query (𝑄) vector, to aid with the subsequent classification. • This unit can be stacked in multiple heads and layers, by concatenating the output from the multiple heads at a given layer, and using the concatenated feature as the next query. • This updated query (𝑄∗ ) is then used to again attend to context features in the following layer.
  • 15. 15/ 36 Video Action Transformer Network [2018 CVPR]  Action Recognition
  • 16. 16/ 36 Video Action Transformer Network [2018 CVPR]  Action Recognition
  • 17. 17/ 36 Video Action Transformer Network [2018 CVPR]  Action Recognition
  • 18. 18/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Object Detection [5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
  • 19. 19/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Faster R-CNN [5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
  • 20. 20/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Non Maximum Suppression (NMS) NMS
  • 21. 21/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Motivation • Multi-staged pipeline  • Too many hand-crafted components or heuristics (e.g., non-maximum suppression, anchor box) that explicitly encode out prior knowledge about the task  • Let’s simplify these pipelines with end-to-end philosophy!  • Let’s remove the need of heuristics with direct set prediction!  • Forces unique predictions via bi-partite matching loss between predicted and ground-truth objects. • Encoder-decoder architecture based on Transformer • Transformer explicitly model all pairwise interactions between elements in a sequence, which is particularly suitable for specific constraints of set prediction such as removing duplicate predictions.
  • 22. 22/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  DETR (DEtection TRansformer) in high-level
  • 23. 23/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  DETR in detail
  • 24. 24/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Encoder self-attention
  • 25. 25/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  NMS & OOD
  • 26. 26/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Decoder attetnion
  • 27. 27/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  Decoder output slot
  • 28. 28/ 36 End-to-End Object Detection with Transformer [2020 ECCV]  DETR PyTorch inference code: Very simple 
  • 29. 29/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  Motivation • Large Transformer-based models are often pre-trained on large corpora and then fine-tuned for the task at hand: BERT uses a denoising self- supervised pre-training task, while the GPT line of work uses language modeling as its pre-training task • Vision Transformer (ViT) yield modest results when trained on mid-sized datasets such as ImageNet, achieving accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. • However, the picture changes if ViT is trained on large datasets (14M- 300M images). i.e., large scale training trumps inductive bias. Transformers attain excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints.
  • 30. 30/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  Vision Transformer (ViT)
  • 31. 31/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  ViT vs. BiT (Alexander Kolesnikov et al. Big transfer (BiT): General visual representation learning. In ECCV, 2020.)
  • 32. 32/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  Performance vs. pre-training samples
  • 33. 33/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  Performance vs. cost
  • 34. 34/ 36 An Image is Worth 16x16 words [2021 ICLR under review]  Image Classification
  • 35. 35/ 36 Concluding Remarks  Transformer are competent with modeling inter-relationship between pixels (video clips, image patches, …)   If Transformer is pre-trained sufficient number of data, it can replace the CNN and it also performs well   Transformer is a generic architecture even more than MLP (I think…)  Not only in NLP, Transformer also shows astonishing results in Vision!  But, Transformer is known to have quadratic complexity   Here’s further reading which reduces the quadratic complexity into the linear complexity.  “Rethinking Attention with Performers” (ICLR 2021 under review)

Notes de l'éditeur

  1. We adjust the concentration of the distribution we sample from with a temperature tau > 0 by which we divide the logits for the channel intensities.
  2. Remove duplicates
  3. Due to the scalability of Attention mechanism, Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes.