SlideShare une entreprise Scribd logo
1  sur  64
Transformers In Vision
From Zero to Hero!
Davide Coccomini & Nicola Messina
Davide Coccomini Nicola Messina
PhD Candidate
Italian National
Research Council
PhD Student
Italian National
Research Council
Reach me on …
Reach me on …
What do you think when you hear the
word «Transformer»?
Davide Coccomini & Nicola Messina | AICamp 2021
The Transformer «today»
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
Salta (90%) | Odia (9%) | Perchè (1%)
The cat jumps the wall Il gatto
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
Davide Coccomini & Nicola Messina | AICamp 2021
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
Outline
Some history: from RNNs to Transformers
Transformers’ attention and self-attention mechanisms
The power of the Transformer Encoder
From text to images: Vision Transformers
Transformers: The beginnings
From images to videos
The scale and data problem
Convolutional Neural Networks and Vision Transformers
Some interesting real-world applications
Transformers in Vision
Videos
Images
Text History
Introduced transformers in NLP
2017
Vision Transformers
2020
2021
Transformers for video
understanding
Now Computer Vision Revolution!
Transformers
Davide Coccomini & Nicola Messina | AICamp 2021
A step back: Recurrent Networks (RNNs)
E
The cat jumps the wall
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D
h6
D
h7
D
h8
D
Il gatto salta il
h9
D
<end>
Davide Coccomini & Nicola Messina | AICamp 2021
muro
Encoder
Final sentence
embedding
Decoder
<s>
Problems
1. We forget tokens too far in the past
2. We need to wait the previous token to compute the next hidden-state
E
The cat jumps the wall
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D h6
D h7
D h8
D
Il gatto salta il muro
h9
D
<end>
<s>
Davide Coccomini & Nicola Messina | AICamp 2021
Solving problem 1
"We forget tokens too far in the past"
E
The cat jumps the wall
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
Solution
Add an attention mechanism
+
h5 h6
<s>
Solving problem 2
"We need to wait the previous token to compute the next hidden-state"
2017 paper
"Attention Is All You Need"
Solution
Throw away recurrent
connections
E
The cat jumps the wall
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
+
h5 h6
<s>
Davide Coccomini & Nicola Messina | AICamp 2021
Full Transformer Architecture
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
“Salta” (90%) | “Odia” (9%) | “Perchè” (1%)
“The cat jumps the wall” “Il gatto”
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
Davide Coccomini & Nicola Messina | AICamp 2021
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
Transformer's Attention Mechanism
Target tokens “from the
point of view” of the
source sequence
Queries
Target
Sequence
Source
Sequence
FFN
FFN
FFN
FFN
Keys & Values
FFN
FFN
∙
∙
∙
∙
Norm
&
Softmax
Dot
product
Il
gatto
salta
The
cat
jumps
the
…
…
FFN
Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Transformer's Attention Mechanism
From a different perspective
Query
“N5”
Weighted
average
“gatto” token, built by
aggregating value vectors in
the source dictionary
Lookup Table
Target
Sequence
Source
Sequence
Il
gatto
salta
The
cat
jumps
the
…
wall
Soft-matching
Davide Coccomini & Nicola Messina | AICamp 2021
Attention and Self-Attention
Self-Attention
• Source = Target
• Key, Queries, Values
obtained from the
same sentence
• Captures intra-sequence dependencies
Attention
• Source ≠ Target
• Queries from Source
• Key, Values from Target
• Captures inter-sequence dependencies
I gave my dog Charlie some food
Ho dato da mangiare al mio cane Charlie
I gave my dog Charlie some food
To whom? What?
Who?
Multi-Head
Attention
V K Q
Multi-Head
Attention
V K Q
Target Source
Davide Coccomini & Nicola Messina | AICamp 2021
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
LINEAR
DOT NORMALIZE
MATMUL
CONCAT + DENSE
VALUES
KEYS
QUERIES
Multi-Head Self-Attention
Multiple instantiations of the attention mechanism
h
h
h
Davide Coccomini & Nicola Messina | AICamp 2021
The cat is running
h slices
Full Transformer Architecture
Input Output
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
Salta (90%) | Odia (9%) | Perchè (1%)
The cat jumps the wall Il gatto
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
Davide Coccomini & Nicola Messina | AICamp 2021
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Lookup Table (source sequence)
Multi-Head
Self Attention
Multi-Head
Attention
Is there any problem
with Transformers?
Attention
calculation is O(n2)
Self-Attention
I gave my dog Charlie some food
I gave my dog Charlie some food
I gave my dog Charlie some food
. . . . . . . . . .
Davide Coccomini & Nicola Messina | AICamp 2021
I gave my dog Charlie some food
I
gave
my
dog
Charlie
some
food
Attention
calculation is O(n2)
Self-Attention
Davide Coccomini & Nicola Messina | AICamp 2021
The Power of the Transformer Encoder
• Many achievements using only the Encoder
• BERT (Devlin et al., 2018)
Next Sentence
Prediction {0, 1}
Transformer Encoder (N layers)
I gave my dog Charlie some food
<CLS> <SEP> He ate it
Positional Encoding
Embedding Layer
Masked Language
Modelling «ate»
Memory
Transformers in Computer Vision
Can we use the self-attention mechanism in images?
Davide Coccomini & Nicola Messina | AICamp 2021
Transformers in Computer Vision
256px
256px
3906250000
calculations
Impossible!
62500
pixels
• The transformer works with a set of tokens
• What are tokens in images?
Davide Coccomini & Nicola Messina | AICamp 2021
• Tokens as the features from an object detector
Transformers in Computer Vision
Tokens!
ROI Pooling
ROI Pooling
ROI Pooling
Davide Coccomini & Nicola Messina | AICamp 2021
“An image is worth 16x16 words”
Image
to
Patches
Tokens!
Linear
Projection
256px
256px
16px
16px
Vision Transformers (ViTs)
Davide Coccomini & Nicola Messina | AICamp 2021
“An image is worth 16x16 words” | Dosovitskiy et al., 2020
Vision Transformers (ViTs)
0 * 1 2 3 4 5 6 7 8 9
1 2 3
4 5 6
7 8 9
MLP
Head
CLASS
Linear Projection of Flattened Patches
Transformer Encoder
Davide Coccomini & Nicola Messina | AICamp 2021
Image Classification on ImageNet
Davide Coccomini & Nicola Messina | AICamp 2021
What about video?
Davide Coccomini & Nicola Messina | AICamp 2021
TimeSformers
Combine space and time attention with Divided Space-Time Attention!
frame
t
-
δ
frame
t
frame
t
+
δ
Space
Time
Time
Davide Coccomini & Nicola Messina | AICamp 2021
Is Space-Time Attention All You Need for Video Understanding? | Gedas Bertasius et al.
TimeSformers
Up to several minutes of
analysis!
Davide Coccomini & Nicola Messina | AICamp 2021
Wait… Can I use different types of
attention???
Transformers use a lot of memory!
Attention
Feed Forward
Feed Forward
.
.
.
.
2 GB
2 GB
2 GB
Davide Coccomini & Nicola Messina | AICamp 2021
USED MEMORY
2 GB
4 GB
A LOT!
Efficient Transformers
Attention
+
FeedForward
+
-
-
Attention
FeedForward
Davide Coccomini & Nicola Messina | AICamp 2021
A new efficient Transformer variant | Lukasz Kaiser
USED MEMORY
2 GB
4 GB
Rev Attention!
Patch
Partition
Davide Coccomini & Nicola Messina | AICamp 2021
Linear
Embedding
x2
Stage 1
Swin
Transformer
Block
Patch
Merging
Swin
Transformer
Block
x2
Stage 2
Patch
Merging
Swin
Transformer
Block
x6
Stage 3
Patch
Merging
Swin
Transformer
Block
x2
Stage 4
𝑯 × 𝑾 × 𝟑
𝑯
𝟒
×
𝑾
𝟒
× 𝟒𝟖
𝑯
𝟒
×
𝑾
𝟒
× 𝑪
𝑯
𝟖
×
𝑾
𝟖
× 𝟐𝑪
𝑯
𝟏𝟔
×
𝑾
𝟏𝟔
× 𝟒𝑪
𝑯
𝟑𝟐
×
𝑾
𝟑𝟐
× 𝟖𝑪
Swin Transformers
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | Ze Liu et al.
16x 4x
8x
16x
Swin Transformers
Swin Transformer
Vision Transformer
16x
16x
Davide Coccomini & Nicola Messina | AICamp 2021
Shifted Window based Self-Attention
Self-Attention Layer l Self-Attention Layer l+1
Davide Coccomini & Nicola Messina | AICamp 2021
Swin Transformers
Source: Swin Transformer Object Detection Demo – By DeepReader
https://www.youtube.com/watch?v=FQVS_0Bja6o
Can we do without Self-Attention?
It is «just» a transformation!
What essentially is the attention
mechanism?
Davide Coccomini & Nicola Messina | AICamp 2021
Attention
Mechanism
Input
Attention
Calculation
Embeddings
Feed Forward
Add & Normalize
Dense
Output Prediction
Add & Normalize
Embeddings
Fourier Network
Fourier
Transformation
Davide Coccomini & Nicola Messina | AICamp 2021
Why Fourier?
It’s just a transformation!
Image from mriquestion.com
Fourier Transform
Davide Coccomini & Nicola Messina | AICamp 2021
Fourier
Transform
Fourier
Transform
Transforming over
the hidden domain
Transforming over
the sequence
domain
Fourier Network
What does it transform?
Input Vectors
Davide Coccomini & Nicola Messina | AICamp 2021
FNet: Mixing Tokens with Fourier Transforms | James Lee-Thorp et al.
MLP Mixer
1 2 3
4 5 6
7 8 9
Per-patch Fully Connected
N x (Mixer Layer)
Davide Coccomini & Nicola Messina | AICamp 2021
Global Average Pooling
Fully-Connected
CLASS
MLP-Mixer: An all-MLP Architecture for Vision | Ilya Tolstikhin et al.
Layer
Norm
MLP
MLP
MLP
MLP
Layer
Norm
MLP
MLP
MLP
MLP
Mixer Layer
Davide Coccomini & Nicola Messina | AICamp 2021
Transforming over
the sequence
domain
Transforming over
the hidden
domain
Convolutional Neural
Network
Vision Transformer
Not Improving
Anymore Still Improving
Learned Knowledge
What happens during training?
Davide Coccomini & Nicola Messina | AICamp 2021
Why are they different?
Able to find
long-term dependencies
Learns
inductive biases
Need very large
dataset for training
Lack of global
understanding
Locality
Sensitive
Translation
Invariant
Davide Coccomini & Nicola Messina | AICamp 2021
Convolutional
Neural
Network
Vision
Transformers
A different point of view
ViTs are both local and global!
The ViT learns only global information
with low amount of data
0.7
0.01
Davide Coccomini & Nicola Messina | AICamp 2021
Heads focus on farther patches
A different point of view
ViTs are both local and global!
The ViT learns also local information
with more data
Davide Coccomini & Nicola Messina | AICamp 2021
0.7
0.01
Higher layers heads still focus on farther
patches
0.01
0.4
0.3
Lower layers heads focus on both farther
and closer patches
0.6
0.06
A different point of view
They learn different representations!
Similar representations
through the layers
Different representations
through the layers
Davide Coccomini & Nicola Messina | AICamp 2021
Do Vision Transformers See Like Convolutional Neural Networks? | Maithra Raghu et al.
A different point of view
Vision Transformers are very robust!
Davide Coccomini & Nicola Messina | AICamp 2021
Occlusion Distribution Shift Adversarial Perturbation Permutation
Intriguing Properties of Vision Transformers | Muzammal Naseer et al.
Can we obtain the best of the two
architectures?
28 x 28 24 x 24 8 x 8
12 x 12 4 x 4
Davide Coccomini & Nicola Messina | AICamp 2021
What happens in CNNs?
Hey! They are patches!
Convolutional
Neural
Network
Transformer
Encoder CLASS
MLP
Davide Coccomini & Nicola Messina | AICamp 2021
Hybrids
A possible configuration!
Combining EfficientNet and Vision Transformers for Video Deepfake Detection | Coccomini et al.
Recap
Use a pure Vision Transformer
Improve the internal attention mechanism
Use an alternative transformation
Combine CNNs with Vision Transformers
How can we use Transformers in Vision?
ViT-G/14
CoAtNet-7
ViT-MoE-15B
1°
2°
3°
90.45% top-1 accuracy
Vision Transformer
Pretrained on JFT
1843M parameters
90.88% top-1 accuracy
Conv + Vision Transformer
Pretrained on JFT
2440M parameters
90.35% top-1 accuracy
Vision Transformer
Pretrained on JFT
14700M parameters
ImageNet Ranking
APPLICATIONS!!!
Video Supervised Learning DINO
Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI
Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI
Source: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction
Source: Image GPT – By OpenAI
Thank You for
the Attention!
Any question?

Contenu connexe

Tendances

Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachFerdin Joe John Joseph PhD
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
PR-315: Taming Transformers for High-Resolution Image Synthesis
PR-315: Taming Transformers for High-Resolution Image SynthesisPR-315: Taming Transformers for High-Resolution Image Synthesis
PR-315: Taming Transformers for High-Resolution Image SynthesisHyeongmin Lee
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015Jia-Bin Huang
 
State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer VisionDeep Kayal
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Universitat Politècnica de Catalunya
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
Object Detection and Recognition
Object Detection and Recognition Object Detection and Recognition
Object Detection and Recognition Intel Nervana
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learningAntonio Rueda-Toicen
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion pathVitaly Bondar
 

Tendances (20)

Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Data Augmentation
Data AugmentationData Augmentation
Data Augmentation
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
PR-315: Taming Transformers for High-Resolution Image Synthesis
PR-315: Taming Transformers for High-Resolution Image SynthesisPR-315: Taming Transformers for High-Resolution Image Synthesis
PR-315: Taming Transformers for High-Resolution Image Synthesis
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 
Cnn
CnnCnn
Cnn
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer Vision
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Object Detection and Recognition
Object Detection and Recognition Object Detection and Recognition
Object Detection and Recognition
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learning
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
 

Similaire à Transformers in Vision: From Zero to Hero

210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5taeseon ryu
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureRouyun Pan
 
Cool NUKE Nodes For VFX Compositing
Cool NUKE Nodes For VFX CompositingCool NUKE Nodes For VFX Compositing
Cool NUKE Nodes For VFX CompositingAnimation Kolkata
 
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI ProjectsDiscovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI ProjectsWee Hyong Tok
 
Understanding Deep Learning
Understanding Deep LearningUnderstanding Deep Learning
Understanding Deep LearningC4Media
 
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...Edge AI and Vision Alliance
 
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Universitat Politècnica de Catalunya
 
Gluecon 2013 - Dark Architecture and How to Forklift Upgrade Your System - Dy...
Gluecon 2013 - Dark Architecture and How to Forklift Upgrade Your System - Dy...Gluecon 2013 - Dark Architecture and How to Forklift Upgrade Your System - Dy...
Gluecon 2013 - Dark Architecture and How to Forklift Upgrade Your System - Dy...Cory von Wallenstein
 
Monitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessAdrian Cockcroft
 
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio...
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio..."3D from 2D: Theory, Implementation, and Applications of Structure from Motio...
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio...Edge AI and Vision Alliance
 
Slides galvin-widjaja
Slides galvin-widjajaSlides galvin-widjaja
Slides galvin-widjajaCodePolitan
 
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Pirouz Nourian
 
Advances in Visual Quality Restoration with Generative Adversarial Networks
Advances in Visual Quality Restoration with Generative Adversarial NetworksAdvances in Visual Quality Restoration with Generative Adversarial Networks
Advances in Visual Quality Restoration with Generative Adversarial NetworksFörderverein Technische Fakultät
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with TransformersDatabricks
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019RackN
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...MLconf
 

Similaire à Transformers in Vision: From Zero to Hero (20)

210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
Cool NUKE Nodes For VFX Compositing
Cool NUKE Nodes For VFX CompositingCool NUKE Nodes For VFX Compositing
Cool NUKE Nodes For VFX Compositing
 
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI ProjectsDiscovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
 
Understanding Deep Learning
Understanding Deep LearningUnderstanding Deep Learning
Understanding Deep Learning
 
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
 
00 introduction
00 introduction00 introduction
00 introduction
 
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
 
Gluecon 2013 - Dark Architecture and How to Forklift Upgrade Your System - Dy...
Gluecon 2013 - Dark Architecture and How to Forklift Upgrade Your System - Dy...Gluecon 2013 - Dark Architecture and How to Forklift Upgrade Your System - Dy...
Gluecon 2013 - Dark Architecture and How to Forklift Upgrade Your System - Dy...
 
Monitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - MonitoringlessMonitoring Challenges - Monitorama 2016 - Monitoringless
Monitoring Challenges - Monitorama 2016 - Monitoringless
 
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio...
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio..."3D from 2D: Theory, Implementation, and Applications of Structure from Motio...
"3D from 2D: Theory, Implementation, and Applications of Structure from Motio...
 
Slides galvin-widjaja
Slides galvin-widjajaSlides galvin-widjaja
Slides galvin-widjaja
 
DL (v2).pptx
DL (v2).pptxDL (v2).pptx
DL (v2).pptx
 
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
 
Advances in Visual Quality Restoration with Generative Adversarial Networks
Advances in Visual Quality Restoration with Generative Adversarial NetworksAdvances in Visual Quality Restoration with Generative Adversarial Networks
Advances in Visual Quality Restoration with Generative Adversarial Networks
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019
 
Extreme DevOps in Fintech
Extreme DevOps in FintechExtreme DevOps in Fintech
Extreme DevOps in Fintech
 
David Helgason
David HelgasonDavid Helgason
David Helgason
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
 

Plus de Bill Liu

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectBill Liu
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Bill Liu
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeBill Liu
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsBill Liu
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixBill Liu
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScaleBill Liu
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Bill Liu
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsBill Liu
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Bill Liu
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningBill Liu
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileBill Liu
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningBill Liu
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsBill Liu
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldBill Liu
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeBill Liu
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...Bill Liu
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 

Plus de Bill Liu (20)

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 

Dernier

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Dernier (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Transformers in Vision: From Zero to Hero

  • 1. Transformers In Vision From Zero to Hero! Davide Coccomini & Nicola Messina Davide Coccomini Nicola Messina PhD Candidate Italian National Research Council PhD Student Italian National Research Council Reach me on … Reach me on …
  • 2. What do you think when you hear the word «Transformer»? Davide Coccomini & Nicola Messina | AICamp 2021
  • 3.
  • 4. The Transformer «today» Input Output Multi-Head Attention Feed Forward Feed Forward Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Multi-Head Attention + + Salta (90%) | Odia (9%) | Perchè (1%) The cat jumps the wall Il gatto Positional Encoding Positional Encoding V K Q Nx Mx Davide Coccomini & Nicola Messina | AICamp 2021 V K Q V K Q Encoder Decoder Add & Norm Linear + Softmax Memory
  • 5. Outline Some history: from RNNs to Transformers Transformers’ attention and self-attention mechanisms The power of the Transformer Encoder From text to images: Vision Transformers Transformers: The beginnings From images to videos The scale and data problem Convolutional Neural Networks and Vision Transformers Some interesting real-world applications Transformers in Vision
  • 6. Videos Images Text History Introduced transformers in NLP 2017 Vision Transformers 2020 2021 Transformers for video understanding Now Computer Vision Revolution! Transformers Davide Coccomini & Nicola Messina | AICamp 2021
  • 7. A step back: Recurrent Networks (RNNs) E The cat jumps the wall h0 hstart E h1 E h2 E h3 E h4 D h5 D h6 D h7 D h8 D Il gatto salta il h9 D <end> Davide Coccomini & Nicola Messina | AICamp 2021 muro Encoder Final sentence embedding Decoder <s>
  • 8. Problems 1. We forget tokens too far in the past 2. We need to wait the previous token to compute the next hidden-state E The cat jumps the wall h0 hstart E h1 E h2 E h3 E h4 D h5 D h6 D h7 D h8 D Il gatto salta il muro h9 D <end> <s> Davide Coccomini & Nicola Messina | AICamp 2021
  • 9. Solving problem 1 "We forget tokens too far in the past" E The cat jumps the wall h0 hstar t E h1 E h2 E h3 E h4 D D D Il gatto salta + + + context = Attention Solution Add an attention mechanism + h5 h6 <s>
  • 10. Solving problem 2 "We need to wait the previous token to compute the next hidden-state" 2017 paper "Attention Is All You Need" Solution Throw away recurrent connections E The cat jumps the wall h0 hstar t E h1 E h2 E h3 E h4 D D D Il gatto salta + + + context = Attention + h5 h6 <s> Davide Coccomini & Nicola Messina | AICamp 2021
  • 11. Full Transformer Architecture Input Output Multi-Head Attention Feed Forward Feed Forward Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Multi-Head Attention + + “Salta” (90%) | “Odia” (9%) | “Perchè” (1%) “The cat jumps the wall” “Il gatto” Positional Encoding Positional Encoding V K Q Nx Mx Davide Coccomini & Nicola Messina | AICamp 2021 V K Q V K Q Encoder Decoder Add & Norm Linear + Softmax Memory
  • 12. Transformer's Attention Mechanism Target tokens “from the point of view” of the source sequence Queries Target Sequence Source Sequence FFN FFN FFN FFN Keys & Values FFN FFN ∙ ∙ ∙ ∙ Norm & Softmax Dot product Il gatto salta The cat jumps the … … FFN
  • 13. Key Value “A4” “N9” “O7” “A4” “N2” Transformer's Attention Mechanism From a different perspective Query “N5” Weighted average “gatto” token, built by aggregating value vectors in the source dictionary Lookup Table Target Sequence Source Sequence Il gatto salta The cat jumps the … wall Soft-matching Davide Coccomini & Nicola Messina | AICamp 2021
  • 14. Attention and Self-Attention Self-Attention • Source = Target • Key, Queries, Values obtained from the same sentence • Captures intra-sequence dependencies Attention • Source ≠ Target • Queries from Source • Key, Values from Target • Captures inter-sequence dependencies I gave my dog Charlie some food Ho dato da mangiare al mio cane Charlie I gave my dog Charlie some food To whom? What? Who? Multi-Head Attention V K Q Multi-Head Attention V K Q Target Source Davide Coccomini & Nicola Messina | AICamp 2021
  • 15. LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR LINEAR DOT NORMALIZE MATMUL CONCAT + DENSE VALUES KEYS QUERIES Multi-Head Self-Attention Multiple instantiations of the attention mechanism h h h Davide Coccomini & Nicola Messina | AICamp 2021 The cat is running h slices
  • 16. Full Transformer Architecture Input Output Feed Forward Feed Forward Add & Norm Add & Norm Add & Norm Add & Norm Multi-Head Attention Multi-Head Attention + + Salta (90%) | Odia (9%) | Perchè (1%) The cat jumps the wall Il gatto Positional Encoding Positional Encoding V K Q Nx Mx Davide Coccomini & Nicola Messina | AICamp 2021 V K Q V K Q Encoder Decoder Add & Norm Linear + Softmax Memory Key Value “A4” “N9” “O7” “A4” “N2” Lookup Table (source sequence) Multi-Head Self Attention Multi-Head Attention
  • 17. Is there any problem with Transformers? Attention calculation is O(n2) Self-Attention I gave my dog Charlie some food I gave my dog Charlie some food I gave my dog Charlie some food . . . . . . . . . . Davide Coccomini & Nicola Messina | AICamp 2021
  • 18. I gave my dog Charlie some food I gave my dog Charlie some food Attention calculation is O(n2) Self-Attention Davide Coccomini & Nicola Messina | AICamp 2021
  • 19. The Power of the Transformer Encoder • Many achievements using only the Encoder • BERT (Devlin et al., 2018) Next Sentence Prediction {0, 1} Transformer Encoder (N layers) I gave my dog Charlie some food <CLS> <SEP> He ate it Positional Encoding Embedding Layer Masked Language Modelling «ate» Memory
  • 20. Transformers in Computer Vision Can we use the self-attention mechanism in images? Davide Coccomini & Nicola Messina | AICamp 2021
  • 21. Transformers in Computer Vision 256px 256px 3906250000 calculations Impossible! 62500 pixels • The transformer works with a set of tokens • What are tokens in images? Davide Coccomini & Nicola Messina | AICamp 2021
  • 22. • Tokens as the features from an object detector Transformers in Computer Vision Tokens! ROI Pooling ROI Pooling ROI Pooling Davide Coccomini & Nicola Messina | AICamp 2021
  • 23. “An image is worth 16x16 words” Image to Patches Tokens! Linear Projection 256px 256px 16px 16px Vision Transformers (ViTs) Davide Coccomini & Nicola Messina | AICamp 2021 “An image is worth 16x16 words” | Dosovitskiy et al., 2020
  • 24. Vision Transformers (ViTs) 0 * 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 MLP Head CLASS Linear Projection of Flattened Patches Transformer Encoder Davide Coccomini & Nicola Messina | AICamp 2021
  • 25. Image Classification on ImageNet Davide Coccomini & Nicola Messina | AICamp 2021
  • 26. What about video? Davide Coccomini & Nicola Messina | AICamp 2021
  • 27. TimeSformers Combine space and time attention with Divided Space-Time Attention! frame t - δ frame t frame t + δ Space Time Time Davide Coccomini & Nicola Messina | AICamp 2021 Is Space-Time Attention All You Need for Video Understanding? | Gedas Bertasius et al.
  • 28. TimeSformers Up to several minutes of analysis! Davide Coccomini & Nicola Messina | AICamp 2021
  • 29. Wait… Can I use different types of attention???
  • 30.
  • 31. Transformers use a lot of memory! Attention Feed Forward Feed Forward . . . . 2 GB 2 GB 2 GB Davide Coccomini & Nicola Messina | AICamp 2021 USED MEMORY 2 GB 4 GB A LOT!
  • 32. Efficient Transformers Attention + FeedForward + - - Attention FeedForward Davide Coccomini & Nicola Messina | AICamp 2021 A new efficient Transformer variant | Lukasz Kaiser USED MEMORY 2 GB 4 GB Rev Attention!
  • 33. Patch Partition Davide Coccomini & Nicola Messina | AICamp 2021 Linear Embedding x2 Stage 1 Swin Transformer Block Patch Merging Swin Transformer Block x2 Stage 2 Patch Merging Swin Transformer Block x6 Stage 3 Patch Merging Swin Transformer Block x2 Stage 4 𝑯 × 𝑾 × 𝟑 𝑯 𝟒 × 𝑾 𝟒 × 𝟒𝟖 𝑯 𝟒 × 𝑾 𝟒 × 𝑪 𝑯 𝟖 × 𝑾 𝟖 × 𝟐𝑪 𝑯 𝟏𝟔 × 𝑾 𝟏𝟔 × 𝟒𝑪 𝑯 𝟑𝟐 × 𝑾 𝟑𝟐 × 𝟖𝑪 Swin Transformers Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | Ze Liu et al.
  • 34. 16x 4x 8x 16x Swin Transformers Swin Transformer Vision Transformer 16x 16x Davide Coccomini & Nicola Messina | AICamp 2021 Shifted Window based Self-Attention
  • 35. Self-Attention Layer l Self-Attention Layer l+1 Davide Coccomini & Nicola Messina | AICamp 2021 Swin Transformers
  • 36. Source: Swin Transformer Object Detection Demo – By DeepReader https://www.youtube.com/watch?v=FQVS_0Bja6o
  • 37. Can we do without Self-Attention?
  • 38. It is «just» a transformation! What essentially is the attention mechanism? Davide Coccomini & Nicola Messina | AICamp 2021 Attention Mechanism
  • 39. Input Attention Calculation Embeddings Feed Forward Add & Normalize Dense Output Prediction Add & Normalize Embeddings Fourier Network Fourier Transformation Davide Coccomini & Nicola Messina | AICamp 2021
  • 40. Why Fourier? It’s just a transformation! Image from mriquestion.com Fourier Transform Davide Coccomini & Nicola Messina | AICamp 2021
  • 41. Fourier Transform Fourier Transform Transforming over the hidden domain Transforming over the sequence domain Fourier Network What does it transform? Input Vectors Davide Coccomini & Nicola Messina | AICamp 2021 FNet: Mixing Tokens with Fourier Transforms | James Lee-Thorp et al.
  • 42. MLP Mixer 1 2 3 4 5 6 7 8 9 Per-patch Fully Connected N x (Mixer Layer) Davide Coccomini & Nicola Messina | AICamp 2021 Global Average Pooling Fully-Connected CLASS MLP-Mixer: An all-MLP Architecture for Vision | Ilya Tolstikhin et al.
  • 43. Layer Norm MLP MLP MLP MLP Layer Norm MLP MLP MLP MLP Mixer Layer Davide Coccomini & Nicola Messina | AICamp 2021 Transforming over the sequence domain Transforming over the hidden domain
  • 44.
  • 45. Convolutional Neural Network Vision Transformer Not Improving Anymore Still Improving Learned Knowledge What happens during training? Davide Coccomini & Nicola Messina | AICamp 2021
  • 46. Why are they different? Able to find long-term dependencies Learns inductive biases Need very large dataset for training Lack of global understanding Locality Sensitive Translation Invariant Davide Coccomini & Nicola Messina | AICamp 2021 Convolutional Neural Network Vision Transformers
  • 47. A different point of view ViTs are both local and global! The ViT learns only global information with low amount of data 0.7 0.01 Davide Coccomini & Nicola Messina | AICamp 2021 Heads focus on farther patches
  • 48. A different point of view ViTs are both local and global! The ViT learns also local information with more data Davide Coccomini & Nicola Messina | AICamp 2021 0.7 0.01 Higher layers heads still focus on farther patches 0.01 0.4 0.3 Lower layers heads focus on both farther and closer patches 0.6 0.06
  • 49. A different point of view They learn different representations! Similar representations through the layers Different representations through the layers Davide Coccomini & Nicola Messina | AICamp 2021 Do Vision Transformers See Like Convolutional Neural Networks? | Maithra Raghu et al.
  • 50. A different point of view Vision Transformers are very robust! Davide Coccomini & Nicola Messina | AICamp 2021 Occlusion Distribution Shift Adversarial Perturbation Permutation Intriguing Properties of Vision Transformers | Muzammal Naseer et al.
  • 51. Can we obtain the best of the two architectures?
  • 52.
  • 53. 28 x 28 24 x 24 8 x 8 12 x 12 4 x 4 Davide Coccomini & Nicola Messina | AICamp 2021 What happens in CNNs? Hey! They are patches!
  • 54. Convolutional Neural Network Transformer Encoder CLASS MLP Davide Coccomini & Nicola Messina | AICamp 2021 Hybrids A possible configuration! Combining EfficientNet and Vision Transformers for Video Deepfake Detection | Coccomini et al.
  • 55.
  • 56. Recap Use a pure Vision Transformer Improve the internal attention mechanism Use an alternative transformation Combine CNNs with Vision Transformers How can we use Transformers in Vision?
  • 57. ViT-G/14 CoAtNet-7 ViT-MoE-15B 1° 2° 3° 90.45% top-1 accuracy Vision Transformer Pretrained on JFT 1843M parameters 90.88% top-1 accuracy Conv + Vision Transformer Pretrained on JFT 2440M parameters 90.35% top-1 accuracy Vision Transformer Pretrained on JFT 14700M parameters ImageNet Ranking
  • 59. Video Supervised Learning DINO Source: Advancing the state of the art in computer vision with self-supervised Transformers and 10x more efficient training – Facebook AI
  • 60. Source: Advancing the state of the art in computer vision with self-supervised Transformers and 10x more efficient training – Facebook AI
  • 61. Source: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction
  • 62. Source: Image GPT – By OpenAI
  • 63.
  • 64. Thank You for the Attention! Any question?

Notes de l'éditeur

  1. - They plot CKA similarities between all pairs of layers across different model architectures. We observe that ViTs have relatively uniform layer similarity structure, with a clear grid-like pattern and large similarity between lower and higher layers. By contrast, the ResNet models show clear stages in similarity structure, with smaller similarity scores between lower and higher layers.