SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
End-to-End Object Detection with
Transformers
(DETR)
Nicolas Carion & Francisco Massa et al., “End-to-End Object Detection with Transformers”, ECCV2020
8th November, 2020
PR12 Paper Review
JinWon Lee
Samsung Electronics
End-to-End Object Detection withTransformers
Reference
• Facebook AI Blog
 https://ai.facebook.com/blog/end-to-end-object-detection-with-
transformers/
• Author’sYoutube
 https://youtu.be/utxbUlo9CyY
• Official Code
 https://github.com/facebookresearch/detr
Object Detection
• Predict a set of bounding boxes with labels, given an image
Classical Approach to Detection
• Popular approach: detection := classification of boxes
• Requires selecting a subset of candidate boxes
• Regression step to refine the predictions
• Typically non-differentiable
DETR in a nutshell
• Set prediction formulation
• No geometric priors(NMS, anchors, …)
• Fully differentiable
• Competitive performance
• Extensible to other tasks
Reframing theTask of Object Detection
• DETR casts the object detection task as an image-to-set problem.
• Given an image, the model must predict an unordered set of all the
objects present, each represented by its class, along with a tight
bounding box surrounding each one.
• This formulation is particularly suitable forTransformers.
Transformers
• Transformers are a deep learning architecture that has gained
popularity in recent years.
• They rely on a simple yet powerful mechanism called attention,
which enables AI models to selectively focus on certain parts of their
input and thus reason more effectively.
• Transformers have been widely applied on problems with sequential
data, and have also been extended to tasks as diverse as speech
recognition, symbolic mathematics, and reinforcement learning.
• But, perhaps surprisingly, computer vision has not yet been swept up
by theTransformer revolution.
Streamlined Detection Pipeline
DETR
Object Detection Set Prediction Loss
• DETR infers a fixed-size set of N predictions, where N is set to be
significantly large than the typical number of objects in an image.
• Let us denote by 𝑦 the ground truth set of objects, and ො𝑦 = {ෝ𝑦𝑖}𝑖=1
𝑁
the set of N predictions.
• 𝑦 also as a set of size N padded with ∅(no object).
Object Detection Set Prediction Loss
• To find a bipartite matching between these two sets we search for a
permutation of N elements 𝜎 ∈ 𝔖 𝑁 with the lowest cost:
ො𝜎 = arg min
𝜎∈𝔖 𝑁
෍
𝑖
𝑁
ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 )
𝑦𝑖 = 𝑐𝑖, 𝑏𝑖 , where 𝑐𝑖 is the target class label(which may be ∅) and
𝑏𝑖 ∈ [0, 1] 4 is a vector of box center coordinates and its height and width.
ℒ 𝑚𝑎𝑡𝑐ℎ 𝑦𝑖, ො𝑦 𝜎 𝑖 = −𝕝 𝑐 𝑖≠∅ Ƹ𝑝 𝜎 𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, ෠𝑏 𝜎 𝑖 )
Ƹ𝑝 𝜎 𝑖 (𝑐𝑖) is a prob. of class 𝑐𝑖 and ෠𝑏 𝜎 𝑖 is a predicted box
• ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 ) is a pair-wise matching cost and this optimal assignment
is computed efficiently with Hungarian algorithm.
Loss Function – Hungarian Loss
• A linear combination of a negative log-likelihood for class prediction
and a box loss
ℒ 𝐻𝑢𝑛𝑔𝑎𝑟𝑖𝑎𝑛 𝑦, ො𝑦 = ෍
𝑖=1
𝑁
[− log Ƹ𝑝ෝ𝜎 𝑖 𝑐𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, ෠𝑏 𝜎 𝑖 )]
In practice, the log prob. term is down-weighted when 𝑐𝑖=∅ by a factor
of 10 to account for class imbalance.
Bounding Box Loss
• Unlike many detectors that do box predictions as a Δ w.r.t. some
initial guesses, DETR make box prediction directly.
• The most commonly-used ℓ1 loss will have different scales for small
and large boxes even if there relative errors are similar.
• To mitigate this issue, DETR use a linear combination of the ℓ1 loss
and generalized IoU loss that is scale-invariant.
ℒ 𝑏𝑜𝑥 𝑏𝑖, ෠𝑏 𝜎 𝑖 = 𝜆𝑖𝑜𝑢ℒ 𝑖𝑜𝑢 𝑏𝑖, ෠𝑏 𝜎 𝑖 + 𝜆 𝐿1 𝑏𝑖 − ෠𝑏 𝜎 𝑖 1
Overall Architecture of DETR
Backbone
• Starting from the initial image 𝑥𝑖𝑚𝑔 ∈ ℝ3×𝐻0×𝑊0 (with 3 color channels),
a conventional CNN backbone generates a lower-resolution activation
map 𝑓 ∈ ℝ 𝐶×𝐻×𝑊
• Typical value DETR use are 𝐶 = 2048 and H, 𝑊 =
𝐻0
32
,
𝑊0
32
Transformer Encoder
• Creating a new feature map 𝑧0 ∈ ℝ 𝑑×𝐻×𝑊, 𝑑 is smaller than 𝐶
• The encoder expects a sequence as input, hence the spatial
dimensions of 𝑧0 is collapsed into one dimension, resulting in a 𝑑 ×
𝐻𝑊 feature map.
• Since the transformer architecture is permutation-invariant, DETR
supplement it with fixed positional encodings that are added to the
input of each attention layer.
Transformer Decoder
• The decoder follows the standard architecture of the transformer,
transforming N embeddings of size d using multi-headed self- and
encoder-decoder attention mechanisms.
• The difference with the original trans former is that our model
decodes the N objects in parallel at each decoder layer.
• Since the decoder is also permutation-invariant, the N input
embeddings must be different to produce different results.
• These input embeddings are learnt positional encodings that we
refer to as object queries, and similarly to the encoder, they are
added to the input of each attention layer
Architecture of DETR’sTransformer
Prediction FFN and Auxiliary Decoding Losses
• The final prediction is computed by a 3-layer perceptron with ReLU
and hidden dimension 𝑑.
• The FFN predicts the normalized center coordinates, height and
width of the box .
• Authors add prediction FFNs and Hungarian loss after each decoder
layer. All predictions FFNs share their parameters.
DETR
https://youtu.be/utxbUlo9CyY
Experiments
• Experiments on COCO 2017 detection and panoptic segmentation
dataset.There are 7 instances per image on average, up to 63
instances in a single image in training set.
• Two different backbones are used: a ResNet-50(DETR) and a ResNet-
101(DETR-R101).
• Dilation to the last stage of the backbone also used for increasing the
feature resolution: DETR-DC5 and DETR-DC5-R101 (dilated C5 stage)
• This modification increases the resolution by a factor of 2, thus
improving performance for small object, at the cost of a 16x higher
cost in the self-attentions of the encoder, leading to an overall 2x
increase in computational cost.
Results
Deformable DETR
(https://arxiv.org/abs/2010.04159)
Number of Encoder Layers
• Without encoder layers, overall AP drops by 3.9 points, with a more
significant drop of 6.0 AP on large objects
• By global scene reasoning, the encoder is important for
disentangling objects.
Encoder Self-Attention
Number of Decoder Layers
• A single decoding layer of the transformer is not able to compute any
cross-correlation between the output elements, and thus it is prone
to making multiple predictions for same object.  NMS improves
performance
DecoderAttention
• Decoder attention is fairly local, meaning that is mostly attends to
object extremities such as heads or legs.
Importance of Positional Encodings
Loss Ablations
Analysis
• Decoder output slot analysis
 DETR learns different specialization for each
query slot.
• Generalization to unseen numbers of
instances
 There is no image with more than 13 giraffes in
the training set.
 This experiment confirms that there is no strong
class-specialization in each object query.
Increasing the Number of Instances
• While the model detects all instances when up to 50 are visible, it
then starts saturating and misses more and more instances. Notably,
when the image contains all 100 instances, the model only detects 30
on average, which is less than if the image contains only 50 instances
that are all detected.
DETR for Panoptic Segmentation
• Similarly to the extension of Faster R-CNN to Mask R-CNN, DETR
can be naturally extended by adding a mask head on top of the
decoder outputs.
DETR for Panoptic Segmentation
DETR for Panoptic Segmentation
https://youtu.be/utxbUlo9CyY
Panoptic Segmentation Results
Panoptic Segmentation Results
Thank you

Contenu connexe

Tendances

Deep learning for object detection
Deep learning for object detectionDeep learning for object detection
Deep learning for object detectionWenjing Chen
 
Deep learning based object detection basics
Deep learning based object detection basicsDeep learning based object detection basics
Deep learning based object detection basicsBrodmann17
 
A Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaA Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaPreferred Networks
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learningAntonio Rueda-Toicen
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object DetectionTaegyun Jeon
 
Codetecon #KRK 3 - Object detection with Deep Learning
Codetecon #KRK 3 - Object detection with Deep LearningCodetecon #KRK 3 - Object detection with Deep Learning
Codetecon #KRK 3 - Object detection with Deep LearningMatthew Opala
 
You only look once: Unified, real-time object detection (UPC Reading Group)
You only look once: Unified, real-time object detection (UPC Reading Group)You only look once: Unified, real-time object detection (UPC Reading Group)
You only look once: Unified, real-time object detection (UPC Reading Group)Universitat Politècnica de Catalunya
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkNader Karimi
 
Object detection with deep learning
Object detection with deep learningObject detection with deep learning
Object detection with deep learningSushant Shrivastava
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detectionDaeHeeKim31
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance SegmentationHichem Felouat
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementJinwon Lee
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것NAVER Engineering
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detectionBrodmann17
 

Tendances (20)

Deep learning for object detection
Deep learning for object detectionDeep learning for object detection
Deep learning for object detection
 
Deep learning based object detection basics
Deep learning based object detection basicsDeep learning based object detection basics
Deep learning based object detection basics
 
Object detection
Object detectionObject detection
Object detection
 
A Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi KerolaA Brief History of Object Detection / Tommi Kerola
A Brief History of Object Detection / Tommi Kerola
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
 
Image segmentation with deep learning
Image segmentation with deep learningImage segmentation with deep learning
Image segmentation with deep learning
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
 
Codetecon #KRK 3 - Object detection with Deep Learning
Codetecon #KRK 3 - Object detection with Deep LearningCodetecon #KRK 3 - Object detection with Deep Learning
Codetecon #KRK 3 - Object detection with Deep Learning
 
You only look once: Unified, real-time object detection (UPC Reading Group)
You only look once: Unified, real-time object detection (UPC Reading Group)You only look once: Unified, real-time object detection (UPC Reading Group)
You only look once: Unified, real-time object detection (UPC Reading Group)
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning Framework
 
Object detection with deep learning
Object detection with deep learningObject detection with deep learning
Object detection with deep learning
 
Focal loss for dense object detection
Focal loss for dense object detectionFocal loss for dense object detection
Focal loss for dense object detection
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance Segmentation
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 

Similaire à PR-284: End-to-End Object Detection with Transformers(DETR)

Wits presentation 6_28072015
Wits presentation 6_28072015Wits presentation 6_28072015
Wits presentation 6_28072015Beatrice van Eden
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networksananth
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextSangmin Woo
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksJinwon Lee
 
Object detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetRishabh Indoria
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORS
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORSADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORS
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORSSoma Boubou
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev
 
150424 Scalable Object Detection using Deep Neural Networks
150424 Scalable Object Detection using Deep Neural Networks150424 Scalable Object Detection using Deep Neural Networks
150424 Scalable Object Detection using Deep Neural NetworksJunho Cho
 
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdfArafathJazeeb1
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...ssuser2624f71
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognitionvatsal199567
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programmingSoumya Mukherjee
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 

Similaire à PR-284: End-to-End Object Detection with Transformers(DETR) (20)

Wits presentation 6_28072015
Wits presentation 6_28072015Wits presentation 6_28072015
Wits presentation 6_28072015
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
 
Object detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs Retinanet
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORS
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORSADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORS
ADAPTIVE FILTER FOR DENOISING 3D DATA CAPTURED BY DEPTH SENSORS
 
Unit3
Unit3Unit3
Unit3
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
150424 Scalable Object Detection using Deep Neural Networks
150424 Scalable Object Detection using Deep Neural Networks150424 Scalable Object Detection using Deep Neural Networks
150424 Scalable Object Detection using Deep Neural Networks
 
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Automatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face RecognitionAutomatic Attendace using convolutional neural network Face Recognition
Automatic Attendace using convolutional neural network Face Recognition
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 

Plus de Jinwon Lee

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sJinwon Lee
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersJinwon Lee
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesJinwon Lee
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionJinwon Lee
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...Jinwon Lee
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksJinwon Lee
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionJinwon Lee
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignJinwon Lee
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...Jinwon Lee
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksJinwon Lee
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingJinwon Lee
 

Plus de Jinwon Lee (20)

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
 

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

PR-284: End-to-End Object Detection with Transformers(DETR)

  • 1. End-to-End Object Detection with Transformers (DETR) Nicolas Carion & Francisco Massa et al., “End-to-End Object Detection with Transformers”, ECCV2020 8th November, 2020 PR12 Paper Review JinWon Lee Samsung Electronics
  • 2. End-to-End Object Detection withTransformers
  • 3. Reference • Facebook AI Blog  https://ai.facebook.com/blog/end-to-end-object-detection-with- transformers/ • Author’sYoutube  https://youtu.be/utxbUlo9CyY • Official Code  https://github.com/facebookresearch/detr
  • 4. Object Detection • Predict a set of bounding boxes with labels, given an image
  • 5. Classical Approach to Detection • Popular approach: detection := classification of boxes • Requires selecting a subset of candidate boxes • Regression step to refine the predictions • Typically non-differentiable
  • 6. DETR in a nutshell • Set prediction formulation • No geometric priors(NMS, anchors, …) • Fully differentiable • Competitive performance • Extensible to other tasks
  • 7. Reframing theTask of Object Detection • DETR casts the object detection task as an image-to-set problem. • Given an image, the model must predict an unordered set of all the objects present, each represented by its class, along with a tight bounding box surrounding each one. • This formulation is particularly suitable forTransformers.
  • 8. Transformers • Transformers are a deep learning architecture that has gained popularity in recent years. • They rely on a simple yet powerful mechanism called attention, which enables AI models to selectively focus on certain parts of their input and thus reason more effectively. • Transformers have been widely applied on problems with sequential data, and have also been extended to tasks as diverse as speech recognition, symbolic mathematics, and reinforcement learning. • But, perhaps surprisingly, computer vision has not yet been swept up by theTransformer revolution.
  • 10. DETR
  • 11. Object Detection Set Prediction Loss • DETR infers a fixed-size set of N predictions, where N is set to be significantly large than the typical number of objects in an image. • Let us denote by 𝑦 the ground truth set of objects, and ො𝑦 = {ෝ𝑦𝑖}𝑖=1 𝑁 the set of N predictions. • 𝑦 also as a set of size N padded with ∅(no object).
  • 12. Object Detection Set Prediction Loss • To find a bipartite matching between these two sets we search for a permutation of N elements 𝜎 ∈ 𝔖 𝑁 with the lowest cost: ො𝜎 = arg min 𝜎∈𝔖 𝑁 ෍ 𝑖 𝑁 ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 ) 𝑦𝑖 = 𝑐𝑖, 𝑏𝑖 , where 𝑐𝑖 is the target class label(which may be ∅) and 𝑏𝑖 ∈ [0, 1] 4 is a vector of box center coordinates and its height and width. ℒ 𝑚𝑎𝑡𝑐ℎ 𝑦𝑖, ො𝑦 𝜎 𝑖 = −𝕝 𝑐 𝑖≠∅ Ƹ𝑝 𝜎 𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, ෠𝑏 𝜎 𝑖 ) Ƹ𝑝 𝜎 𝑖 (𝑐𝑖) is a prob. of class 𝑐𝑖 and ෠𝑏 𝜎 𝑖 is a predicted box • ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 ) is a pair-wise matching cost and this optimal assignment is computed efficiently with Hungarian algorithm.
  • 13. Loss Function – Hungarian Loss • A linear combination of a negative log-likelihood for class prediction and a box loss ℒ 𝐻𝑢𝑛𝑔𝑎𝑟𝑖𝑎𝑛 𝑦, ො𝑦 = ෍ 𝑖=1 𝑁 [− log Ƹ𝑝ෝ𝜎 𝑖 𝑐𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, ෠𝑏 𝜎 𝑖 )] In practice, the log prob. term is down-weighted when 𝑐𝑖=∅ by a factor of 10 to account for class imbalance.
  • 14. Bounding Box Loss • Unlike many detectors that do box predictions as a Δ w.r.t. some initial guesses, DETR make box prediction directly. • The most commonly-used ℓ1 loss will have different scales for small and large boxes even if there relative errors are similar. • To mitigate this issue, DETR use a linear combination of the ℓ1 loss and generalized IoU loss that is scale-invariant. ℒ 𝑏𝑜𝑥 𝑏𝑖, ෠𝑏 𝜎 𝑖 = 𝜆𝑖𝑜𝑢ℒ 𝑖𝑜𝑢 𝑏𝑖, ෠𝑏 𝜎 𝑖 + 𝜆 𝐿1 𝑏𝑖 − ෠𝑏 𝜎 𝑖 1
  • 16. Backbone • Starting from the initial image 𝑥𝑖𝑚𝑔 ∈ ℝ3×𝐻0×𝑊0 (with 3 color channels), a conventional CNN backbone generates a lower-resolution activation map 𝑓 ∈ ℝ 𝐶×𝐻×𝑊 • Typical value DETR use are 𝐶 = 2048 and H, 𝑊 = 𝐻0 32 , 𝑊0 32
  • 17. Transformer Encoder • Creating a new feature map 𝑧0 ∈ ℝ 𝑑×𝐻×𝑊, 𝑑 is smaller than 𝐶 • The encoder expects a sequence as input, hence the spatial dimensions of 𝑧0 is collapsed into one dimension, resulting in a 𝑑 × 𝐻𝑊 feature map. • Since the transformer architecture is permutation-invariant, DETR supplement it with fixed positional encodings that are added to the input of each attention layer.
  • 18. Transformer Decoder • The decoder follows the standard architecture of the transformer, transforming N embeddings of size d using multi-headed self- and encoder-decoder attention mechanisms. • The difference with the original trans former is that our model decodes the N objects in parallel at each decoder layer. • Since the decoder is also permutation-invariant, the N input embeddings must be different to produce different results. • These input embeddings are learnt positional encodings that we refer to as object queries, and similarly to the encoder, they are added to the input of each attention layer
  • 20. Prediction FFN and Auxiliary Decoding Losses • The final prediction is computed by a 3-layer perceptron with ReLU and hidden dimension 𝑑. • The FFN predicts the normalized center coordinates, height and width of the box . • Authors add prediction FFNs and Hungarian loss after each decoder layer. All predictions FFNs share their parameters.
  • 22. Experiments • Experiments on COCO 2017 detection and panoptic segmentation dataset.There are 7 instances per image on average, up to 63 instances in a single image in training set. • Two different backbones are used: a ResNet-50(DETR) and a ResNet- 101(DETR-R101). • Dilation to the last stage of the backbone also used for increasing the feature resolution: DETR-DC5 and DETR-DC5-R101 (dilated C5 stage) • This modification increases the resolution by a factor of 2, thus improving performance for small object, at the cost of a 16x higher cost in the self-attentions of the encoder, leading to an overall 2x increase in computational cost.
  • 25. Number of Encoder Layers • Without encoder layers, overall AP drops by 3.9 points, with a more significant drop of 6.0 AP on large objects • By global scene reasoning, the encoder is important for disentangling objects.
  • 27. Number of Decoder Layers • A single decoding layer of the transformer is not able to compute any cross-correlation between the output elements, and thus it is prone to making multiple predictions for same object.  NMS improves performance
  • 28. DecoderAttention • Decoder attention is fairly local, meaning that is mostly attends to object extremities such as heads or legs.
  • 31. Analysis • Decoder output slot analysis  DETR learns different specialization for each query slot. • Generalization to unseen numbers of instances  There is no image with more than 13 giraffes in the training set.  This experiment confirms that there is no strong class-specialization in each object query.
  • 32. Increasing the Number of Instances • While the model detects all instances when up to 50 are visible, it then starts saturating and misses more and more instances. Notably, when the image contains all 100 instances, the model only detects 30 on average, which is less than if the image contains only 50 instances that are all detected.
  • 33. DETR for Panoptic Segmentation • Similarly to the extension of Faster R-CNN to Mask R-CNN, DETR can be naturally extended by adding a mask head on top of the decoder outputs.
  • 34. DETR for Panoptic Segmentation
  • 35. DETR for Panoptic Segmentation https://youtu.be/utxbUlo9CyY