TensorFlow Korea 논문읽기모임 PR12 284번째 논문 review입니다.
이번 논문은 Facebook에서 나온 DETR(DEtection with TRansformer) 입니다.
arxiv-sanity에 top recent/last year에서 가장 상위에 자리하고 있는 논문이기도 합니다(http://www.arxiv-sanity.com/top?timefilter=year&vfilter=all)
최근에 ICLR 2021에 submit된 ViT로 인해서 이제 Transformer가 CNN을 대체하는 것 아닌가 하는 얘기들이 많이 나오고 있는데요, 올 해 ECCV에 발표된 논문이고 feature extraction 부분은 CNN을 사용하긴 했지만 transformer를 활용하여 효과적으로 Object Detection을 수행하는 방법을 제안한 중요한 논문이라고 생각합니다. 이 논문에서는 detection 문제에서 anchor box나 NMS(Non Maximum Supression)와 같은 heuristic 하고 미분 불가능한 방법들이 많이 사용되고, 이로 인해서 유독 object detection 문제는 딥러닝의 철학인 end-to-end 방식으로 해결되지 못하고 있음을 지적하고 있습니다. 그 해결책으로 bounding box를 예측하는 문제를 set prediction problem(중복을 허용하지 않고, 순서에 무관함)으로 보고 transformer를 활용한 end-to-end 방식의 알고리즘을 제안하였습니다. anchor box도 필요없고 NMS도 필요없는 DETR 알고리즘의 자세한 내용이 알고싶으시면 영상을 참고해주세요!
영상링크: https://youtu.be/lXpBcW_I54U
논문링크: https://arxiv.org/abs/2005.12872
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
PR-284: End-to-End Object Detection with Transformers(DETR)
1. End-to-End Object Detection with
Transformers
(DETR)
Nicolas Carion & Francisco Massa et al., “End-to-End Object Detection with Transformers”, ECCV2020
8th November, 2020
PR12 Paper Review
JinWon Lee
Samsung Electronics
5. Classical Approach to Detection
• Popular approach: detection := classification of boxes
• Requires selecting a subset of candidate boxes
• Regression step to refine the predictions
• Typically non-differentiable
6. DETR in a nutshell
• Set prediction formulation
• No geometric priors(NMS, anchors, …)
• Fully differentiable
• Competitive performance
• Extensible to other tasks
7. Reframing theTask of Object Detection
• DETR casts the object detection task as an image-to-set problem.
• Given an image, the model must predict an unordered set of all the
objects present, each represented by its class, along with a tight
bounding box surrounding each one.
• This formulation is particularly suitable forTransformers.
8. Transformers
• Transformers are a deep learning architecture that has gained
popularity in recent years.
• They rely on a simple yet powerful mechanism called attention,
which enables AI models to selectively focus on certain parts of their
input and thus reason more effectively.
• Transformers have been widely applied on problems with sequential
data, and have also been extended to tasks as diverse as speech
recognition, symbolic mathematics, and reinforcement learning.
• But, perhaps surprisingly, computer vision has not yet been swept up
by theTransformer revolution.
11. Object Detection Set Prediction Loss
• DETR infers a fixed-size set of N predictions, where N is set to be
significantly large than the typical number of objects in an image.
• Let us denote by 𝑦 the ground truth set of objects, and ො𝑦 = {ෝ𝑦𝑖}𝑖=1
𝑁
the set of N predictions.
• 𝑦 also as a set of size N padded with ∅(no object).
12. Object Detection Set Prediction Loss
• To find a bipartite matching between these two sets we search for a
permutation of N elements 𝜎 ∈ 𝔖 𝑁 with the lowest cost:
ො𝜎 = arg min
𝜎∈𝔖 𝑁
𝑖
𝑁
ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 )
𝑦𝑖 = 𝑐𝑖, 𝑏𝑖 , where 𝑐𝑖 is the target class label(which may be ∅) and
𝑏𝑖 ∈ [0, 1] 4 is a vector of box center coordinates and its height and width.
ℒ 𝑚𝑎𝑡𝑐ℎ 𝑦𝑖, ො𝑦 𝜎 𝑖 = −𝕝 𝑐 𝑖≠∅ Ƹ𝑝 𝜎 𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, 𝑏 𝜎 𝑖 )
Ƹ𝑝 𝜎 𝑖 (𝑐𝑖) is a prob. of class 𝑐𝑖 and 𝑏 𝜎 𝑖 is a predicted box
• ℒ 𝑚𝑎𝑡𝑐ℎ(𝑦𝑖, ො𝑦 𝜎 𝑖 ) is a pair-wise matching cost and this optimal assignment
is computed efficiently with Hungarian algorithm.
13. Loss Function – Hungarian Loss
• A linear combination of a negative log-likelihood for class prediction
and a box loss
ℒ 𝐻𝑢𝑛𝑔𝑎𝑟𝑖𝑎𝑛 𝑦, ො𝑦 =
𝑖=1
𝑁
[− log Ƹ𝑝ෝ𝜎 𝑖 𝑐𝑖 + 𝕝 𝑐 𝑖≠∅ ℒ 𝑏𝑜𝑥(𝑏𝑖, 𝑏 𝜎 𝑖 )]
In practice, the log prob. term is down-weighted when 𝑐𝑖=∅ by a factor
of 10 to account for class imbalance.
14. Bounding Box Loss
• Unlike many detectors that do box predictions as a Δ w.r.t. some
initial guesses, DETR make box prediction directly.
• The most commonly-used ℓ1 loss will have different scales for small
and large boxes even if there relative errors are similar.
• To mitigate this issue, DETR use a linear combination of the ℓ1 loss
and generalized IoU loss that is scale-invariant.
ℒ 𝑏𝑜𝑥 𝑏𝑖, 𝑏 𝜎 𝑖 = 𝜆𝑖𝑜𝑢ℒ 𝑖𝑜𝑢 𝑏𝑖, 𝑏 𝜎 𝑖 + 𝜆 𝐿1 𝑏𝑖 − 𝑏 𝜎 𝑖 1
16. Backbone
• Starting from the initial image 𝑥𝑖𝑚𝑔 ∈ ℝ3×𝐻0×𝑊0 (with 3 color channels),
a conventional CNN backbone generates a lower-resolution activation
map 𝑓 ∈ ℝ 𝐶×𝐻×𝑊
• Typical value DETR use are 𝐶 = 2048 and H, 𝑊 =
𝐻0
32
,
𝑊0
32
17. Transformer Encoder
• Creating a new feature map 𝑧0 ∈ ℝ 𝑑×𝐻×𝑊, 𝑑 is smaller than 𝐶
• The encoder expects a sequence as input, hence the spatial
dimensions of 𝑧0 is collapsed into one dimension, resulting in a 𝑑 ×
𝐻𝑊 feature map.
• Since the transformer architecture is permutation-invariant, DETR
supplement it with fixed positional encodings that are added to the
input of each attention layer.
18. Transformer Decoder
• The decoder follows the standard architecture of the transformer,
transforming N embeddings of size d using multi-headed self- and
encoder-decoder attention mechanisms.
• The difference with the original trans former is that our model
decodes the N objects in parallel at each decoder layer.
• Since the decoder is also permutation-invariant, the N input
embeddings must be different to produce different results.
• These input embeddings are learnt positional encodings that we
refer to as object queries, and similarly to the encoder, they are
added to the input of each attention layer
20. Prediction FFN and Auxiliary Decoding Losses
• The final prediction is computed by a 3-layer perceptron with ReLU
and hidden dimension 𝑑.
• The FFN predicts the normalized center coordinates, height and
width of the box .
• Authors add prediction FFNs and Hungarian loss after each decoder
layer. All predictions FFNs share their parameters.
22. Experiments
• Experiments on COCO 2017 detection and panoptic segmentation
dataset.There are 7 instances per image on average, up to 63
instances in a single image in training set.
• Two different backbones are used: a ResNet-50(DETR) and a ResNet-
101(DETR-R101).
• Dilation to the last stage of the backbone also used for increasing the
feature resolution: DETR-DC5 and DETR-DC5-R101 (dilated C5 stage)
• This modification increases the resolution by a factor of 2, thus
improving performance for small object, at the cost of a 16x higher
cost in the self-attentions of the encoder, leading to an overall 2x
increase in computational cost.
25. Number of Encoder Layers
• Without encoder layers, overall AP drops by 3.9 points, with a more
significant drop of 6.0 AP on large objects
• By global scene reasoning, the encoder is important for
disentangling objects.
27. Number of Decoder Layers
• A single decoding layer of the transformer is not able to compute any
cross-correlation between the output elements, and thus it is prone
to making multiple predictions for same object. NMS improves
performance
31. Analysis
• Decoder output slot analysis
DETR learns different specialization for each
query slot.
• Generalization to unseen numbers of
instances
There is no image with more than 13 giraffes in
the training set.
This experiment confirms that there is no strong
class-specialization in each object query.
32. Increasing the Number of Instances
• While the model detects all instances when up to 50 are visible, it
then starts saturating and misses more and more instances. Notably,
when the image contains all 100 instances, the model only detects 30
on average, which is less than if the image contains only 50 instances
that are all detected.
33. DETR for Panoptic Segmentation
• Similarly to the extension of Faster R-CNN to Mask R-CNN, DETR
can be naturally extended by adding a mask head on top of the
decoder outputs.