TensorFlow Korea 논문읽기모임 PR12 217번째 논문 review입니다
이번 논문은 GoogleBrain에서 쓴 EfficientDet입니다. EfficientNet의 후속작으로 accuracy와 efficiency를 둘 다 잡기 위한 object detection 방법을 제안한 논문입니다. 이를 위하여 weighted bidirectional feature pyramid network(BiFPN)과 EfficientNet과 유사한 방법의 detection용 compound scaling 방법을 제안하고 있는데요, 자세한 내용은 영상을 참고해주세요
논문링크: https://arxiv.org/abs/1911.09070
영상링크: https://youtu.be/11jDC8uZL0E
4. Intro.
• State-of-the-art object detectors become increasingly more
expensive.
AmoebaNet-based NAS-FPN detector requires 167M parameters and 3045B
FLOPS(30x more than RetinaNet)
• Given real-world resource constraints such as robotics and self-
driving cars, model efficiency becomes increasingly important for
object detection.
• Although previous works tend to achieve better efficiency, they
usually sacrifice accuracy and they only focus on a specific or a small
range of resource requirements.
5. A Natural Question
Is it possible to build a scalable detection architecture with
both higher accuracy and better efficiency across a wide
spectrum of resource constraints?
6. Two Challenges
1. Efficient multi-scale feature fusion
FPN has been widely used for multi-scale feature fusion.
PANet, NAS-FPN and other studies have developed more network structures
for cross-scale feature fusion.
Most previous works simply sum them up without distinction.
However, they usually contribute to the fused output feature unequally.
2. Model Scaling
Inspired by EfficientNet, the authors propose a compound scaling method
for object detectors, which jointly scales up the resolution/depth/width for all
backbone, feature network, box/class prediction network.
• Combining EfficientNet backbones with BiFPN and compound
scaling EfficientDet
7. Contributions
• Propose BiFPN, a weighted bidirectional feature network for easy
and fast multi-scale feature fusion.
• Propose a new compound scaling method, which jointly scales up
backbone, feature network, box/class network, and resolution, in a
principled way.
• Develop EfficientDet, a new family of one-stage detectors with
significantly better accuracy and efficiency across a wide spectrum of
resource constraints.
8. BiFPN – Problem Formulation
• Formaly, given a list of multi-scale features 𝑃 𝑖𝑛 = (𝑃𝑙1
𝑖𝑛
, 𝑃𝑙2
𝑖𝑛
, … ),
where 𝑃𝑙 𝑖
𝑖𝑛
represents the feature at level 𝑙𝑖, the goal is to find a
transformation 𝑓 that can effectively aggregate different features
and output a list of new features: 𝑃 𝑜𝑢𝑡 = 𝑓(𝑃 𝑖𝑛)
9. BiFPN – Problem Formulation
• FPN takes level 3-7 input features 𝑃 𝑖𝑛
= (𝑃3
𝑖𝑛
, … , 𝑃7
𝑖𝑛
), where 𝑃𝑖
𝑖𝑛
represents
a feature level with resolution of Τ1 2𝑖 of the input images.
• For instance, if input resolution is 640x640, then 𝑃3
𝑖𝑛
represents feature level
3 (640/23 = 80) with resolution 80x80, while 𝑃7
𝑖𝑛
represents feature level 7
(640/27 = 5) with resolution 5x5
• The conventional FPN aggregates multi-scale features
In a top-down manner:
𝑃7
𝑜𝑢𝑡
= 𝐶𝑜𝑛𝑣(𝑃7
𝑖𝑛
)
𝑃6
𝑜𝑢𝑡
= 𝐶𝑜𝑛𝑣 𝑃6
𝑖𝑛
+ 𝑅𝑒𝑠𝑖𝑧𝑒 𝑃7
𝑜𝑢𝑡
…
𝑃3
𝑜𝑢𝑡
= 𝐶𝑜𝑛𝑣 𝑃3
𝑖𝑛
+ 𝑅𝑒𝑠𝑖𝑧𝑒 𝑃4
𝑜𝑢𝑡
10. BiFPN – Cross-Scale Connections
• PANet adds an extra bottom-up path aggregation network.
• NAS-FPN employs neural architecture search to search for better
cross-scale feature network topology, but it requires thousands of
GPU hours during search and the found network is irregular and
difficult to interpret or modify.
11. PANet
• Augmenting a top-down path propagates semantically strong
features and enhances all features with reasonable classification
capability in FPN
• Augmenting a bottom-up path of low-level patterns based on the
fact that high response to edges or instance parts is a strong
indicator to accurately localize instances.
12. NAS-FPN
• Adopt Neural Architecture Search and discover a
new feature pyramid architecture in a novel
scalable search space covering all cross-scale
connections.
• The discovered architecture, named NAS-FPN,
consists of a combination of top-down and
bottom-up connections to fuse features across
scales.
13. BiFPN – Cross-Scale Connections
• PANet achieves better
accuracy then FPN and NAS-
FPN, but with the cost of more
parameters and computations.
• First, removing those nodes
that only have one input edge
with no feature fusion, then it
will have less contribution to
feature network that aims at
fusing different features.
14. BiFPN – Cross-Scale Connections
• Second, adding an extra edge from the
original input to output node if they are
at the same level, in order to fuse more
features without adding much cost.
• Third, unlike PANet that only has one
top-down and one bottom-up path, we
treat each bidirectional (top-down &
bottom-up) path as one feature network
layer, and repeat the same layer multiple
times to enable more high-level feature
fusion.
15. BiFPN –Weighted Feature Fusion
• When fusing multiple input features with
different resolutions, a common way is to first
resize them to the same resolution and then
sum them up.
• Pyramid attention network introduces global
self-attention upsampling to recover pixel
localization.
• Since different input features are at different
resolutions, they usually contribute to the
output feature unequally. So, adding an
additional weight for each input during feature
fusion, making the network to learn the
importance of each input feature.
16. Unbounded Fusion
𝑂 =
𝑖
𝑤𝑖 ∙ 𝐼𝑖
• where 𝑤𝑖 is a learnable weight that can be a scalar (per-feature), a
vector(per-channel), or a multi-dimensional tensor (per-pixel).
• Authors find a scale can achieve comparable accuracy to other
approaches with minimal computational costs. However, since the
scalar weight is unbounded, it could potentially cause training
instability.
17. Softmax-Based Fusion
𝑂 =
𝑖
𝑒 𝑤 𝑖
σ 𝑗 𝑒 𝑤 𝑗
∙ 𝐼𝑖
• An intuitive idea is to apply softmax to each weight, such that all
weights are normalized to be a probability with value range from 0 to
1, representing the importance of each input.
• However, the extra softmax leads to significant slowdown on GPU
hardware.
18. Fast Normalized Fusion
𝑂 =
𝑖
𝑤𝑖
𝜖 + σ 𝑗 𝑤𝑗
∙ 𝐼𝑖
• where 𝑤𝑖 ≥ 0is ensured by applying a Relu after each 𝑤𝑖,
and 𝜖 = 0.0001 is a small value to avoid numerical instability.
• Ablation study shows this fast fusion approach has very similar
learning behavior and accuracy as the softmax-based fusion, but runs
up to 30% faster on GPUs .
19. BiFPN
• BiFPN integrates both the bidirectional cross-scale
connections and the fast normalized fusion.
𝑃6
𝑡𝑑
= 𝐶𝑜𝑛𝑣
𝑤1 ∙ 𝑃6
𝑖𝑛
+ 𝑤2 ∙ 𝑅𝑒𝑠𝑖𝑧𝑒 𝑃7
𝑖𝑛
𝑤1 + 𝑤2 + 𝜖
𝑃6
𝑜𝑢𝑡
= 𝐶𝑜𝑛𝑣(
𝑤1
′
∙ 𝑃6
𝑖𝑛
+ 𝑤2
′
∙ 𝑃6
𝑡𝑑
+ 𝑤3
′
∙ 𝑅𝑒𝑠𝑖𝑧𝑒 𝑃5
𝑜𝑢𝑡
𝑤1 + 𝑤2 + 𝜖
)
Where 𝑃6
𝑡𝑑
is the intermediate feature at level 6 on the
top-down pathway, and 𝑃6
𝑜𝑢𝑡
is the output feature at
level 6 on the bottom-up pathway.
• Depthwise separable convolution is used for feature
fusion.
𝑷 𝟔
𝒕𝒅
𝑷 𝟔
𝒐𝒖𝒕
𝑷 𝟓
𝒐𝒖𝒕
22. EfficientNet – Compound Scaling Method
• , , are constants that can be determined by a small grid search.
• Intuitively, 𝜙 is a user-specified coefficient that controls how many
more resources are available for model scaling.
23. EfficientNet – Compound Scaling
Method
• Notably, the FLOPS of a regular convolution op is proportional to d,
w2, r2.
Doubling network depth will double FLOPS, but doubling network width or
resolution will increase FLOPS by four times. Since convolution ops usually
dominate the computation cost in ConvNets, scaling a ConvNet with above
equation will approximately increase total FLOPS by
• In this paper, total FLOPs approximately increase by
24. Compound Scaling
• Inspired by EfficientNet, a new compound scaling method which
uses a simple compound coefficient 𝜙 to jointly scale up all
dimensions of backbone network, BiFPN network, class/box network,
and resolution.
• Unfortunately, object detectors have much more scaling dimensions
than image classification models, so a heuristic-based scaling
approach is used.
25. Compound Scaling
• Backbone network
Same width/depth scaling coefficient of EfficientNet-B0 to B6
• BiFPN network
Exponentially grow BiFPN width 𝑊𝑏𝑖𝑓𝑝𝑛(#channels), but linearly increase depth
𝐷 𝑏𝑖𝑓𝑝𝑛(#layers) since depth needs to be rounded to small integers.
𝑊𝑏𝑖𝑓𝑝𝑛 = 64 ∙ 1.35 𝜙
, 𝐷 𝑏𝑖𝑓𝑝𝑛 = 2 + 𝜙
• Box/class prediction network
Fix their width to always the same as BiFPN(i.e., 𝑊𝑝𝑟𝑒𝑑 = 𝑊𝑏𝑖𝑓𝑝𝑛)
But, linearly increase the depth(#layers) using equation:
𝐷 𝑏𝑜𝑥 = 𝐷𝑐𝑙𝑎𝑠𝑠 = 3 + ہ 𝜙 ۂ/3
• Input image resolution
Since feature level 3-7 are used in BiFPN, the input resolution must be dividable by
27=128
𝑅𝑖𝑛𝑝𝑢𝑡 = 512 + 𝜙 ∙ 128
29. Ablation Study
• Disentangling Backbone and BiFPN
• BiFPN Cross-Scale Connections
FPN and PANet only have one top-down or bottom-up flow so they were repeated 5 times
32. Conclusion
• Weighted bidirectional feature network and customized compound
scaling method are proposed.
• These are for improving both accuracy and efficiency.
• EfficientDet-D7 achieves state-of-the-art 51.0 mAP on COCO dataset
with 52M parameters and 326B FLOPs, being 4x smaller and using
9.3x fewer FLOPS yet still more accurate(+0.3% mAP) than the best
previous detector.
• EfficientDet is also up to 3.2x faster on GPUs and 8.1x faster on CPUs.