Anchor free object detection by deep learning

Anchor Free Object Detection
by Deep Learning
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
 UnitBox: An Advanced Object Detection Network
 Densebox
 Yolo1/2/3
 CornerNet
 ExtremeNet
 FSAF: Feature Selective Anchor-Free
 FCOS: Fully Convolutional One-Stage
 FoveaBox
 Center and Scale Prediction: A Box-free Approach for Object Detection
 Region Proposal by Guided Anchoring(GA-RPN)
 CenterNet: Objects as Points
 CenterNet: Keypoint Triplets for Object Detection
 CornerNet-Lite: Efficient Keypoint Based Object Detection

UnitBox: An Advanced Object Detection Network
 Deep CNN methods assume the object bounds to be four independent
variables, which could be regressed by the `2 loss separately.
 Such an oversimplified assumption is contrary to the well-received
observation, that those variables are correlated, resulting to less accurate
localization.
 To address the issue, use a Intersection over Union (IoU) loss function for
bounding box prediction, which regresses the four bounds of a predicted box
as a whole unit.
 By taking the advantages of IoU loss and deep FCN, the UnitBox is introduced,
which performs accurate and efficient localization, shows robust to objects of
varied shapes and scales, and converges fast.

Illustration of IoU loss and l2 loss for
pixel-wise bounding box prediction.
The Architecture of UnitBox Network.

Compared to l2 loss, the IoU loss is much more robust to scale variations for bounding box prediction.

DenseBox: Unifying Landmark Localization and Object Detection
 Fully convolutional neural network (FCN);
 Directly predicts bounding boxes and object class confidences through all
locations and scales of an image;
 Improve accuracy with landmark localization during multi-task learning.
Pipeline:1) Image pyramid is fed to the network. 2) After several layers of convolution and
pooling, upsampling feature map back and apply convolution layers to get final output. 3)
Convert output feature map to bounding boxes, and apply non-maximum suppression to all
bounding boxes over the threshold.

DenseBox: Landmark Localization and Object Detection
DenseBox
Densebox with landmark localization

You Only Look Once (YOLO) for Object Detection
The YOLO Detection System
The system models detection as a regression problem to
a 7724 tensor. This tensor encodes bounding boxes and
class probabilities for all objects in the image.
The network uses strided conv. layers to
downsample the feature space instead
of maxpooling layers. Pre-train the conv.
layers on the ImageNet classification
task and then double the resolution for
detection.
Note: More localization errors, relatively low recall.

YOLO9000: Better, Faster, Stronger
 Detect over 9000 object categories: http://pjreddie.com/yolo9000/;
 YOLOv2, 67 FPS, 76.8 mAP on VOC 2007; 40 FPS, 78.6 mAP;
 Jointly train on object detection COCO and classification ImageNet;
 Batch Normalization: 2% improvement in mAP;
 High Resolution Classifier: full 448 × 448 resolution, almost 4% up in mAP;
 Convolutional With Anchor Boxes: use anchor boxes to predict bound. boxes;
 Dimension Clusters: k-means on the training set bounding boxes to
automatically find good priors to adjust the boxes appropriately;
 Direct location prediction: predict location relative to location of the grid cell;
 Fine-Grained Features: 13 × 13 map, pass through layer from 26 × 26 res.
 Multi-Scale Training: Every 10 batches randomly a new image dimension size.

Bounding boxes with dimension priors and
location prediction. Predict the width and
height of the box as offsets from cluster
centroids. Predict the center coordinates of the
box relative to the location of filter application
using a sigmoid function.
Clustering box dimensions on VOC and COCO. K-means
clustering on the dimensions of bounding boxes to get good
priors for model. Left: the average IOU we get with various
choices for k. Find that k = 5 gives a good tradeoff for recall vs.
complexity of the model. Right: relative centroids for VOC and
COCO. Both sets of priors favor thinner, taller boxes while COCO
has greater variation in size than VOC.

 Based on Googlenet architecture, faster than VGG-16;
 Darknet-19: 19 convolutional layers and 5 maxpooling layers;
 Training for classification: Darknet, data augmentation;
 Training for detection: remove the last conv. layer, add on three 3 × 3
conv. layers with 1024 filters each followed by a final 1 × 1 conv. layer;
 Hierarchical classification: WordNet, -> WordTree, a model of visual
concepts;
 Dataset combination with WordTree: combine labels from ImageNet &
COCO;
 Joint classification and detection: use the COCO detection dataset
and the top 9000 classes from the full ImageNet release;
 YOLO9000: WordTree with 9418 classes.

Combining datasets using WordTree hierarchy

YOLO v3
 Predict bounding boxes using dimension clusters as anchor
boxes like yolo9000;
 Predict an objectness score for each bounding box using
logistic regression;
 Use binary cross-entropy loss for the class predictions;
 Predict boxes at 3 different scales:
 Extract features from those scales using a similar concept to
feature pyramid networks;
 Add several convolutional layers and the last of these predicts a
3-d tensor encoding bounding box, objectness, and class
predictions;
 Take the feature map from 2 layers previous and upsample it by
2×and then merge with a feature map from earlier in the network;
 Use k-means clustering to determine bounding box priors (9
clusters).

YOLO v3
 A hybrid approach between the network used in YOLOv2,
DarkNet-19, and that newfangled residual network stuff.
 It uses successive 3x3, 1x1 convolutional layers with some shortcut
connections;
 It has 53 convolutional layers called DarkNet-53;
 At 320x320 it runs in 22ms at 28.2mAP, as good as SSD but 3
times faster.

CornerNet: Detecting Objects as Paired Keypoints
 CornerNet detects an object bounding box as a pair of keypoints, the top-
left corner and the bottom-right corner, using a single convolution neural
network.
 By detecting objects as paired keypoints, eliminate the need for designing a
set of anchor boxes commonly used in prior single-stage detectors.
 In addition, corner pooling, a type of pooling layer that helps the network
better localize corners.
bounding box predictions overlaid on predicted heatmaps of corners

Detect an object as a pair of bounding box corners grouped together. A convolutional network outputs a
heatmap for all top-left corners, a heatmap for all bottom-right corners, and an embedding vector for each
detected corner. The network is trained to predict similar embeddings for corners from the same object.

Corner pooling: for each channel, we take the maximum values
(red dots) in two directions (red lines), each from a separate
feature map, and add the two maximums together (blue dot).
“Ground-truth” heatmaps for training.

The backbone network is followed by two prediction modules, one for the top-left corners and the
other for the bottom-right corners. Using the predictions from both modules, we locate and group
the corners.
“pull” loss to train the network to group the
corners , “push” loss to separate the
corners:

ExtremeNet: Bottom-up Object Detection by
Grouping Extreme and Center Points
 Bottom-up approaches still perform competitively wrt top down approaches.
 To detect four extreme points and one center point of objects using a standard
keypoint estimation network.
 To group the five keypoints into a bounding box if they are geometrically aligned.
 Object detection is then a purely appearance-based keypoint estimation problem,
without region classification or implicit feature learning.

The network predicts four
extreme point heatmaps (Top.
the heatmap overlaid on the
input image) and one center
heatmap (Bottom row left) for
each category. Combinations
of the peaks (Middle left) of four
extreme point heatmaps and
the geometric center of the
composed bounding box
(Middle right). A bounding box
is produced if and only if its
geometric center has a high
response in the center heatmap
(Bottom right).

The network takes an image as input and produces four C-channel heatmaps, one C- channel
heatmap, and four 2-channel category-agnostic offset map. The heatmaps are trained by
weighted pixel-wise logistic regression, where the weight is used to reduce false-positive penalty
near the ground truth location. And the offset map is trained with Smooth L1 loss applied at
ground truth peak locations.

In the case of multiple points being the extreme
point on one edge, our model predicts a segment
of low confident responses (a). Edge aggregation
enhances the confidence of the middle pixel (b).

FSAF: Feature Selective Anchor-Free Module
 Feature selective anchor-free (FSAF) module can be plugged into single shot
detectors with feature pyramid structure (FPN).
 The FSAF module avoids limitations by the anchor-based detection:
 1) heuristic-guided feature selection;
 2) overlap-based anchor sampling.
 The general concept of the FSAF module is online feature selection applied to
the training of multi-level anchor-free branches.
 Specifically, an anchor-free branch is attached to each level of the feature
pyramid, allowing box encoding and decoding in the anchor-free manner at
an arbitrary level.
 In training, dynamically assign each instance to the most suitable feature level.
 At the time of inference, the FSAF module can work jointly with anchor-based
branches by outputting predictions in parallel.

Selected feature level in anchor-based branches may not be optimal.
FSAF module plugged into conventional anchor-based detection methods.
During training, each instance is assigned to a pyramid level via feature selection for setting up supervision.

Supervision for an instance in one feature level of the
anchor-free branches. We use focal loss for
classification and IoU loss for box regression.
Online feature selection mechanism. Each instance is passing through all
levels of anchor-free branches to compute the averaged classification
(focal) loss and regression (IoU) loss over effective regions. Then the level
with minimal summation of two losses is selected to set up the supervision
signals for that instance.

Network architecture of RetinaNet with FSAF module. The FSAF module only introduces two
additional conv layers (dashed feature maps) per pyramid level, keeping the architecture fully
convolutional.

FCOS: Fully Convolutional One-Stage Object Detection
 A fully convolutional one-stage object detector (FCOS) to solve object
detection in per-pixel prediction, analogue to semantic segmentation.
 This detector FCOS is anchorbox free, as well as proposal free.
 By eliminating the predefined set of anchor boxes, FCOS avoids the
complicated computation related to anchor boxes, as calculating
overlapping in training and significantly reduces the training memory footprint.
 Also it avoids all hyper-parameters related to anchor boxes, very sensitive to
the final detection performance.
 With the only post-processing NMS, FCOS outperforms previous anchor-based
one-stage detectors with the advantage of being much simpler.

The network architecture of FCOS, where C3, C4, and C5 denote the feature maps of the backbone
network and P3 to P7 are the feature levels used for the final prediction. H × W is the height and width
of feature maps. ‘/s’ (s = 8, 16, ..., 128) is the down-sampling ratio of the level of feature maps to the
input image. As an example, all the numbers are computed with an 800 × 1024 input.

ResNet-50 is used as the backbone. As shown in the figure, FCOS works well with a wide range of
objects including crowded, occluded, highly overlapped, extremely small and very large objects.

FoveaBox: Beyond Anchor-based Object Detector
 FoveaBox, an accurate and anchor-free framework for object detection.
 Object detectors with the anchors are limited to the design of anchors.
 FoveaBox directly learns the object existing possibility and the bounding box
coordinates without anchor reference.
 (a) predicting category-sensitive semantic maps for the object existing possibility,
 (b) producing category-agnostic Bbox for each position as object candidate.
 The scales of target boxes are naturally associated with feature pyramid
representations for each input image.
 For the objects with arbitrary aspect ratios, FoveaBox brings in significant
improvement compared to the anchor-based detectors.
 FoveaBox shows great robustness and generalization ability to the changed
distribution of bounding box shapes.

FoveaBox object detector. For each output
spacial position that potentially presents an
object, FoveaBox directly predicts the
confidences for all target categories and the
bounding box.
FoveaBox network architecture.
FoveaBox uses a FPN backbone on
top of a feed-forward ResNet
architecture. To this backbone,
FoveaBox attaches two subnetworks,
one for classifying and one for
prediction.

These results are based on ResNet-101, achieving a single model box AP of 38.9.

Center and Scale Prediction: A Box-free Approach for
Object Detection
 It scans for feature points all over the image, for which convolution is suited.
 This detector goes for a higher-level abstraction, central points where there
are objects, and deep models capable of high level semantic abstraction.
 It predicts the scales of central points, also a straightforward convolution.
 Object detection is simplified as a straightforward center and scale
prediction task through convolutions.
 Though structurally simple, it presents competitive accuracy on several
challenging benchmarks, like pedestrian detection and face detection.
 A cross dataset evaluation is performed for the method’s generalization.

Object Detection
The overall pipeline of the proposed CSP (Center and Scale Prediction) detector. The final
convolutions have two channels, one is a heatmap indicating the locations of the centers (red dots),
and the other serves to predict the scales (yellow dotted lines) for each detected center.

Object Detection
Overall architecture of CSP, which mainly comprises two components, i.e. the feature extraction module and
the detection head. The feature extraction module concatenates feature maps of different resolutions into a
single one. The detection head merely contains a 3x3 convolutional layer, followed by two prediction layers,
one for the center location and the other for the corresponding scale.

Region Proposal by Guided Anchoring (GA-RPN)
 Guided Anchoring leverages semantic features to guide the anchoring.
 The method jointly predicts the locations where the center of objects of interest are
likely to exist as well as the scales and aspect ratios at different locations.
 On top of predicted anchor shapes, to mitigate the feature inconsistency with a
feature adaption module.
 Use of high-quality proposals to improve detection performance.
 The anchoring scheme can be seamlessly integrated into proposal methods and
detectors.
 Code: //github.com/open-mmlab/mmdetection.

GA-RPN framework. For each output feature map in the feature pyramid, use an anchor
generation module with two branches to predict the anchor location and shape, respectively. Then a
feature adaption module is applied to the original feature map to make the new feature map aware
of anchor shapes.

Anchor location target for multi-level features. Assign ground truth objects to
different feature levels according to their scales, and define CR,IR and OR
respectively.

Examples of RPN proposals (top row) and GA-RPN proposals (bottom row).

CenterNet: Objects as Points
 Detection identifies objects as axis-aligned boxes in an image.
 To model an object as a single point — the center point of its bounding box.
 The detector uses keypoint estimation to find center points and regresses to all other
object properties, such as size, 3D location, orientation, and even pose.
 The center point based approach, CenterNet, is end-to-end differentiable, simpler,
faster, and more accurate than corresponding bounding box based detectors.
To model an object as the center point of its bounding box. The bounding box size and
other object properties are inferred from the keypoint feature at the center.

Different between anchor-based detectors (a) and center point detector (b). Best viewed on screen.
(a) Standard anchor based detection. (b) Center point based detection.
Anchors count as positive with an
overlap IoU > 0.7 to any object,
negative with an overlap IoU < 0.3,
or are ignored otherwise.
The center pixel is assigned to
the object. Nearby points have a
reduced negative loss. Object
size is regressed.

Model diagrams. The numbers are the stride. (a): Hourglass Network as is in CornerNet. (b): ResNet with transpose
convolutions. Add one 3 × 3 deformable convolutional layer before each up-sampling layer. Specifically, use deformable
convolution to change the channels and then use transposed convolution to upsample the feature map (such two steps are
shown separately in 32 → 16. These two steps together as a dashed arrow for 16 → 8 and 8 → 4). (c): The original DLA-34
(Deep layer aggregation) for semantic segmentation. (d): Modified DLA-34. Add more skip connections upsampling stages
to deformable convolutional layer.

CenterNet: Object Detection with Keypoint Triplets
 An efficient solution which explores the visual patterns within each cropped
region with minimal costs.
 The framework upon a representative one-stage keypoint-based detector
named CornerNet.
 CenterNet, detects each object as a triplet, rather than a pair, of keypoints,
which improves both precision and recall.
 Two customized modules: cascade corner pooling and center pooling, play
the roles of enriching info. collected by both top-left and bottom-right
corners and provide more recognizable information at the central regions,
respectively.

Architecture of CenterNet. A convolutional backbone network applies cascade corner pooling
and center pooling to output two corner heatmaps and a center keypoint heatmap, respectively.
Similar to CornerNet, a pair of detected corners and the similar embeddings are used to detect a
potential bounding box. Then the detected center keypoints are used to determine the final
bounding boxes.

(a) Center pooling takes the max values in both
horizontal and vertical directions. (b) Corner pooling
only takes the max values in boundary directions. (c)
Cascade corner pooling takes the max values in
both boundary directions and internal directions of
objects.
The structures of the center pooling module (a)
and the cascade top corner pooling module (b).
Center pooling and the cascade corner pooling
by combining the corner pooling at different
directions.

CornerNet-Lite: Efficient Keypoint Based Object Detection
 CornerNet-Lite is a combination of two efficient variants of CornerNet:
CornerNet-Saccade, which uses an attention mechanism to eliminate the
need for exhaustively processing all pixels of the image, and CornerNet-
Squeeze, which introduces a new compact backbone architecture.
 Together these two variants address the two critical use cases in efficient
object detection: improving efficiency without sacrificing accuracy, and
improving accuracy at real-time efficiency.
 CornerNet-Saccade is suitable for offline processing, improving the
efficiency of CornerNet by 6.0x and the AP by 1.0% on COCO.
 CornerNet-Squeeze is suitable for real-time detection, improving both the
efficiency and accuracy of the popular real-time detector YOLOv3 (34.4%
AP at 34ms for CornerNet-Squeeze compared to 33.0% AP at 39ms for
YOLOv3 on COCO).

Overview of CornerNet-Saccade. Predict a set of possible object locations from the attention maps and
bounding boxes generated on a downsized full image. Zoom into each location and crop a small region
around that location. Then detect objects in each region. Control the efficiency by ranking the object
locations and choosing top k locations to process. Finally, merge the detections by NMS.

 In contrast to CornerNet-Saccade, which focuses on a subset of the pixels to
reduce the amount of processing, CornerNet-Squeeze explores an alternative
approach of reducing the amount of processing per pixel.
 In CornerNet, most of the computational resources are spent on Hourglass-104.
 Hourglass-104 is built from residual blocks which consists of two 3 × 3 convolution
layers and a skip connection.
 Although Hourglass-104 achieves competitive performance, it is expensive in
terms of number of parameters and inference time.
 To reduce the complexity of Hourglass-104, incorporate ideas from SqueezeNet
and MobileNets to design a lightweight hourglass architecture.
 SqueezeNet’s 3 strategies to reduce network complexity: (1) replacing 3 × 3
kernels with 1 × 1 kernels; (2) decreasing input channels to 3 × 3 kernels; (3)
down- sampling late.

Qualitative examples on COCO validation set.

Anchor free object detection by deep learning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Anchor free object detection by deep learning

Similaire à Anchor free object detection by deep learning (20)

Plus de Yu Huang

Plus de Yu Huang (20)

Dernier

Dernier (20)

Anchor free object detection by deep learning