2. I. GROUP NORMALIZATION
GROUP NORMALIZATION &
RETHINKING IMAGENET PRE-TRAINING
II. RETHINKING IMAGENET PRE-TRAINING
• METHODOLOGY
• EXPERIMENTS
• METHODOLOGY
• EXPERIMENTS
3. I. GROUP NORMALIZATION
125.11.2018
BN’s error increases rapidly when the batch
size becomes smaller, caused by inaccurate
batch statistics estimation.
Group Normalization
ImageNet classification error vs. batch sizes.
7. 525.11.2018
Group Normalization
Only need to specify how the mean and variance (“moments”) are computed,
along the appropriate axes as defined by the normalization method.
IMPLEMENTATION
10. 825.11.2018
Group Normalization
Evolution of feature distributions of conv5-3’s output (before normalization and ReLU) from
VGG-16, shown as the {1, 20, 80, 99} percentile of responses. The table on the right shows
the ImageNet validation error (%). Models are trained with 32 images/GPU.
EXPERIMENTS: IMAGE CLASSIFICATION IN IMAGENET
VGG models: For VGG-16, GN is better than BN by 0.4%. This possibly implies that VGG-16
benefits less from BN’s regularization effect.
11. 925.11.2018
Group Normalization
EXPERIMENTS: IMAGE CLASSIFICATION IN IMAGENET
With a given fixed group number, GN performs
reasonably well for all values of G we studied.
Fixing the number of channels per group.
Note that because the layers can have different
channel numbers, the group number G can
change across layers in this setting.
Deeper models: ResNet-101 Batch size=32
BN baseline: error 22.0%
GN: error 22.4%
Batch size=2
BN baseline: error 31.9%
GN: error 23.0%
12. 1025.11.2018
Group Normalization
OBJECT DETECTION AND SEGMENTATION IN COCO
GN is not fully trained with the default schedule,
so we also tried increasing the iterations from
180k to 270k (BN* does not benefit from longer
training).
13. 1125.11.2018
Group Normalization
OBJECT DETECTION AND SEGMENTATION IN COCO
Error curves in Kinetics with an input length of 32 frames. We show ResNet-50 I3D’s
validation error of BN (left) and GN (right) using a batch size of 8 and 4 clips/GPU.
14. 1225.11.2018
Group Normalization
VIDEO CLASSIFICATION IN KINETICS
Video classification results in Kinetics:
ResNet-50 I3D baseline’s top-1 / top-5
accuracy (%).
Detection and segmentation results trained
from scratch in COCO using Mask R-CNN and
FPN. Here the BN is synced across GPUs and is
not frozen.
16. 1425.11.2018
Rethinking ImageNet Pre-training
RETHINKING IMAGENET PRE-TRAINING
Get competitive results on object detection and instance segmentation on
the COCO dataset using standard models trained from random initialization.
NO worse than their ImageNet pre-training counterparts
ONLY !!! increase the number of training iterations so the
randomly initialized models may converge
(i) using only 10% of the training data,
(ii) for deeper and wider models,
and (iii) for multiple tasks and metrics.
EVEN WHEN
17. 1525.11.2018
Rethinking ImageNet Pre-training
RETHINKING IMAGENET PRE-TRAINING
We train Mask R-CNN with a ResNet-50
FPN and GroupNorm backbone on the
COCO train2017 set and evaluate
bounding box AP on the val2017 set.
(i) ImageNet pre-training speeds up
convergence
(ii) ImageNet pre-training does not
automatically give better regularization.
(iii) ImageNet pre-training shows no
benefit when the target tasks/metrics are
more sensitive to spatially welllocalized
predictions.
Observation:
18. 1625.11.2018
Rethinking ImageNet Pre-training
METHODOLOGY
1. Normalization
(i) Group Normalization (GN)
(ii) Synchronized Batch Normalization (SyncBN)
Small batch sizes severely degrade the accuracy of BN. This issue can be circumvented if
pre-training is used, because fine-tuning can adopt the pretraining batch statistics as fixed
parameters; however, freezing BN is invalid when training from scratch.
2. Convergence
trained for longer than typical fine-tuning...
19. 1725.11.2018
Rethinking ImageNet Pre-training
METHODOLOGY
2. Convergence
trained for longer than typical fine-tuning
This suggests that a sufficiently large
number of total samples (arguably in
terms of pixels) are required for the
models trained from random
initialization to converge well
20. 125.11.2018
Rethinking ImageNet Pre-training
TRAINING FROM SCRATCH TO MATCH ACCURACY
Our first surprising discovery is that when only using the COCO data, models trained from
scratch can catch up in accuracy with ones that are fine-tuned.
21. 1925.11.2018
Rethinking ImageNet Pre-training
TRAINING FROM SCRATCH TO MATCH ACCURACY
(i) Typical fine-tuning schedules (2×) work well
for the models with pre-training to converge to
near optimum.But these schedules are not
enough for models trained from scratch.
(ii) Models trained from scratch can catch up
with their fine-tuning counterparts, their
detection AP is no worse than their fine-tuning
counterparts. The models trained from scratch
catch up not only by chance for a single metric.
25. 2325.11.2018
Rethinking ImageNet Pre-training
BREAKDOWN REGIME
I. 1k COCO training images.
Training with 1k COCO images (shown as the loss in the training set). The randomly initialized
model can catch up for the training loss, but has lower validation accuracy (3.4 AP) than the
pre-training counterpart (9.9 AP).
A sign of strong overfitting due to the
severe lack of data. The breakdown
point in the COCO dataset is somewhere
between 3.5k to 10k training images
26. 2425.11.2018
Rethinking ImageNet Pre-training
BREAKDOWN REGIME
II. PASCAL VOC
There are 15k VOC images used for training. But these images have on
average 2.3 instances per image (vs. COCO’s ∼7) and 20 categories (vs.
COCO’s 80).
We suspect that the fewer instances (and categories) has a similar
negative impact as insufficient training data, which can explain why
training from scratch on VOC is not able to catch up as observed on
COCO.
Using ImageNet pre-training: 82.7 mAP at 18k iterations
Trained from scratch: 77.6 mAP at 144k iterations
27. 2525.11.2018
Rethinking ImageNet Pre-training
MAIN OBSERVATIONS
Training from scratch on target tasks is possible without architectural changes.
Training from scratch requires more iterations to sufficiently converge.
Training from scratch can be no worse than its ImageNet pre-training counterparts
under many circumstances, down to as few as 10k COCO images.
ImageNet pre-training speeds up convergence on the target task.
ImageNet pre-training does not necessarily help reduce overfitting unless we enter
a very small data regime.
ImageNet pre-training helps less if the target task is more sensitive to localization
than classification.
28. 2625.11.2018
Rethinking ImageNet Pre-training
A FEW IMPORTANT QUESTIONS
Is ImageNet pre-training necessary? -No
Is ImageNet helpful? -Yes
Do we need big data? -Yes
Shall we pursuit universal representations? -Yes