GNorm and Rethinking pre training-ruijie

GROUP NORMALIZATION &
RETHINKING IMAGENET PRE-TRAINING
Ruijie Quan 2018/11/25

I. GROUP NORMALIZATION
GROUP NORMALIZATION &
II. RETHINKING IMAGENET PRE-TRAINING
• METHODOLOGY
• EXPERIMENTS
• METHODOLOGY
• EXPERIMENTS

I. GROUP NORMALIZATION
125.11.2018
BN’s error increases rapidly when the batch
size becomes smaller, caused by inaccurate
batch statistics estimation.
Group Normalization
ImageNet classification error vs. batch sizes.

225.11.2018
GROUP NORMALIZATION
Group Normalization
A general formulation of
feature normalization:

325.11.2018
GROUP NORMALIZATION
Group Normalization

425.11.2018
GROUP NORMALIZATION
Group Normalization
(C//G , H, W)

525.11.2018
Group Normalization
Only need to specify how the mean and variance (“moments”) are computed,
along the appropriate axes as defined by the normalization method.
IMPLEMENTATION

625.11.2018
Group Normalization
EXPERIMENTS: IMAGE CLASSIFICATION IN IMAGENET
Comparison of error curves with a batch size of 32 images/GPU.(Model: Resnet-50)

725.11.2018
Group Normalization

825.11.2018
Group Normalization
Evolution of feature distributions of conv5-3’s output (before normalization and ReLU) from
VGG-16, shown as the {1, 20, 80, 99} percentile of responses. The table on the right shows
the ImageNet validation error (%). Models are trained with 32 images/GPU.
VGG models: For VGG-16, GN is better than BN by 0.4%. This possibly implies that VGG-16
benefits less from BN’s regularization effect.

925.11.2018
Group Normalization
With a given fixed group number, GN performs
reasonably well for all values of G we studied.
Fixing the number of channels per group.
Note that because the layers can have different
channel numbers, the group number G can
change across layers in this setting.
Deeper models: ResNet-101 Batch size=32
BN baseline: error 22.0%
GN: error 22.4%
Batch size=2
BN baseline: error 31.9%
GN: error 23.0%

1025.11.2018
Group Normalization
OBJECT DETECTION AND SEGMENTATION IN COCO
GN is not fully trained with the default schedule,
so we also tried increasing the iterations from
180k to 270k (BN* does not benefit from longer
training).

1125.11.2018
Group Normalization
OBJECT DETECTION AND SEGMENTATION IN COCO
Error curves in Kinetics with an input length of 32 frames. We show ResNet-50 I3D’s
validation error of BN (left) and GN (right) using a batch size of 8 and 4 clips/GPU.

1225.11.2018
Group Normalization
VIDEO CLASSIFICATION IN KINETICS
Video classification results in Kinetics:
ResNet-50 I3D baseline’s top-1 / top-5
accuracy (%).
Detection and segmentation results trained
from scratch in COCO using Mask R-CNN and
FPN. Here the BN is synced across GPUs and is
not frozen.

II. RETHINKING IMAGENET PRE-TRAINING
1325.11.2018
Rethinking ImageNet Pre-training

1425.11.2018
Get competitive results on object detection and instance segmentation on
the COCO dataset using standard models trained from random initialization.
NO worse than their ImageNet pre-training counterparts
ONLY !!! increase the number of training iterations so the
randomly initialized models may converge
(i) using only 10% of the training data,
(ii) for deeper and wider models,
and (iii) for multiple tasks and metrics.
EVEN WHEN

1525.11.2018
We train Mask R-CNN with a ResNet-50
FPN and GroupNorm backbone on the
COCO train2017 set and evaluate
bounding box AP on the val2017 set.
(i) ImageNet pre-training speeds up
convergence
(ii) ImageNet pre-training does not
automatically give better regularization.
(iii) ImageNet pre-training shows no
benefit when the target tasks/metrics are
more sensitive to spatially welllocalized
predictions.
Observation:

1625.11.2018
METHODOLOGY
1. Normalization
(i) Group Normalization (GN)
(ii) Synchronized Batch Normalization (SyncBN)
Small batch sizes severely degrade the accuracy of BN. This issue can be circumvented if
pre-training is used, because fine-tuning can adopt the pretraining batch statistics as fixed
parameters; however, freezing BN is invalid when training from scratch.
2. Convergence
trained for longer than typical fine-tuning...

1725.11.2018
METHODOLOGY
2. Convergence
trained for longer than typical fine-tuning
This suggests that a sufficiently large
number of total samples (arguably in
terms of pixels) are required for the
models trained from random
initialization to converge well

125.11.2018
TRAINING FROM SCRATCH TO MATCH ACCURACY
Our first surprising discovery is that when only using the COCO data, models trained from
scratch can catch up in accuracy with ones that are fine-tuned.

1925.11.2018
(i) Typical fine-tuning schedules (2×) work well
for the models with pre-training to converge to
near optimum.But these schedules are not
enough for models trained from scratch.
(ii) Models trained from scratch can catch up
with their fine-tuning counterparts, their
detection AP is no worse than their fine-tuning
counterparts. The models trained from scratch
catch up not only by chance for a single metric.

2025.11.2018
X152: Large models trained from scratch

2125.11.2018
ImageNet pre-training, which has little
explicit localization information, does
not help keypoint detection

2225.11.2018
TRAINING FROM SCRATCH WITH LESS DATA

2325.11.2018
BREAKDOWN REGIME
I. 1k COCO training images.
Training with 1k COCO images (shown as the loss in the training set). The randomly initialized
model can catch up for the training loss, but has lower validation accuracy (3.4 AP) than the
pre-training counterpart (9.9 AP).
A sign of strong overfitting due to the
severe lack of data. The breakdown
point in the COCO dataset is somewhere
between 3.5k to 10k training images

2425.11.2018
BREAKDOWN REGIME
II. PASCAL VOC
There are 15k VOC images used for training. But these images have on
average 2.3 instances per image (vs. COCO’s ∼7) and 20 categories (vs.
COCO’s 80).
We suspect that the fewer instances (and categories) has a similar
negative impact as insufficient training data, which can explain why
training from scratch on VOC is not able to catch up as observed on
COCO.
Using ImageNet pre-training: 82.7 mAP at 18k iterations
Trained from scratch: 77.6 mAP at 144k iterations

2525.11.2018
MAIN OBSERVATIONS
 Training from scratch on target tasks is possible without architectural changes.
 Training from scratch requires more iterations to sufficiently converge.
 Training from scratch can be no worse than its ImageNet pre-training counterparts
under many circumstances, down to as few as 10k COCO images.
 ImageNet pre-training speeds up convergence on the target task.
 ImageNet pre-training does not necessarily help reduce overfitting unless we enter
a very small data regime.
 ImageNet pre-training helps less if the target task is more sensitive to localization
than classification.

2625.11.2018
A FEW IMPORTANT QUESTIONS
Is ImageNet pre-training necessary? -No
Is ImageNet helpful? -Yes
Do we need big data? -Yes
Shall we pursuit universal representations? -Yes

GNorm and Rethinking pre training-ruijie

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à GNorm and Rethinking pre training-ruijie

Similaire à GNorm and Rethinking pre training-ruijie (20)

Plus de 哲东郑

Plus de 哲东郑 (20)

Dernier

Dernier (20)