C3 w3

Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not
use or distribute these slides for commercial purposes. You may make copies of these
slides and use or distribute them for educational purposes as long as you
cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see
https://creativecommons.org/licenses/by-sa/2.0/legalcode

High Performance Modeling
Welcome

Distributed Training

Rise in computational requirements
● At ﬁrst, training models is quick and easy
● Training models becomes more time-consuming
○ With more data
○ With larger models
● Longer training -> More epochs -> Less efﬁcient
● Use distributed training approaches

Types of distributed training
● Data parallelism: In data parallelism, models are replicated onto
different accelerators (GPU/TPU) and data is split between them
● Model parallelism: When models are too large to ﬁt on a single device
then they can be divided into partitions, assigning different partitions to
different accelerators

Distributed training using data parallelism
Synchronous
training
● All workers train and complete updates in sync
● Supported via all-reduce architecture
Asynchronous
Training
● Each worker trains and completes updates separately
● Supported via parameter server architecture
● More efﬁcient, but can result in lower accuracy and
slower convergence

● If you want to distribute a model:
○ Supported in high-level APIs such as Keras/Estimators
○ For more control, you can use custom training loops
Making your models distribute-aware

● Library in TensorFlow for running a computation in multiple devices
● Supports distribution strategies for high-level APIs like Keras and
custom training loops
● Convenient to use with little or no code changes
tf.distribute.Strategy

Distribution Strategies supported by tf.distribute.Strategy
● One Device Strategy
● Mirrored Strategy
● Parameter Server Strategy
● Multi-Worker Mirrored Strategy
● Central Storage Strategy
● TPU Strategy

One Device Strategy
● Single device - no distribution
● Typical usage of this strategy is testing your code before
switching to other strategies that actually distribute your code

Mirrored Strategy
● This strategy is typically used for training on one machine with multiple
GPUs
○ Creates a replica per GPU <> Variables are mirrored
○ Weight updating is done using efﬁcient cross-device communication
algorithms (all-reduce algorithms)

● Some machines are designated as workers and others as parameter
servers
○ Parameter servers store variables so that workers can perform
computations on them
● Implements asynchronous data parallelism by default
Parameter Server Strategy

● Catastrophic failures in one worker would cause failure of
distribution strategies.
● How to enable fault tolerance in case a worker dies?
○ By restoring training state upon restart from job failure
○ Keras implementation: BackupAndRestore callback
Fault tolerance

High-performance
Ingestion

Data at times can’t ﬁt into memory and sometimes, CPUs are under-utilized
in compute intensive tasks like training a complex model
You should avoid these inefﬁciencies so that you can make the most of the
hardware available → Use input pipelines
Why input pipelines?

Local (HDD/SSD)
Remote (GCS/HDFS)
Extract Transform
Shufﬂing & Batching
Decompression
Augmentation
Vectorization
. . .
Load
Load transformed data
to an accelerator
tf.data: TensorFlow Input Pipeline

Extract
Step 1
Transform 1
Extract
Step 3
Transform 2
Load
1
Train 1
Load
2
Extract
Step 2
Train 2
Time
Accelerator idle
CPU Idle
Inefﬁcient ETL process

Extract
Step 1
Transform 1
Extract
Step 2
Extract
Step 3
Transform 2
Load
1
Train 1
Transform 3
Load
2
Load
3
Extract
Step 4
Transform 4
Load
4
Transform 5
Extract
Step 5
Extract
Step 6
Transform 6
Load
5
Train 2 Train 3 Train 4
Extract
Step 7
Accelerator 100%
utilized
Disk Idle CPU Idle
Time
An improved ETL process

Without
pipelining
With
pipelining
CPU
GPU/TPU idle
Prepare 1 Prepare 2 Prepare 3 Prepare 4
Train 1
Prepare 1
idle idle
Prepare 2
idle
Prepare 3
Train 3
idle
Train 2
idle
Train 1
idle
CPU
GPU/TPU
Time
Pipelining
Train 1 Train 2 Train 3
Time

How to optimize pipeline performance?
● Prefetching
● Parallelize data extraction and transformation
● Caching
● Reduce memory

benchmark(
ArtificialDataset()
.prefetch(tf.data.experimental.AUTOTUNE)
)
Optimize with prefetching
Prefetched
Time (s)
Epoch
Train
Read
Open

Parallelize data extraction
● Time-to-first-byte: Prefer local storage as it takes significantly longer
to read data from remote storage
● Read throughput: Maximize the aggregate bandwidth of remote
storage by reading more files

benchmark(
tf.data.Dataset.range(2)
.interleave(
ArtificialDataset,
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
)
Parallel interleave
Parallel interleave
Time (s)
Epoch
Train
Read
Open

Parallelize data transformation
● Post data loading, the inputs may need preprocessing
● Element-wise preprocessing can be parallelized across CPU cores
● The optimal value for the level of parallelism depends on:
○ Size and shape of training data
○ Cost of the mapping transformation
○ Load the CPU is experiencing currently
● With tf.data you can use AUTOTUNE to set parallelism automatically

Parallel mapping
benchmark(
ArtificialDataset()
.map(
mapped_function,
num_parallel_calls=tf.data.AUTOTUNE
)
)
Epoch
Train
Map
Read
Open
Parallel map
Time (s)

Improve training time with caching
● In-memory: tf.data.Dataset.cache()
● Disk: tf.data.Dataset.cache(filename=...)

benchmark(
ArtificialDataset().map(mapped_function).cache(),5
)
Caching
Epoch
Train
Map
Read
Open
Cached dataset
Time (s)

Training Large Models -
The Rise of Giant Neural Nets
and Parallelism
High performance modeling

Rise of giant neural networks
● In 2014, the ImageNet winner was GoogleNet with 4 mil. parameters
and scoring a 74.8% top-1 accuracy
● In 2017, Squeeze-and-excitation networks achieved 82.7% top-1
accuracy with 145.8 mil. Parameters
36 fold increase in the number of parameters in just 3 years!

Issues training larger networks
● GPU memory only increased by factor ~ 3
● Saturated the amount of memory available in Cloud TPUs
● Need for large-scale training of giant neural networks

Overcoming memory constraints
● Strategy #1 - Gradient Accumulation
○ Split batches into mini-batches and only perform backprop after whole
batch
● Strategy #2 - Memory swap
○ Copy activations between CPU and memory, back and forth

Parallelism revisited
● Data parallelism: In data parallelism, models are replicated onto
different accelerators (GPU/TPU) and data is split between them
● Model parallelism: When models are too large to ﬁt on a single device
then they can be divided into partitions, assigning different partitions to
different accelerators

Challenges in data parallelism
4 8 16 4 8 16 4 8 16
VGG16 ResNet50 AlexNet
Communication
overhead
(%
time)
0
100
50
K80
Titan X
V100

● Accelerators have limited memory
● Model parallelism: large networks can be trained
○ But, accelerator compute capacity is underutilized
● Data parallelism: train same model with different input data
○ But, the maximum model size an accelerator can support is limited
Challenges keeping accelerators busy

F1
F2
F3
B3
B2
B1
B0
Update
Update
Update
Update
Time
F0
Device 3
Device 2
Device 1
Device 0
Update
Update
Update
Update
B0.0
B0.1
B1,3
B1,2
B1,1
B1,0
B2.0
B2,1
B2.2
B2,3
B3,0
B3,1
B3,2
B3,3
F3,0
F2,3
F2,2
F2,1
F2,0
F1,3
F1,2
F1,1
F1,0
Bubble B0,2
B0,3
F0,1
F0,0
F0,2
F0,3
Device 3
Device 2
Device 1
Device 0
F3,1
F3,2
F3,3
Pipeline parallelism

Pipeline parallelism
● Integrates both data and model parallelism:
○ Divide mini-batch data into micro-batches
○ Different workers work on different micro-batches in parallel
○ Allow ML models to have signiﬁcantly more parameters

GPipe - Key features
● Open-source TensorFlow library (using Lingvo)
● Inserts communication primitives at the partition boundaries
● Automatic parallelism to reduce memory consumption
● Gradient accumulation across micro-batches, so that model quality is
preserved
● Partitioning is heuristic-based

AmoebaNEt-D (4,512)
Speedup
4
3
2
1
0
naive-2 pipeline-2 pipeline-4 pipeline-8
GPipe Results

Knowledge Distillation
Teacher and Student
Networks

Sophisticated models and their problems
● Larger sophisticated models become complex
● Complex models learn complex tasks
● Can we express this learning more efﬁciently?
Is it possible to ‘distill’ or concentrate this complexity into smaller
networks?

● Duplicate the performance of a
complex model in a simpler model
Knowledge distillation
Loss
Teacher
Student 1 Student 2
● Idea: Create a simple ‘student’
model that learns from a complex
‘teacher’ model

Techniques

Teacher and student
● Training objectives of the models vary
● Teacher (normal training)
○ maximizes the actual metric
● Student (knowledge transfer)
○ matches p-distribution of the teacher’s predictions to form ‘soft targets’
○ ‘Soft targets’ tell us about the knowledge learned by the teacher

Transferring “dark knowledge” to the student
● Improve softness of the teacher’s
distribution with ‘softmax
temperature’ (T)
● As T grows, you get more insight
about which classes the teacher ﬁnds
similar to the predicted one

Techniques
● Approach #1: Weigh objectives (student and teacher) and combine
during backprop
● Approach #2: Compare distributions of the predictions (student and
teacher) using KL divergence

t - logits from the teacher
s - logits of the student
Input
x
Layer
1
Layer
2
Layer
m
softmax (T=t)
Teacher model
Layer
1
Layer
m
softmax (T=t)
Student (distilled) model
softmax (T=1)
Layer
2
...
...
labels
predictions
distillation loss
predictions
soft
hard
hard label y
student loss
Ground truth
How knowledge transfer takes place

Model Accuracy Word Error Rate (WER)
Baseline 58.9% 10.9%
10x Ensemble 61.1% 10.7%
Distilled Single Model 60.8% 10.7%
First quantitative results of distillation

Case Study - How to Distill
Knowledge for a Q&A Task

Two-stage multi-teacher distillation for Q & A
Distillation
task 1
Distillation
task 2
Distillation
task N
Q&A unsupervised large corpus
(soft labels)
Student model
Q&A
Golden
task
Distillation
task 2
Distillation
task N
Task speciﬁc corpus
(golden + soft labels)
Enhanced Student model
RTE
Distillation
task 2
Distillation
task N
Q&A
Golden
task
Distillation
task 2
Distillation
task N
MNLI

TKD MKD TMKD
DeepQA MNLI SNLI QNLI RTE
40
50
60
70
80
90
40
50
60
70
80
90
40
50
60
70
80
90
40
50
60
70
80
90
40
50
60
70
80
90
80
78
80.4
72.3 72
73.9
78.2 79 79.5
86
78
87
67
60
67
Impact of two-stage knowledge distillation

Make EfﬁcientNets robust to noise with distillation
Replace teacher with student
Train noised student model
using labeled and
pseudo-labeled data
● Data augmentation
● Dropout
● Stochastic depth
Assign pseudo-labels to
unlabeled data
Train teacher
on labeled data

Results of noisy student training
86
84
82
80
78
76
74
0 40 80 120 160
# of parameters (in millions)
ImageNet
Top-1
Accuracy
%
NoisyStudent EfﬁcentNet-B7
EfﬁcentNet-B7
AmoebaNet-C
SENet

C3 w3

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à C3 w3

Similaire à C3 w3 (20)

Dernier

Dernier (20)

C3 w3