SlideShare une entreprise Scribd logo
1  sur  57
Télécharger pour lire hors ligne
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not
use or distribute these slides for commercial purposes. You may make copies of these
slides and use or distribute them for educational purposes as long as you
cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see
https://creativecommons.org/licenses/by-sa/2.0/legalcode
High Performance Modeling
Welcome
Distributed Training
High Performance Modeling
Rise in computational requirements
● At first, training models is quick and easy
● Training models becomes more time-consuming
○ With more data
○ With larger models
● Longer training -> More epochs -> Less efficient
● Use distributed training approaches
Types of distributed training
● Data parallelism: In data parallelism, models are replicated onto
different accelerators (GPU/TPU) and data is split between them
● Model parallelism: When models are too large to fit on a single device
then they can be divided into partitions, assigning different partitions to
different accelerators
Data parallelism
Distributed training using data parallelism
Synchronous
training
● All workers train and complete updates in sync
● Supported via all-reduce architecture
Asynchronous
Training
● Each worker trains and completes updates separately
● Supported via parameter server architecture
● More efficient, but can result in lower accuracy and
slower convergence
● If you want to distribute a model:
○ Supported in high-level APIs such as Keras/Estimators
○ For more control, you can use custom training loops
Making your models distribute-aware
● Library in TensorFlow for running a computation in multiple devices
● Supports distribution strategies for high-level APIs like Keras and
custom training loops
● Convenient to use with little or no code changes
tf.distribute.Strategy
Distribution Strategies supported by tf.distribute.Strategy
● One Device Strategy
● Mirrored Strategy
● Parameter Server Strategy
● Multi-Worker Mirrored Strategy
● Central Storage Strategy
● TPU Strategy
One Device Strategy
● Single device - no distribution
● Typical usage of this strategy is testing your code before
switching to other strategies that actually distribute your code
Mirrored Strategy
● This strategy is typically used for training on one machine with multiple
GPUs
○ Creates a replica per GPU <> Variables are mirrored
○ Weight updating is done using efficient cross-device communication
algorithms (all-reduce algorithms)
● Some machines are designated as workers and others as parameter
servers
○ Parameter servers store variables so that workers can perform
computations on them
● Implements asynchronous data parallelism by default
Parameter Server Strategy
● Catastrophic failures in one worker would cause failure of
distribution strategies.
● How to enable fault tolerance in case a worker dies?
○ By restoring training state upon restart from job failure
○ Keras implementation: BackupAndRestore callback
Fault tolerance
High Performance Modeling
High-performance
Ingestion
Data at times can’t fit into memory and sometimes, CPUs are under-utilized
in compute intensive tasks like training a complex model
You should avoid these inefficiencies so that you can make the most of the
hardware available → Use input pipelines
Why input pipelines?
Local (HDD/SSD)
Remote (GCS/HDFS)
Extract Transform
Shuffling & Batching
Decompression
Augmentation
Vectorization
. . .
Load
Load transformed data
to an accelerator
tf.data: TensorFlow Input Pipeline
Extract
Step 1
Transform 1
Extract
Step 3
Transform 2
Load
1
Train 1
Load
2
Extract
Step 2
Train 2
Time
Accelerator idle
CPU Idle
Inefficient ETL process
Extract
Step 1
Transform 1
Extract
Step 2
Extract
Step 3
Transform 2
Load
1
Train 1
Transform 3
Load
2
Load
3
Extract
Step 4
Transform 4
Load
4
Transform 5
Extract
Step 5
Extract
Step 6
Transform 6
Load
5
Train 2 Train 3 Train 4
Extract
Step 7
Accelerator 100%
utilized
Disk Idle CPU Idle
Time
An improved ETL process
Without
pipelining
With
pipelining
CPU
GPU/TPU idle
Prepare 1 Prepare 2 Prepare 3 Prepare 4
Train 1
Prepare 1
idle idle
Prepare 2
idle
Prepare 3
Train 3
idle
Train 2
idle
Train 1
idle
CPU
GPU/TPU
Time
Pipelining
Train 1 Train 2 Train 3
Time
How to optimize pipeline performance?
● Prefetching
● Parallelize data extraction and transformation
● Caching
● Reduce memory
benchmark(
ArtificialDataset()
.prefetch(tf.data.experimental.AUTOTUNE)
)
Optimize with prefetching
Prefetched
Time (s)
Epoch
Train
Read
Open
Parallelize data extraction
● Time-to-first-byte: Prefer local storage as it takes significantly longer
to read data from remote storage
● Read throughput: Maximize the aggregate bandwidth of remote
storage by reading more files
benchmark(
tf.data.Dataset.range(2)
.interleave(
ArtificialDataset,
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
)
Parallel interleave
Parallel interleave
Time (s)
Epoch
Train
Read
Open
Parallelize data transformation
● Post data loading, the inputs may need preprocessing
● Element-wise preprocessing can be parallelized across CPU cores
● The optimal value for the level of parallelism depends on:
○ Size and shape of training data
○ Cost of the mapping transformation
○ Load the CPU is experiencing currently
● With tf.data you can use AUTOTUNE to set parallelism automatically
Parallel mapping
benchmark(
ArtificialDataset()
.map(
mapped_function,
num_parallel_calls=tf.data.AUTOTUNE
)
)
Epoch
Train
Map
Read
Open
Parallel map
Time (s)
Improve training time with caching
● In-memory: tf.data.Dataset.cache()
● Disk: tf.data.Dataset.cache(filename=...)
benchmark(
ArtificialDataset().map(mapped_function).cache(),5
)
Caching
Epoch
Train
Map
Read
Open
Cached dataset
Time (s)
Training Large Models -
The Rise of Giant Neural Nets
and Parallelism
High performance modeling
Rise of giant neural networks
● In 2014, the ImageNet winner was GoogleNet with 4 mil. parameters
and scoring a 74.8% top-1 accuracy
● In 2017, Squeeze-and-excitation networks achieved 82.7% top-1
accuracy with 145.8 mil. Parameters
36 fold increase in the number of parameters in just 3 years!
Issues training larger networks
● GPU memory only increased by factor ~ 3
● Saturated the amount of memory available in Cloud TPUs
● Need for large-scale training of giant neural networks
Overcoming memory constraints
● Strategy #1 - Gradient Accumulation
○ Split batches into mini-batches and only perform backprop after whole
batch
● Strategy #2 - Memory swap
○ Copy activations between CPU and memory, back and forth
Parallelism revisited
● Data parallelism: In data parallelism, models are replicated onto
different accelerators (GPU/TPU) and data is split between them
● Model parallelism: When models are too large to fit on a single device
then they can be divided into partitions, assigning different partitions to
different accelerators
Challenges in data parallelism
4 8 16 4 8 16 4 8 16
VGG16 ResNet50 AlexNet
Communication
overhead
(%
time)
0
100
50
K80
Titan X
V100
● Accelerators have limited memory
● Model parallelism: large networks can be trained
○ But, accelerator compute capacity is underutilized
● Data parallelism: train same model with different input data
○ But, the maximum model size an accelerator can support is limited
Challenges keeping accelerators busy
F1
F2
F3
B3
B2
B1
B0
Update
Update
Update
Update
Time
F0
Device 3
Device 2
Device 1
Device 0
Update
Update
Update
Update
B0.0
B0.1
B1,3
B1,2
B1,1
B1,0
B2.0
B2,1
B2.2
B2,3
B3,0
B3,1
B3,2
B3,3
F3,0
F2,3
F2,2
F2,1
F2,0
F1,3
F1,2
F1,1
F1,0
Bubble B0,2
B0,3
F0,1
F0,0
F0,2
F0,3
Device 3
Device 2
Device 1
Device 0
F3,1
F3,2
F3,3
Pipeline parallelism
Pipeline parallelism
● Integrates both data and model parallelism:
○ Divide mini-batch data into micro-batches
○ Different workers work on different micro-batches in parallel
○ Allow ML models to have significantly more parameters
GPipe - Key features
● Open-source TensorFlow library (using Lingvo)
● Inserts communication primitives at the partition boundaries
● Automatic parallelism to reduce memory consumption
● Gradient accumulation across micro-batches, so that model quality is
preserved
● Partitioning is heuristic-based
AmoebaNEt-D (4,512)
Speedup
4
3
2
1
0
naive-2 pipeline-2 pipeline-4 pipeline-8
GPipe Results
Knowledge Distillation
Teacher and Student
Networks
Sophisticated models and their problems
● Larger sophisticated models become complex
● Complex models learn complex tasks
● Can we express this learning more efficiently?
Is it possible to ‘distill’ or concentrate this complexity into smaller
networks?
GoogLeNet
● Duplicate the performance of a
complex model in a simpler model
Knowledge distillation
Loss
Teacher
Student 1 Student 2
● Idea: Create a simple ‘student’
model that learns from a complex
‘teacher’ model
Knowledge Distillation
Knowledge Distillation
Techniques
Teacher and student
● Training objectives of the models vary
● Teacher (normal training)
○ maximizes the actual metric
● Student (knowledge transfer)
○ matches p-distribution of the teacher’s predictions to form ‘soft targets’
○ ‘Soft targets’ tell us about the knowledge learned by the teacher
Dar w ge
Transferring “dark knowledge” to the student
● Improve softness of the teacher’s
distribution with ‘softmax
temperature’ (T)
● As T grows, you get more insight
about which classes the teacher finds
similar to the predicted one
Techniques
● Approach #1: Weigh objectives (student and teacher) and combine
during backprop
● Approach #2: Compare distributions of the predictions (student and
teacher) using KL divergence
KL divergence
t - logits from the teacher
s - logits of the student
Input
x
Layer
1
Layer
2
Layer
m
softmax (T=t)
Teacher model
Layer
1
Layer
m
softmax (T=t)
Student (distilled) model
softmax (T=1)
Layer
2
...
...
labels
predictions
distillation loss
predictions
soft
hard
hard label y
student loss
Ground truth
How knowledge transfer takes place
Model Accuracy Word Error Rate (WER)
Baseline 58.9% 10.9%
10x Ensemble 61.1% 10.7%
Distilled Single Model 60.8% 10.7%
First quantitative results of distillation
DistilBERT
Knowledge Distillation
Case Study - How to Distill
Knowledge for a Q&A Task
Two-stage multi-teacher distillation for Q & A
Distillation
task 1
Distillation
task 2
Distillation
task N
Q&A unsupervised large corpus
(soft labels)
Student model
Q&A
Golden
task
Distillation
task 2
Distillation
task N
Task specific corpus
(golden + soft labels)
Enhanced Student model
RTE
Distillation
task 2
Distillation
task N
Task specific corpus
(golden + soft labels)
Enhanced Student model
Q&A
Golden
task
Distillation
task 2
Distillation
task N
Task specific corpus
(golden + soft labels)
Enhanced Student model
MNLI
TKD MKD TMKD
DeepQA MNLI SNLI QNLI RTE
40
50
60
70
80
90
40
50
60
70
80
90
40
50
60
70
80
90
40
50
60
70
80
90
40
50
60
70
80
90
80
78
80.4
72.3 72
73.9
78.2 79 79.5
86
78
87
67
60
67
Impact of two-stage knowledge distillation
Make EfficientNets robust to noise with distillation
Replace teacher with student
Train noised student model
using labeled and
pseudo-labeled data
● Data augmentation
● Dropout
● Stochastic depth
Assign pseudo-labels to
unlabeled data
Train teacher
on labeled data
Results of noisy student training
86
84
82
80
78
76
74
0 40 80 120 160
# of parameters (in millions)
ImageNet
Top-1
Accuracy
%
NoisyStudent EfficentNet-B7
EfficentNet-B7
AmoebaNet-C
SENet

Contenu connexe

Tendances

Lecture 5 machine learning updated
Lecture 5   machine learning updatedLecture 5   machine learning updated
Lecture 5 machine learning updatedVajira Thambawita
 
Developing Recommendation System to provide a Personalized Learning experienc...
Developing Recommendation System to provide a PersonalizedLearning experienc...Developing Recommendation System to provide a PersonalizedLearning experienc...
Developing Recommendation System to provide a Personalized Learning experienc...Sanghamitra Deb
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Aseda Owusua Addai-Deseh
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningSanghamitra Deb
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overviewprih_yah
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.ASHOK KUMAR
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning ProjectEng Teong Cheah
 
Machine learning it is time...
Machine learning it is time...Machine learning it is time...
Machine learning it is time...Sandip Chatterjee
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)butest
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Machine learning_ Replicating Human Brain
Machine learning_ Replicating Human BrainMachine learning_ Replicating Human Brain
Machine learning_ Replicating Human BrainNishant Jain
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 

Tendances (20)

Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Lecture 5 machine learning updated
Lecture 5   machine learning updatedLecture 5   machine learning updated
Lecture 5 machine learning updated
 
Developing Recommendation System to provide a Personalized Learning experienc...
Developing Recommendation System to provide a PersonalizedLearning experienc...Developing Recommendation System to provide a PersonalizedLearning experienc...
Developing Recommendation System to provide a Personalized Learning experienc...
 
Machine learning
Machine learning Machine learning
Machine learning
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overview
 
Machine learning yearning
Machine learning yearningMachine learning yearning
Machine learning yearning
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 
Deep learning
Deep learningDeep learning
Deep learning
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Machine learning it is time...
Machine learning it is time...Machine learning it is time...
Machine learning it is time...
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
 
Machine learning_ Replicating Human Brain
Machine learning_ Replicating Human BrainMachine learning_ Replicating Human Brain
Machine learning_ Replicating Human Brain
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 

Similaire à C3 w3

Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
 
Distributed Deep learning Training.
Distributed Deep learning Training.Distributed Deep learning Training.
Distributed Deep learning Training.Umang Sharma
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
Toward Distributed, Global, Deep Learning Using IoT Devices
Toward Distributed, Global, Deep Learning Using IoT DevicesToward Distributed, Global, Deep Learning Using IoT Devices
Toward Distributed, Global, Deep Learning Using IoT DevicesBharath Sudharsan
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...Bharath Sudharsan
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learningGanesan Narayanasamy
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity ManagementEDB
 
Parallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWParallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWDataWorks Summit
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitRafael Arana
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
TFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformTFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformShunya Ueta
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Codemotion
 
Electi Deep Learning Optimization
Electi  Deep Learning OptimizationElecti  Deep Learning Optimization
Electi Deep Learning OptimizationNikolas Markou
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 

Similaire à C3 w3 (20)

Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
Distributed Deep learning Training.
Distributed Deep learning Training.Distributed Deep learning Training.
Distributed Deep learning Training.
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Toward Distributed, Global, Deep Learning Using IoT Devices
Toward Distributed, Global, Deep Learning Using IoT DevicesToward Distributed, Global, Deep Learning Using IoT Devices
Toward Distributed, Global, Deep Learning Using IoT Devices
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learning
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
Parallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWParallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSW
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks Summit
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
TFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformTFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platform
 
Multicore
MulticoreMulticore
Multicore
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
 
openmp final2.pptx
openmp final2.pptxopenmp final2.pptx
openmp final2.pptx
 
Electi Deep Learning Optimization
Electi  Deep Learning OptimizationElecti  Deep Learning Optimization
Electi Deep Learning Optimization
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 

Dernier

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 

Dernier (20)

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

C3 w3

  • 1. Copyright Notice These slides are distributed under the Creative Commons License. DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute these slides for commercial purposes. You may make copies of these slides and use or distribute them for educational purposes as long as you cite DeepLearning.AI as the source of the slides. For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode
  • 4. Rise in computational requirements ● At first, training models is quick and easy ● Training models becomes more time-consuming ○ With more data ○ With larger models ● Longer training -> More epochs -> Less efficient ● Use distributed training approaches
  • 5. Types of distributed training ● Data parallelism: In data parallelism, models are replicated onto different accelerators (GPU/TPU) and data is split between them ● Model parallelism: When models are too large to fit on a single device then they can be divided into partitions, assigning different partitions to different accelerators
  • 7. Distributed training using data parallelism Synchronous training ● All workers train and complete updates in sync ● Supported via all-reduce architecture Asynchronous Training ● Each worker trains and completes updates separately ● Supported via parameter server architecture ● More efficient, but can result in lower accuracy and slower convergence
  • 8. ● If you want to distribute a model: ○ Supported in high-level APIs such as Keras/Estimators ○ For more control, you can use custom training loops Making your models distribute-aware
  • 9. ● Library in TensorFlow for running a computation in multiple devices ● Supports distribution strategies for high-level APIs like Keras and custom training loops ● Convenient to use with little or no code changes tf.distribute.Strategy
  • 10. Distribution Strategies supported by tf.distribute.Strategy ● One Device Strategy ● Mirrored Strategy ● Parameter Server Strategy ● Multi-Worker Mirrored Strategy ● Central Storage Strategy ● TPU Strategy
  • 11. One Device Strategy ● Single device - no distribution ● Typical usage of this strategy is testing your code before switching to other strategies that actually distribute your code
  • 12. Mirrored Strategy ● This strategy is typically used for training on one machine with multiple GPUs ○ Creates a replica per GPU <> Variables are mirrored ○ Weight updating is done using efficient cross-device communication algorithms (all-reduce algorithms)
  • 13. ● Some machines are designated as workers and others as parameter servers ○ Parameter servers store variables so that workers can perform computations on them ● Implements asynchronous data parallelism by default Parameter Server Strategy
  • 14. ● Catastrophic failures in one worker would cause failure of distribution strategies. ● How to enable fault tolerance in case a worker dies? ○ By restoring training state upon restart from job failure ○ Keras implementation: BackupAndRestore callback Fault tolerance
  • 16. Data at times can’t fit into memory and sometimes, CPUs are under-utilized in compute intensive tasks like training a complex model You should avoid these inefficiencies so that you can make the most of the hardware available → Use input pipelines Why input pipelines?
  • 17. Local (HDD/SSD) Remote (GCS/HDFS) Extract Transform Shuffling & Batching Decompression Augmentation Vectorization . . . Load Load transformed data to an accelerator tf.data: TensorFlow Input Pipeline
  • 18. Extract Step 1 Transform 1 Extract Step 3 Transform 2 Load 1 Train 1 Load 2 Extract Step 2 Train 2 Time Accelerator idle CPU Idle Inefficient ETL process
  • 19. Extract Step 1 Transform 1 Extract Step 2 Extract Step 3 Transform 2 Load 1 Train 1 Transform 3 Load 2 Load 3 Extract Step 4 Transform 4 Load 4 Transform 5 Extract Step 5 Extract Step 6 Transform 6 Load 5 Train 2 Train 3 Train 4 Extract Step 7 Accelerator 100% utilized Disk Idle CPU Idle Time An improved ETL process
  • 20. Without pipelining With pipelining CPU GPU/TPU idle Prepare 1 Prepare 2 Prepare 3 Prepare 4 Train 1 Prepare 1 idle idle Prepare 2 idle Prepare 3 Train 3 idle Train 2 idle Train 1 idle CPU GPU/TPU Time Pipelining Train 1 Train 2 Train 3 Time
  • 21. How to optimize pipeline performance? ● Prefetching ● Parallelize data extraction and transformation ● Caching ● Reduce memory
  • 23. Parallelize data extraction ● Time-to-first-byte: Prefer local storage as it takes significantly longer to read data from remote storage ● Read throughput: Maximize the aggregate bandwidth of remote storage by reading more files
  • 25. Parallelize data transformation ● Post data loading, the inputs may need preprocessing ● Element-wise preprocessing can be parallelized across CPU cores ● The optimal value for the level of parallelism depends on: ○ Size and shape of training data ○ Cost of the mapping transformation ○ Load the CPU is experiencing currently ● With tf.data you can use AUTOTUNE to set parallelism automatically
  • 27. Improve training time with caching ● In-memory: tf.data.Dataset.cache() ● Disk: tf.data.Dataset.cache(filename=...)
  • 29. Training Large Models - The Rise of Giant Neural Nets and Parallelism High performance modeling
  • 30. Rise of giant neural networks ● In 2014, the ImageNet winner was GoogleNet with 4 mil. parameters and scoring a 74.8% top-1 accuracy ● In 2017, Squeeze-and-excitation networks achieved 82.7% top-1 accuracy with 145.8 mil. Parameters 36 fold increase in the number of parameters in just 3 years!
  • 31. Issues training larger networks ● GPU memory only increased by factor ~ 3 ● Saturated the amount of memory available in Cloud TPUs ● Need for large-scale training of giant neural networks
  • 32. Overcoming memory constraints ● Strategy #1 - Gradient Accumulation ○ Split batches into mini-batches and only perform backprop after whole batch ● Strategy #2 - Memory swap ○ Copy activations between CPU and memory, back and forth
  • 33. Parallelism revisited ● Data parallelism: In data parallelism, models are replicated onto different accelerators (GPU/TPU) and data is split between them ● Model parallelism: When models are too large to fit on a single device then they can be divided into partitions, assigning different partitions to different accelerators
  • 34. Challenges in data parallelism 4 8 16 4 8 16 4 8 16 VGG16 ResNet50 AlexNet Communication overhead (% time) 0 100 50 K80 Titan X V100
  • 35. ● Accelerators have limited memory ● Model parallelism: large networks can be trained ○ But, accelerator compute capacity is underutilized ● Data parallelism: train same model with different input data ○ But, the maximum model size an accelerator can support is limited Challenges keeping accelerators busy
  • 36. F1 F2 F3 B3 B2 B1 B0 Update Update Update Update Time F0 Device 3 Device 2 Device 1 Device 0 Update Update Update Update B0.0 B0.1 B1,3 B1,2 B1,1 B1,0 B2.0 B2,1 B2.2 B2,3 B3,0 B3,1 B3,2 B3,3 F3,0 F2,3 F2,2 F2,1 F2,0 F1,3 F1,2 F1,1 F1,0 Bubble B0,2 B0,3 F0,1 F0,0 F0,2 F0,3 Device 3 Device 2 Device 1 Device 0 F3,1 F3,2 F3,3 Pipeline parallelism
  • 37. Pipeline parallelism ● Integrates both data and model parallelism: ○ Divide mini-batch data into micro-batches ○ Different workers work on different micro-batches in parallel ○ Allow ML models to have significantly more parameters
  • 38. GPipe - Key features ● Open-source TensorFlow library (using Lingvo) ● Inserts communication primitives at the partition boundaries ● Automatic parallelism to reduce memory consumption ● Gradient accumulation across micro-batches, so that model quality is preserved ● Partitioning is heuristic-based
  • 39. AmoebaNEt-D (4,512) Speedup 4 3 2 1 0 naive-2 pipeline-2 pipeline-4 pipeline-8 GPipe Results
  • 41. Sophisticated models and their problems ● Larger sophisticated models become complex ● Complex models learn complex tasks ● Can we express this learning more efficiently? Is it possible to ‘distill’ or concentrate this complexity into smaller networks?
  • 43. ● Duplicate the performance of a complex model in a simpler model Knowledge distillation Loss Teacher Student 1 Student 2 ● Idea: Create a simple ‘student’ model that learns from a complex ‘teacher’ model
  • 45. Teacher and student ● Training objectives of the models vary ● Teacher (normal training) ○ maximizes the actual metric ● Student (knowledge transfer) ○ matches p-distribution of the teacher’s predictions to form ‘soft targets’ ○ ‘Soft targets’ tell us about the knowledge learned by the teacher
  • 47. Transferring “dark knowledge” to the student ● Improve softness of the teacher’s distribution with ‘softmax temperature’ (T) ● As T grows, you get more insight about which classes the teacher finds similar to the predicted one
  • 48. Techniques ● Approach #1: Weigh objectives (student and teacher) and combine during backprop ● Approach #2: Compare distributions of the predictions (student and teacher) using KL divergence
  • 50. t - logits from the teacher s - logits of the student Input x Layer 1 Layer 2 Layer m softmax (T=t) Teacher model Layer 1 Layer m softmax (T=t) Student (distilled) model softmax (T=1) Layer 2 ... ... labels predictions distillation loss predictions soft hard hard label y student loss Ground truth How knowledge transfer takes place
  • 51. Model Accuracy Word Error Rate (WER) Baseline 58.9% 10.9% 10x Ensemble 61.1% 10.7% Distilled Single Model 60.8% 10.7% First quantitative results of distillation
  • 53. Knowledge Distillation Case Study - How to Distill Knowledge for a Q&A Task
  • 54. Two-stage multi-teacher distillation for Q & A Distillation task 1 Distillation task 2 Distillation task N Q&A unsupervised large corpus (soft labels) Student model Q&A Golden task Distillation task 2 Distillation task N Task specific corpus (golden + soft labels) Enhanced Student model RTE Distillation task 2 Distillation task N Task specific corpus (golden + soft labels) Enhanced Student model Q&A Golden task Distillation task 2 Distillation task N Task specific corpus (golden + soft labels) Enhanced Student model MNLI
  • 55. TKD MKD TMKD DeepQA MNLI SNLI QNLI RTE 40 50 60 70 80 90 40 50 60 70 80 90 40 50 60 70 80 90 40 50 60 70 80 90 40 50 60 70 80 90 80 78 80.4 72.3 72 73.9 78.2 79 79.5 86 78 87 67 60 67 Impact of two-stage knowledge distillation
  • 56. Make EfficientNets robust to noise with distillation Replace teacher with student Train noised student model using labeled and pseudo-labeled data ● Data augmentation ● Dropout ● Stochastic depth Assign pseudo-labels to unlabeled data Train teacher on labeled data
  • 57. Results of noisy student training 86 84 82 80 78 76 74 0 40 80 120 160 # of parameters (in millions) ImageNet Top-1 Accuracy % NoisyStudent EfficentNet-B7 EfficentNet-B7 AmoebaNet-C SENet