One of the largest challenges facing the machine learning community today is understanding how to build a platform to run common open-source machine learning libraries such as Tensorflow. Both Joy and Lachie are both passionate about making machine learning accessible to the masses using Kubernetes. In this session they'll share how to deploy a distributed Tensorflow training cluster complete with GPU scheduling on Kubernetes. We'll also share how distributed Tensorflow training works, various options for distributed training, and when to choose what option. We'll also share some best practices on using distributed Tensorflow on top of Kubernetes, based on our latest performance tests performed on public cloud providers. All work presented in this session will be accessible via a public Github repository.
2. Principal Solution Architect
Cloud and AI, Microsoft
@joyqiao2016
Joy Qiao
Principal Program Manager
Azure Containers, Microsoft
@LachlanEvenson
Lachlan Evenson
3. • The Data Scientist
• Building and training models
• Experience in Machine Learning frameworks/libraries
• Basic understanding of computer hardware
• Lucky to have Kubernetes experience
Who are we?
Data Scientist
4. • The Infra Engineer/SRE
• Build and maintain baremetal/cloud infra
• Kubernetes experience
• Little to no Machine Learning library experience
Who are we? (continued)
Infra Engineer
SRE
6. We have the RIGHT tools and libraries to build
and train models
We have the RIGHT platform in Kubernetes to
run and train these models
Why this matters
7. • Two discrete worlds are coming together
• The knowledge is not widely accessible to the right
audience
• Nomenclature
• Documentation and use-cases are lacking
• APIs are evolving very fast, sample code gets out
of date quickly
What we’ve experienced
10. Data Parallelism
1. Parallel training on different machines
2. Update the parameters
synchronously/asynchronously
3. Refresh the local model with new parameters, go to
1 and repeat
Distributed Training Architecture
Credits: Taifeng Wang, DMTK team
11. Model Parallelism
The global model is partitioned into K sub-models.
The sub-models are distributed over K local
workers and serve as their local models.
In each mini-batch, the local workers compute the
gradients of the local weights by back propagation.
Distributed Training Architecture
Credits: Taifeng Wang, DMTK team
12. For Variable Distribution & Gradient
Aggregation
• Parameter_server
Distributed TensorFlow Architecture
Source: https://www.tensorflow.org/performance/performance_models
24. Settings:
•Topology: 1 ps and 2 workers
•Async variables update
•Using cpu as the local_parameter_device
•Each ps/worker pod has its own dedicated host
•variable_update mode: parameter_server
•Network protocol: gRPC
Distributed Training
25. Single Pod Training with 4 GPUs
vs Distributed Training
with 2 workers with 8 GPUs
Distributed Training (continued)
26. Distributed Training (continued)
Training Speedup on 2 nodes vs single-node
The model with a
higher ratio
scales better.
Distributed training scalability depends on the compute/communication ratio of the model
27. Observations during test:
• Linear scalability largely depends on the model and network bandwidth.
• GPUs not fully saturated on the worker nodes, likely due to network bottleneck.
• VGG16 does not scale across multiple nodes. GPUs “starved” most of the time.
• Having parameter servers running on the same pods as the workers seem to have
worse performance
• Tricky to decide the right ratio of workers to parameter servers
• Sync vs Async variable updates
Distributed Training (continued)
29. Benchmark on 32 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s
network
(source: https://github.com/uber/horovod)
Horovod: Uber’s Open Source Distributed Deep Learning
Framework for TensorFlow, Keras, and PyTorch
• A stand-alone python
package
• Seamless install on top of
TensorFlow
• Uses NCCL for ring-
allreduce across servers
instead of parameter server
• Uses MPI for worker
discovery and reduction
coordination
• Tensor Fusion
30. Benchmark on AKS – Traditional Distributed TensorFlow vs Horovod
Horovod: Uber’s Open Source Distributed Deep Learning
Framework
Training Speedup on 2 nodes vs single-node
31. Ecosystem open source Tooling for Distributed
training on Kubernetes
“FreeFlow” CNI plugin from Microsoft Research (Credits: Yibo Zhu from MS Research)
https://github.com/Microsoft/Freeflow
• Speeds up container overlay network communication to the same as host machine
• Supports both RDMA and TCP
• Higher throughput, lower latency, and less CPU overhead
• Pure user space implementation
• Not relying on anything specific to Kubernetes.
• Detailed steps on how to setup FreeFlow in K8s at
https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md
32. Ecosystem open source Tooling for Distributed
training on Kubernetes (continued)
Custom CRI & Scheduler: GPU-related resource scheduling on K8s
(Credits: Sanjeev Mehrotra from MS Research)
• Pods with no. of GPUs with how much memory
• Pods with no. of GPUs interconnected via NVLink, etc.
• https://github.com/Microsoft/KubeGPU
33. Fast.ai: Training Imagenet in 3 hours for $25
and CIFAR10 in 3 minutes for $0.26
http://www.fast.ai/2018/04/30/dawnbench-fastai/
Speed to accuracy is very important, in addition to images/sec
Algorithmic creativity is more important than bare-metal performance
• Super convergence - slowly increase the learning rate whilst decreasing momentum during training
https://arxiv.org/abs/1708.07120
• Progressive resizing - train on smaller images at the start of training, and gradually increase image
size as you train further
o Progressive Growing of GANs https://arxiv.org/abs/1710.10196
o Enhanced Deep Residual Networks https://arxiv.org/abs/1707.02921
• Half precision arithmetic
• Wide Residual Networks https://arxiv.org/abs/1605.07146
34. • Distributed Training Architectures
• Open Source tooling on Kubernetes
• Performance benchmarks
• Ecosystem libraries and tooling
In Summary
35. Kubeflow – https://github.com/kubeflow/kubeflow
Kubeflow Labs – https://aka.ms/kubeflow-labs
Deep Learning Workspace – https://github.com/microsoft/DLWorkspace/
AKS GPU Cluster create - https://docs.microsoft.com/azure/aks/gpu-cluster
Resources