Democratizing machine learning on kubernetes

v
Democratizing Machine
Learning on Kubernetes

Principal Solution Architect
Cloud and AI, Microsoft
@joyqiao2016
Joy Qiao
Principal Program Manager
Azure Containers, Microsoft
@LachlanEvenson
Lachlan Evenson

• The Data Scientist
• Building and training models
• Experience in Machine Learning frameworks/libraries
• Basic understanding of computer hardware
• Lucky to have Kubernetes experience
Who are we?
Data Scientist

• The Infra Engineer/SRE
• Build and maintain baremetal/cloud infra
• Kubernetes experience
• Little to no Machine Learning library experience
Who are we? (continued)
Infra Engineer
SRE

ML on Kubernetes
Infra Engineer
SRE
Data Scientist

We have the RIGHT tools and libraries to build
and train models
We have the RIGHT platform in Kubernetes to
run and train these models
Why this matters

• Two discrete worlds are coming together
• The knowledge is not widely accessible to the right
audience
• Nomenclature
• Documentation and use-cases are lacking
• APIs are evolving very fast, sample code gets out
of date quickly
What we’ve experienced

Distributed Deep Learning Architectures

Data Parallelism
1. Parallel training on different machines
2. Update the parameters
synchronously/asynchronously
3. Refresh the local model with new parameters, go to
1 and repeat
Distributed Training Architecture
Credits: Taifeng Wang, DMTK team

Model Parallelism
The global model is partitioned into K sub-models.
The sub-models are distributed over K local
workers and serve as their local models.
In each mini-batch, the local workers compute the
gradients of the local weights by back propagation.
Distributed Training Architecture
Credits: Taifeng Wang, DMTK team

For Variable Distribution & Gradient
Aggregation
• Parameter_server
Distributed TensorFlow Architecture
Source: https://www.tensorflow.org/performance/performance_models

Distributed Deep Learning Tooling

• Kubeflow - https://github.com/kubeflow/kubeflow
• JupyterHub, Tensorflow training/serving
• Seldon, Pachyderm, Pytorch-job
• Hands on Labs – https://aka.ms/kubeflow-labs
• Deep Learning Workspace
• https://github.com/microsoft/DLWorkspace/
• https://microsoft.github.io/DLWorkspace/
ML on Kubernetes open source tooling

Distributed Tensorflow using Kubeflow

Distributed Tensorflow using Kubeflow (cont)

• MetaML - https://aka.ms/kubeflow-labs
ML on Kubernetes open source tooling

In just 3 simple steps
Running Distributed TensorFlow on Kubernetes with
Kubeflow

1. Create a Kubernetes cluster with GPU enabled nodes
(AKS, GKE, EKS) -
https://docs.microsoft.com/azure/aks/gpu-cluster
2. Install Kubeflow
3. Run Distributed TensorFlow training job
Detailed instructions at https://aka.ms/kubeflow-labs
Running Distributed TensorFlow on Kubernetes
(continued)

Distributed Training Performance
on Kubernetes

Training Environment on Azure
Node VMs
oNC24r for workers
§ 4x NVIDIA® Tesla® K80 GPU
§ 24 CPU cores, 224 GB RAM
oD14_v2 for parameter server
§ 16 CPU cores, 112 GB RAM
Kubernetes: AKS (Azure Kubernetes Service) v1.9.6
GPU: NVIDIA® Tesla® K80
Benchmarks scripts: https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
• OS: Ubuntu 16.04 LTS
• TensorFlow: 1.8
• CUDA / cuDNN: 9.0 / 7.0
• Disk: Local SSD
• DataSet: ImageNet (real data, not synthetic)

•Linear scalability
•GPUs are fully saturated
Training on Single Pod, Multi-GPU

Settings:
•Topology: 1 ps and 2 workers
•Async variables update
•Using cpu as the local_parameter_device
•Each ps/worker pod has its own dedicated host
•variable_update mode: parameter_server
•Network protocol: gRPC
Distributed Training

Single Pod Training with 4 GPUs
vs Distributed Training
with 2 workers with 8 GPUs
Distributed Training (continued)

Training Speedup on 2 nodes vs single-node
The model with a
higher ratio
scales better.
Distributed training scalability depends on the compute/communication ratio of the model

Observations during test:
• Linear scalability largely depends on the model and network bandwidth.
• GPUs not fully saturated on the worker nodes, likely due to network bottleneck.
• VGG16 does not scale across multiple nodes. GPUs “starved” most of the time.
• Having parameter servers running on the same pods as the workers seem to have
worse performance
• Tricky to decide the right ratio of workers to parameter servers
• Sync vs Async variable updates

How can we do better?

Benchmark on 32 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s
network
(source: https://github.com/uber/horovod)
Horovod: Uber’s Open Source Distributed Deep Learning
Framework for TensorFlow, Keras, and PyTorch
• A stand-alone python
package
• Seamless install on top of
TensorFlow
• Uses NCCL for ring-
allreduce across servers
instead of parameter server
• Uses MPI for worker
discovery and reduction
coordination
• Tensor Fusion

Benchmark on AKS – Traditional Distributed TensorFlow vs Horovod
Horovod: Uber’s Open Source Distributed Deep Learning
Framework
Training Speedup on 2 nodes vs single-node

Ecosystem open source Tooling for Distributed
training on Kubernetes
“FreeFlow” CNI plugin from Microsoft Research (Credits: Yibo Zhu from MS Research)
https://github.com/Microsoft/Freeflow
• Speeds up container overlay network communication to the same as host machine
• Supports both RDMA and TCP
• Higher throughput, lower latency, and less CPU overhead
• Pure user space implementation
• Not relying on anything specific to Kubernetes.
• Detailed steps on how to setup FreeFlow in K8s at
https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md

Ecosystem open source Tooling for Distributed
training on Kubernetes (continued)
Custom CRI & Scheduler: GPU-related resource scheduling on K8s
(Credits: Sanjeev Mehrotra from MS Research)
• Pods with no. of GPUs with how much memory
• Pods with no. of GPUs interconnected via NVLink, etc.
• https://github.com/Microsoft/KubeGPU

Fast.ai: Training Imagenet in 3 hours for $25
and CIFAR10 in 3 minutes for $0.26
http://www.fast.ai/2018/04/30/dawnbench-fastai/
Speed to accuracy is very important, in addition to images/sec
Algorithmic creativity is more important than bare-metal performance
• Super convergence - slowly increase the learning rate whilst decreasing momentum during training
https://arxiv.org/abs/1708.07120
• Progressive resizing - train on smaller images at the start of training, and gradually increase image
size as you train further
o Progressive Growing of GANs https://arxiv.org/abs/1710.10196
o Enhanced Deep Residual Networks https://arxiv.org/abs/1707.02921
• Half precision arithmetic
• Wide Residual Networks https://arxiv.org/abs/1605.07146

• Distributed Training Architectures
• Open Source tooling on Kubernetes
• Performance benchmarks
• Ecosystem libraries and tooling
In Summary

Kubeflow – https://github.com/kubeflow/kubeflow
Kubeflow Labs – https://aka.ms/kubeflow-labs
Deep Learning Workspace – https://github.com/microsoft/DLWorkspace/
AKS GPU Cluster create - https://docs.microsoft.com/azure/aks/gpu-cluster
Resources

Horovod - https://github.com/uber/horovod
FreeFlow - https://github.com/Microsoft/Freeflow
FreeFlow setup docs - https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md
Kubernetes GPU Scheduler - https://github.com/Microsoft/KubeGPU
Fast.ai - http://www.fast.ai/2018/04/30/dawnbench-fastai/
Resources (continued)

You can reach us on Twitter via:
@joyqiao2016
@LachlanEvenson
Thank you

Democratizing machine learning on kubernetes

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Democratizing machine learning on kubernetes

Similaire à Democratizing machine learning on kubernetes (20)

Plus de Docker, Inc.

Plus de Docker, Inc. (20)

Dernier

Dernier (17)

Democratizing machine learning on kubernetes