SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
v
Democratizing Machine
Learning on Kubernetes
Principal Solution Architect
Cloud and AI, Microsoft
@joyqiao2016
Joy Qiao
Principal Program Manager
Azure Containers, Microsoft
@LachlanEvenson
Lachlan Evenson
• The Data Scientist
• Building and training models
• Experience in Machine Learning frameworks/libraries
• Basic understanding of computer hardware
• Lucky to have Kubernetes experience
Who are we?
Data Scientist
• The Infra Engineer/SRE
• Build and maintain baremetal/cloud infra
• Kubernetes experience
• Little to no Machine Learning library experience
Who are we? (continued)
Infra Engineer
SRE
ML on Kubernetes
Infra Engineer
SRE
Data Scientist
We have the RIGHT tools and libraries to build
and train models
We have the RIGHT platform in Kubernetes to
run and train these models
Why this matters
• Two discrete worlds are coming together
• The knowledge is not widely accessible to the right
audience
• Nomenclature
• Documentation and use-cases are lacking
• APIs are evolving very fast, sample code gets out
of date quickly
What we’ve experienced
Let’s get started
Distributed Deep Learning Architectures
Data Parallelism
1. Parallel training on different machines
2. Update the parameters
synchronously/asynchronously
3. Refresh the local model with new parameters, go to
1 and repeat
Distributed Training Architecture
Credits: Taifeng Wang, DMTK team
Model Parallelism
The global model is partitioned into K sub-models.
The sub-models are distributed over K local
workers and serve as their local models.
In each mini-batch, the local workers compute the
gradients of the local weights by back propagation.
Distributed Training Architecture
Credits: Taifeng Wang, DMTK team
For Variable Distribution & Gradient
Aggregation
• Parameter_server
Distributed TensorFlow Architecture
Source: https://www.tensorflow.org/performance/performance_models
Distributed Deep Learning Tooling
• Kubeflow - https://github.com/kubeflow/kubeflow
• JupyterHub, Tensorflow training/serving
• Seldon, Pachyderm, Pytorch-job
• Hands on Labs – https://aka.ms/kubeflow-labs
• Deep Learning Workspace
• https://github.com/microsoft/DLWorkspace/
• https://microsoft.github.io/DLWorkspace/
ML on Kubernetes open source tooling
Distributed Tensorflow using Kubeflow
Distributed Tensorflow using Kubeflow (cont)
• MetaML - https://aka.ms/kubeflow-labs
ML on Kubernetes open source tooling
In just 3 simple steps
Running Distributed TensorFlow on Kubernetes with
Kubeflow
1. Create a Kubernetes cluster with GPU enabled nodes
(AKS, GKE, EKS) -
https://docs.microsoft.com/azure/aks/gpu-cluster
2. Install Kubeflow
3. Run Distributed TensorFlow training job
Detailed instructions at https://aka.ms/kubeflow-labs
Running Distributed TensorFlow on Kubernetes
(continued)
Demo
Distributed Training Performance
on Kubernetes
Training Environment on Azure
Node VMs
oNC24r for workers
§ 4x NVIDIA® Tesla® K80 GPU
§ 24 CPU cores, 224 GB RAM
oD14_v2 for parameter server
§ 16 CPU cores, 112 GB RAM
Kubernetes: AKS (Azure Kubernetes Service) v1.9.6
GPU: NVIDIA® Tesla® K80
Benchmarks scripts: https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
• OS: Ubuntu 16.04 LTS
• TensorFlow: 1.8
• CUDA / cuDNN: 9.0 / 7.0
• Disk: Local SSD
• DataSet: ImageNet (real data, not synthetic)
•Linear scalability
•GPUs are fully saturated
Training on Single Pod, Multi-GPU
Settings:
•Topology: 1 ps and 2 workers
•Async variables update
•Using cpu as the local_parameter_device
•Each ps/worker pod has its own dedicated host
•variable_update mode: parameter_server
•Network protocol: gRPC
Distributed Training
Single Pod Training with 4 GPUs
vs Distributed Training
with 2 workers with 8 GPUs
Distributed Training (continued)
Distributed Training (continued)
Training Speedup on 2 nodes vs single-node
The model with a
higher ratio
scales better.
Distributed training scalability depends on the compute/communication ratio of the model
Observations during test:
• Linear scalability largely depends on the model and network bandwidth.
• GPUs not fully saturated on the worker nodes, likely due to network bottleneck.
• VGG16 does not scale across multiple nodes. GPUs “starved” most of the time.
• Having parameter servers running on the same pods as the workers seem to have
worse performance
• Tricky to decide the right ratio of workers to parameter servers
• Sync vs Async variable updates
Distributed Training (continued)
Distributed Training (continued)
How can we do better?
Benchmark on 32 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s
network
(source: https://github.com/uber/horovod)
Horovod: Uber’s Open Source Distributed Deep Learning
Framework for TensorFlow, Keras, and PyTorch
• A stand-alone python
package
• Seamless install on top of
TensorFlow
• Uses NCCL for ring-
allreduce across servers
instead of parameter server
• Uses MPI for worker
discovery and reduction
coordination
• Tensor Fusion
Benchmark on AKS – Traditional Distributed TensorFlow vs Horovod
Horovod: Uber’s Open Source Distributed Deep Learning
Framework
Training Speedup on 2 nodes vs single-node
Ecosystem open source Tooling for Distributed
training on Kubernetes
“FreeFlow” CNI plugin from Microsoft Research (Credits: Yibo Zhu from MS Research)
https://github.com/Microsoft/Freeflow
• Speeds up container overlay network communication to the same as host machine
• Supports both RDMA and TCP
• Higher throughput, lower latency, and less CPU overhead
• Pure user space implementation
• Not relying on anything specific to Kubernetes.
• Detailed steps on how to setup FreeFlow in K8s at
https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md
Ecosystem open source Tooling for Distributed
training on Kubernetes (continued)
Custom CRI & Scheduler: GPU-related resource scheduling on K8s
(Credits: Sanjeev Mehrotra from MS Research)
• Pods with no. of GPUs with how much memory
• Pods with no. of GPUs interconnected via NVLink, etc.
• https://github.com/Microsoft/KubeGPU
Fast.ai: Training Imagenet in 3 hours for $25
and CIFAR10 in 3 minutes for $0.26
http://www.fast.ai/2018/04/30/dawnbench-fastai/
Speed to accuracy is very important, in addition to images/sec
Algorithmic creativity is more important than bare-metal performance
• Super convergence - slowly increase the learning rate whilst decreasing momentum during training
https://arxiv.org/abs/1708.07120
• Progressive resizing - train on smaller images at the start of training, and gradually increase image
size as you train further
o Progressive Growing of GANs https://arxiv.org/abs/1710.10196
o Enhanced Deep Residual Networks https://arxiv.org/abs/1707.02921
• Half precision arithmetic
• Wide Residual Networks https://arxiv.org/abs/1605.07146
• Distributed Training Architectures
• Open Source tooling on Kubernetes
• Performance benchmarks
• Ecosystem libraries and tooling
In Summary
Kubeflow – https://github.com/kubeflow/kubeflow
Kubeflow Labs – https://aka.ms/kubeflow-labs
Deep Learning Workspace – https://github.com/microsoft/DLWorkspace/
AKS GPU Cluster create - https://docs.microsoft.com/azure/aks/gpu-cluster
Resources
Horovod - https://github.com/uber/horovod
FreeFlow - https://github.com/Microsoft/Freeflow
FreeFlow setup docs - https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md
Kubernetes GPU Scheduler - https://github.com/Microsoft/KubeGPU
Fast.ai - http://www.fast.ai/2018/04/30/dawnbench-fastai/
Resources (continued)
You can reach us on Twitter via:
@joyqiao2016
@LachlanEvenson
Thank you

Contenu connexe

Tendances

Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam BiradarIntroducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradarsangam biradar
 
DCSF 19 Microservices API: Routing Across Any Infrastructure
DCSF 19 Microservices API: Routing Across Any InfrastructureDCSF 19 Microservices API: Routing Across Any Infrastructure
DCSF 19 Microservices API: Routing Across Any InfrastructureDocker, Inc.
 
DockerCon 18 Cool Hacks: solo.io
DockerCon 18 Cool Hacks:  solo.ioDockerCon 18 Cool Hacks:  solo.io
DockerCon 18 Cool Hacks: solo.ioDocker, Inc.
 
DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA Docker, Inc.
 
Tectonic Summit 2016: Multi-Cluster Kubernetes: Planning for Unknowns
Tectonic Summit 2016: Multi-Cluster Kubernetes: Planning for UnknownsTectonic Summit 2016: Multi-Cluster Kubernetes: Planning for Unknowns
Tectonic Summit 2016: Multi-Cluster Kubernetes: Planning for UnknownsCoreOS
 
Why I wish I'd Heard of Docker when I was 12 - Finnian Anderson
Why I wish I'd Heard of Docker when I was 12 - Finnian AndersonWhy I wish I'd Heard of Docker when I was 12 - Finnian Anderson
Why I wish I'd Heard of Docker when I was 12 - Finnian AndersonDocker, Inc.
 
Troubleshooting tips from docker support engineers
Troubleshooting tips from docker support engineersTroubleshooting tips from docker support engineers
Troubleshooting tips from docker support engineersDocker, Inc.
 
Dev opsec dockerimage_patch_n_lifecyclemanagement_2019
Dev opsec dockerimage_patch_n_lifecyclemanagement_2019Dev opsec dockerimage_patch_n_lifecyclemanagement_2019
Dev opsec dockerimage_patch_n_lifecyclemanagement_2019kanedafromparis
 
DCEU 18: App-in-a-Box with Docker Application Packages
DCEU 18: App-in-a-Box with Docker Application PackagesDCEU 18: App-in-a-Box with Docker Application Packages
DCEU 18: App-in-a-Box with Docker Application PackagesDocker, Inc.
 
Okteto For Kubernetes Developer :- Container Camp 2020
Okteto For Kubernetes Developer :- Container Camp 2020 Okteto For Kubernetes Developer :- Container Camp 2020
Okteto For Kubernetes Developer :- Container Camp 2020 sangam biradar
 
GKE Tip Series how do i choose between gke standard, autopilot and cloud run
GKE Tip Series   how do i choose between gke standard, autopilot and cloud run GKE Tip Series   how do i choose between gke standard, autopilot and cloud run
GKE Tip Series how do i choose between gke standard, autopilot and cloud run Sreenivas Makam
 
DCEU 18: From Monolith to Microservices
DCEU 18: From Monolith to MicroservicesDCEU 18: From Monolith to Microservices
DCEU 18: From Monolith to MicroservicesDocker, Inc.
 
DevOps with Kubernetes and Helm - OSCON 2018
DevOps with Kubernetes and Helm - OSCON 2018DevOps with Kubernetes and Helm - OSCON 2018
DevOps with Kubernetes and Helm - OSCON 2018Jessica Deen
 
Building a Secure Supply Chain with Docker
Building a Secure Supply Chain with DockerBuilding a Secure Supply Chain with Docker
Building a Secure Supply Chain with DockerDocker, Inc.
 
Modernizing Traditional Applications
Modernizing Traditional ApplicationsModernizing Traditional Applications
Modernizing Traditional ApplicationsDocker, Inc.
 
Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...Red Hat Developers
 
Kubernetes Helm: Why It Matters
Kubernetes Helm: Why It MattersKubernetes Helm: Why It Matters
Kubernetes Helm: Why It MattersPlatform9
 

Tendances (20)

Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam BiradarIntroducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
Introducing Pico - A Deep Learning Platform using Docker & IoT - Sangam Biradar
 
DCSF 19 Microservices API: Routing Across Any Infrastructure
DCSF 19 Microservices API: Routing Across Any InfrastructureDCSF 19 Microservices API: Routing Across Any Infrastructure
DCSF 19 Microservices API: Routing Across Any Infrastructure
 
DockerCon 18 Cool Hacks: solo.io
DockerCon 18 Cool Hacks:  solo.ioDockerCon 18 Cool Hacks:  solo.io
DockerCon 18 Cool Hacks: solo.io
 
DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA DCSF19 Kubernetes Security with OPA
DCSF19 Kubernetes Security with OPA
 
Tectonic Summit 2016: Multi-Cluster Kubernetes: Planning for Unknowns
Tectonic Summit 2016: Multi-Cluster Kubernetes: Planning for UnknownsTectonic Summit 2016: Multi-Cluster Kubernetes: Planning for Unknowns
Tectonic Summit 2016: Multi-Cluster Kubernetes: Planning for Unknowns
 
Why I wish I'd Heard of Docker when I was 12 - Finnian Anderson
Why I wish I'd Heard of Docker when I was 12 - Finnian AndersonWhy I wish I'd Heard of Docker when I was 12 - Finnian Anderson
Why I wish I'd Heard of Docker when I was 12 - Finnian Anderson
 
Troubleshooting tips from docker support engineers
Troubleshooting tips from docker support engineersTroubleshooting tips from docker support engineers
Troubleshooting tips from docker support engineers
 
Dev opsec dockerimage_patch_n_lifecyclemanagement_2019
Dev opsec dockerimage_patch_n_lifecyclemanagement_2019Dev opsec dockerimage_patch_n_lifecyclemanagement_2019
Dev opsec dockerimage_patch_n_lifecyclemanagement_2019
 
DCEU 18: App-in-a-Box with Docker Application Packages
DCEU 18: App-in-a-Box with Docker Application PackagesDCEU 18: App-in-a-Box with Docker Application Packages
DCEU 18: App-in-a-Box with Docker Application Packages
 
On Prem Container Cloud - Lessons Learned
On Prem Container Cloud - Lessons LearnedOn Prem Container Cloud - Lessons Learned
On Prem Container Cloud - Lessons Learned
 
Okteto For Kubernetes Developer :- Container Camp 2020
Okteto For Kubernetes Developer :- Container Camp 2020 Okteto For Kubernetes Developer :- Container Camp 2020
Okteto For Kubernetes Developer :- Container Camp 2020
 
GKE Tip Series how do i choose between gke standard, autopilot and cloud run
GKE Tip Series   how do i choose between gke standard, autopilot and cloud run GKE Tip Series   how do i choose between gke standard, autopilot and cloud run
GKE Tip Series how do i choose between gke standard, autopilot and cloud run
 
DCEU 18: From Monolith to Microservices
DCEU 18: From Monolith to MicroservicesDCEU 18: From Monolith to Microservices
DCEU 18: From Monolith to Microservices
 
DevOps with Kubernetes and Helm - OSCON 2018
DevOps with Kubernetes and Helm - OSCON 2018DevOps with Kubernetes and Helm - OSCON 2018
DevOps with Kubernetes and Helm - OSCON 2018
 
Building a Secure Supply Chain with Docker
Building a Secure Supply Chain with DockerBuilding a Secure Supply Chain with Docker
Building a Secure Supply Chain with Docker
 
Modernizing Traditional Applications
Modernizing Traditional ApplicationsModernizing Traditional Applications
Modernizing Traditional Applications
 
Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...
 
Zero-downtime deployment with Kubernetes [Meetup #21 - 01]
Zero-downtime deployment with Kubernetes [Meetup #21 - 01]Zero-downtime deployment with Kubernetes [Meetup #21 - 01]
Zero-downtime deployment with Kubernetes [Meetup #21 - 01]
 
Kubernetes Helm: Why It Matters
Kubernetes Helm: Why It MattersKubernetes Helm: Why It Matters
Kubernetes Helm: Why It Matters
 
Kubernetes Logging
Kubernetes LoggingKubernetes Logging
Kubernetes Logging
 

Similaire à Democratizing machine learning on kubernetes

Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learningAntje Barth
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusJakob Karalus
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersJoy Qiao
 
Criteo meetup - S.R.E Tech Talk
Criteo meetup - S.R.E Tech TalkCriteo meetup - S.R.E Tech Talk
Criteo meetup - S.R.E Tech TalkPierre Mavro
 
SCaLE 20X: Kubernetes Cloud Cost Monitoring with OpenCost & Optimization Stra...
SCaLE 20X: Kubernetes Cloud Cost Monitoring with OpenCost & Optimization Stra...SCaLE 20X: Kubernetes Cloud Cost Monitoring with OpenCost & Optimization Stra...
SCaLE 20X: Kubernetes Cloud Cost Monitoring with OpenCost & Optimization Stra...Matt Ray
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Machine learning on kubernetes
Machine learning on kubernetesMachine learning on kubernetes
Machine learning on kubernetesAnirudh Ramanathan
 
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...Seldon
 
Hassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowHassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowBarbara Fusinska
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsS N
 
The Power of Azure DevOps
The Power of Azure DevOpsThe Power of Azure DevOps
The Power of Azure DevOpsJeff Bramwell
 
AI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with KnativeAI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with KnativeAnimesh Singh
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Akash Tandon
 

Similaire à Democratizing machine learning on kubernetes (20)

Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learning
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clusters
 
Criteo meetup - S.R.E Tech Talk
Criteo meetup - S.R.E Tech TalkCriteo meetup - S.R.E Tech Talk
Criteo meetup - S.R.E Tech Talk
 
SCaLE 20X: Kubernetes Cloud Cost Monitoring with OpenCost & Optimization Stra...
SCaLE 20X: Kubernetes Cloud Cost Monitoring with OpenCost & Optimization Stra...SCaLE 20X: Kubernetes Cloud Cost Monitoring with OpenCost & Optimization Stra...
SCaLE 20X: Kubernetes Cloud Cost Monitoring with OpenCost & Optimization Stra...
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Kubeflow.pptx
Kubeflow.pptxKubeflow.pptx
Kubeflow.pptx
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Machine learning on kubernetes
Machine learning on kubernetesMachine learning on kubernetes
Machine learning on kubernetes
 
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
 
Hassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowHassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with Kubeflow
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
 
01-06 OCRE Test Suite - Fernandes.pdf
01-06 OCRE Test Suite - Fernandes.pdf01-06 OCRE Test Suite - Fernandes.pdf
01-06 OCRE Test Suite - Fernandes.pdf
 
The Power of Azure DevOps
The Power of Azure DevOpsThe Power of Azure DevOps
The Power of Azure DevOps
 
Intro to kubernetes
Intro to kubernetesIntro to kubernetes
Intro to kubernetes
 
AI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with KnativeAI & Machine Learning Pipelines with Knative
AI & Machine Learning Pipelines with Knative
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
 

Plus de Docker, Inc.

Containerize Your Game Server for the Best Multiplayer Experience
Containerize Your Game Server for the Best Multiplayer Experience Containerize Your Game Server for the Best Multiplayer Experience
Containerize Your Game Server for the Best Multiplayer Experience Docker, Inc.
 
How to Improve Your Image Builds Using Advance Docker Build
How to Improve Your Image Builds Using Advance Docker BuildHow to Improve Your Image Builds Using Advance Docker Build
How to Improve Your Image Builds Using Advance Docker BuildDocker, Inc.
 
Build & Deploy Multi-Container Applications to AWS
Build & Deploy Multi-Container Applications to AWSBuild & Deploy Multi-Container Applications to AWS
Build & Deploy Multi-Container Applications to AWSDocker, Inc.
 
Securing Your Containerized Applications with NGINX
Securing Your Containerized Applications with NGINXSecuring Your Containerized Applications with NGINX
Securing Your Containerized Applications with NGINXDocker, Inc.
 
How To Build and Run Node Apps with Docker and Compose
How To Build and Run Node Apps with Docker and ComposeHow To Build and Run Node Apps with Docker and Compose
How To Build and Run Node Apps with Docker and ComposeDocker, Inc.
 
Distributed Deep Learning with Docker at Salesforce
Distributed Deep Learning with Docker at SalesforceDistributed Deep Learning with Docker at Salesforce
Distributed Deep Learning with Docker at SalesforceDocker, Inc.
 
The First 10M Pulls: Building The Official Curl Image for Docker Hub
The First 10M Pulls: Building The Official Curl Image for Docker HubThe First 10M Pulls: Building The Official Curl Image for Docker Hub
The First 10M Pulls: Building The Official Curl Image for Docker HubDocker, Inc.
 
Monitoring in a Microservices World
Monitoring in a Microservices WorldMonitoring in a Microservices World
Monitoring in a Microservices WorldDocker, Inc.
 
COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...
COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...
COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...Docker, Inc.
 
Predicting Space Weather with Docker
Predicting Space Weather with DockerPredicting Space Weather with Docker
Predicting Space Weather with DockerDocker, Inc.
 
Become a Docker Power User With Microsoft Visual Studio Code
Become a Docker Power User With Microsoft Visual Studio CodeBecome a Docker Power User With Microsoft Visual Studio Code
Become a Docker Power User With Microsoft Visual Studio CodeDocker, Inc.
 
How to Use Mirroring and Caching to Optimize your Container Registry
How to Use Mirroring and Caching to Optimize your Container RegistryHow to Use Mirroring and Caching to Optimize your Container Registry
How to Use Mirroring and Caching to Optimize your Container RegistryDocker, Inc.
 
Monolithic to Microservices + Docker = SDLC on Steroids!
Monolithic to Microservices + Docker = SDLC on Steroids!Monolithic to Microservices + Docker = SDLC on Steroids!
Monolithic to Microservices + Docker = SDLC on Steroids!Docker, Inc.
 
Kubernetes at Datadog Scale
Kubernetes at Datadog ScaleKubernetes at Datadog Scale
Kubernetes at Datadog ScaleDocker, Inc.
 
Labels, Labels, Labels
Labels, Labels, Labels Labels, Labels, Labels
Labels, Labels, Labels Docker, Inc.
 
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment ModelUsing Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment ModelDocker, Inc.
 
Build & Deploy Multi-Container Applications to AWS
Build & Deploy Multi-Container Applications to AWSBuild & Deploy Multi-Container Applications to AWS
Build & Deploy Multi-Container Applications to AWSDocker, Inc.
 
From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...
From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...
From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...Docker, Inc.
 
Developing with Docker for the Arm Architecture
Developing with Docker for the Arm ArchitectureDeveloping with Docker for the Arm Architecture
Developing with Docker for the Arm ArchitectureDocker, Inc.
 

Plus de Docker, Inc. (20)

Containerize Your Game Server for the Best Multiplayer Experience
Containerize Your Game Server for the Best Multiplayer Experience Containerize Your Game Server for the Best Multiplayer Experience
Containerize Your Game Server for the Best Multiplayer Experience
 
How to Improve Your Image Builds Using Advance Docker Build
How to Improve Your Image Builds Using Advance Docker BuildHow to Improve Your Image Builds Using Advance Docker Build
How to Improve Your Image Builds Using Advance Docker Build
 
Build & Deploy Multi-Container Applications to AWS
Build & Deploy Multi-Container Applications to AWSBuild & Deploy Multi-Container Applications to AWS
Build & Deploy Multi-Container Applications to AWS
 
Securing Your Containerized Applications with NGINX
Securing Your Containerized Applications with NGINXSecuring Your Containerized Applications with NGINX
Securing Your Containerized Applications with NGINX
 
How To Build and Run Node Apps with Docker and Compose
How To Build and Run Node Apps with Docker and ComposeHow To Build and Run Node Apps with Docker and Compose
How To Build and Run Node Apps with Docker and Compose
 
Hands-on Helm
Hands-on Helm Hands-on Helm
Hands-on Helm
 
Distributed Deep Learning with Docker at Salesforce
Distributed Deep Learning with Docker at SalesforceDistributed Deep Learning with Docker at Salesforce
Distributed Deep Learning with Docker at Salesforce
 
The First 10M Pulls: Building The Official Curl Image for Docker Hub
The First 10M Pulls: Building The Official Curl Image for Docker HubThe First 10M Pulls: Building The Official Curl Image for Docker Hub
The First 10M Pulls: Building The Official Curl Image for Docker Hub
 
Monitoring in a Microservices World
Monitoring in a Microservices WorldMonitoring in a Microservices World
Monitoring in a Microservices World
 
COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...
COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...
COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...
 
Predicting Space Weather with Docker
Predicting Space Weather with DockerPredicting Space Weather with Docker
Predicting Space Weather with Docker
 
Become a Docker Power User With Microsoft Visual Studio Code
Become a Docker Power User With Microsoft Visual Studio CodeBecome a Docker Power User With Microsoft Visual Studio Code
Become a Docker Power User With Microsoft Visual Studio Code
 
How to Use Mirroring and Caching to Optimize your Container Registry
How to Use Mirroring and Caching to Optimize your Container RegistryHow to Use Mirroring and Caching to Optimize your Container Registry
How to Use Mirroring and Caching to Optimize your Container Registry
 
Monolithic to Microservices + Docker = SDLC on Steroids!
Monolithic to Microservices + Docker = SDLC on Steroids!Monolithic to Microservices + Docker = SDLC on Steroids!
Monolithic to Microservices + Docker = SDLC on Steroids!
 
Kubernetes at Datadog Scale
Kubernetes at Datadog ScaleKubernetes at Datadog Scale
Kubernetes at Datadog Scale
 
Labels, Labels, Labels
Labels, Labels, Labels Labels, Labels, Labels
Labels, Labels, Labels
 
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment ModelUsing Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
 
Build & Deploy Multi-Container Applications to AWS
Build & Deploy Multi-Container Applications to AWSBuild & Deploy Multi-Container Applications to AWS
Build & Deploy Multi-Container Applications to AWS
 
From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...
From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...
From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...
 
Developing with Docker for the Arm Architecture
Developing with Docker for the Arm ArchitectureDeveloping with Docker for the Arm Architecture
Developing with Docker for the Arm Architecture
 

Dernier

Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...David Celestin
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityHung Le
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lodhisaajjda
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.thamaeteboho94
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIINhPhngng3
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar TrainingKylaCullinane
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...ZurliaSoop
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfMahamudul Hasan
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Baileyhlharris
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatmentnswingard
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoKayode Fayemi
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxlionnarsimharajumjf
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfSkillCertProExams
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...amilabibi1
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalFabian de Rijk
 

Dernier (17)

Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait Cityin kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptx
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 

Democratizing machine learning on kubernetes

  • 2. Principal Solution Architect Cloud and AI, Microsoft @joyqiao2016 Joy Qiao Principal Program Manager Azure Containers, Microsoft @LachlanEvenson Lachlan Evenson
  • 3. • The Data Scientist • Building and training models • Experience in Machine Learning frameworks/libraries • Basic understanding of computer hardware • Lucky to have Kubernetes experience Who are we? Data Scientist
  • 4. • The Infra Engineer/SRE • Build and maintain baremetal/cloud infra • Kubernetes experience • Little to no Machine Learning library experience Who are we? (continued) Infra Engineer SRE
  • 5. ML on Kubernetes Infra Engineer SRE Data Scientist
  • 6. We have the RIGHT tools and libraries to build and train models We have the RIGHT platform in Kubernetes to run and train these models Why this matters
  • 7. • Two discrete worlds are coming together • The knowledge is not widely accessible to the right audience • Nomenclature • Documentation and use-cases are lacking • APIs are evolving very fast, sample code gets out of date quickly What we’ve experienced
  • 10. Data Parallelism 1. Parallel training on different machines 2. Update the parameters synchronously/asynchronously 3. Refresh the local model with new parameters, go to 1 and repeat Distributed Training Architecture Credits: Taifeng Wang, DMTK team
  • 11. Model Parallelism The global model is partitioned into K sub-models. The sub-models are distributed over K local workers and serve as their local models. In each mini-batch, the local workers compute the gradients of the local weights by back propagation. Distributed Training Architecture Credits: Taifeng Wang, DMTK team
  • 12. For Variable Distribution & Gradient Aggregation • Parameter_server Distributed TensorFlow Architecture Source: https://www.tensorflow.org/performance/performance_models
  • 14. • Kubeflow - https://github.com/kubeflow/kubeflow • JupyterHub, Tensorflow training/serving • Seldon, Pachyderm, Pytorch-job • Hands on Labs – https://aka.ms/kubeflow-labs • Deep Learning Workspace • https://github.com/microsoft/DLWorkspace/ • https://microsoft.github.io/DLWorkspace/ ML on Kubernetes open source tooling
  • 16. Distributed Tensorflow using Kubeflow (cont)
  • 17. • MetaML - https://aka.ms/kubeflow-labs ML on Kubernetes open source tooling
  • 18. In just 3 simple steps Running Distributed TensorFlow on Kubernetes with Kubeflow
  • 19. 1. Create a Kubernetes cluster with GPU enabled nodes (AKS, GKE, EKS) - https://docs.microsoft.com/azure/aks/gpu-cluster 2. Install Kubeflow 3. Run Distributed TensorFlow training job Detailed instructions at https://aka.ms/kubeflow-labs Running Distributed TensorFlow on Kubernetes (continued)
  • 20. Demo
  • 22. Training Environment on Azure Node VMs oNC24r for workers § 4x NVIDIA® Tesla® K80 GPU § 24 CPU cores, 224 GB RAM oD14_v2 for parameter server § 16 CPU cores, 112 GB RAM Kubernetes: AKS (Azure Kubernetes Service) v1.9.6 GPU: NVIDIA® Tesla® K80 Benchmarks scripts: https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks • OS: Ubuntu 16.04 LTS • TensorFlow: 1.8 • CUDA / cuDNN: 9.0 / 7.0 • Disk: Local SSD • DataSet: ImageNet (real data, not synthetic)
  • 23. •Linear scalability •GPUs are fully saturated Training on Single Pod, Multi-GPU
  • 24. Settings: •Topology: 1 ps and 2 workers •Async variables update •Using cpu as the local_parameter_device •Each ps/worker pod has its own dedicated host •variable_update mode: parameter_server •Network protocol: gRPC Distributed Training
  • 25. Single Pod Training with 4 GPUs vs Distributed Training with 2 workers with 8 GPUs Distributed Training (continued)
  • 26. Distributed Training (continued) Training Speedup on 2 nodes vs single-node The model with a higher ratio scales better. Distributed training scalability depends on the compute/communication ratio of the model
  • 27. Observations during test: • Linear scalability largely depends on the model and network bandwidth. • GPUs not fully saturated on the worker nodes, likely due to network bottleneck. • VGG16 does not scale across multiple nodes. GPUs “starved” most of the time. • Having parameter servers running on the same pods as the workers seem to have worse performance • Tricky to decide the right ratio of workers to parameter servers • Sync vs Async variable updates Distributed Training (continued)
  • 29. Benchmark on 32 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network (source: https://github.com/uber/horovod) Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow, Keras, and PyTorch • A stand-alone python package • Seamless install on top of TensorFlow • Uses NCCL for ring- allreduce across servers instead of parameter server • Uses MPI for worker discovery and reduction coordination • Tensor Fusion
  • 30. Benchmark on AKS – Traditional Distributed TensorFlow vs Horovod Horovod: Uber’s Open Source Distributed Deep Learning Framework Training Speedup on 2 nodes vs single-node
  • 31. Ecosystem open source Tooling for Distributed training on Kubernetes “FreeFlow” CNI plugin from Microsoft Research (Credits: Yibo Zhu from MS Research) https://github.com/Microsoft/Freeflow • Speeds up container overlay network communication to the same as host machine • Supports both RDMA and TCP • Higher throughput, lower latency, and less CPU overhead • Pure user space implementation • Not relying on anything specific to Kubernetes. • Detailed steps on how to setup FreeFlow in K8s at https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md
  • 32. Ecosystem open source Tooling for Distributed training on Kubernetes (continued) Custom CRI & Scheduler: GPU-related resource scheduling on K8s (Credits: Sanjeev Mehrotra from MS Research) • Pods with no. of GPUs with how much memory • Pods with no. of GPUs interconnected via NVLink, etc. • https://github.com/Microsoft/KubeGPU
  • 33. Fast.ai: Training Imagenet in 3 hours for $25 and CIFAR10 in 3 minutes for $0.26 http://www.fast.ai/2018/04/30/dawnbench-fastai/ Speed to accuracy is very important, in addition to images/sec Algorithmic creativity is more important than bare-metal performance • Super convergence - slowly increase the learning rate whilst decreasing momentum during training https://arxiv.org/abs/1708.07120 • Progressive resizing - train on smaller images at the start of training, and gradually increase image size as you train further o Progressive Growing of GANs https://arxiv.org/abs/1710.10196 o Enhanced Deep Residual Networks https://arxiv.org/abs/1707.02921 • Half precision arithmetic • Wide Residual Networks https://arxiv.org/abs/1605.07146
  • 34. • Distributed Training Architectures • Open Source tooling on Kubernetes • Performance benchmarks • Ecosystem libraries and tooling In Summary
  • 35. Kubeflow – https://github.com/kubeflow/kubeflow Kubeflow Labs – https://aka.ms/kubeflow-labs Deep Learning Workspace – https://github.com/microsoft/DLWorkspace/ AKS GPU Cluster create - https://docs.microsoft.com/azure/aks/gpu-cluster Resources
  • 36. Horovod - https://github.com/uber/horovod FreeFlow - https://github.com/Microsoft/Freeflow FreeFlow setup docs - https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md Kubernetes GPU Scheduler - https://github.com/Microsoft/KubeGPU Fast.ai - http://www.fast.ai/2018/04/30/dawnbench-fastai/ Resources (continued)
  • 37. You can reach us on Twitter via: @joyqiao2016 @LachlanEvenson Thank you