SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Uber on Using Horovod for
Distributed Deep Learning
Alex Sergeev
Horovod Project Lead
Uber Technologies, Inc.
A I M 4 1 1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why deep learning?
• Continues to improve accuracy
long after older algorithms
reach data saturation
• State of the art for vision,
machine translation, and
many other domains
• Capitalize on massive amount
of research happening in the
global community
- Andrew Ng
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Massive amounts of data . . .
. . . make things slow (weeks!)
Solution
distributed training
p3x16large has
128GB GPU memory
Most models fit in a server
→ Use data-parallel training
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Goals
There are many ways to do data-parallel training
Some are more confusing than others. UX varies greatly
Our goals
1. Infrastructure people (like me ) deal with choosing servers, network
gear, container environment, default containers, and tuning
distributed training performance
2. ML engineers focus on making great models that improve business
using deep learning frameworks that they love
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Meet Horovod
• Library for distributed deep learning
• Works with stock TensorFlow, Keras, and
PyTorch
• Installs on top with `pip install horovod`
• Uses advanced algorithms & can leverage
features of high-performance networks
(RDMA, GPUDirect)
• Separates infrastructure from ML engineers
• Infra team provides container & MPI environment
• ML engineers use DL frameworks that they love
• Both Infra team and ML engineers have consistent
expectations for distributed training across frameworks
horovod.ai
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Initialize the library
import horovod.tensorflow as hvd
hvd.init()
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Adjust LR & add Horovod Distributed Optimizer
opt = tf.train.MomentumOptimizer(lr=0.01 * hvd.size())
opt = hvd.DistributedOptimizer(opt)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Synchronize initial state between workers
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
with tf.train.MonitoredTrainingSession(hooks=hooks, …) as mon_sess:
…
# Or
bcast_op = hvd.broadcast_global_variables(0)
sess.run(bcast_op)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use checkpoints only on first worker
ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None
with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir, …) as mon_sess:
…
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Full example
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list =
str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.MomentumOptimizer(
lr=0.01 * hvd.size())
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to synchronize initial state
hooks =[hvd.BroadcastGlobalVariablesHook(0)]
# Only checkpoint on rank 0
ckpt_dir = "/tmp/train_logs" 
if hvd.rank() == 0 else None
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of
# session initialization, restoring from a
# checkpoint, saving to a checkpoint, and
# closing when done or an error occurs.
with
tf.train.MonitoredTrainingSession(checkpoint
_dir=ckpt_dir, config=config, hooks=hooks)
as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Run!
# Use AWS Deep Learning AMI
laptop$ ssh ubuntu@<aws-ip-1>
aws-ip-1$ source activate tensorflow_p27
aws-ip-1$ ssh-keygen
aws-ip-1$ cat /home/ubuntu/.ssh/id_rsa.pub
[copy contents of the pubkey]
aws-ip-1$ exit
laptop$ ssh ubuntu@<aws-ip-2>
aws-ip-2$ source activate tensorflow_p27
aws-ip-2$ cat >>
/home/ubuntu/.ssh/authorized_keys
[paste contents of the pubkey]
aws-ip-2$ exit
laptop$ ssh ubuntu@<aws-ip-1>
aws-ip-2$ ssh aws-ip-2
[will ask for prompt, say yes]
aws-ip-2$ exit
aws-ip-1$ mpirun -np 2 
-H aws-ip-1,aws-ip-2 
wget
https://raw.githubusercontent.com/uber/horov
od/master/examples/tensorflow_mnist.py
aws-ip-1$ mpirun -bind-to none -map-by slot
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 
-x LD_LIBRARY_PATH -x PATH 
-mca btl_tcp_if_exclude lo,docker0 
–np 16 -H aws-ip-1:8,aws-ip-2:8 
python tensorflow_mnist.py
# Pro tip: hide mpirun args into mpirun.sh
aws-ip-1$ mpirun.sh 
–np 16 –H aws-ip-1:8,aws-ip-2:8 
python tensorflow_mnist.py
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod for TensorFlow
import horovod.tensorflow as hvd
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod for TensorFlow, Keras, and PyTorch
import horovod.tensorflow as hvd
import horovod.keras as hvd
import horovod.tensorflow.keras as hvd
import horovod.torch as hvd
# more frameworks coming
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Full example – Keras
import keras
from keras import backend as K
import tensorflow as tf
import horovod.keras as hvd
# Initialize Horovod.
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list =
str(hvd.local_rank())
K.set_session(tf.Session(config=config))
# Build model…
model = ...
opt = keras.optimizers.Adadelta(
lr=1.0 * hvd.size())
# Add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)
model.compile(
loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
# Broadcast initial variable state.
callbacks =
[hvd.callbacks.BroadcastGlobalVariablesCallb
ack(0)]
model.fit(
x_train,
y_train,
callbacks=callbacks,
epochs=10,
validation_data=(x_test, y_test))
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Full example – Estimator API
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list =
str(hvd.local_rank())
# Build model...
def model_fn(features, labels, mode):
loss = …
opt = tf.train.MomentumOptimizer(
lr=0.01 * hvd.size())
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
return tf.estimator.EstimatorSpec(…)
# Broadcast initial variable state.
hooks = 
[hvd.BroadcastGlobalVariablesHook(0)]
# Only checkpoint on rank 0
ckpt_dir = "/tmp/train_logs" 
if hvd.rank() == 0 else None
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn,
model_dir=ckpt_dir,
config=tf.estimator.RunConfig(
session_config=config))
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=hooks)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Full example – PyTorch
import torch
import horovod.torch as hvd
# Initialize Horovod
hvd.init()
# Horovod: pin GPU to local rank.
torch.cuda.set_device(hvd.local_rank())
# Build model.
model = Net()
model.cuda()
optimizer = optim.SGD(model.parameters())
# Wrap optimizer with DistributedOptimizer.
optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters())
# Horovod: broadcast parameters.
hvd.broadcast_parameters(
model.state_dict(),
root_rank=0)
for epoch in range(100):
for batch_idx, (data, target) in …:
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling Horovod
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Horovod knobs
• Single-ring NCCL versus Hierarchical All-reduce
• HOROVOD_HIERARCHICAL_ALLREDUCE=1
• Tensor Fusion
• HOROVOD_FUSION_THRESHOLD=67108864
• HOROVOD_CYCLE_TIME=5
• FP16 all-reduce
• hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)
Stay tuned for more . . .
• Auto-tuning
• Further gradient compression
• Local gradient aggregation
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Alex Sergeev
@alsrgv
Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any
form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains
information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are
notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of,
disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the
extent necessary for consultations with authorized personnel of Uber.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Contenu connexe

Tendances

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Amazon Web Services
 
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Amazon Web Services
 
[NEW LAUNCH!] Introducing Amazon SageMaker RL - Build and Train Reinforcement...
[NEW LAUNCH!] Introducing Amazon SageMaker RL - Build and Train Reinforcement...[NEW LAUNCH!] Introducing Amazon SageMaker RL - Build and Train Reinforcement...
[NEW LAUNCH!] Introducing Amazon SageMaker RL - Build and Train Reinforcement...Amazon Web Services
 
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech Talks
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech TalksIntegrating Amazon SageMaker into your Enterprise - AWS Online Tech Talks
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech TalksAmazon Web Services
 
Hollywood's Cloud-Based Content Lakes: Modernized Media Archives (MAE203) - A...
Hollywood's Cloud-Based Content Lakes: Modernized Media Archives (MAE203) - A...Hollywood's Cloud-Based Content Lakes: Modernized Media Archives (MAE203) - A...
Hollywood's Cloud-Based Content Lakes: Modernized Media Archives (MAE203) - A...Amazon Web Services
 
[NEW LAUNCH!] [REPEAT 1] Amazon FSx for Lustre: How to build and deploy file ...
[NEW LAUNCH!] [REPEAT 1] Amazon FSx for Lustre: How to build and deploy file ...[NEW LAUNCH!] [REPEAT 1] Amazon FSx for Lustre: How to build and deploy file ...
[NEW LAUNCH!] [REPEAT 1] Amazon FSx for Lustre: How to build and deploy file ...Amazon Web Services
 
Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...
Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...
Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...Amazon Web Services
 
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...Amazon Web Services
 
Democratize Data Preparation for Analytics & Machine Learning A Hands-On Lab ...
Democratize Data Preparation for Analytics & Machine Learning A Hands-On Lab ...Democratize Data Preparation for Analytics & Machine Learning A Hands-On Lab ...
Democratize Data Preparation for Analytics & Machine Learning A Hands-On Lab ...Amazon Web Services
 
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018Amazon Web Services
 
Deep Dive on Amazon Rekognition, ft. Tinder & News UK (AIM307-R) - AWS re:Inv...
Deep Dive on Amazon Rekognition, ft. Tinder & News UK (AIM307-R) - AWS re:Inv...Deep Dive on Amazon Rekognition, ft. Tinder & News UK (AIM307-R) - AWS re:Inv...
Deep Dive on Amazon Rekognition, ft. Tinder & News UK (AIM307-R) - AWS re:Inv...Amazon Web Services
 
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018Amazon Web Services
 
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018Amazon Web Services
 
Train Models on Amazon SageMaker Using Data Not from Amazon S3 (AIM419) - AWS...
Train Models on Amazon SageMaker Using Data Not from Amazon S3 (AIM419) - AWS...Train Models on Amazon SageMaker Using Data Not from Amazon S3 (AIM419) - AWS...
Train Models on Amazon SageMaker Using Data Not from Amazon S3 (AIM419) - AWS...Amazon Web Services
 
Go Global with Cloud-Native Architecture: Deploy AdTech Services Across Four ...
Go Global with Cloud-Native Architecture: Deploy AdTech Services Across Four ...Go Global with Cloud-Native Architecture: Deploy AdTech Services Across Four ...
Go Global with Cloud-Native Architecture: Deploy AdTech Services Across Four ...Amazon Web Services
 
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...Amazon Web Services
 
Searching Your Data with Amazon Elasticsearch Service (ANT384) - AWS re:Inven...
Searching Your Data with Amazon Elasticsearch Service (ANT384) - AWS re:Inven...Searching Your Data with Amazon Elasticsearch Service (ANT384) - AWS re:Inven...
Searching Your Data with Amazon Elasticsearch Service (ANT384) - AWS re:Inven...Amazon Web Services
 
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018Amazon Web Services
 
What's New in AR & VR: State of the World Report (ARV203) - AWS re:Invent 2018
What's New in AR & VR: State of the World Report (ARV203) - AWS re:Invent 2018What's New in AR & VR: State of the World Report (ARV203) - AWS re:Invent 2018
What's New in AR & VR: State of the World Report (ARV203) - AWS re:Invent 2018Amazon Web Services
 

Tendances (20)

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
AWS reInvent 2018 recap edition
AWS reInvent 2018 recap editionAWS reInvent 2018 recap edition
AWS reInvent 2018 recap edition
 
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
Building Deep Learning Applications with TensorFlow and SageMaker on AWS - Te...
 
[NEW LAUNCH!] Introducing Amazon SageMaker RL - Build and Train Reinforcement...
[NEW LAUNCH!] Introducing Amazon SageMaker RL - Build and Train Reinforcement...[NEW LAUNCH!] Introducing Amazon SageMaker RL - Build and Train Reinforcement...
[NEW LAUNCH!] Introducing Amazon SageMaker RL - Build and Train Reinforcement...
 
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech Talks
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech TalksIntegrating Amazon SageMaker into your Enterprise - AWS Online Tech Talks
Integrating Amazon SageMaker into your Enterprise - AWS Online Tech Talks
 
Hollywood's Cloud-Based Content Lakes: Modernized Media Archives (MAE203) - A...
Hollywood's Cloud-Based Content Lakes: Modernized Media Archives (MAE203) - A...Hollywood's Cloud-Based Content Lakes: Modernized Media Archives (MAE203) - A...
Hollywood's Cloud-Based Content Lakes: Modernized Media Archives (MAE203) - A...
 
[NEW LAUNCH!] [REPEAT 1] Amazon FSx for Lustre: How to build and deploy file ...
[NEW LAUNCH!] [REPEAT 1] Amazon FSx for Lustre: How to build and deploy file ...[NEW LAUNCH!] [REPEAT 1] Amazon FSx for Lustre: How to build and deploy file ...
[NEW LAUNCH!] [REPEAT 1] Amazon FSx for Lustre: How to build and deploy file ...
 
Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...
Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...
Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...
 
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...
Solve Common Voice UI Challenges with Advanced Dialog Management Techniques (...
 
Democratize Data Preparation for Analytics & Machine Learning A Hands-On Lab ...
Democratize Data Preparation for Analytics & Machine Learning A Hands-On Lab ...Democratize Data Preparation for Analytics & Machine Learning A Hands-On Lab ...
Democratize Data Preparation for Analytics & Machine Learning A Hands-On Lab ...
 
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018
Debugging Gluon and Apache MXNet (AIM423) - AWS re:Invent 2018
 
Deep Dive on Amazon Rekognition, ft. Tinder & News UK (AIM307-R) - AWS re:Inv...
Deep Dive on Amazon Rekognition, ft. Tinder & News UK (AIM307-R) - AWS re:Inv...Deep Dive on Amazon Rekognition, ft. Tinder & News UK (AIM307-R) - AWS re:Inv...
Deep Dive on Amazon Rekognition, ft. Tinder & News UK (AIM307-R) - AWS re:Inv...
 
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
What Can Your Logs Tell You? (ANT215) - AWS re:Invent 2018
 
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
 
Train Models on Amazon SageMaker Using Data Not from Amazon S3 (AIM419) - AWS...
Train Models on Amazon SageMaker Using Data Not from Amazon S3 (AIM419) - AWS...Train Models on Amazon SageMaker Using Data Not from Amazon S3 (AIM419) - AWS...
Train Models on Amazon SageMaker Using Data Not from Amazon S3 (AIM419) - AWS...
 
Go Global with Cloud-Native Architecture: Deploy AdTech Services Across Four ...
Go Global with Cloud-Native Architecture: Deploy AdTech Services Across Four ...Go Global with Cloud-Native Architecture: Deploy AdTech Services Across Four ...
Go Global with Cloud-Native Architecture: Deploy AdTech Services Across Four ...
 
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
Alexa, Ask Jarvis to Create a Serverless App for Me (SRV315) - AWS re:Invent ...
 
Searching Your Data with Amazon Elasticsearch Service (ANT384) - AWS re:Inven...
Searching Your Data with Amazon Elasticsearch Service (ANT384) - AWS re:Inven...Searching Your Data with Amazon Elasticsearch Service (ANT384) - AWS re:Inven...
Searching Your Data with Amazon Elasticsearch Service (ANT384) - AWS re:Inven...
 
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
 
What's New in AR & VR: State of the World Report (ARV203) - AWS re:Invent 2018
What's New in AR & VR: State of the World Report (ARV203) - AWS re:Invent 2018What's New in AR & VR: State of the World Report (ARV203) - AWS re:Invent 2018
What's New in AR & VR: State of the World Report (ARV203) - AWS re:Invent 2018
 

Similaire à Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent 2018

Distributed Model Training using MXNet with Horovod
Distributed Model Training using MXNet with HorovodDistributed Model Training using MXNet with Horovod
Distributed Model Training using MXNet with HorovodLin Yuan
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Amazon Web Services
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Databricks
 
From Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerFrom Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerAmazon Web Services
 
Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)Julien SIMON
 
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...Codiax
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 
An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)Julien SIMON
 
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...Amazon Web Services
 
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...Amazon Web Services
 
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...Amazon Web Services Korea
 
Build, train and deploy your ML models with Amazon Sage Maker
Build, train and deploy your ML models with Amazon Sage MakerBuild, train and deploy your ML models with Amazon Sage Maker
Build, train and deploy your ML models with Amazon Sage MakerAWS User Group Bengaluru
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
How to lock a Python in a cage? Managing Python environment inside an R project
How to lock a Python in a cage?  Managing Python environment inside an R projectHow to lock a Python in a cage?  Managing Python environment inside an R project
How to lock a Python in a cage? Managing Python environment inside an R projectWLOG Solutions
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Construindo Aplicações Deep Learning com TensorFlow e Amazon SageMaker - MCL...
Construindo Aplicações Deep Learning com TensorFlow e Amazon SageMaker -  MCL...Construindo Aplicações Deep Learning com TensorFlow e Amazon SageMaker -  MCL...
Construindo Aplicações Deep Learning com TensorFlow e Amazon SageMaker - MCL...Amazon Web Services
 
Introduction to Scalable Deep Learning on AWS with Apache MXNet
Introduction to Scalable Deep Learning on AWS with Apache MXNetIntroduction to Scalable Deep Learning on AWS with Apache MXNet
Introduction to Scalable Deep Learning on AWS with Apache MXNetAmazon Web Services
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopJulien SIMON
 
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Amazon Web Services
 
AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 :: 이준범 :: AWS Summit Seoul 2018
AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 :: 이준범 :: AWS Summit Seoul 2018AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 :: 이준범 :: AWS Summit Seoul 2018
AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 :: 이준범 :: AWS Summit Seoul 2018Amazon Web Services Korea
 

Similaire à Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent 2018 (20)

Distributed Model Training using MXNet with Horovod
Distributed Model Training using MXNet with HorovodDistributed Model Training using MXNet with Horovod
Distributed Model Training using MXNet with Horovod
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
 
From Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerFrom Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMaker
 
Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)Amazon SageMaker (December 2018)
Amazon SageMaker (December 2018)
 
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
Julien Simon, Principal Technical Evangelist at Amazon - Machine Learning: Fr...
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)An Introduction to Amazon SageMaker (October 2018)
An Introduction to Amazon SageMaker (October 2018)
 
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
 
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
 
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
 
Build, train and deploy your ML models with Amazon Sage Maker
Build, train and deploy your ML models with Amazon Sage MakerBuild, train and deploy your ML models with Amazon Sage Maker
Build, train and deploy your ML models with Amazon Sage Maker
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
How to lock a Python in a cage? Managing Python environment inside an R project
How to lock a Python in a cage?  Managing Python environment inside an R projectHow to lock a Python in a cage?  Managing Python environment inside an R project
How to lock a Python in a cage? Managing Python environment inside an R project
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Construindo Aplicações Deep Learning com TensorFlow e Amazon SageMaker - MCL...
Construindo Aplicações Deep Learning com TensorFlow e Amazon SageMaker -  MCL...Construindo Aplicações Deep Learning com TensorFlow e Amazon SageMaker -  MCL...
Construindo Aplicações Deep Learning com TensorFlow e Amazon SageMaker - MCL...
 
Introduction to Scalable Deep Learning on AWS with Apache MXNet
Introduction to Scalable Deep Learning on AWS with Apache MXNetIntroduction to Scalable Deep Learning on AWS with Apache MXNet
Introduction to Scalable Deep Learning on AWS with Apache MXNet
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
 
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
 
AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 :: 이준범 :: AWS Summit Seoul 2018
AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 :: 이준범 :: AWS Summit Seoul 2018AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 :: 이준범 :: AWS Summit Seoul 2018
AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 :: 이준범 :: AWS Summit Seoul 2018
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Uber on Using Horovod for Distributed Deep Learning Alex Sergeev Horovod Project Lead Uber Technologies, Inc. A I M 4 1 1
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why deep learning? • Continues to improve accuracy long after older algorithms reach data saturation • State of the art for vision, machine translation, and many other domains • Capitalize on massive amount of research happening in the global community - Andrew Ng
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Massive amounts of data . . . . . . make things slow (weeks!) Solution distributed training p3x16large has 128GB GPU memory Most models fit in a server → Use data-parallel training
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Goals There are many ways to do data-parallel training Some are more confusing than others. UX varies greatly Our goals 1. Infrastructure people (like me ) deal with choosing servers, network gear, container environment, default containers, and tuning distributed training performance 2. ML engineers focus on making great models that improve business using deep learning frameworks that they love
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Meet Horovod • Library for distributed deep learning • Works with stock TensorFlow, Keras, and PyTorch • Installs on top with `pip install horovod` • Uses advanced algorithms & can leverage features of high-performance networks (RDMA, GPUDirect) • Separates infrastructure from ML engineers • Infra team provides container & MPI environment • ML engineers use DL frameworks that they love • Both Infra team and ML engineers have consistent expectations for distributed training across frameworks horovod.ai
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Initialize the library import horovod.tensorflow as hvd hvd.init()
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank())
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adjust LR & add Horovod Distributed Optimizer opt = tf.train.MomentumOptimizer(lr=0.01 * hvd.size()) opt = hvd.DistributedOptimizer(opt)
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Synchronize initial state between workers hooks = [hvd.BroadcastGlobalVariablesHook(0)] with tf.train.MonitoredTrainingSession(hooks=hooks, …) as mon_sess: … # Or bcast_op = hvd.broadcast_global_variables(0) sess.run(bcast_op)
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use checkpoints only on first worker ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir, …) as mon_sess: …
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Full example import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ... opt = tf.train.MomentumOptimizer( lr=0.01 * hvd.size()) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) # Add hook to synchronize initial state hooks =[hvd.BroadcastGlobalVariablesHook(0)] # Only checkpoint on rank 0 ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None # Make training operation train_op = opt.minimize(loss) # The MonitoredTrainingSession takes care of # session initialization, restoring from a # checkpoint, saving to a checkpoint, and # closing when done or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint _dir=ckpt_dir, config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Run! # Use AWS Deep Learning AMI laptop$ ssh ubuntu@<aws-ip-1> aws-ip-1$ source activate tensorflow_p27 aws-ip-1$ ssh-keygen aws-ip-1$ cat /home/ubuntu/.ssh/id_rsa.pub [copy contents of the pubkey] aws-ip-1$ exit laptop$ ssh ubuntu@<aws-ip-2> aws-ip-2$ source activate tensorflow_p27 aws-ip-2$ cat >> /home/ubuntu/.ssh/authorized_keys [paste contents of the pubkey] aws-ip-2$ exit laptop$ ssh ubuntu@<aws-ip-1> aws-ip-2$ ssh aws-ip-2 [will ask for prompt, say yes] aws-ip-2$ exit aws-ip-1$ mpirun -np 2 -H aws-ip-1,aws-ip-2 wget https://raw.githubusercontent.com/uber/horov od/master/examples/tensorflow_mnist.py aws-ip-1$ mpirun -bind-to none -map-by slot -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_exclude lo,docker0 –np 16 -H aws-ip-1:8,aws-ip-2:8 python tensorflow_mnist.py # Pro tip: hide mpirun args into mpirun.sh aws-ip-1$ mpirun.sh –np 16 –H aws-ip-1:8,aws-ip-2:8 python tensorflow_mnist.py
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Horovod for TensorFlow import horovod.tensorflow as hvd
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Horovod for TensorFlow, Keras, and PyTorch import horovod.tensorflow as hvd import horovod.keras as hvd import horovod.tensorflow.keras as hvd import horovod.torch as hvd # more frameworks coming
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Full example – Keras import keras from keras import backend as K import tensorflow as tf import horovod.keras as hvd # Initialize Horovod. hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config)) # Build model… model = ... opt = keras.optimizers.Adadelta( lr=1.0 * hvd.size()) # Add Horovod Distributed Optimizer. opt = hvd.DistributedOptimizer(opt) model.compile( loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) # Broadcast initial variable state. callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallb ack(0)] model.fit( x_train, y_train, callbacks=callbacks, epochs=10, validation_data=(x_test, y_test))
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Full example – Estimator API import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... def model_fn(features, labels, mode): loss = … opt = tf.train.MomentumOptimizer( lr=0.01 * hvd.size()) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) return tf.estimator.EstimatorSpec(…) # Broadcast initial variable state. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Only checkpoint on rank 0 ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None # Create the Estimator mnist_classifier = tf.estimator.Estimator( model_fn=cnn_model_fn, model_dir=ckpt_dir, config=tf.estimator.RunConfig( session_config=config)) mnist_classifier.train( input_fn=train_input_fn, steps=100, hooks=hooks)
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Full example – PyTorch import torch import horovod.torch as hvd # Initialize Horovod hvd.init() # Horovod: pin GPU to local rank. torch.cuda.set_device(hvd.local_rank()) # Build model. model = Net() model.cuda() optimizer = optim.SGD(model.parameters()) # Wrap optimizer with DistributedOptimizer. optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=model.named_parameters()) # Horovod: broadcast parameters. hvd.broadcast_parameters( model.state_dict(), root_rank=0) for epoch in range(100): for batch_idx, (data, target) in …: optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scaling Horovod
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Horovod knobs • Single-ring NCCL versus Hierarchical All-reduce • HOROVOD_HIERARCHICAL_ALLREDUCE=1 • Tensor Fusion • HOROVOD_FUSION_THRESHOLD=67108864 • HOROVOD_CYCLE_TIME=5 • FP16 all-reduce • hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16) Stay tuned for more . . . • Auto-tuning • Further gradient compression • Local gradient aggregation
  • 24. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alex Sergeev @alsrgv
  • 25. Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.