Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Uber on Using Horovod for
Distributed Deep Learning
Alex Sergeev
Horovod Project Lead
Uber Technologies, Inc.
A I M 4 1 1

Why deep learning?
• Continues to improve accuracy
long after older algorithms
reach data saturation
• State of the art for vision,
machine translation, and
many other domains
• Capitalize on massive amount
of research happening in the
global community
- Andrew Ng

Massive amounts of data . . .
. . . make things slow (weeks!)
Solution
distributed training
p3x16large has
128GB GPU memory
Most models fit in a server
→ Use data-parallel training

Goals
There are many ways to do data-parallel training
Some are more confusing than others. UX varies greatly
Our goals
1. Infrastructure people (like me ) deal with choosing servers, network
gear, container environment, default containers, and tuning
distributed training performance
2. ML engineers focus on making great models that improve business
using deep learning frameworks that they love

Meet Horovod
• Library for distributed deep learning
• Works with stock TensorFlow, Keras, and
PyTorch
• Installs on top with `pip install horovod`
• Uses advanced algorithms & can leverage
features of high-performance networks
(RDMA, GPUDirect)
• Separates infrastructure from ML engineers
• Infra team provides container & MPI environment
• ML engineers use DL frameworks that they love
• Both Infra team and ML engineers have consistent
expectations for distributed training across frameworks
horovod.ai

Initialize the library
import horovod.tensorflow as hvd
hvd.init()

Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

Adjust LR & add Horovod Distributed Optimizer
opt = tf.train.MomentumOptimizer(lr=0.01 * hvd.size())
opt = hvd.DistributedOptimizer(opt)

Synchronize initial state between workers
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
with tf.train.MonitoredTrainingSession(hooks=hooks, …) as mon_sess:
…
# Or
bcast_op = hvd.broadcast_global_variables(0)
sess.run(bcast_op)

Use checkpoints only on first worker
ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None
with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir, …) as mon_sess:
…

Full example
import tensorflow as tf
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config.gpu_options.visible_device_list =
str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.MomentumOptimizer(
lr=0.01 * hvd.size())
# Add Horovod Distributed Optimizer
# Add hook to synchronize initial state
hooks =[hvd.BroadcastGlobalVariablesHook(0)]
# Only checkpoint on rank 0
ckpt_dir = "/tmp/train_logs"
if hvd.rank() == 0 else None
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of
# session initialization, restoring from a
# checkpoint, saving to a checkpoint, and
# closing when done or an error occurs.
with
tf.train.MonitoredTrainingSession(checkpoint
_dir=ckpt_dir, config=config, hooks=hooks)
as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)

Run!
# Use AWS Deep Learning AMI
laptop$ ssh ubuntu@<aws-ip-1>
aws-ip-1$ source activate tensorflow_p27
aws-ip-1$ ssh-keygen
aws-ip-1$ cat /home/ubuntu/.ssh/id_rsa.pub
[copy contents of the pubkey]
aws-ip-1$ exit
aws-ip-2$ source activate tensorflow_p27
aws-ip-2$ cat >>
/home/ubuntu/.ssh/authorized_keys
[paste contents of the pubkey]
aws-ip-2$ exit
aws-ip-2$ ssh aws-ip-2
[will ask for prompt, say yes]
aws-ip-2$ exit
aws-ip-1$ mpirun -np 2
-H aws-ip-1,aws-ip-2
wget
https://raw.githubusercontent.com/uber/horov
od/master/examples/tensorflow_mnist.py
aws-ip-1$ mpirun -bind-to none -map-by slot
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1
-x LD_LIBRARY_PATH -x PATH
-mca btl_tcp_if_exclude lo,docker0
–np 16 -H aws-ip-1:8,aws-ip-2:8
python tensorflow_mnist.py
# Pro tip: hide mpirun args into mpirun.sh
aws-ip-1$ mpirun.sh
–np 16 –H aws-ip-1:8,aws-ip-2:8
python tensorflow_mnist.py

Horovod for TensorFlow

Horovod for TensorFlow, Keras, and PyTorch
import horovod.keras as hvd
import horovod.tensorflow.keras as hvd
import horovod.torch as hvd
# more frameworks coming

Full example – Keras
import keras
from keras import backend as K
import horovod.keras as hvd
# Initialize Horovod.
hvd.init()
K.set_session(tf.Session(config=config))
# Build model…
model = ...
opt = keras.optimizers.Adadelta(
# Add Horovod Distributed Optimizer.
model.compile(
loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
# Broadcast initial variable state.
callbacks =
[hvd.callbacks.BroadcastGlobalVariablesCallb
ack(0)]
model.fit(
x_train,
y_train,
callbacks=callbacks,
epochs=10,
validation_data=(x_test, y_test))

Full example – Estimator API
hvd.init()
# Build model...
def model_fn(features, labels, mode):
loss = …
opt = tf.train.MomentumOptimizer(
# Add Horovod Distributed Optimizer
return tf.estimator.EstimatorSpec(…)
# Broadcast initial variable state.
hooks =
[hvd.BroadcastGlobalVariablesHook(0)]
# Only checkpoint on rank 0
ckpt_dir = "/tmp/train_logs"
if hvd.rank() == 0 else None
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn,
model_dir=ckpt_dir,
config=tf.estimator.RunConfig(
session_config=config))
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=hooks)

Full example – PyTorch
import torch
import horovod.torch as hvd
hvd.init()
# Horovod: pin GPU to local rank.
torch.cuda.set_device(hvd.local_rank())
# Build model.
model = Net()
model.cuda()
optimizer = optim.SGD(model.parameters())
# Wrap optimizer with DistributedOptimizer.
optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters())
# Horovod: broadcast parameters.
hvd.broadcast_parameters(
model.state_dict(),
root_rank=0)
for epoch in range(100):
for batch_idx, (data, target) in …:
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()

Scaling Horovod

Horovod knobs
• Single-ring NCCL versus Hierarchical All-reduce
• HOROVOD_HIERARCHICAL_ALLREDUCE=1
• Tensor Fusion
• HOROVOD_FUSION_THRESHOLD=67108864
• HOROVOD_CYCLE_TIME=5
• FP16 all-reduce
• hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)
Stay tuned for more . . .
• Auto-tuning
• Further gradient compression
• Local gradient aggregation

Thank you!
Alex Sergeev
@alsrgv

Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any
form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains
information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are
notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of,
disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the
extent necessary for consultations with authorized personnel of Uber.

Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent 2018

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent 2018

Similaire à Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent 2018 (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent 2018