The freedom of fast iterations of distributed deep learning tasks is crucial for smaller companies to gain competitive advantages and market shares from big tech giants. Horovod Runner brings this process to relatively accessible spark clusters.
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters
1. Benchmark Tests and How-
tos of Distributed Deep Learning
On HorovodRunner
Jing Pan and Wendao Liu
2. 2020 Copyright eHealth Insurance
ABOUT US
Wendao Liu, Sr. Data Scientist at eHealth, Inc
§ Wears many hats: data science/machine learning, data pipeline, end-to-
end data product
§ Currently studying Doctor in Business Administration
Jing Pan, PhD, Sr. Staff User Experience Researcher at eHealth, Inc
§ Architect of customer facing machine learning models
§ Expert in application of deep learning models on Spark
§ Author of multiple patents and speaker at top AI conferences (KDD, AAAI)
3. 2020 Copyright eHealth Insurance
AGENDA
§ Horovod
§ HorovodRunner
§ HorovodRunner Benchmark
§ How to Use HorovodRunner
4. 2020 Copyright eHealth Insurance
WHY DISTRIBUTED DEEP LEARNING?
Rapidly Growing Data
§ Image net has 1.3M images (150 GB)
§ Amazon has 143 million product reviews (20 GB)
Increasing Model Complexity
§ AlexNet with batch size 128 requires 1.1GB memory (5 conv layers
and 3 fully connected layers)
§ VGG-16 with batch size 128 requires 14GB memory, size 256
requires 28GB
5. 2020 Copyright eHealth Insurance
MEET HOROVOD
§ Uber's open source distributed deep learning library
§ Easy to use
§ Slightly modify single-node DL code to make it distributed
using Horovod
§ Great scaling efficiency
§ Supports four popular frameworks
§ TensorFlow, Keras, PyTorch, MXNet
§ Supports both data and model parallelismHorovod Github
Courtesy of Uber
8. 2020 Copyright eHealth Insurance
HOROVOD – RING-ALLREDUCE
0 1
2 3
4
6789
13
14 15
Horovod Size: Number of processing units, e.g. 16
Horovod Rank: Ordinal rank of processing units, e.g. 0-15
Courtesy of Uber
9. 2020 Copyright eHealth Insurance
HOROVOD BENCHMARK
§ Great scaling efficiency, but requires dedicated engineering resources to set it up
• Container, MPI, and NCCL
§ Fine-tuning infra is not trivial
§ Previous Horovod in-house implementation gains no overall scaling effect (Wu et al '18)
Courtesy of Uber
10. 2020 Copyright eHealth Insurance
HOROVODRUNNER – DATABRICKS
HorovodRunner is a general API to run distributed deep learning workloads
on Databricks using Uber's Horovod framework
§ Built on top of Horovod
§ No need to set up underlying infrastructure
• Supports AWS and Azure
§ Run in Databricks’ Spark
§ Data prep and data training in one place
§ Takes care of random shuffling, fault tolerance, etc.
§ Barrier execute mode
Non-Endorsement Disclaimer
11. 2020 Copyright eHealth Insurance
HOROVODRUNNER DIAGRAM
Courtesy of Databricks
§ A spark driver and num of executors
that run Horovod
§ Barrier execution mode
§ Enable synchronize training
§ Start all tasks together
§ Restart all tasks in case of failure
12. 2020 Copyright eHealth Insurance
HOROVODRUNNER BENCHMARK – MNIST
Dataset: MNIST
Instance: C4.2xlarge
Instance Type: CPU
Model: Simple CNN
(2 convolutional layers)
Epochs: 50
Network: 10 Gbps
Demonstrated scaling efficiency on simple CNN runs on CPU clusters.
13. 2020 Copyright eHealth Insurance
HOROVODRUNNER BENCHMARK
Achieved good scaling efficiency using HorovodRunner for both models:
Inception V3 (79.7%~48.9%) and VGG-16 (49.0%~18.5%)
14. 2020 Copyright eHealth Insurance
HOROVODRUNNER BENCHMARK OTHERS
§ GCN
§ Currently no scaling efficiency
§ Adjacency matrix is input and cannot be divided
§ Stochastic GCN might able to help
§ Multiple GPUs instance
§ Horovod usually outperforms multithreading
22. 2020 Copyright eHealth Insurance
PIN GPUs
def train_hvd(learning_rate=0.1):
from tensorflow.keras import backend as K
import tensorflow as tf
import horovod.tensorflow.keras as hvd
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list =
str(hvd.local_rank())
K.set_session(tf.Session(config=config))
GPU 0
GPU 1
GPU 2...
GPU 15
§ For ring-all reduce to function properly
§ Find all GPU device ids on the slaves
§ Assign an invariant ordinal rank to each
GPU device id
23. 2020 Copyright eHealth Insurance
DATA PARALLELISM
def get_dataset(num_classes, rank=0, size=1):
from tensorflow import keras
(x_train, y_train), (x_test, y_test) =
keras.datasets.mnist.load_data('MNIST-data-%d' % rank)
x_train = x_train[rank::size]
y_train = y_train[rank::size]
def train_hvd(batch_size=512,
epochs=12, learning_rate=0.1):
(x_train, y_train), (x_test, y_test)
= get_data(hvd.rank(), hvd.size())
Conceptually,
data in the train_hvd function = data in one GPU
Chunk
0
Chunk
1
Chunk
...
Chunk
k
Entire Data Set
Rank 0 Rank 1 Rank ... Rank k
Slave
GPUs
Graphics for conceptual illustration purpose only, not for backend implementation
24. 2020 Copyright eHealth Insurance
GET DATA – INDEXED SOLUTION
def get_dataset(num_classes, rank=0, size=1):
from tensorflow import keras
(x_train, y_train), (x_test, y_test) =
keras.datasets.mnist.load_data('MNIST-data-%d' % rank)
x_train = x_train[rank::size]
y_train = y_train[rank::size]
def train_hvd(batch_size=512, #on 1 GPU
epochs=12, learning_rate=0.1):
(x_train, y_train), (x_test, y_test)
= get_data(hvd.rank(), hvd.size())
Graphics for conceptual illustration purpose only, not for backend implementation
GPU(rank) Slice Row_ind
0 Rank_0+size*0 0
1 Rank_1+size*0 1
.. ... ...
k Rank_k+size*0 ...
0 Rank_0+size*1 k+1
1 Rank_1+size*1 k+2
.. ... ...
k Rank_k+size*1 k+size
... ... ...
k Rank_k+size*n i
N is how many rows that can be in each GPU,
N= number of rows I//hvd.size
25. 2020 Copyright eHealth Insurance
GET DATA – INDEX SOLUTION
PROBLEM?
§ At each step, in each GPU, are the
rows the same?
§ Yes
§ No Shuffle, no representativeness.
§ Solution for parquet files on S3
§ Petastrom, shuffle by default
https://github.com/uber/petast
orm
§ Image files?
GPU(rank) Slice Row_ind
0 Rank_0+size*0 0
1 Rank_1+size*0 1
.. ... ...
k Rank_k+size*0 ...
0 Rank_0+size*1 k+1
1 Rank_1+size*1 k+2
.. ... ...
k Rank_k+size*1 k+size
... ... ...
k Rank_k+size*n i
Graphics for conceptual illustration purpose only, not for backend implementation
26. 2020 Copyright eHealth Insurance
GET DATA – GENERATOR SOLUTION
Generator-based solution will shuffle by default at each epoch.
train_generator, validation_generator = get_dataset() #shuffle set to true
step_size_train = train_generator.n//train_generator.batch_size
step_size_validation = validation_generator.n//validation_generator.batch_size
history = model.fit_generator(
generator = train_generator,
steps_per_epoch = step_size_train // hvd.size() ,
validation_data = validation_generator,
validation_steps = step_size_validation // hvd.size() ,
epochs = epochs,
callbacks = callbacks,
verbose=2
)
27. 2020 Copyright eHealth Insurance
GET DATA – GENERATOR SOLUTION
§ Entire data set
step_size_train = train_generator.n//train_generator.batch_size
§ Inside each GPU
steps_per_epoch = step_size_train // hvd.size()
GPU Rank Steps in a GPU Entire Steps Batch img_ind (n total)
0 0 0 Batch_size 346,…, 29
0 1 1 Batch_size 420,…,1032
0 2 2 Batch_size 75,…,89
0 3 3 Batch_size ...
1 0 4 Batch_size ...
1 1 5 Batch_size ...
1 2 ... Batch_size ...
1 3 ... Batch_size ...
... ... ... Batch_size ...
k 0 ... Batch_size ...
k 1 ... Batch_size ...
k 2 ... Batch_size ...
k 3 m Batch_size ...
§ Ensures no repetition on images in an epoch
§ How?
§ Why?
Images
are shuffled
28. 2020 Copyright eHealth Insurance
DISTRIBUTED MODEL RETRIEVAL
§ Why
Every GPU will load the model structure at the beginning of training; too many
requests error if loading from github
§ How
Save model to S3 or dbfs
example_model = get_model()
example_model.save("path_on_master/vgg_model.h5")
shutil.copy("path_on_master/vgg_model.h5",
"dbfs_or_s3_path/vgg_model.h5")
§ Then in train_hvd,
Replace
model=get_model()
With
model = keras.models.load_model("dbfs_or_s3_path_to/vgg_model.h5")
29. 2020 Copyright eHealth Insurance
WRAP THE OPTIMIZER
#single machine optimizer
optimizer = keras.optimizers.Adadelta
(lr=learning_rate * hvd.size())
# Wrap with Distributed Optimizer.
optimizer = hvd.DistributedOptimizer(optimizer)
model.compile(optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy'])
Paper by Facebook
Accurate, Large Minibatch SGD: Training
ImageNet in 1 Hour
§ Preserve the same number of epochs in
hvdRunner as in a single machine for model to
converge to preserve accuracy
§ By linearly scaling of learning rate with
batch size
§ Synchronous hvdRunner batch size
= batch_size*hvd_size
§ LR_n = LR_1*N
§ HvdRunner's steps_per_epoch is inversely
proportionate to the number of GPUs
§ Same epochs * less_steps_per_epoch
= faster training time
§ Same epochs ~ comparable accuracy
30. 2020 Copyright eHealth Insurance
RECTIFIED ADAM OPTIMIZER
§ Why
§ Fast convergence
§ Accurate initial direction finding to avoid bad local optima
§ Setting
§ Cluster install keras-retified-adam
§ Notebook set %env TF_KERAS =1
§ RA optimizer setting
optimizer = RAdam(total_steps=5000, warmup_proportion=0.1,
learning_rate=learning_rate*hvd.size(), min_lr=1e-5)
callbacks = [
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback(),
hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1),
keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1)]
ON THE VARIANCE OF THE ADAPTIVE LEARNING RATE AND BEYOND
Liu et al 2020
31. 2020 Copyright eHealth Insurance
SYNCHRONIZE & CHECKPOINT
Checkpoint model parameters from GPU 0
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoi
nt(checkpoint_dir + '/checkpoint-
{epoch}.ckpt', save_weights_only = True))
callbacks = [
hvd.callbacks.BroadcastGlobalVariables
Callback(0) ]
Synchronize parameters from GPU 0
GPU 0
GPU 1
GPU 2...
GPU 15
Graphics for conceptual illustration purpose only, not for backend implementation
At the end of synchronous step
§ GPU 0 gets the averaged gradient from ring-
Allreduce
§ And send the updated parameters to the rest of
the GPUs (broadcast)
§ The weights from each step is saved from GPU 0
32. 2020 Copyright eHealth Insurance
AVOID HVD.TIMELINE
§ Why: hvd.timeline = no scaling efficiency
§ How: add timestamp to standard output
Redirect HorovodRunner output to log
reset_stdout()
redirect_stdout(output_dir+filename)
hr = HorovodRunner(np = np_setup)
hr.run(train_hvd, learning_rate=learning_rate)
#checkpointed model is on master
#If you want to keep your model after cluster went
down
save_model_to_s3()
move_log_to_s3()
import logging
def redirect_stdout(log_filename):
class StreamToLogger
…
stdout_logger = logging.getLogger('STDOUT')
sl = StreamToLogger(stdout_logger,logging.INFO)
sys.stdout = sl
33. 2020 Copyright eHealth Insurance
EXAMPLE TIMESTAMP ADDED OUTPUT
Hvd.rank Current step Total steps per epoch
Current epoch
Total epoch
Added Timestamp
34. 2020 Copyright eHealth Insurance
SUMMARY
HorovodRunner is great for distributed deep learning
§ Unlike Horovod, does not require engineering resources to set up infrastructure
§ Simplicity of coding inherited from Horovod
§ Scaling efficiency is good; has room for improvement
§ Choose better network bandwidth instances
§ Change AWS S3 to EC2 instance store
§ Works best if the data can be divided
§ Horovod Timeline adversely impacts performance
§ Security
§ Since Open MPI does not use encrypted communication and can launch new processes,
it's recommended to use network-level security to isolate Horovod jobs from potential
attackers
35. 2020 Copyright eHealth Insurance
LINK TO CODE AND PAPER
§ Code:
https://github.com/psychologyphd/horovodRunnerBenchMark_IPython
§ Paper (AAAI2020 Workshop 8 Accepted Poster):
http://arxiv.org/abs/2005.05510
or
https://deep-learning-graphs.bitbucket.io/dlg-
aaai20/accepted_papers/DLGMA_2020_paper_23.pdf
36. 2020 Copyright eHealth Insurance
FEEDBACK
Thank you!
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
38. 2020 Copyright eHealth Insurance
APPENDIX
Some things we found and can be useful to share
§ When training some NLP models where we need to determine certain constraints like vocab, create the vocab
first and read on each worker. But during the training, each worker is still only processing a subset of the data
independently.
§ Shuffle
§ Default is random shuffling. But only works on parquet data.
https://github.com/uber/petastorm
Save dataframe to parquet and use petastorm for data digestion.
§ Horovod supports N-gram readouts, assuming it might be able to shuffle the data by the order.
§ Kereas data_generator is by default random shuffling too
§ Recitified Adam
https://www.zhihu.com/question/340834465
§ Real time serving:
§ Same model as single machine trained model. Kubernetics + docker or sagemaker. Check other
sessions today.
§ ring-Allreduce bandwidth optimized
§ https://databricks.com/blog/2019/08/15/how-not-to-scale-deep-learning-in-6-easy-
steps.html
39. Some tips/takeaways:
1. You can use HorovodRunner out of box and it works great
2. Do not use Horovod timeline
3. Init script and disable hippa to run HorovodRunner
4. Not all optimizers are well supported, some learning rates require special setting.
5. Make sure everything is wrapped in the function including import statements so it can be serialized.
6. Don't use many GPU instances blindly, there is network cost. Instead, run few smaller samples and check GPU
memory usage first.
7. You will still gain performance from a single machine with multiple GPUs
(JP the rest of the tips in the appendix)
40. Wendao, see here.
https://stackoverflow.com/questions/44788946/shuffling-training-data-with-lstm-rnn
Stateful LSTM is a special case. Brandon correct me if I am wrong. I don’t think horovodRunner can handle the shuffle of stateful LSTM.
--
Jing Pan, Ph.D
Quantitative User Researcher
eHealth
From: Wendao Liu <Wendao.Liu@ehealth.com>
Date: Tuesday, June 2, 2020 at 4:04 PM
To: Brandon Williams <brandon.williams@databricks.com>
Cc: Ryan O'Rourke <ryan.orourke@databricks.com>, Jing Pan <jing.pan@ehealth.com>
Subject: Re: [EXTERNAL] Re: Question regarding HorovodRunner architecture
Thanks Brandon for quick reply!
First question make totally sense, the entire process will fail.
for second questions, yes, I mean the time only, the data is organized by chronological order. Sorry my questions wasn’t really clear so I am adding more
context here:
Let’s say we have historical 5 years of amazon stock data and our goal is to predict the future amazon stock price and data is organized by chronological order
and each row is at day level. In this case, if we train a model such LSTM, we want to preserve the order of the time and direct random shuffle probably won’t
work as it break the sequence of the stock prices. Do you have any suggestions of how to train such model on Horovod? Especially on how to shuffle the data in
a meaningful way. Hope it help to clarify the problem.
Thanks a lot!
From: Brandon Williams <brandon.williams@databricks.com>
Date: Tuesday, June 2, 2020 at 2:47 PM
To: Wendao Liu <Wendao.Liu@ehealth.com>
Cc: Ryan O'Rourke <ryan.orourke@databricks.com>, Jing Pan <jing.pan@ehealth.com>
Subject: [EXTERNAL] Re: Question regarding HorovodRunner architecture
Hi Wendao,
41. Hi Wendao,
+1 Jing Pan as well.
Regarding recommendations on shuffling in a meaningful way given your case, one approach is to pre-
transform these into (overlapping) arrays of contiguous time steps. Then each row is a chunk of time
and can be read pretty independently so shuffling would be fine. But that may cause a large bit of
storage but is worth a try.
Also, petastorm looks like it shuffles by row group. so that should be fine since the data is ordered by
time chronologically, as each rowgroup should be contiguous in time. Following that you should be able
to then generate the overlapping windows of data on the fly from that batch, as normal. Our ML team
believes this is also good approach to test out albeit not a trivial task.
42. Logging get log from master
Ifi you want to
Retrieve log from slave, db MLflow