Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters

Benchmark Tests and How-
tos of Distributed Deep Learning
On HorovodRunner
Jing Pan and Wendao Liu

2020 Copyright eHealth Insurance
ABOUT US
Wendao Liu, Sr. Data Scientist at eHealth, Inc
§ Wears many hats: data science/machine learning, data pipeline, end-to-
end data product
§ Currently studying Doctor in Business Administration
Jing Pan, PhD, Sr. Staff User Experience Researcher at eHealth, Inc
§ Architect of customer facing machine learning models
§ Expert in application of deep learning models on Spark
§ Author of multiple patents and speaker at top AI conferences (KDD, AAAI)

AGENDA
§ Horovod
§ HorovodRunner
§ HorovodRunner Benchmark
§ How to Use HorovodRunner

WHY DISTRIBUTED DEEP LEARNING?
Rapidly Growing Data
§ Image net has 1.3M images (150 GB)
§ Amazon has 143 million product reviews (20 GB)
Increasing Model Complexity
§ AlexNet with batch size 128 requires 1.1GB memory (5 conv layers
and 3 fully connected layers)
§ VGG-16 with batch size 128 requires 14GB memory, size 256
requires 28GB

MEET HOROVOD
§ Uber's open source distributed deep learning library
§ Easy to use
§ Slightly modify single-node DL code to make it distributed
using Horovod
§ Great scaling efficiency
§ Supports four popular frameworks
§ TensorFlow, Keras, PyTorch, MXNet
§ Supports both data and model parallelismHorovod Github
Courtesy of Uber

HOROVOD – DATA PARALLELISM
Courtesy of Uber

HOROVOD – RING-ALLREDUCE
Courtesy of Uber

HOROVOD – RING-ALLREDUCE
0 1
2 3
4
6789
13
14 15
Horovod Size: Number of processing units, e.g. 16
Horovod Rank: Ordinal rank of processing units, e.g. 0-15
Courtesy of Uber

HOROVOD BENCHMARK
§ Great scaling efficiency, but requires dedicated engineering resources to set it up
• Container, MPI, and NCCL
§ Fine-tuning infra is not trivial
§ Previous Horovod in-house implementation gains no overall scaling effect (Wu et al '18)
Courtesy of Uber

HOROVODRUNNER – DATABRICKS
HorovodRunner is a general API to run distributed deep learning workloads
on Databricks using Uber's Horovod framework
§ Built on top of Horovod
§ No need to set up underlying infrastructure
• Supports AWS and Azure
§ Run in Databricks’ Spark
§ Data prep and data training in one place
§ Takes care of random shuffling, fault tolerance, etc.
§ Barrier execute mode
Non-Endorsement Disclaimer

HOROVODRUNNER DIAGRAM
Courtesy of Databricks
§ A spark driver and num of executors
that run Horovod
§ Barrier execution mode
§ Enable synchronize training
§ Start all tasks together
§ Restart all tasks in case of failure

HOROVODRUNNER BENCHMARK – MNIST
Dataset: MNIST
Instance: C4.2xlarge
Instance Type: CPU
Model: Simple CNN
(2 convolutional layers)
Epochs: 50
Network: 10 Gbps
Demonstrated scaling efficiency on simple CNN runs on CPU clusters.

HOROVODRUNNER BENCHMARK
Achieved good scaling efficiency using HorovodRunner for both models:
Inception V3 (79.7%~48.9%) and VGG-16 (49.0%~18.5%)

HOROVODRUNNER BENCHMARK OTHERS
§ GCN
§ Currently no scaling efficiency
§ Adjacency matrix is input and cannot be divided
§ Stochastic GCN might able to help
§ Multiple GPUs instance
§ Horovod usually outperforms multithreading

HOW TO USE
HOROVODRUNNER

CLUSTER SETUP
TensorFlow 1 (DB ML GPU 6.x)
VGG and Inception ok
ResNet requires tf2
No DB ML GPU 7.x (tf2) yet
No SSL encryption
DATABRICKS.HOROVOD.IGNORESSL true
CONF_DISABLE_HIPAA true
Fix timeout error in optimizers,
run this in init script
dbutils.fs.put("tmp/tf_horovod/tf_dbfs_timeout_fix.sh","""
#!/bin/bash
fusermount -u /dbfs
nohup /databricks/spark/scripts/fuse/goofys-dbr -f -
o allow_other --file-mode=0777 --dir-mode=0777 --type- cache-
ttl 0 --stat-cache-ttl 1s --http-timeout 5m /:
/dbfs >& /databricks/data/logs/dbfs_fuse_stderr &""", True)

BASIC CODE STRUCTURE
1. INIT LIBRARY 2. PIN GPU 3. WRAP
OPTIMIZER
4. SYNC
PARAMETERS
5. CHECKPOINT
MODEL
Courtesy of Uber

INITIALIZE THE LIBRARY
Single node code HorovodRunner Code
def train(learning_rate=0.1):
from tensorflow import keras
get_dataset()
model = get_model()
optimizer =
keras.optimizers.Adadelta(lr=learning_rate)
model.compile()
model.fit()
train(learning_rate=0.1)
def train_hvd():
import horovod.tensorflow.keras as hvd
hvd.init()
hr = HorovodRunner(np=2)
# np: number of GPUs on slaves,
# aka, hvd_size
hr.run(train_hvd,learning_rate=0.1)
Courtesy of Databricks

HOROVODRUNNER CODE - BAREBONE
def train_hvd():
hvd.init()
get_data()
model = get_model()
opt = keras.optimizers.Adadelta()
model.compile()
model.fit()
Graphics by Van Oktop. Code Courtesy of Databricks.

HorovodRunner-Barebone
def train_hvd():
hvd.init()
get_data()
model = get_model()
model.compile()
model.fit()
1. Init the library
2. Pin GPUs
3. Wrap the Optimizer
4. Sync Parameters
5. Checkpoint the Model

def train_hvd():
hvd.init()
get_data()
model = get_model()
model.compile()
model.fit()

PIN GPUs
def train_hvd(learning_rate=0.1):
from tensorflow.keras import backend as K
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list =
str(hvd.local_rank())
K.set_session(tf.Session(config=config))
GPU 0
GPU 1
GPU 2...
GPU 15
§ For ring-all reduce to function properly
§ Find all GPU device ids on the slaves
§ Assign an invariant ordinal rank to each
GPU device id

DATA PARALLELISM
def get_dataset(num_classes, rank=0, size=1):
(x_train, y_train), (x_test, y_test) =
keras.datasets.mnist.load_data('MNIST-data-%d' % rank)
x_train = x_train[rank::size]
y_train = y_train[rank::size]
def train_hvd(batch_size=512,
epochs=12, learning_rate=0.1):
(x_train, y_train), (x_test, y_test)
= get_data(hvd.rank(), hvd.size())
Conceptually,
data in the train_hvd function = data in one GPU
Chunk
0
Chunk
1
Chunk
...
Chunk
k
Entire Data Set
Rank 0 Rank 1 Rank ... Rank k
Slave
GPUs
Graphics for conceptual illustration purpose only, not for backend implementation

GET DATA – INDEXED SOLUTION
def get_dataset(num_classes, rank=0, size=1):
(x_train, y_train), (x_test, y_test) =
keras.datasets.mnist.load_data('MNIST-data-%d' % rank)
x_train = x_train[rank::size]
y_train = y_train[rank::size]
def train_hvd(batch_size=512, #on 1 GPU
epochs=12, learning_rate=0.1):
(x_train, y_train), (x_test, y_test)
= get_data(hvd.rank(), hvd.size())
GPU(rank) Slice Row_ind
0 Rank_0+size*0 0
1 Rank_1+size*0 1
.. ... ...
k Rank_k+size*0 ...
0 Rank_0+size*1 k+1
1 Rank_1+size*1 k+2
.. ... ...
k Rank_k+size*1 k+size
... ... ...
k Rank_k+size*n i
N is how many rows that can be in each GPU,
N= number of rows I//hvd.size

GET DATA – INDEX SOLUTION
PROBLEM?
§ At each step, in each GPU, are the
rows the same?
§ Yes
§ No Shuffle, no representativeness.
§ Solution for parquet files on S3
§ Petastrom, shuffle by default
https://github.com/uber/petast
orm
§ Image files?
GPU(rank) Slice Row_ind
0 Rank_0+size*0 0
1 Rank_1+size*0 1
.. ... ...
k Rank_k+size*0 ...
0 Rank_0+size*1 k+1
1 Rank_1+size*1 k+2
.. ... ...
k Rank_k+size*1 k+size
... ... ...
k Rank_k+size*n i

GET DATA – GENERATOR SOLUTION
Generator-based solution will shuffle by default at each epoch.
train_generator, validation_generator = get_dataset() #shuffle set to true
step_size_train = train_generator.n//train_generator.batch_size
step_size_validation = validation_generator.n//validation_generator.batch_size
history = model.fit_generator(
generator = train_generator,
steps_per_epoch = step_size_train // hvd.size() ,
validation_data = validation_generator,
validation_steps = step_size_validation // hvd.size() ,
epochs = epochs,
callbacks = callbacks,
verbose=2
)

GET DATA – GENERATOR SOLUTION
§ Entire data set
step_size_train = train_generator.n//train_generator.batch_size
§ Inside each GPU
steps_per_epoch = step_size_train // hvd.size()
GPU Rank Steps in a GPU Entire Steps Batch img_ind (n total)
0 0 0 Batch_size 346,…, 29
0 1 1 Batch_size 420,…,1032
0 2 2 Batch_size 75,…,89
0 3 3 Batch_size ...
1 2 ... Batch_size ...
1 3 ... Batch_size ...
... ... ... Batch_size ...
k 0 ... Batch_size ...
k 3 m Batch_size ...
§ Ensures no repetition on images in an epoch
§ How?
§ Why?
Images
are shuffled

DISTRIBUTED MODEL RETRIEVAL
§ Why
Every GPU will load the model structure at the beginning of training; too many
requests error if loading from github
§ How
Save model to S3 or dbfs
example_model = get_model()
example_model.save("path_on_master/vgg_model.h5")
shutil.copy("path_on_master/vgg_model.h5",
"dbfs_or_s3_path/vgg_model.h5")
§ Then in train_hvd,
Replace
model=get_model()
With
model = keras.models.load_model("dbfs_or_s3_path_to/vgg_model.h5")

WRAP THE OPTIMIZER
#single machine optimizer
optimizer = keras.optimizers.Adadelta
(lr=learning_rate * hvd.size())
# Wrap with Distributed Optimizer.
optimizer = hvd.DistributedOptimizer(optimizer)
model.compile(optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy'])
Paper by Facebook
Accurate, Large Minibatch SGD: Training
ImageNet in 1 Hour
§ Preserve the same number of epochs in
hvdRunner as in a single machine for model to
converge to preserve accuracy
§ By linearly scaling of learning rate with
batch size
§ Synchronous hvdRunner batch size
= batch_size*hvd_size
§ LR_n = LR_1*N
§ HvdRunner's steps_per_epoch is inversely
proportionate to the number of GPUs
§ Same epochs * less_steps_per_epoch
= faster training time
§ Same epochs ~ comparable accuracy

RECTIFIED ADAM OPTIMIZER
§ Why
§ Fast convergence
§ Accurate initial direction finding to avoid bad local optima
§ Setting
§ Cluster install keras-retified-adam
§ Notebook set %env TF_KERAS =1
§ RA optimizer setting
optimizer = RAdam(total_steps=5000, warmup_proportion=0.1,
learning_rate=learning_rate*hvd.size(), min_lr=1e-5)
callbacks = [
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback(),
hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1),
keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1)]
ON THE VARIANCE OF THE ADAPTIVE LEARNING RATE AND BEYOND
Liu et al 2020

SYNCHRONIZE & CHECKPOINT
Checkpoint model parameters from GPU 0
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoi
nt(checkpoint_dir + '/checkpoint-
{epoch}.ckpt', save_weights_only = True))
callbacks = [
hvd.callbacks.BroadcastGlobalVariables
Callback(0) ]
Synchronize parameters from GPU 0
GPU 0
GPU 1
GPU 2...
GPU 15
At the end of synchronous step
§ GPU 0 gets the averaged gradient from ring-
Allreduce
§ And send the updated parameters to the rest of
the GPUs (broadcast)
§ The weights from each step is saved from GPU 0

AVOID HVD.TIMELINE
§ Why: hvd.timeline = no scaling efficiency
§ How: add timestamp to standard output
Redirect HorovodRunner output to log
reset_stdout()
redirect_stdout(output_dir+filename)
hr = HorovodRunner(np = np_setup)
hr.run(train_hvd, learning_rate=learning_rate)
#checkpointed model is on master
#If you want to keep your model after cluster went
down
save_model_to_s3()
move_log_to_s3()
import logging
def redirect_stdout(log_filename):
class StreamToLogger
…
stdout_logger = logging.getLogger('STDOUT')
sl = StreamToLogger(stdout_logger,logging.INFO)
sys.stdout = sl

EXAMPLE TIMESTAMP ADDED OUTPUT
Hvd.rank Current step Total steps per epoch
Current epoch
Total epoch
Added Timestamp

SUMMARY
HorovodRunner is great for distributed deep learning
§ Unlike Horovod, does not require engineering resources to set up infrastructure
§ Simplicity of coding inherited from Horovod
§ Scaling efficiency is good; has room for improvement
§ Choose better network bandwidth instances
§ Change AWS S3 to EC2 instance store
§ Works best if the data can be divided
§ Horovod Timeline adversely impacts performance
§ Security
§ Since Open MPI does not use encrypted communication and can launch new processes,
it's recommended to use network-level security to isolate Horovod jobs from potential
attackers

LINK TO CODE AND PAPER
§ Code:
https://github.com/psychologyphd/horovodRunnerBenchMark_IPython
§ Paper (AAAI2020 Workshop 8 Accepted Poster):
http://arxiv.org/abs/2005.05510
or
https://deep-learning-graphs.bitbucket.io/dlg-
aaai20/accepted_papers/DLGMA_2020_paper_23.pdf

FEEDBACK
Thank you!
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

APPENDIX

APPENDIX
Some things we found and can be useful to share
§ When training some NLP models where we need to determine certain constraints like vocab, create the vocab
first and read on each worker. But during the training, each worker is still only processing a subset of the data
independently.
§ Shuffle
§ Default is random shuffling. But only works on parquet data.
https://github.com/uber/petastorm
Save dataframe to parquet and use petastorm for data digestion.
§ Horovod supports N-gram readouts, assuming it might be able to shuffle the data by the order.
§ Kereas data_generator is by default random shuffling too
§ Recitified Adam
https://www.zhihu.com/question/340834465
§ Real time serving:
§ Same model as single machine trained model. Kubernetics + docker or sagemaker. Check other
sessions today.
§ ring-Allreduce bandwidth optimized
§ https://databricks.com/blog/2019/08/15/how-not-to-scale-deep-learning-in-6-easy-
steps.html

Some tips/takeaways:
1. You can use HorovodRunner out of box and it works great
2. Do not use Horovod timeline
3. Init script and disable hippa to run HorovodRunner
4. Not all optimizers are well supported, some learning rates require special setting.
5. Make sure everything is wrapped in the function including import statements so it can be serialized.
6. Don't use many GPU instances blindly, there is network cost. Instead, run few smaller samples and check GPU
memory usage first.
7. You will still gain performance from a single machine with multiple GPUs
(JP the rest of the tips in the appendix)

Wendao, see here.
https://stackoverflow.com/questions/44788946/shuffling-training-data-with-lstm-rnn
Stateful LSTM is a special case. Brandon correct me if I am wrong. I don’t think horovodRunner can handle the shuffle of stateful LSTM.
--
Jing Pan, Ph.D
Quantitative User Researcher
eHealth
From: Wendao Liu <Wendao.Liu@ehealth.com>
Date: Tuesday, June 2, 2020 at 4:04 PM
To: Brandon Williams <brandon.williams@databricks.com>
Cc: Ryan O'Rourke <ryan.orourke@databricks.com>, Jing Pan <jing.pan@ehealth.com>
Subject: Re: [EXTERNAL] Re: Question regarding HorovodRunner architecture
Thanks Brandon for quick reply!
First question make totally sense, the entire process will fail.
for second questions, yes, I mean the time only, the data is organized by chronological order. Sorry my questions wasn’t really clear so I am adding more
context here:
Let’s say we have historical 5 years of amazon stock data and our goal is to predict the future amazon stock price and data is organized by chronological order
and each row is at day level. In this case, if we train a model such LSTM, we want to preserve the order of the time and direct random shuffle probably won’t
work as it break the sequence of the stock prices. Do you have any suggestions of how to train such model on Horovod? Especially on how to shuffle the data in
a meaningful way. Hope it help to clarify the problem.
Thanks a lot!
From: Brandon Williams <brandon.williams@databricks.com>
Date: Tuesday, June 2, 2020 at 2:47 PM
To: Wendao Liu <Wendao.Liu@ehealth.com>
Cc: Ryan O'Rourke <ryan.orourke@databricks.com>, Jing Pan <jing.pan@ehealth.com>
Subject: [EXTERNAL] Re: Question regarding HorovodRunner architecture
Hi Wendao,

Hi Wendao,
+1 Jing Pan as well.
Regarding recommendations on shuffling in a meaningful way given your case, one approach is to pre-
transform these into (overlapping) arrays of contiguous time steps. Then each row is a chunk of time
and can be read pretty independently so shuffling would be fine. But that may cause a large bit of
storage but is worth a try.
Also, petastorm looks like it shuffles by row group. so that should be fine since the data is ordered by
time chronologically, as each rowgroup should be contiguous in time. Following that you should be able
to then generate the overlapping windows of data on the fly from that batch, as normal. Our ML team
believes this is also good approach to test out albeit not a trivial task.

Logging get log from master
Ifi you want to
Retrieve log from slave, db MLflow

Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters

Similaire à Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters