SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Benchmark Tests and How-
tos of Distributed Deep Learning
On HorovodRunner
Jing Pan and Wendao Liu
2020 Copyright eHealth Insurance
ABOUT US
Wendao Liu, Sr. Data Scientist at eHealth, Inc
§ Wears many hats: data science/machine learning, data pipeline, end-to-
end data product
§ Currently studying Doctor in Business Administration
Jing Pan, PhD, Sr. Staff User Experience Researcher at eHealth, Inc
§ Architect of customer facing machine learning models
§ Expert in application of deep learning models on Spark
§ Author of multiple patents and speaker at top AI conferences (KDD, AAAI)
2020 Copyright eHealth Insurance
AGENDA
§ Horovod
§ HorovodRunner
§ HorovodRunner Benchmark
§ How to Use HorovodRunner
2020 Copyright eHealth Insurance
WHY DISTRIBUTED DEEP LEARNING?
Rapidly Growing Data
§ Image net has 1.3M images (150 GB)
§ Amazon has 143 million product reviews (20 GB)
Increasing Model Complexity
§ AlexNet with batch size 128 requires 1.1GB memory (5 conv layers
and 3 fully connected layers)
§ VGG-16 with batch size 128 requires 14GB memory, size 256
requires 28GB
2020 Copyright eHealth Insurance
MEET HOROVOD
§ Uber's open source distributed deep learning library
§ Easy to use
§ Slightly modify single-node DL code to make it distributed
using Horovod
§ Great scaling efficiency
§ Supports four popular frameworks
§ TensorFlow, Keras, PyTorch, MXNet
§ Supports both data and model parallelismHorovod Github
Courtesy of Uber
2020 Copyright eHealth Insurance
HOROVOD – DATA PARALLELISM
Courtesy of Uber
2020 Copyright eHealth Insurance
HOROVOD – RING-ALLREDUCE
Courtesy of Uber
2020 Copyright eHealth Insurance
HOROVOD – RING-ALLREDUCE
0 1
2 3
4
6789
13
14 15
Horovod Size: Number of processing units, e.g. 16
Horovod Rank: Ordinal rank of processing units, e.g. 0-15
Courtesy of Uber
2020 Copyright eHealth Insurance
HOROVOD BENCHMARK
§ Great scaling efficiency, but requires dedicated engineering resources to set it up
• Container, MPI, and NCCL
§ Fine-tuning infra is not trivial
§ Previous Horovod in-house implementation gains no overall scaling effect (Wu et al '18)
Courtesy of Uber
2020 Copyright eHealth Insurance
HOROVODRUNNER – DATABRICKS
HorovodRunner is a general API to run distributed deep learning workloads
on Databricks using Uber's Horovod framework
§ Built on top of Horovod
§ No need to set up underlying infrastructure
• Supports AWS and Azure
§ Run in Databricks’ Spark
§ Data prep and data training in one place
§ Takes care of random shuffling, fault tolerance, etc.
§ Barrier execute mode
Non-Endorsement Disclaimer
2020 Copyright eHealth Insurance
HOROVODRUNNER DIAGRAM
Courtesy of Databricks
§ A spark driver and num of executors
that run Horovod
§ Barrier execution mode
§ Enable synchronize training
§ Start all tasks together
§ Restart all tasks in case of failure
2020 Copyright eHealth Insurance
HOROVODRUNNER BENCHMARK – MNIST
Dataset: MNIST
Instance: C4.2xlarge
Instance Type: CPU
Model: Simple CNN
(2 convolutional layers)
Epochs: 50
Network: 10 Gbps
Demonstrated scaling efficiency on simple CNN runs on CPU clusters.
2020 Copyright eHealth Insurance
HOROVODRUNNER BENCHMARK
Achieved good scaling efficiency using HorovodRunner for both models:
Inception V3 (79.7%~48.9%) and VGG-16 (49.0%~18.5%)
2020 Copyright eHealth Insurance
HOROVODRUNNER BENCHMARK OTHERS
§ GCN
§ Currently no scaling efficiency
§ Adjacency matrix is input and cannot be divided
§ Stochastic GCN might able to help
§ Multiple GPUs instance
§ Horovod usually outperforms multithreading
2020 Copyright eHealth Insurance
HOW TO USE
HOROVODRUNNER
2020 Copyright eHealth Insurance
CLUSTER SETUP
TensorFlow 1 (DB ML GPU 6.x)
VGG and Inception ok
ResNet requires tf2
No DB ML GPU 7.x (tf2) yet
No SSL encryption
DATABRICKS.HOROVOD.IGNORESSL true
CONF_DISABLE_HIPAA true
Fix timeout error in optimizers,
run this in init script
dbutils.fs.put("tmp/tf_horovod/tf_dbfs_timeout_fix.sh","""
#!/bin/bash
fusermount -u /dbfs
nohup /databricks/spark/scripts/fuse/goofys-dbr -f -
o allow_other --file-mode=0777 --dir-mode=0777 --type- cache-
ttl 0 --stat-cache-ttl 1s --http-timeout 5m /:
/dbfs >& /databricks/data/logs/dbfs_fuse_stderr &""", True)
2020 Copyright eHealth Insurance
BASIC CODE STRUCTURE
1. INIT LIBRARY 2. PIN GPU 3. WRAP
OPTIMIZER
4. SYNC
PARAMETERS
5. CHECKPOINT
MODEL
Courtesy of Uber
2020 Copyright eHealth Insurance
INITIALIZE THE LIBRARY
Single node code HorovodRunner Code
def train(learning_rate=0.1):
from tensorflow import keras
get_dataset()
model = get_model()
optimizer =
keras.optimizers.Adadelta(lr=learning_rate)
model.compile()
model.fit()
train(learning_rate=0.1)
def train_hvd():
import horovod.tensorflow.keras as hvd
hvd.init()
hr = HorovodRunner(np=2)
# np: number of GPUs on slaves,
# aka, hvd_size
hr.run(train_hvd,learning_rate=0.1)
Courtesy of Databricks
2020 Copyright eHealth Insurance
HOROVODRUNNER CODE - BAREBONE
def train_hvd():
import horovod.tensorflow.keras as hvd
hvd.init()
get_data()
model = get_model()
opt = keras.optimizers.Adadelta()
model.compile()
model.fit()
hr = HorovodRunner(np=2)
hr.run(train_hvd,learning_rate=0.1)
Graphics by Van Oktop. Code Courtesy of Databricks.
2020 Copyright eHealth Insurance
HOROVODRUNNER CODE - BAREBONE
HorovodRunner-Barebone
def train_hvd():
import horovod.tensorflow.keras as hvd
hvd.init()
get_data()
model = get_model()
opt = keras.optimizers.Adadelta()
model.compile()
model.fit()
1. Init the library
2. Pin GPUs
3. Wrap the Optimizer
4. Sync Parameters
5. Checkpoint the Model
2020 Copyright eHealth Insurance
HOROVODRUNNER CODE - BAREBONE
def train_hvd():
import horovod.tensorflow.keras as hvd
hvd.init()
get_data()
model = get_model()
opt = keras.optimizers.Adadelta()
model.compile()
model.fit()
hr = HorovodRunner(np=2)
hr.run(train_hvd,learning_rate=0.1)
2020 Copyright eHealth Insurance
PIN GPUs
def train_hvd(learning_rate=0.1):
from tensorflow.keras import backend as K
import tensorflow as tf
import horovod.tensorflow.keras as hvd
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list =
str(hvd.local_rank())
K.set_session(tf.Session(config=config))
GPU 0
GPU 1
GPU 2...
GPU 15
§ For ring-all reduce to function properly
§ Find all GPU device ids on the slaves
§ Assign an invariant ordinal rank to each
GPU device id
2020 Copyright eHealth Insurance
DATA PARALLELISM
def get_dataset(num_classes, rank=0, size=1):
from tensorflow import keras
(x_train, y_train), (x_test, y_test) =
keras.datasets.mnist.load_data('MNIST-data-%d' % rank)
x_train = x_train[rank::size]
y_train = y_train[rank::size]
def train_hvd(batch_size=512,
epochs=12, learning_rate=0.1):
(x_train, y_train), (x_test, y_test)
= get_data(hvd.rank(), hvd.size())
Conceptually,
data in the train_hvd function = data in one GPU
Chunk
0
Chunk
1
Chunk
...
Chunk
k
Entire Data Set
Rank 0 Rank 1 Rank ... Rank k
Slave
GPUs
Graphics for conceptual illustration purpose only, not for backend implementation
2020 Copyright eHealth Insurance
GET DATA – INDEXED SOLUTION
def get_dataset(num_classes, rank=0, size=1):
from tensorflow import keras
(x_train, y_train), (x_test, y_test) =
keras.datasets.mnist.load_data('MNIST-data-%d' % rank)
x_train = x_train[rank::size]
y_train = y_train[rank::size]
def train_hvd(batch_size=512, #on 1 GPU
epochs=12, learning_rate=0.1):
(x_train, y_train), (x_test, y_test)
= get_data(hvd.rank(), hvd.size())
Graphics for conceptual illustration purpose only, not for backend implementation
GPU(rank) Slice Row_ind
0 Rank_0+size*0 0
1 Rank_1+size*0 1
.. ... ...
k Rank_k+size*0 ...
0 Rank_0+size*1 k+1
1 Rank_1+size*1 k+2
.. ... ...
k Rank_k+size*1 k+size
... ... ...
k Rank_k+size*n i
N is how many rows that can be in each GPU,
N= number of rows I//hvd.size
2020 Copyright eHealth Insurance
GET DATA – INDEX SOLUTION
PROBLEM?
§ At each step, in each GPU, are the
rows the same?
§ Yes
§ No Shuffle, no representativeness.
§ Solution for parquet files on S3
§ Petastrom, shuffle by default
https://github.com/uber/petast
orm
§ Image files?
GPU(rank) Slice Row_ind
0 Rank_0+size*0 0
1 Rank_1+size*0 1
.. ... ...
k Rank_k+size*0 ...
0 Rank_0+size*1 k+1
1 Rank_1+size*1 k+2
.. ... ...
k Rank_k+size*1 k+size
... ... ...
k Rank_k+size*n i
Graphics for conceptual illustration purpose only, not for backend implementation
2020 Copyright eHealth Insurance
GET DATA – GENERATOR SOLUTION
Generator-based solution will shuffle by default at each epoch.
train_generator, validation_generator = get_dataset() #shuffle set to true
step_size_train = train_generator.n//train_generator.batch_size
step_size_validation = validation_generator.n//validation_generator.batch_size
history = model.fit_generator(
generator = train_generator,
steps_per_epoch = step_size_train // hvd.size() ,
validation_data = validation_generator,
validation_steps = step_size_validation // hvd.size() ,
epochs = epochs,
callbacks = callbacks,
verbose=2
)
2020 Copyright eHealth Insurance
GET DATA – GENERATOR SOLUTION
§ Entire data set
step_size_train = train_generator.n//train_generator.batch_size
§ Inside each GPU
steps_per_epoch = step_size_train // hvd.size()
GPU Rank Steps in a GPU Entire Steps Batch img_ind (n total)
0 0 0 Batch_size 346,…, 29
0 1 1 Batch_size 420,…,1032
0 2 2 Batch_size 75,…,89
0 3 3 Batch_size ...
1 0 4 Batch_size ...
1 1 5 Batch_size ...
1 2 ... Batch_size ...
1 3 ... Batch_size ...
... ... ... Batch_size ...
k 0 ... Batch_size ...
k 1 ... Batch_size ...
k 2 ... Batch_size ...
k 3 m Batch_size ...
§ Ensures no repetition on images in an epoch
§ How?
§ Why?
Images
are shuffled
2020 Copyright eHealth Insurance
DISTRIBUTED MODEL RETRIEVAL
§ Why
Every GPU will load the model structure at the beginning of training; too many
requests error if loading from github
§ How
Save model to S3 or dbfs
example_model = get_model()
example_model.save("path_on_master/vgg_model.h5")
shutil.copy("path_on_master/vgg_model.h5",
"dbfs_or_s3_path/vgg_model.h5")
§ Then in train_hvd,
Replace
model=get_model()
With
model = keras.models.load_model("dbfs_or_s3_path_to/vgg_model.h5")
2020 Copyright eHealth Insurance
WRAP THE OPTIMIZER
#single machine optimizer
optimizer = keras.optimizers.Adadelta
(lr=learning_rate * hvd.size())
# Wrap with Distributed Optimizer.
optimizer = hvd.DistributedOptimizer(optimizer)
model.compile(optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy'])
Paper by Facebook
Accurate, Large Minibatch SGD: Training
ImageNet in 1 Hour
§ Preserve the same number of epochs in
hvdRunner as in a single machine for model to
converge to preserve accuracy
§ By linearly scaling of learning rate with
batch size
§ Synchronous hvdRunner batch size
= batch_size*hvd_size
§ LR_n = LR_1*N
§ HvdRunner's steps_per_epoch is inversely
proportionate to the number of GPUs
§ Same epochs * less_steps_per_epoch
= faster training time
§ Same epochs ~ comparable accuracy
2020 Copyright eHealth Insurance
RECTIFIED ADAM OPTIMIZER
§ Why
§ Fast convergence
§ Accurate initial direction finding to avoid bad local optima
§ Setting
§ Cluster install keras-retified-adam
§ Notebook set %env TF_KERAS =1
§ RA optimizer setting
optimizer = RAdam(total_steps=5000, warmup_proportion=0.1,
learning_rate=learning_rate*hvd.size(), min_lr=1e-5)
callbacks = [
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback(),
hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1),
keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1)]
ON THE VARIANCE OF THE ADAPTIVE LEARNING RATE AND BEYOND
Liu et al 2020
2020 Copyright eHealth Insurance
SYNCHRONIZE & CHECKPOINT
Checkpoint model parameters from GPU 0
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoi
nt(checkpoint_dir + '/checkpoint-
{epoch}.ckpt', save_weights_only = True))
callbacks = [
hvd.callbacks.BroadcastGlobalVariables
Callback(0) ]
Synchronize parameters from GPU 0
GPU 0
GPU 1
GPU 2...
GPU 15
Graphics for conceptual illustration purpose only, not for backend implementation
At the end of synchronous step
§ GPU 0 gets the averaged gradient from ring-
Allreduce
§ And send the updated parameters to the rest of
the GPUs (broadcast)
§ The weights from each step is saved from GPU 0
2020 Copyright eHealth Insurance
AVOID HVD.TIMELINE
§ Why: hvd.timeline = no scaling efficiency
§ How: add timestamp to standard output
Redirect HorovodRunner output to log
reset_stdout()
redirect_stdout(output_dir+filename)
hr = HorovodRunner(np = np_setup)
hr.run(train_hvd, learning_rate=learning_rate)
#checkpointed model is on master
#If you want to keep your model after cluster went
down
save_model_to_s3()
move_log_to_s3()
import logging
def redirect_stdout(log_filename):
class StreamToLogger
…
stdout_logger = logging.getLogger('STDOUT')
sl = StreamToLogger(stdout_logger,logging.INFO)
sys.stdout = sl
2020 Copyright eHealth Insurance
EXAMPLE TIMESTAMP ADDED OUTPUT
Hvd.rank Current step Total steps per epoch
Current epoch
Total epoch
Added Timestamp
2020 Copyright eHealth Insurance
SUMMARY
HorovodRunner is great for distributed deep learning
§ Unlike Horovod, does not require engineering resources to set up infrastructure
§ Simplicity of coding inherited from Horovod
§ Scaling efficiency is good; has room for improvement
§ Choose better network bandwidth instances
§ Change AWS S3 to EC2 instance store
§ Works best if the data can be divided
§ Horovod Timeline adversely impacts performance
§ Security
§ Since Open MPI does not use encrypted communication and can launch new processes,
it's recommended to use network-level security to isolate Horovod jobs from potential
attackers
2020 Copyright eHealth Insurance
LINK TO CODE AND PAPER
§ Code:
https://github.com/psychologyphd/horovodRunnerBenchMark_IPython
§ Paper (AAAI2020 Workshop 8 Accepted Poster):
http://arxiv.org/abs/2005.05510
or
https://deep-learning-graphs.bitbucket.io/dlg-
aaai20/accepted_papers/DLGMA_2020_paper_23.pdf
2020 Copyright eHealth Insurance
FEEDBACK
Thank you!
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
2020 Copyright eHealth Insurance
APPENDIX
2020 Copyright eHealth Insurance
APPENDIX
Some things we found and can be useful to share
§ When training some NLP models where we need to determine certain constraints like vocab, create the vocab
first and read on each worker. But during the training, each worker is still only processing a subset of the data
independently.
§ Shuffle
§ Default is random shuffling. But only works on parquet data.
https://github.com/uber/petastorm
Save dataframe to parquet and use petastorm for data digestion.
§ Horovod supports N-gram readouts, assuming it might be able to shuffle the data by the order.
§ Kereas data_generator is by default random shuffling too
§ Recitified Adam
https://www.zhihu.com/question/340834465
§ Real time serving:
§ Same model as single machine trained model. Kubernetics + docker or sagemaker. Check other
sessions today.
§ ring-Allreduce bandwidth optimized
§ https://databricks.com/blog/2019/08/15/how-not-to-scale-deep-learning-in-6-easy-
steps.html
Some tips/takeaways:
1. You can use HorovodRunner out of box and it works great
2. Do not use Horovod timeline
3. Init script and disable hippa to run HorovodRunner
4. Not all optimizers are well supported, some learning rates require special setting.
5. Make sure everything is wrapped in the function including import statements so it can be serialized.
6. Don't use many GPU instances blindly, there is network cost. Instead, run few smaller samples and check GPU
memory usage first.
7. You will still gain performance from a single machine with multiple GPUs
(JP the rest of the tips in the appendix)
Wendao, see here.
https://stackoverflow.com/questions/44788946/shuffling-training-data-with-lstm-rnn
Stateful LSTM is a special case. Brandon correct me if I am wrong. I don’t think horovodRunner can handle the shuffle of stateful LSTM.
--
Jing Pan, Ph.D
Quantitative User Researcher
eHealth
From: Wendao Liu <Wendao.Liu@ehealth.com>
Date: Tuesday, June 2, 2020 at 4:04 PM
To: Brandon Williams <brandon.williams@databricks.com>
Cc: Ryan O'Rourke <ryan.orourke@databricks.com>, Jing Pan <jing.pan@ehealth.com>
Subject: Re: [EXTERNAL] Re: Question regarding HorovodRunner architecture
Thanks Brandon for quick reply!
First question make totally sense, the entire process will fail.
for second questions, yes, I mean the time only, the data is organized by chronological order. Sorry my questions wasn’t really clear so I am adding more
context here:
Let’s say we have historical 5 years of amazon stock data and our goal is to predict the future amazon stock price and data is organized by chronological order
and each row is at day level. In this case, if we train a model such LSTM, we want to preserve the order of the time and direct random shuffle probably won’t
work as it break the sequence of the stock prices. Do you have any suggestions of how to train such model on Horovod? Especially on how to shuffle the data in
a meaningful way. Hope it help to clarify the problem.
Thanks a lot!
From: Brandon Williams <brandon.williams@databricks.com>
Date: Tuesday, June 2, 2020 at 2:47 PM
To: Wendao Liu <Wendao.Liu@ehealth.com>
Cc: Ryan O'Rourke <ryan.orourke@databricks.com>, Jing Pan <jing.pan@ehealth.com>
Subject: [EXTERNAL] Re: Question regarding HorovodRunner architecture
Hi Wendao,
Hi Wendao,
+1 Jing Pan as well.
Regarding recommendations on shuffling in a meaningful way given your case, one approach is to pre-
transform these into (overlapping) arrays of contiguous time steps. Then each row is a chunk of time
and can be read pretty independently so shuffling would be fine. But that may cause a large bit of
storage but is worth a try.
Also, petastorm looks like it shuffles by row group. so that should be fine since the data is ordered by
time chronologically, as each rowgroup should be contiguous in time. Following that you should be able
to then generate the overlapping windows of data on the fly from that batch, as normal. Our ML team
believes this is also good approach to test out albeit not a trivial task.
Logging get log from master
Ifi you want to
Retrieve log from slave, db MLflow

Contenu connexe

Tendances

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
DataWorks Summit
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Databricks
 

Tendances (20)

Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
Operationalize Apache Spark Analytics
Operationalize Apache Spark AnalyticsOperationalize Apache Spark Analytics
Operationalize Apache Spark Analytics
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDeep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on KubeflowMigrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 

Similaire à Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters

Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Databricks
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 

Similaire à Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters (20)

Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
Uber on Using Horovod for Distributed Deep Learning (AIM411) - AWS re:Invent ...
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ss
 
Tutorial: Image Generation and Image-to-Image Translation using GAN
Tutorial: Image Generation and Image-to-Image Translation using GANTutorial: Image Generation and Image-to-Image Translation using GAN
Tutorial: Image Generation and Image-to-Image Translation using GAN
 
Rohan resume
Rohan resumeRohan resume
Rohan resume
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
 
Demystifying Machine and Deep Learning for Developers
Demystifying Machine and Deep Learning for DevelopersDemystifying Machine and Deep Learning for Developers
Demystifying Machine and Deep Learning for Developers
 
Scalable TensorFlow Deep Learning as a Service with Docker, OpenPOWER, and GPUs
Scalable TensorFlow Deep Learning as a Service with Docker, OpenPOWER, and GPUsScalable TensorFlow Deep Learning as a Service with Docker, OpenPOWER, and GPUs
Scalable TensorFlow Deep Learning as a Service with Docker, OpenPOWER, and GPUs
 
TensorFlow - La IA detrás de Google
TensorFlow - La IA detrás de GoogleTensorFlow - La IA detrás de Google
TensorFlow - La IA detrás de Google
 
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Dernier (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters

  • 1. Benchmark Tests and How- tos of Distributed Deep Learning On HorovodRunner Jing Pan and Wendao Liu
  • 2. 2020 Copyright eHealth Insurance ABOUT US Wendao Liu, Sr. Data Scientist at eHealth, Inc § Wears many hats: data science/machine learning, data pipeline, end-to- end data product § Currently studying Doctor in Business Administration Jing Pan, PhD, Sr. Staff User Experience Researcher at eHealth, Inc § Architect of customer facing machine learning models § Expert in application of deep learning models on Spark § Author of multiple patents and speaker at top AI conferences (KDD, AAAI)
  • 3. 2020 Copyright eHealth Insurance AGENDA § Horovod § HorovodRunner § HorovodRunner Benchmark § How to Use HorovodRunner
  • 4. 2020 Copyright eHealth Insurance WHY DISTRIBUTED DEEP LEARNING? Rapidly Growing Data § Image net has 1.3M images (150 GB) § Amazon has 143 million product reviews (20 GB) Increasing Model Complexity § AlexNet with batch size 128 requires 1.1GB memory (5 conv layers and 3 fully connected layers) § VGG-16 with batch size 128 requires 14GB memory, size 256 requires 28GB
  • 5. 2020 Copyright eHealth Insurance MEET HOROVOD § Uber's open source distributed deep learning library § Easy to use § Slightly modify single-node DL code to make it distributed using Horovod § Great scaling efficiency § Supports four popular frameworks § TensorFlow, Keras, PyTorch, MXNet § Supports both data and model parallelismHorovod Github Courtesy of Uber
  • 6. 2020 Copyright eHealth Insurance HOROVOD – DATA PARALLELISM Courtesy of Uber
  • 7. 2020 Copyright eHealth Insurance HOROVOD – RING-ALLREDUCE Courtesy of Uber
  • 8. 2020 Copyright eHealth Insurance HOROVOD – RING-ALLREDUCE 0 1 2 3 4 6789 13 14 15 Horovod Size: Number of processing units, e.g. 16 Horovod Rank: Ordinal rank of processing units, e.g. 0-15 Courtesy of Uber
  • 9. 2020 Copyright eHealth Insurance HOROVOD BENCHMARK § Great scaling efficiency, but requires dedicated engineering resources to set it up • Container, MPI, and NCCL § Fine-tuning infra is not trivial § Previous Horovod in-house implementation gains no overall scaling effect (Wu et al '18) Courtesy of Uber
  • 10. 2020 Copyright eHealth Insurance HOROVODRUNNER – DATABRICKS HorovodRunner is a general API to run distributed deep learning workloads on Databricks using Uber's Horovod framework § Built on top of Horovod § No need to set up underlying infrastructure • Supports AWS and Azure § Run in Databricks’ Spark § Data prep and data training in one place § Takes care of random shuffling, fault tolerance, etc. § Barrier execute mode Non-Endorsement Disclaimer
  • 11. 2020 Copyright eHealth Insurance HOROVODRUNNER DIAGRAM Courtesy of Databricks § A spark driver and num of executors that run Horovod § Barrier execution mode § Enable synchronize training § Start all tasks together § Restart all tasks in case of failure
  • 12. 2020 Copyright eHealth Insurance HOROVODRUNNER BENCHMARK – MNIST Dataset: MNIST Instance: C4.2xlarge Instance Type: CPU Model: Simple CNN (2 convolutional layers) Epochs: 50 Network: 10 Gbps Demonstrated scaling efficiency on simple CNN runs on CPU clusters.
  • 13. 2020 Copyright eHealth Insurance HOROVODRUNNER BENCHMARK Achieved good scaling efficiency using HorovodRunner for both models: Inception V3 (79.7%~48.9%) and VGG-16 (49.0%~18.5%)
  • 14. 2020 Copyright eHealth Insurance HOROVODRUNNER BENCHMARK OTHERS § GCN § Currently no scaling efficiency § Adjacency matrix is input and cannot be divided § Stochastic GCN might able to help § Multiple GPUs instance § Horovod usually outperforms multithreading
  • 15. 2020 Copyright eHealth Insurance HOW TO USE HOROVODRUNNER
  • 16. 2020 Copyright eHealth Insurance CLUSTER SETUP TensorFlow 1 (DB ML GPU 6.x) VGG and Inception ok ResNet requires tf2 No DB ML GPU 7.x (tf2) yet No SSL encryption DATABRICKS.HOROVOD.IGNORESSL true CONF_DISABLE_HIPAA true Fix timeout error in optimizers, run this in init script dbutils.fs.put("tmp/tf_horovod/tf_dbfs_timeout_fix.sh",""" #!/bin/bash fusermount -u /dbfs nohup /databricks/spark/scripts/fuse/goofys-dbr -f - o allow_other --file-mode=0777 --dir-mode=0777 --type- cache- ttl 0 --stat-cache-ttl 1s --http-timeout 5m /: /dbfs >& /databricks/data/logs/dbfs_fuse_stderr &""", True)
  • 17. 2020 Copyright eHealth Insurance BASIC CODE STRUCTURE 1. INIT LIBRARY 2. PIN GPU 3. WRAP OPTIMIZER 4. SYNC PARAMETERS 5. CHECKPOINT MODEL Courtesy of Uber
  • 18. 2020 Copyright eHealth Insurance INITIALIZE THE LIBRARY Single node code HorovodRunner Code def train(learning_rate=0.1): from tensorflow import keras get_dataset() model = get_model() optimizer = keras.optimizers.Adadelta(lr=learning_rate) model.compile() model.fit() train(learning_rate=0.1) def train_hvd(): import horovod.tensorflow.keras as hvd hvd.init() hr = HorovodRunner(np=2) # np: number of GPUs on slaves, # aka, hvd_size hr.run(train_hvd,learning_rate=0.1) Courtesy of Databricks
  • 19. 2020 Copyright eHealth Insurance HOROVODRUNNER CODE - BAREBONE def train_hvd(): import horovod.tensorflow.keras as hvd hvd.init() get_data() model = get_model() opt = keras.optimizers.Adadelta() model.compile() model.fit() hr = HorovodRunner(np=2) hr.run(train_hvd,learning_rate=0.1) Graphics by Van Oktop. Code Courtesy of Databricks.
  • 20. 2020 Copyright eHealth Insurance HOROVODRUNNER CODE - BAREBONE HorovodRunner-Barebone def train_hvd(): import horovod.tensorflow.keras as hvd hvd.init() get_data() model = get_model() opt = keras.optimizers.Adadelta() model.compile() model.fit() 1. Init the library 2. Pin GPUs 3. Wrap the Optimizer 4. Sync Parameters 5. Checkpoint the Model
  • 21. 2020 Copyright eHealth Insurance HOROVODRUNNER CODE - BAREBONE def train_hvd(): import horovod.tensorflow.keras as hvd hvd.init() get_data() model = get_model() opt = keras.optimizers.Adadelta() model.compile() model.fit() hr = HorovodRunner(np=2) hr.run(train_hvd,learning_rate=0.1)
  • 22. 2020 Copyright eHealth Insurance PIN GPUs def train_hvd(learning_rate=0.1): from tensorflow.keras import backend as K import tensorflow as tf import horovod.tensorflow.keras as hvd config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.visible_device_list = str(hvd.local_rank()) K.set_session(tf.Session(config=config)) GPU 0 GPU 1 GPU 2... GPU 15 § For ring-all reduce to function properly § Find all GPU device ids on the slaves § Assign an invariant ordinal rank to each GPU device id
  • 23. 2020 Copyright eHealth Insurance DATA PARALLELISM def get_dataset(num_classes, rank=0, size=1): from tensorflow import keras (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data('MNIST-data-%d' % rank) x_train = x_train[rank::size] y_train = y_train[rank::size] def train_hvd(batch_size=512, epochs=12, learning_rate=0.1): (x_train, y_train), (x_test, y_test) = get_data(hvd.rank(), hvd.size()) Conceptually, data in the train_hvd function = data in one GPU Chunk 0 Chunk 1 Chunk ... Chunk k Entire Data Set Rank 0 Rank 1 Rank ... Rank k Slave GPUs Graphics for conceptual illustration purpose only, not for backend implementation
  • 24. 2020 Copyright eHealth Insurance GET DATA – INDEXED SOLUTION def get_dataset(num_classes, rank=0, size=1): from tensorflow import keras (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data('MNIST-data-%d' % rank) x_train = x_train[rank::size] y_train = y_train[rank::size] def train_hvd(batch_size=512, #on 1 GPU epochs=12, learning_rate=0.1): (x_train, y_train), (x_test, y_test) = get_data(hvd.rank(), hvd.size()) Graphics for conceptual illustration purpose only, not for backend implementation GPU(rank) Slice Row_ind 0 Rank_0+size*0 0 1 Rank_1+size*0 1 .. ... ... k Rank_k+size*0 ... 0 Rank_0+size*1 k+1 1 Rank_1+size*1 k+2 .. ... ... k Rank_k+size*1 k+size ... ... ... k Rank_k+size*n i N is how many rows that can be in each GPU, N= number of rows I//hvd.size
  • 25. 2020 Copyright eHealth Insurance GET DATA – INDEX SOLUTION PROBLEM? § At each step, in each GPU, are the rows the same? § Yes § No Shuffle, no representativeness. § Solution for parquet files on S3 § Petastrom, shuffle by default https://github.com/uber/petast orm § Image files? GPU(rank) Slice Row_ind 0 Rank_0+size*0 0 1 Rank_1+size*0 1 .. ... ... k Rank_k+size*0 ... 0 Rank_0+size*1 k+1 1 Rank_1+size*1 k+2 .. ... ... k Rank_k+size*1 k+size ... ... ... k Rank_k+size*n i Graphics for conceptual illustration purpose only, not for backend implementation
  • 26. 2020 Copyright eHealth Insurance GET DATA – GENERATOR SOLUTION Generator-based solution will shuffle by default at each epoch. train_generator, validation_generator = get_dataset() #shuffle set to true step_size_train = train_generator.n//train_generator.batch_size step_size_validation = validation_generator.n//validation_generator.batch_size history = model.fit_generator( generator = train_generator, steps_per_epoch = step_size_train // hvd.size() , validation_data = validation_generator, validation_steps = step_size_validation // hvd.size() , epochs = epochs, callbacks = callbacks, verbose=2 )
  • 27. 2020 Copyright eHealth Insurance GET DATA – GENERATOR SOLUTION § Entire data set step_size_train = train_generator.n//train_generator.batch_size § Inside each GPU steps_per_epoch = step_size_train // hvd.size() GPU Rank Steps in a GPU Entire Steps Batch img_ind (n total) 0 0 0 Batch_size 346,…, 29 0 1 1 Batch_size 420,…,1032 0 2 2 Batch_size 75,…,89 0 3 3 Batch_size ... 1 0 4 Batch_size ... 1 1 5 Batch_size ... 1 2 ... Batch_size ... 1 3 ... Batch_size ... ... ... ... Batch_size ... k 0 ... Batch_size ... k 1 ... Batch_size ... k 2 ... Batch_size ... k 3 m Batch_size ... § Ensures no repetition on images in an epoch § How? § Why? Images are shuffled
  • 28. 2020 Copyright eHealth Insurance DISTRIBUTED MODEL RETRIEVAL § Why Every GPU will load the model structure at the beginning of training; too many requests error if loading from github § How Save model to S3 or dbfs example_model = get_model() example_model.save("path_on_master/vgg_model.h5") shutil.copy("path_on_master/vgg_model.h5", "dbfs_or_s3_path/vgg_model.h5") § Then in train_hvd, Replace model=get_model() With model = keras.models.load_model("dbfs_or_s3_path_to/vgg_model.h5")
  • 29. 2020 Copyright eHealth Insurance WRAP THE OPTIMIZER #single machine optimizer optimizer = keras.optimizers.Adadelta (lr=learning_rate * hvd.size()) # Wrap with Distributed Optimizer. optimizer = hvd.DistributedOptimizer(optimizer) model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy']) Paper by Facebook Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour § Preserve the same number of epochs in hvdRunner as in a single machine for model to converge to preserve accuracy § By linearly scaling of learning rate with batch size § Synchronous hvdRunner batch size = batch_size*hvd_size § LR_n = LR_1*N § HvdRunner's steps_per_epoch is inversely proportionate to the number of GPUs § Same epochs * less_steps_per_epoch = faster training time § Same epochs ~ comparable accuracy
  • 30. 2020 Copyright eHealth Insurance RECTIFIED ADAM OPTIMIZER § Why § Fast convergence § Accurate initial direction finding to avoid bad local optima § Setting § Cluster install keras-retified-adam § Notebook set %env TF_KERAS =1 § RA optimizer setting optimizer = RAdam(total_steps=5000, warmup_proportion=0.1, learning_rate=learning_rate*hvd.size(), min_lr=1e-5) callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0), hvd.callbacks.MetricAverageCallback(), hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1), keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1)] ON THE VARIANCE OF THE ADAPTIVE LEARNING RATE AND BEYOND Liu et al 2020
  • 31. 2020 Copyright eHealth Insurance SYNCHRONIZE & CHECKPOINT Checkpoint model parameters from GPU 0 if hvd.rank() == 0: callbacks.append(keras.callbacks.ModelCheckpoi nt(checkpoint_dir + '/checkpoint- {epoch}.ckpt', save_weights_only = True)) callbacks = [ hvd.callbacks.BroadcastGlobalVariables Callback(0) ] Synchronize parameters from GPU 0 GPU 0 GPU 1 GPU 2... GPU 15 Graphics for conceptual illustration purpose only, not for backend implementation At the end of synchronous step § GPU 0 gets the averaged gradient from ring- Allreduce § And send the updated parameters to the rest of the GPUs (broadcast) § The weights from each step is saved from GPU 0
  • 32. 2020 Copyright eHealth Insurance AVOID HVD.TIMELINE § Why: hvd.timeline = no scaling efficiency § How: add timestamp to standard output Redirect HorovodRunner output to log reset_stdout() redirect_stdout(output_dir+filename) hr = HorovodRunner(np = np_setup) hr.run(train_hvd, learning_rate=learning_rate) #checkpointed model is on master #If you want to keep your model after cluster went down save_model_to_s3() move_log_to_s3() import logging def redirect_stdout(log_filename): class StreamToLogger … stdout_logger = logging.getLogger('STDOUT') sl = StreamToLogger(stdout_logger,logging.INFO) sys.stdout = sl
  • 33. 2020 Copyright eHealth Insurance EXAMPLE TIMESTAMP ADDED OUTPUT Hvd.rank Current step Total steps per epoch Current epoch Total epoch Added Timestamp
  • 34. 2020 Copyright eHealth Insurance SUMMARY HorovodRunner is great for distributed deep learning § Unlike Horovod, does not require engineering resources to set up infrastructure § Simplicity of coding inherited from Horovod § Scaling efficiency is good; has room for improvement § Choose better network bandwidth instances § Change AWS S3 to EC2 instance store § Works best if the data can be divided § Horovod Timeline adversely impacts performance § Security § Since Open MPI does not use encrypted communication and can launch new processes, it's recommended to use network-level security to isolate Horovod jobs from potential attackers
  • 35. 2020 Copyright eHealth Insurance LINK TO CODE AND PAPER § Code: https://github.com/psychologyphd/horovodRunnerBenchMark_IPython § Paper (AAAI2020 Workshop 8 Accepted Poster): http://arxiv.org/abs/2005.05510 or https://deep-learning-graphs.bitbucket.io/dlg- aaai20/accepted_papers/DLGMA_2020_paper_23.pdf
  • 36. 2020 Copyright eHealth Insurance FEEDBACK Thank you! Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 37. 2020 Copyright eHealth Insurance APPENDIX
  • 38. 2020 Copyright eHealth Insurance APPENDIX Some things we found and can be useful to share § When training some NLP models where we need to determine certain constraints like vocab, create the vocab first and read on each worker. But during the training, each worker is still only processing a subset of the data independently. § Shuffle § Default is random shuffling. But only works on parquet data. https://github.com/uber/petastorm Save dataframe to parquet and use petastorm for data digestion. § Horovod supports N-gram readouts, assuming it might be able to shuffle the data by the order. § Kereas data_generator is by default random shuffling too § Recitified Adam https://www.zhihu.com/question/340834465 § Real time serving: § Same model as single machine trained model. Kubernetics + docker or sagemaker. Check other sessions today. § ring-Allreduce bandwidth optimized § https://databricks.com/blog/2019/08/15/how-not-to-scale-deep-learning-in-6-easy- steps.html
  • 39. Some tips/takeaways: 1. You can use HorovodRunner out of box and it works great 2. Do not use Horovod timeline 3. Init script and disable hippa to run HorovodRunner 4. Not all optimizers are well supported, some learning rates require special setting. 5. Make sure everything is wrapped in the function including import statements so it can be serialized. 6. Don't use many GPU instances blindly, there is network cost. Instead, run few smaller samples and check GPU memory usage first. 7. You will still gain performance from a single machine with multiple GPUs (JP the rest of the tips in the appendix)
  • 40. Wendao, see here. https://stackoverflow.com/questions/44788946/shuffling-training-data-with-lstm-rnn Stateful LSTM is a special case. Brandon correct me if I am wrong. I don’t think horovodRunner can handle the shuffle of stateful LSTM. -- Jing Pan, Ph.D Quantitative User Researcher eHealth From: Wendao Liu <Wendao.Liu@ehealth.com> Date: Tuesday, June 2, 2020 at 4:04 PM To: Brandon Williams <brandon.williams@databricks.com> Cc: Ryan O'Rourke <ryan.orourke@databricks.com>, Jing Pan <jing.pan@ehealth.com> Subject: Re: [EXTERNAL] Re: Question regarding HorovodRunner architecture Thanks Brandon for quick reply! First question make totally sense, the entire process will fail. for second questions, yes, I mean the time only, the data is organized by chronological order. Sorry my questions wasn’t really clear so I am adding more context here: Let’s say we have historical 5 years of amazon stock data and our goal is to predict the future amazon stock price and data is organized by chronological order and each row is at day level. In this case, if we train a model such LSTM, we want to preserve the order of the time and direct random shuffle probably won’t work as it break the sequence of the stock prices. Do you have any suggestions of how to train such model on Horovod? Especially on how to shuffle the data in a meaningful way. Hope it help to clarify the problem. Thanks a lot! From: Brandon Williams <brandon.williams@databricks.com> Date: Tuesday, June 2, 2020 at 2:47 PM To: Wendao Liu <Wendao.Liu@ehealth.com> Cc: Ryan O'Rourke <ryan.orourke@databricks.com>, Jing Pan <jing.pan@ehealth.com> Subject: [EXTERNAL] Re: Question regarding HorovodRunner architecture Hi Wendao,
  • 41. Hi Wendao, +1 Jing Pan as well. Regarding recommendations on shuffling in a meaningful way given your case, one approach is to pre- transform these into (overlapping) arrays of contiguous time steps. Then each row is a chunk of time and can be read pretty independently so shuffling would be fine. But that may cause a large bit of storage but is worth a try. Also, petastorm looks like it shuffles by row group. so that should be fine since the data is ordered by time chronologically, as each rowgroup should be contiguous in time. Following that you should be able to then generate the overlapping windows of data on the fly from that batch, as normal. Our ML team believes this is also good approach to test out albeit not a trivial task.
  • 42. Logging get log from master Ifi you want to Retrieve log from slave, db MLflow