SlideShare une entreprise Scribd logo
1  sur  44
Accelerating TensorFlow with RDMA for High-
Performance Deep Learning
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Talk at DataWorks Summit Berlin 2018
by
Xiaoyi Lu
The Ohio State University
E-mail: luxi@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~luxi
DataWorks Summit@Berlin 2018 2Network Based Computing Laboratory
• Deep Learning is a sub-set of Machine Learning
– But, it is perhaps the most radical and revolutionary
subset
• Deep Learning is going through a resurgence
– Model: Excellent accuracy for deep/convolutional
neural networks
– Data: Public availability of versatile datasets like
MNIST, CIFAR, and ImageNet
– Capability: Unprecedented computing and
communication capabilities: Multi-/Many-Core,
GPGPUs, Xeon Phi, InfiniBand, RoCE, etc.
• Big Data has become one of the most important
elements in business analytics
– Increasing demand for getting Big Value out of Big
Data to drive the revenue continuously growing
Why Deep Learning is so hot?
Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-
flexibility-scalability-and-advocacy/
MNIST handwritten digits Deep Neural Network
DataWorks Summit@Berlin 2018 3Network Based Computing Laboratory
Application Example of DL: Flickr’s Magic View Photo Filtering
Courtesy: https://thenextweb.com/opinion/2015/05/22/flickrs-new-magic-view-photo-filtering-feature-works-so-well-it-convinced-me-to-ditch-iphoto/#.tnw_RaZEaD6g
• Image recognition to divide pictures into surprisingly accurate categories
• Magic of AI/DL: Generate accurate tags for billions of pictures
DataWorks Summit@Berlin 2018 4Network Based Computing Laboratory
• TensorFlow
• Caffe/Caffe2
• Torch
• SparkNet
• TensorFrame
• DeepLearning4J
• BigDL
• CNTK
• mmlspark
• Many others…
Examples of Deep Learning Stacks
DataWorks Summit@Berlin 2018 5Network Based Computing Laboratory
Trends of Deep Learning Stacks
• Google TensorFlow
• Microsoft CNTK
• Facebook Caffe2
• Google Search Trend (March 23, 2018)
DataWorks Summit@Berlin 2018 6Network Based Computing Laboratory
Big Data
(Hadoop, Spark,
HBase,
Memcached,
etc.)
Deep Learning
(Caffe, TensorFlow, BigDL,
etc.)
HPC
(MPI, RDMA,
Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!!!
DataWorks Summit@Berlin 2018 7Network Based Computing Laboratory
• BLAS Libraries – the heart of math
operations
– Atlas/OpenBLAS
– NVIDIA cuBlas
– Intel Math Kernel Library (MKL)
• DNN Libraries – the heart of Convolutions!
– NVIDIA cuDNN (already reached its 7th
iteration – cudnn-v7)
– Intel MKL-DNN (MKL 2017) – recent but a
very promising development
• Communication Libraries – the heart of
model parameter updating
– RDMA
– GPUDirect RDMA
Highly-Optimized Underlying Libraries with HPC Technologies
DataWorks Summit@Berlin 2018 8Network Based Computing Laboratory
Outline
• Overview of gRPC, TensorFlow, and RDMA Technologies
• Accelerating gRPC and TensorFlow with RDMA
• Benchmarking gRPC and TensorFlow
• Performance Evaluation
• Conclusion
DataWorks Summit@Berlin 2018 9Network Based Computing Laboratory
Architecture Overview of gRPC
Key Features:
• Simple service definition
• Works across languages and platforms
• C++, Java, Python, Android Java etc
• Linux, Mac, Windows
• Start quickly and scale
• Bi-directional streaming and integrated
authentication
• Used by Google (several of Google’s cloud
products and Google externally facing
APIs, TensorFlow), NetFlix, Docker, Cisco,
Juniper Networks etc.
• Uses sockets for communication!
Source: http://www.grpc.io/
Large-scale distributed systems composed of
micro services
DataWorks Summit@Berlin 2018 10Network Based Computing Laboratory
Architecture Overview of Google TensorFlow
Key Features:
• Widely used for Deep Learning
• Open source software library for numerical
computation using data flow graphs
• Graph edges represent the multidimensional
data arrays
• Nodes in the graph represent mathematical
operations
• Flexible architecture allows to deploy
computation to one or more CPUs or GPUs in a
desktop, server, or mobile device with a single
API
• Used by Google, Airbnb, DropBox, Snapchat,
Twitter
• Communication and Computation intensive
Source: https://www.tensorflow.org/
Architecture of TensorFlow
DataWorks Summit@Berlin 2018 11Network Based Computing Laboratory
Overview of gRPC with TensorFlow
Worker services communicate among each other using gRPC, or gRPC+X!
Client Master
Worker
/job:PS/task:0
Worker
/job:Worker/task:0
CPU GPU
gRPC server/ client
CPU GPU
gRPC server/ client
DataWorks Summit@Berlin 2018 12Network Based Computing Laboratory
Drivers of Modern HPC Cluster and Data Center Architecture
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
– Single Root I/O Virtualization (SR-IOV)
• Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
High Performance Interconnects –
InfiniBand (with SR-IOV)
<1usec latency, 200Gbps Bandwidth>
Multi-/Many-core
Processors
Cloud CloudSDSC Comet TACC Stampede
Accelerators / Coprocessors
high compute density, high
performance/watt
>1 TFlop DP on a chip
SSD, NVMe-SSD, NVRAM
DataWorks Summit@Berlin 2018 13Network Based Computing Laboratory
Interconnects and Protocols in OpenFabrics Stack
Kernel
Space
Application /
Middleware
Verbs
Ethernet
Adapter
Ethernet
Switch
Ethernet
Driver
TCP/IP
1/10/25/40/
100 GigE
InfiniBand
Adapter
InfiniBand
Switch
IPoIB
IPoIB
Ethernet
Adapter
Ethernet
Switch
Hardware
Offload
TCP/IP
10/40 GigE-
TOE
InfiniBand
Adapter
InfiniBand
Switch
User
Space
RSockets
RSockets
iWARP
Adapter
Ethernet
Switch
TCP/IP
User
Space
iWARP
RoCE
Adapter
Ethernet
Switch
RDMA
User
Space
RoCE
InfiniBand
Switch
InfiniBand
Adapter
RDMA
User
Space
IB Native
Sockets
Application /
Middleware Interface
Protocol
Adapter
Switch
InfiniBand
Adapter
InfiniBand
Switch
RDMA
SDP
SDP
DataWorks Summit@Berlin 2018 14Network Based Computing Laboratory
• 163 IB Clusters (32.6%) in the Nov’17 Top500 list
– (http://www.top500.org)
• Installations in the Top 50 (17 systems):
Large-scale InfiniBand Installations
19,860,000 core (Gyoukou) in Japan (4th) 60,512 core (DGX SATURN V) at NVIDIA/USA (36th)
241,108 core (Pleiades) at NASA/Ames (17th) 72,000 core (HPC2) in Italy (37th)
220,800 core (Pangea) in France (21st) 152,692 core (Thunder) at AFRL/USA (40th)
144,900 core (Cheyenne) at NCAR/USA (24th) 99,072 core (Mistral) at DKRZ/Germany (42nd)
155,150 core (Jureca) in Germany (29th) 147,456 core (SuperMUC) in Germany (44th)
72,800 core Cray CS-Storm in US (30th) 86,016 core (SuperMUC Phase 2) in Germany (45th)
72,800 core Cray CS-Storm in US (31st) 74,520 core (Tsubame 2.5) at Japan/GSIC (48th)
78,336 core (Electra) at NASA/USA (33rd) 66,000 core (HPC3) in Italy (51st)
124,200 core (Topaz) SGI ICE at ERDC DSRC in US (34th) 194,616 core (Cascade) at PNNL (53rd)
60,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35th) and many more!
DataWorks Summit@Berlin 2018 15Network Based Computing Laboratory
• Introduced in Oct 2000
• High Performance Data Transfer
– Interprocessor communication and I/O
– Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and
low CPU utilization (5-10%)
• Flexibility for LAN and WAN communication
• Multiple Transport Services
– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable
Datagram (UD), and Raw Datagram
– Provides flexibility to develop upper layers
• Multiple Operations
– Send/Recv
– RDMA Read/Write
– Atomic Operations (very unique)
• high performance and scalable implementations of distributed locks, semaphores, collective
communication operations
• Leading to big changes in designing HPC clusters, file systems, cloud computing
systems, grid computing systems, ….
Open Standard InfiniBand Networking Technology
DataWorks Summit@Berlin 2018 16Network Based Computing Laboratory
RDMA over Converged Enhanced Ethernet (RoCE)
IB Verbs
Application
Hardware
RoCE
IB Verbs
Application
RoCE v2
InfiniBand
Link Layer
IB Network
IB Transport
IB Verbs
Application
InfiniBand
Ethernet Link
Layer
IB Network
IB Transport
Ethernet Link
Layer
UDP / IP
IB Transport
• Takes advantage of IB and Ethernet
– Software written with IB-Verbs
– Link layer is Converged (Enhanced) Ethernet (CE)
– 100Gb/s support from latest EDR and ConnectX-
3 Pro adapters
• Pros: IB Vs RoCE
– Works natively in Ethernet environments
• Entire Ethernet management ecosystem is available
– Has all the benefits of IB verbs
– Link layer is very similar to the link layer of
native IB, so there are no missing features
• RoCE v2: Additional Benefits over RoCE
– Traditional Network Management Tools Apply
– ACLs (Metering, Accounting, Firewalling)
– GMP Snooping for Optimized Multicast
– Network Monitoring Tools
Courtesy: OFED, Mellanox
Network Stack Comparison
Packet Header Comparison
ETH
L2 Hdr
Ethertype
IB GRH
L3 Hdr
IB BTH+
L4 Hdr
RoCE
ETH
L2 Hdr
Ethertype
IP Hdr
L3 Hdr
IB BTH+
L4 Hdr
Proto#
RoCEv2
UDP
Hdr
Port#
DataWorks Summit@Berlin 2018 17Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 280 organizations from 34 countries
• More than 26,000 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Also run on Ethernet
Available for x86 and OpenPOWER
Support for Singularity and Docker
DataWorks Summit@Berlin 2018 18Network Based Computing Laboratory
HiBD Release Timeline and Downloads
0
5000
10000
15000
20000
25000
30000
NumberofDownloads
Timeline
RDMA-Hadoop2.x0.9.1
RDMA-Hadoop2.x0.9.6
RDMA-Memcached0.9.4
RDMA-Hadoop2.x0.9.7
RDMA-Spark0.9.1
RDMA-HBase0.9.1
RDMA-Hadoop2.x1.0.0
RDMA-Spark0.9.4
RDMA-Hadoop2.x1.3.0
RDMA-Memcached0.9.6&OHB-0.9.3
RDMA-Spark0.9.5
RDMA-Hadoop1.x0.9.0
RDMA-Hadoop1.x0.9.8
RDMA-Memcached0.9.1&OHB-0.7.1
RDMA-Hadoop1.x0.9.9
DataWorks Summit@Berlin 2018 19Network Based Computing Laboratory
• Can similar designs be done for gRPC and TensorFlow to achieve significant
performance benefits by taking advantage of native RDMA support?
• How do we benchmark gRPC and TensorFlow for both deep learning and
system researchers?
• What kind of performance benefits we can get through native RDMA-based
designs in gRPC and TensorFlow?
Motivation
DataWorks Summit@Berlin 2018 20Network Based Computing Laboratory
Outline
• Overview of gRPC, TensorFlow, and RDMA Technologies
• Accelerating gRPC and TensorFlow with RDMA
• Benchmarking gRPC and TensorFlow
• Performance Evaluation
• Conclusion
DataWorks Summit@Berlin 2018 21Network Based Computing Laboratory
Tensor Communication over gRPC channel
• Rendezvous protocol
• TensorFlow worker (tensor receiving
process) actively requests for tensors to
the parameter server (tensor sending
process)
• Worker issues Tensor RPC request that to
Parameter Server (PS)
• PS finds the requested tensor, and responds to
the worker
• gRPC core uses recvmsg and sendmsg
primitives for receiving and sending payloads
• Tensor Transmission uses iovec structures
R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
DataWorks Summit@Berlin 2018 22Network Based Computing Laboratory
• gRPC + Verbs
– Dedicated verbs channel for tensor communication
– gRPC channel for administrative task communication
• gRPC + MPI
– Dedicated MPI channel for tensor communication
– gRPC channel for administrative task communication
• Uber Horovod
– Uber’s approach of MPI based distributed TensorFlow
• Baidu Tensorflow-Allreduce
– Baidu’s approach of MPI based distributed TensorFlow
High Performance Tensor Communication Channel
DataWorks Summit@Berlin 2018 23Network Based Computing Laboratory
TensorFlow workload via gRPC
iovec Buffer Distribution Observed for TensorFlow
training over gRPC
• Small, Medium and Large indicate
buffers of few Bytes, KBytes and
MBytes of length
• gRPC payload may contain a
uniform distribution of such Small
buffers
• A lot of Large buffers and a few
Small buffers may create a skew
distribution of such buffers in one
gRPC payload
R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-
Benchmark Suite to Evaluate gRPC for TensorFlow: Early
Experiences, BPOE, 2018.
DataWorks Summit@Berlin 2018 24Network Based Computing Laboratory
OSU AR-gRPC Architecture
R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? (Under Review)
• Adaptive RDMA gRPC
• Features
– Hybrid Communication engine
• Adaptive protocol selection between
eager and rendezvous
– Message pipelining and coalescing
• Adaptive chunking and accumulation
• Intelligent threshold detection
– Zero copy transmission
• Zero copy send/recv
DataWorks Summit@Berlin 2018 25Network Based Computing Laboratory
AR-gRPC Enhanced TensorFlow
DataWorks Summit@Berlin 2018 26Network Based Computing Laboratory
Outline
• Overview of gRPC, TensorFlow, and RDMA Technologies
• Accelerating gRPC and TensorFlow with RDMA
• Benchmarking gRPC and TensorFlow
• Performance Evaluation
• Conclusion
DataWorks Summit@Berlin 2018 27Network Based Computing Laboratory
MNIST CIFAR-10 ImageNet
Category Digit Classification Object Classification Object Classification
Resolution 28 × 28 B&W 32 × 32 Color 256 × 256 Color
Classes 10 10 1000
Training Images 60 K 50 K 1.2 M
Testing Images 10 K 10 K 100 K
Available Benchmarks, Models, and Datasets
Model Layers (Conv. / Full-connected) Dataset Framework
LeNet 2 / 2 MNIST CaffeOnSpark, TensorFlowOnSpark
SoftMax Regression NA / NA MNIST TensorFlowOnSpark
CIFAR-10 Quick 3 / 1 CIFAR-10 CaffeOnSpark, TensorFlowOnSpark, MMLSpark
VGG-16 13 / 3 CIFAR-10 BigDL
AlexNet 5 / 3 ImageNet CaffeOnSpark
GoogLeNet 22 / 0 ImageNet CaffeOnSpark
Resnet-50 53/1 Synthetic TensorFlow
DataWorks Summit@Berlin 2018 28Network Based Computing Laboratory
• Current DL models and benchmarks are deep
learning research oriented
• Example: Facebook caffe2 takes 1 hour to
train ImageNet data (State of the art)1
• However, many system researchers are
focused on improving the communication
engine of deep learning frameworks
• A fast benchmark that models deep learning
characteristics is highly desirable
Are Current Benchmarks Sufficient?
1. Goyal, Priya, et al. "Accurate, large minibatch SGD: training
imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).
DataWorks Summit@Berlin 2018 29Network Based Computing Laboratory
TensorFlow DL Micro-benchmarks for gRPC
R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
DataWorks Summit@Berlin 2018 30Network Based Computing Laboratory
Outline
• Overview of gRPC, TensorFlow, and RDMA Technologies
• Accelerating gRPC and TensorFlow with RDMA
• Benchmarking gRPC and TensorFlow
• Performance Evaluation
• Conclusion
DataWorks Summit@Berlin 2018 31Network Based Computing Laboratory
Experimental Setup
• We have used two different clusters
• Software Stack used
A: OSU-RI2-IB-EDR B: SDSC-Comet-IB-FDR
Intel Broadwell, dual fourteen-core processors Intel Haswell, dual twelve-core processors
512 GB RAM 128 GB RAM
370 GB Local NVMe-SSD 320 GB Local SSD
InfiniBand EDR (Some nodes with 40G-Ether) InfiniBand FDR and 10G-Ether
Stack Version Cluster
gRPC 1.5.0 A, B
AR-gRPC (OSU
RDMA gRPC)1
Based on 1.5.0 A, B
TensorFlow 1.4 , Python 2.7 A
DataWorks Summit@Berlin 2018 32Network Based Computing Laboratory
Performance Benefits for AR-gRPC with Micro-Benchmark
Point-to-Point Latency
• AR-gRPC (OSU design) Latency on SDSC-Comet-FDR
– Up to 2.7x performance speedup over Default gRPC (IPoIB) for Latency for small messages.
– Up to 2.8x performance speedup over Default gRPC (IPoIB) for Latency for medium messages.
– Up to 2.5x performance speedup over Default gRPC (IPoIB) for Latency for large messages.
0
15
30
45
60
75
90
2 8 32 128 512 2K 8K
Latency(us)
payload (Bytes)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
0
200
400
600
800
1000
16K 32K 64K 128K 256K 512KLatency(us)
Payload (Bytes)
Default gRPC (IPoIB-
56Gbps)
AR-gRPC (RDMA-56
Gbps)
100
3800
7500
11200
14900
18600
1M 2M 4M 8M
Latency(us)
Payload (Bytes)
Default gRPC
(IPoIB-56Gbps)
AR-gRPC (RDMA-
56Gbps)
R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? (Under Review)
DataWorks Summit@Berlin 2018 33Network Based Computing Laboratory
0
2
4
Uniform Random Skew
Latency(ms)
Payload Generation Scheme
Default gRPC (Ethernet 40G)
Default gRPC (IPoIB-100Gbps)
AR-gRPC (RDMA-100Gbps)
TF-gRPC-P2P-Latency
0
2
4
6
Uniform Random Skew
Latency(ms)
Payload Generation Scheme
Default gRPC (Ethernet 10G)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
• Cluster A: AR-gRPC (RDMA) reduces latency by 59% and 56% compared to Default gRPC over
40G Ethernet and IPoIB
• Cluster B: AR-gRPC (RDMA) reduces 78% latency compared to 10G (Default gRPC) Ethernet
and 69% compared to IPoIB (Default gRPC)
Cluster A Cluster B
56% 69%
DataWorks Summit@Berlin 2018 34Network Based Computing Laboratory
TF-gRPC-P2P-Bandwidth
0
1000
2000
3000
4000
5000
6000
Uniform Random Skew
Bandwidth(MB/s)
Payload Generation Scheme
Default gRPC (Ethernet 40G)
Default gRPC (IPoIB-100Gbps)
AR-gRPC (RDMA-100Gbps)
0
2000
4000
6000
Uniform Random Skew
Bandwidth(MB/s)
Payload Generation Scheme
Default gRPC (Ethernet 10G)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)Cluster A Cluster B
• Cluster A: AR-gRPC (RDMA) gRPC achieves a 2.14x bandwidth increase compared to Default gRPC (IPoIB
and Ethernet)
• Cluster B: AR-gRPC (RDMA) achieves 3.2x bandwidth compared to Default gRPC IPoIB for skewed data
2.14x 3.2x
DataWorks Summit@Berlin 2018 35Network Based Computing Laboratory
0
1000
2000
3000
4000
Uniform Random Skew
RPC/second
Payload Generation Scheme
Default gRPC (Ethernet 40G)
Default gRPC (IPoIB-100Gbps)
AR-gRPC (RDMA-100Gbps)
TF-gRPC-PS-Throughput
0
1000
2000
3000
4000
Uniform Random Skew
RPC/second
Payload Generation Scheme
Default gRPC (Ethernet 10G)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
Cluster A Cluster B
• Cluster A: AR-gRPC (RDMA) gRPC achieves a 3.4x speedup compared to Default gRPC over IPoIB for
uniform scheme
• Cluster B: AR-gRPC (RDMA) achieves 3.6x bandwidth compared to Default gRPC over IPoIB for uniform
scheme
3.4 x
3.6 x
DataWorks Summit@Berlin 2018 36Network Based Computing Laboratory
Performance Benefits for AR-gRPC with Micro-Benchmark
Fully-Connected Architecture (Mimic TensorFlow communication)
• AR-gRPC (OSU design) TensorFlow Mimic test on SDSC-Comet-FDR
– Up to 60% reduction in average latency over Default gRPC (IPoIB)
– Up to 2.68x performance speedup over Default gRPC (IPoIB)
0
6
12
18
24
30
2M 4M 8M
Latency(ms)
payload (Bytes)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
0
100
200
300
400
500
2M 4M 8M
Calls/Second
payload (Bytes)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
2.68 x
DataWorks Summit@Berlin 2018 37Network Based Computing Laboratory
0
200
400
600
800
16 32 64
Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
0
200
400
600
16 32 64
Images/Second Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
0
100
200
300
16 32 64
Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
Performance Benefit for TensorFlow (Resnet50)
• TensorFlow Resnet50 performance evaluation on an IB EDR cluster
– Up to 30% performance speedup over Default gRPC (IPoIB) for 8 GPUs
– Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs
– Up to 44% performance speedup over Default gRPC (IPoIB) for 24 GPUs
4 Nodes (8 GPUs) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS)
DataWorks Summit@Berlin 2018 38Network Based Computing Laboratory
0
50
100
150
200
16 32 64
Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
Performance Benefit for TensorFlow (Inception3)
• TensorFlow Inception3 performance evaluation on an IB EDR cluster
– Up to 20% performance speedup over Default gRPC (IPoIB) for 8 GPUs
– Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs
– Up to 37% performance speedup over Default gRPC (IPoIB) for 24 GPUs
4 Nodes (8 GPUS) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS)
0
200
400
600
16 32 64
Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
0
100
200
300
400
16 32 64Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
DataWorks Summit@Berlin 2018 39Network Based Computing Laboratory
Outline
• Overview of gRPC, TensorFlow, and RDMA Technologies
• Accelerating gRPC and TensorFlow with RDMA
• Benchmarking gRPC and TensorFlow
• Performance Evaluation
• Conclusion
DataWorks Summit@Berlin 2018 40Network Based Computing Laboratory
• Present overview of gRPC and TensorFlow
• Present overview of RDMA and high-performance networks
• Discussed challenges in accelerating and benchmarking TensorFlow and gRPC
• RDMA can benefit DL workloads as showed by our AR-gRPC and the
corresponding enhanced TensorFlow
• Many other open issues need to be solved
• Will enable Deep Learning community to take advantage of modern HPC
technologies to carry out their analytics in a fast and scalable manner
Concluding Remarks
DataWorks Summit@Berlin 2018 41Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
DataWorks Summit@Berlin 2018 42Network Based Computing Laboratory
Personnel Acknowledgments
Current Students (Graduate)
– A. Awan (Ph.D.)
– R. Biswas (M.S.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– K. Hamidouche
– S. Sur
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– H. Javed (Ph.D.)
– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– H. Shi (Ph.D.)
– J. Zhang (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists
– X. Lu
– H. Subramoni
Past Programmers
– D. Bureddy
– J. Perkins
Current Research Specialist
– J. Smith
– M. Arnold
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc
– A. Ruhela
– K. Manian
Current Students (Undergraduate)
– N. Sarkauskas (B.S.)
DataWorks Summit@Berlin 2018 43Network Based Computing Laboratory
The 4th International Workshop on
High-Performance Big Data Computing (HPBDC)
HPBDC 2018 will be held with IEEE International Parallel and Distributed Processing
Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, 2018
Workshop Date: May 21st, 2018
Keynote Talk: Prof. Geoffrey Fox, Twister2: A High-Performance Big Data Programming Environment
Six Regular Research Papers and Two Short Research Papers
Panel Topic: Which Framework is the Best for High-Performance Deep Learning:
Big Data Framework or HPC Framework?
http://web.cse.ohio-state.edu/~luxi/hpbdc2018
HPBDC 2017 was held in conjunction with IPDPS’17
http://web.cse.ohio-state.edu/~luxi/hpbdc2017
HPBDC 2016 was held in conjunction with IPDPS’16
http://web.cse.ohio-state.edu/~luxi/hpbdc2016
DataWorks Summit@Berlin 2018 44Network Based Computing Laboratory
{panda, luxi}@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~luxi
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/

Contenu connexe

Tendances

コンテナ基盤であるLXC/LXDを 本番環境で運用する話
コンテナ基盤であるLXC/LXDを 本番環境で運用する話コンテナ基盤であるLXC/LXDを 本番環境で運用する話
コンテナ基盤であるLXC/LXDを 本番環境で運用する話
Nobuhiro Fujita
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
Yoshinori Matsunobu
 

Tendances (20)

Introduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStackIntroduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStack
 
高速にコンテナを起動できるイメージフォーマット
高速にコンテナを起動できるイメージフォーマット高速にコンテナを起動できるイメージフォーマット
高速にコンテナを起動できるイメージフォーマット
 
hbstudy#88 5G+MEC時代のシステム設計
hbstudy#88 5G+MEC時代のシステム設計hbstudy#88 5G+MEC時代のシステム設計
hbstudy#88 5G+MEC時代のシステム設計
 
cluster-monitoringで困ったこと学んだこと
cluster-monitoringで困ったこと学んだことcluster-monitoringで困ったこと学んだこと
cluster-monitoringで困ったこと学んだこと
 
コンテナの作り方「Dockerは裏方で何をしているのか?」
コンテナの作り方「Dockerは裏方で何をしているのか?」コンテナの作り方「Dockerは裏方で何をしているのか?」
コンテナの作り方「Dockerは裏方で何をしているのか?」
 
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
 
オラクルのHPC/GPUソリューションご紹介(2021/08版)
オラクルのHPC/GPUソリューションご紹介(2021/08版)オラクルのHPC/GPUソリューションご紹介(2021/08版)
オラクルのHPC/GPUソリューションご紹介(2021/08版)
 
GKE multi-cluster Ingress
GKE multi-cluster IngressGKE multi-cluster Ingress
GKE multi-cluster Ingress
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
 
Kubernetes introduction
Kubernetes introductionKubernetes introduction
Kubernetes introduction
 
Openv switchの使い方とか
Openv switchの使い方とかOpenv switchの使い方とか
Openv switchの使い方とか
 
コンテナ基盤であるLXC/LXDを 本番環境で運用する話
コンテナ基盤であるLXC/LXDを 本番環境で運用する話コンテナ基盤であるLXC/LXDを 本番環境で運用する話
コンテナ基盤であるLXC/LXDを 本番環境で運用する話
 
Apache Kafkaによるログ転送とパフォーマンスチューニング - Bonfire Backend #2 -
Apache Kafkaによるログ転送とパフォーマンスチューニング - Bonfire Backend #2 -Apache Kafkaによるログ転送とパフォーマンスチューニング - Bonfire Backend #2 -
Apache Kafkaによるログ転送とパフォーマンスチューニング - Bonfire Backend #2 -
 
分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要
 
確実な再起動からはじめる クラウドネイティブオペレーション
確実な再起動からはじめる クラウドネイティブオペレーション確実な再起動からはじめる クラウドネイティブオペレーション
確実な再起動からはじめる クラウドネイティブオペレーション
 
サポート エンジニアが語る、Microsoft Azure を支えるインフラの秘密
サポート エンジニアが語る、Microsoft Azure を支えるインフラの秘密サポート エンジニアが語る、Microsoft Azure を支えるインフラの秘密
サポート エンジニアが語る、Microsoft Azure を支えるインフラの秘密
 
Hadoopエコシステムのデータストア振り返り
Hadoopエコシステムのデータストア振り返りHadoopエコシステムのデータストア振り返り
Hadoopエコシステムのデータストア振り返り
 
bicep 紹介
bicep 紹介bicep 紹介
bicep 紹介
 
Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
 

Similaire à Accelerating TensorFlow with RDMA for high-performance deep learning

Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
inside-BigData.com
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
inside-BigData.com
 

Similaire à Accelerating TensorFlow with RDMA for high-performance deep learning (20)

Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPC
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Feec telecom-nw-softwarization-aug-2015
Feec telecom-nw-softwarization-aug-2015Feec telecom-nw-softwarization-aug-2015
Feec telecom-nw-softwarization-aug-2015
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
2017 dagstuhl-nfv-rothenberg
2017 dagstuhl-nfv-rothenberg2017 dagstuhl-nfv-rothenberg
2017 dagstuhl-nfv-rothenberg
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Accelerating TensorFlow with RDMA for high-performance deep learning

  • 1. Accelerating TensorFlow with RDMA for High- Performance Deep Learning Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Talk at DataWorks Summit Berlin 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi
  • 2. DataWorks Summit@Berlin 2018 2Network Based Computing Laboratory • Deep Learning is a sub-set of Machine Learning – But, it is perhaps the most radical and revolutionary subset • Deep Learning is going through a resurgence – Model: Excellent accuracy for deep/convolutional neural networks – Data: Public availability of versatile datasets like MNIST, CIFAR, and ImageNet – Capability: Unprecedented computing and communication capabilities: Multi-/Many-Core, GPGPUs, Xeon Phi, InfiniBand, RoCE, etc. • Big Data has become one of the most important elements in business analytics – Increasing demand for getting Big Value out of Big Data to drive the revenue continuously growing Why Deep Learning is so hot? Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions- flexibility-scalability-and-advocacy/ MNIST handwritten digits Deep Neural Network
  • 3. DataWorks Summit@Berlin 2018 3Network Based Computing Laboratory Application Example of DL: Flickr’s Magic View Photo Filtering Courtesy: https://thenextweb.com/opinion/2015/05/22/flickrs-new-magic-view-photo-filtering-feature-works-so-well-it-convinced-me-to-ditch-iphoto/#.tnw_RaZEaD6g • Image recognition to divide pictures into surprisingly accurate categories • Magic of AI/DL: Generate accurate tags for billions of pictures
  • 4. DataWorks Summit@Berlin 2018 4Network Based Computing Laboratory • TensorFlow • Caffe/Caffe2 • Torch • SparkNet • TensorFrame • DeepLearning4J • BigDL • CNTK • mmlspark • Many others… Examples of Deep Learning Stacks
  • 5. DataWorks Summit@Berlin 2018 5Network Based Computing Laboratory Trends of Deep Learning Stacks • Google TensorFlow • Microsoft CNTK • Facebook Caffe2 • Google Search Trend (March 23, 2018)
  • 6. DataWorks Summit@Berlin 2018 6Network Based Computing Laboratory Big Data (Hadoop, Spark, HBase, Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) HPC (MPI, RDMA, Lustre, etc.) Increasing Usage of HPC, Big Data and Deep Learning Convergence of HPC, Big Data, and Deep Learning!!!
  • 7. DataWorks Summit@Berlin 2018 7Network Based Computing Laboratory • BLAS Libraries – the heart of math operations – Atlas/OpenBLAS – NVIDIA cuBlas – Intel Math Kernel Library (MKL) • DNN Libraries – the heart of Convolutions! – NVIDIA cuDNN (already reached its 7th iteration – cudnn-v7) – Intel MKL-DNN (MKL 2017) – recent but a very promising development • Communication Libraries – the heart of model parameter updating – RDMA – GPUDirect RDMA Highly-Optimized Underlying Libraries with HPC Technologies
  • 8. DataWorks Summit@Berlin 2018 8Network Based Computing Laboratory Outline • Overview of gRPC, TensorFlow, and RDMA Technologies • Accelerating gRPC and TensorFlow with RDMA • Benchmarking gRPC and TensorFlow • Performance Evaluation • Conclusion
  • 9. DataWorks Summit@Berlin 2018 9Network Based Computing Laboratory Architecture Overview of gRPC Key Features: • Simple service definition • Works across languages and platforms • C++, Java, Python, Android Java etc • Linux, Mac, Windows • Start quickly and scale • Bi-directional streaming and integrated authentication • Used by Google (several of Google’s cloud products and Google externally facing APIs, TensorFlow), NetFlix, Docker, Cisco, Juniper Networks etc. • Uses sockets for communication! Source: http://www.grpc.io/ Large-scale distributed systems composed of micro services
  • 10. DataWorks Summit@Berlin 2018 10Network Based Computing Laboratory Architecture Overview of Google TensorFlow Key Features: • Widely used for Deep Learning • Open source software library for numerical computation using data flow graphs • Graph edges represent the multidimensional data arrays • Nodes in the graph represent mathematical operations • Flexible architecture allows to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API • Used by Google, Airbnb, DropBox, Snapchat, Twitter • Communication and Computation intensive Source: https://www.tensorflow.org/ Architecture of TensorFlow
  • 11. DataWorks Summit@Berlin 2018 11Network Based Computing Laboratory Overview of gRPC with TensorFlow Worker services communicate among each other using gRPC, or gRPC+X! Client Master Worker /job:PS/task:0 Worker /job:Worker/task:0 CPU GPU gRPC server/ client CPU GPU gRPC server/ client
  • 12. DataWorks Summit@Berlin 2018 12Network Based Computing Laboratory Drivers of Modern HPC Cluster and Data Center Architecture • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) – Single Root I/O Virtualization (SR-IOV) • Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) High Performance Interconnects – InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> Multi-/Many-core Processors Cloud CloudSDSC Comet TACC Stampede Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM
  • 13. DataWorks Summit@Berlin 2018 13Network Based Computing Laboratory Interconnects and Protocols in OpenFabrics Stack Kernel Space Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP 1/10/25/40/ 100 GigE InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Ethernet Adapter Ethernet Switch Hardware Offload TCP/IP 10/40 GigE- TOE InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets iWARP Adapter Ethernet Switch TCP/IP User Space iWARP RoCE Adapter Ethernet Switch RDMA User Space RoCE InfiniBand Switch InfiniBand Adapter RDMA User Space IB Native Sockets Application / Middleware Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP
  • 14. DataWorks Summit@Berlin 2018 14Network Based Computing Laboratory • 163 IB Clusters (32.6%) in the Nov’17 Top500 list – (http://www.top500.org) • Installations in the Top 50 (17 systems): Large-scale InfiniBand Installations 19,860,000 core (Gyoukou) in Japan (4th) 60,512 core (DGX SATURN V) at NVIDIA/USA (36th) 241,108 core (Pleiades) at NASA/Ames (17th) 72,000 core (HPC2) in Italy (37th) 220,800 core (Pangea) in France (21st) 152,692 core (Thunder) at AFRL/USA (40th) 144,900 core (Cheyenne) at NCAR/USA (24th) 99,072 core (Mistral) at DKRZ/Germany (42nd) 155,150 core (Jureca) in Germany (29th) 147,456 core (SuperMUC) in Germany (44th) 72,800 core Cray CS-Storm in US (30th) 86,016 core (SuperMUC Phase 2) in Germany (45th) 72,800 core Cray CS-Storm in US (31st) 74,520 core (Tsubame 2.5) at Japan/GSIC (48th) 78,336 core (Electra) at NASA/USA (33rd) 66,000 core (HPC3) in Italy (51st) 124,200 core (Topaz) SGI ICE at ERDC DSRC in US (34th) 194,616 core (Cascade) at PNNL (53rd) 60,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35th) and many more!
  • 15. DataWorks Summit@Berlin 2018 15Network Based Computing Laboratory • Introduced in Oct 2000 • High Performance Data Transfer – Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and low CPU utilization (5-10%) • Flexibility for LAN and WAN communication • Multiple Transport Services – Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram – Provides flexibility to develop upper layers • Multiple Operations – Send/Recv – RDMA Read/Write – Atomic Operations (very unique) • high performance and scalable implementations of distributed locks, semaphores, collective communication operations • Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, …. Open Standard InfiniBand Networking Technology
  • 16. DataWorks Summit@Berlin 2018 16Network Based Computing Laboratory RDMA over Converged Enhanced Ethernet (RoCE) IB Verbs Application Hardware RoCE IB Verbs Application RoCE v2 InfiniBand Link Layer IB Network IB Transport IB Verbs Application InfiniBand Ethernet Link Layer IB Network IB Transport Ethernet Link Layer UDP / IP IB Transport • Takes advantage of IB and Ethernet – Software written with IB-Verbs – Link layer is Converged (Enhanced) Ethernet (CE) – 100Gb/s support from latest EDR and ConnectX- 3 Pro adapters • Pros: IB Vs RoCE – Works natively in Ethernet environments • Entire Ethernet management ecosystem is available – Has all the benefits of IB verbs – Link layer is very similar to the link layer of native IB, so there are no missing features • RoCE v2: Additional Benefits over RoCE – Traditional Network Management Tools Apply – ACLs (Metering, Accounting, Firewalling) – GMP Snooping for Optimized Multicast – Network Monitoring Tools Courtesy: OFED, Mellanox Network Stack Comparison Packet Header Comparison ETH L2 Hdr Ethertype IB GRH L3 Hdr IB BTH+ L4 Hdr RoCE ETH L2 Hdr Ethertype IP Hdr L3 Hdr IB BTH+ L4 Hdr Proto# RoCEv2 UDP Hdr Port#
  • 17. DataWorks Summit@Berlin 2018 17Network Based Computing Laboratory • RDMA for Apache Spark • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions • RDMA for Apache HBase • RDMA for Memcached (RDMA-Memcached) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • OSU HiBD-Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 280 organizations from 34 countries • More than 26,000 downloads from the project site The High-Performance Big Data (HiBD) Project Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Support for Singularity and Docker
  • 18. DataWorks Summit@Berlin 2018 18Network Based Computing Laboratory HiBD Release Timeline and Downloads 0 5000 10000 15000 20000 25000 30000 NumberofDownloads Timeline RDMA-Hadoop2.x0.9.1 RDMA-Hadoop2.x0.9.6 RDMA-Memcached0.9.4 RDMA-Hadoop2.x0.9.7 RDMA-Spark0.9.1 RDMA-HBase0.9.1 RDMA-Hadoop2.x1.0.0 RDMA-Spark0.9.4 RDMA-Hadoop2.x1.3.0 RDMA-Memcached0.9.6&OHB-0.9.3 RDMA-Spark0.9.5 RDMA-Hadoop1.x0.9.0 RDMA-Hadoop1.x0.9.8 RDMA-Memcached0.9.1&OHB-0.7.1 RDMA-Hadoop1.x0.9.9
  • 19. DataWorks Summit@Berlin 2018 19Network Based Computing Laboratory • Can similar designs be done for gRPC and TensorFlow to achieve significant performance benefits by taking advantage of native RDMA support? • How do we benchmark gRPC and TensorFlow for both deep learning and system researchers? • What kind of performance benefits we can get through native RDMA-based designs in gRPC and TensorFlow? Motivation
  • 20. DataWorks Summit@Berlin 2018 20Network Based Computing Laboratory Outline • Overview of gRPC, TensorFlow, and RDMA Technologies • Accelerating gRPC and TensorFlow with RDMA • Benchmarking gRPC and TensorFlow • Performance Evaluation • Conclusion
  • 21. DataWorks Summit@Berlin 2018 21Network Based Computing Laboratory Tensor Communication over gRPC channel • Rendezvous protocol • TensorFlow worker (tensor receiving process) actively requests for tensors to the parameter server (tensor sending process) • Worker issues Tensor RPC request that to Parameter Server (PS) • PS finds the requested tensor, and responds to the worker • gRPC core uses recvmsg and sendmsg primitives for receiving and sending payloads • Tensor Transmission uses iovec structures R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
  • 22. DataWorks Summit@Berlin 2018 22Network Based Computing Laboratory • gRPC + Verbs – Dedicated verbs channel for tensor communication – gRPC channel for administrative task communication • gRPC + MPI – Dedicated MPI channel for tensor communication – gRPC channel for administrative task communication • Uber Horovod – Uber’s approach of MPI based distributed TensorFlow • Baidu Tensorflow-Allreduce – Baidu’s approach of MPI based distributed TensorFlow High Performance Tensor Communication Channel
  • 23. DataWorks Summit@Berlin 2018 23Network Based Computing Laboratory TensorFlow workload via gRPC iovec Buffer Distribution Observed for TensorFlow training over gRPC • Small, Medium and Large indicate buffers of few Bytes, KBytes and MBytes of length • gRPC payload may contain a uniform distribution of such Small buffers • A lot of Large buffers and a few Small buffers may create a skew distribution of such buffers in one gRPC payload R. Biswas, X. Lu, and D. K. Panda, Designing a Micro- Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
  • 24. DataWorks Summit@Berlin 2018 24Network Based Computing Laboratory OSU AR-gRPC Architecture R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? (Under Review) • Adaptive RDMA gRPC • Features – Hybrid Communication engine • Adaptive protocol selection between eager and rendezvous – Message pipelining and coalescing • Adaptive chunking and accumulation • Intelligent threshold detection – Zero copy transmission • Zero copy send/recv
  • 25. DataWorks Summit@Berlin 2018 25Network Based Computing Laboratory AR-gRPC Enhanced TensorFlow
  • 26. DataWorks Summit@Berlin 2018 26Network Based Computing Laboratory Outline • Overview of gRPC, TensorFlow, and RDMA Technologies • Accelerating gRPC and TensorFlow with RDMA • Benchmarking gRPC and TensorFlow • Performance Evaluation • Conclusion
  • 27. DataWorks Summit@Berlin 2018 27Network Based Computing Laboratory MNIST CIFAR-10 ImageNet Category Digit Classification Object Classification Object Classification Resolution 28 × 28 B&W 32 × 32 Color 256 × 256 Color Classes 10 10 1000 Training Images 60 K 50 K 1.2 M Testing Images 10 K 10 K 100 K Available Benchmarks, Models, and Datasets Model Layers (Conv. / Full-connected) Dataset Framework LeNet 2 / 2 MNIST CaffeOnSpark, TensorFlowOnSpark SoftMax Regression NA / NA MNIST TensorFlowOnSpark CIFAR-10 Quick 3 / 1 CIFAR-10 CaffeOnSpark, TensorFlowOnSpark, MMLSpark VGG-16 13 / 3 CIFAR-10 BigDL AlexNet 5 / 3 ImageNet CaffeOnSpark GoogLeNet 22 / 0 ImageNet CaffeOnSpark Resnet-50 53/1 Synthetic TensorFlow
  • 28. DataWorks Summit@Berlin 2018 28Network Based Computing Laboratory • Current DL models and benchmarks are deep learning research oriented • Example: Facebook caffe2 takes 1 hour to train ImageNet data (State of the art)1 • However, many system researchers are focused on improving the communication engine of deep learning frameworks • A fast benchmark that models deep learning characteristics is highly desirable Are Current Benchmarks Sufficient? 1. Goyal, Priya, et al. "Accurate, large minibatch SGD: training imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).
  • 29. DataWorks Summit@Berlin 2018 29Network Based Computing Laboratory TensorFlow DL Micro-benchmarks for gRPC R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
  • 30. DataWorks Summit@Berlin 2018 30Network Based Computing Laboratory Outline • Overview of gRPC, TensorFlow, and RDMA Technologies • Accelerating gRPC and TensorFlow with RDMA • Benchmarking gRPC and TensorFlow • Performance Evaluation • Conclusion
  • 31. DataWorks Summit@Berlin 2018 31Network Based Computing Laboratory Experimental Setup • We have used two different clusters • Software Stack used A: OSU-RI2-IB-EDR B: SDSC-Comet-IB-FDR Intel Broadwell, dual fourteen-core processors Intel Haswell, dual twelve-core processors 512 GB RAM 128 GB RAM 370 GB Local NVMe-SSD 320 GB Local SSD InfiniBand EDR (Some nodes with 40G-Ether) InfiniBand FDR and 10G-Ether Stack Version Cluster gRPC 1.5.0 A, B AR-gRPC (OSU RDMA gRPC)1 Based on 1.5.0 A, B TensorFlow 1.4 , Python 2.7 A
  • 32. DataWorks Summit@Berlin 2018 32Network Based Computing Laboratory Performance Benefits for AR-gRPC with Micro-Benchmark Point-to-Point Latency • AR-gRPC (OSU design) Latency on SDSC-Comet-FDR – Up to 2.7x performance speedup over Default gRPC (IPoIB) for Latency for small messages. – Up to 2.8x performance speedup over Default gRPC (IPoIB) for Latency for medium messages. – Up to 2.5x performance speedup over Default gRPC (IPoIB) for Latency for large messages. 0 15 30 45 60 75 90 2 8 32 128 512 2K 8K Latency(us) payload (Bytes) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) 0 200 400 600 800 1000 16K 32K 64K 128K 256K 512KLatency(us) Payload (Bytes) Default gRPC (IPoIB- 56Gbps) AR-gRPC (RDMA-56 Gbps) 100 3800 7500 11200 14900 18600 1M 2M 4M 8M Latency(us) Payload (Bytes) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA- 56Gbps) R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? (Under Review)
  • 33. DataWorks Summit@Berlin 2018 33Network Based Computing Laboratory 0 2 4 Uniform Random Skew Latency(ms) Payload Generation Scheme Default gRPC (Ethernet 40G) Default gRPC (IPoIB-100Gbps) AR-gRPC (RDMA-100Gbps) TF-gRPC-P2P-Latency 0 2 4 6 Uniform Random Skew Latency(ms) Payload Generation Scheme Default gRPC (Ethernet 10G) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) • Cluster A: AR-gRPC (RDMA) reduces latency by 59% and 56% compared to Default gRPC over 40G Ethernet and IPoIB • Cluster B: AR-gRPC (RDMA) reduces 78% latency compared to 10G (Default gRPC) Ethernet and 69% compared to IPoIB (Default gRPC) Cluster A Cluster B 56% 69%
  • 34. DataWorks Summit@Berlin 2018 34Network Based Computing Laboratory TF-gRPC-P2P-Bandwidth 0 1000 2000 3000 4000 5000 6000 Uniform Random Skew Bandwidth(MB/s) Payload Generation Scheme Default gRPC (Ethernet 40G) Default gRPC (IPoIB-100Gbps) AR-gRPC (RDMA-100Gbps) 0 2000 4000 6000 Uniform Random Skew Bandwidth(MB/s) Payload Generation Scheme Default gRPC (Ethernet 10G) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps)Cluster A Cluster B • Cluster A: AR-gRPC (RDMA) gRPC achieves a 2.14x bandwidth increase compared to Default gRPC (IPoIB and Ethernet) • Cluster B: AR-gRPC (RDMA) achieves 3.2x bandwidth compared to Default gRPC IPoIB for skewed data 2.14x 3.2x
  • 35. DataWorks Summit@Berlin 2018 35Network Based Computing Laboratory 0 1000 2000 3000 4000 Uniform Random Skew RPC/second Payload Generation Scheme Default gRPC (Ethernet 40G) Default gRPC (IPoIB-100Gbps) AR-gRPC (RDMA-100Gbps) TF-gRPC-PS-Throughput 0 1000 2000 3000 4000 Uniform Random Skew RPC/second Payload Generation Scheme Default gRPC (Ethernet 10G) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) Cluster A Cluster B • Cluster A: AR-gRPC (RDMA) gRPC achieves a 3.4x speedup compared to Default gRPC over IPoIB for uniform scheme • Cluster B: AR-gRPC (RDMA) achieves 3.6x bandwidth compared to Default gRPC over IPoIB for uniform scheme 3.4 x 3.6 x
  • 36. DataWorks Summit@Berlin 2018 36Network Based Computing Laboratory Performance Benefits for AR-gRPC with Micro-Benchmark Fully-Connected Architecture (Mimic TensorFlow communication) • AR-gRPC (OSU design) TensorFlow Mimic test on SDSC-Comet-FDR – Up to 60% reduction in average latency over Default gRPC (IPoIB) – Up to 2.68x performance speedup over Default gRPC (IPoIB) 0 6 12 18 24 30 2M 4M 8M Latency(ms) payload (Bytes) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) 0 100 200 300 400 500 2M 4M 8M Calls/Second payload (Bytes) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) 2.68 x
  • 37. DataWorks Summit@Berlin 2018 37Network Based Computing Laboratory 0 200 400 600 800 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) 0 200 400 600 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) 0 100 200 300 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) Performance Benefit for TensorFlow (Resnet50) • TensorFlow Resnet50 performance evaluation on an IB EDR cluster – Up to 30% performance speedup over Default gRPC (IPoIB) for 8 GPUs – Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs – Up to 44% performance speedup over Default gRPC (IPoIB) for 24 GPUs 4 Nodes (8 GPUs) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS)
  • 38. DataWorks Summit@Berlin 2018 38Network Based Computing Laboratory 0 50 100 150 200 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) Performance Benefit for TensorFlow (Inception3) • TensorFlow Inception3 performance evaluation on an IB EDR cluster – Up to 20% performance speedup over Default gRPC (IPoIB) for 8 GPUs – Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs – Up to 37% performance speedup over Default gRPC (IPoIB) for 24 GPUs 4 Nodes (8 GPUS) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS) 0 200 400 600 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) 0 100 200 300 400 16 32 64Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps)
  • 39. DataWorks Summit@Berlin 2018 39Network Based Computing Laboratory Outline • Overview of gRPC, TensorFlow, and RDMA Technologies • Accelerating gRPC and TensorFlow with RDMA • Benchmarking gRPC and TensorFlow • Performance Evaluation • Conclusion
  • 40. DataWorks Summit@Berlin 2018 40Network Based Computing Laboratory • Present overview of gRPC and TensorFlow • Present overview of RDMA and high-performance networks • Discussed challenges in accelerating and benchmarking TensorFlow and gRPC • RDMA can benefit DL workloads as showed by our AR-gRPC and the corresponding enhanced TensorFlow • Many other open issues need to be solved • Will enable Deep Learning community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner Concluding Remarks
  • 41. DataWorks Summit@Berlin 2018 41Network Based Computing Laboratory Funding Acknowledgments Funding Support by Equipment Support by
  • 42. DataWorks Summit@Berlin 2018 42Network Based Computing Laboratory Personnel Acknowledgments Current Students (Graduate) – A. Awan (Ph.D.) – R. Biswas (M.S.) – M. Bayatpour (Ph.D.) – S. Chakraborthy (Ph.D.) – C.-H. Chu (Ph.D.) – S. Guganani (Ph.D.) Past Students – A. Augustine (M.S.) – P. Balaji (Ph.D.) – S. Bhagvat (M.S.) – A. Bhat (M.S.) – D. Buntinas (Ph.D.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – R. Rajachandrasekar (Ph.D.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) Past Research Scientist – K. Hamidouche – S. Sur Past Post-Docs – D. Banerjee – X. Besseron – H.-W. Jin – W. Huang (Ph.D.) – W. Jiang (M.S.) – J. Jose (Ph.D.) – S. Kini (M.S.) – M. Koop (Ph.D.) – K. Kulkarni (M.S.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – M. Li (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – A. Moody (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – S. Potluri (Ph.D.) – J. Hashmi (Ph.D.) – H. Javed (Ph.D.) – P. Kousha (Ph.D.) – D. Shankar (Ph.D.) – H. Shi (Ph.D.) – J. Zhang (Ph.D.) – J. Lin – M. Luo – E. Mancini Current Research Scientists – X. Lu – H. Subramoni Past Programmers – D. Bureddy – J. Perkins Current Research Specialist – J. Smith – M. Arnold – S. Marcarelli – J. Vienne – H. Wang Current Post-doc – A. Ruhela – K. Manian Current Students (Undergraduate) – N. Sarkauskas (B.S.)
  • 43. DataWorks Summit@Berlin 2018 43Network Based Computing Laboratory The 4th International Workshop on High-Performance Big Data Computing (HPBDC) HPBDC 2018 will be held with IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, 2018 Workshop Date: May 21st, 2018 Keynote Talk: Prof. Geoffrey Fox, Twister2: A High-Performance Big Data Programming Environment Six Regular Research Papers and Two Short Research Papers Panel Topic: Which Framework is the Best for High-Performance Deep Learning: Big Data Framework or HPC Framework? http://web.cse.ohio-state.edu/~luxi/hpbdc2018 HPBDC 2017 was held in conjunction with IPDPS’17 http://web.cse.ohio-state.edu/~luxi/hpbdc2017 HPBDC 2016 was held in conjunction with IPDPS’16 http://web.cse.ohio-state.edu/~luxi/hpbdc2016
  • 44. DataWorks Summit@Berlin 2018 44Network Based Computing Laboratory {panda, luxi}@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~luxi Thank You! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/