Designing High performance & Scalable Middleware for HPC

1
Network Based Computing Laboratory Global AI (Dec ‘22)

Designing High-Performance and Scalable Middleware for HPC,
AI and Data Science
Kaushik Kandadi Suresh
The Ohio State University
E-mail: kandadisuresh.1@osu.edu
http://www.cse.ohio-state.edu/~panda
A Talk at Global AI Event (December ‘22)
by
Follow us on
https://twitter.com/mvapich

3
Introduction to HPC, MPI, RDMA
• High Performance Computing (HPC):
– utilization of computing power to process data and operations at high speeds
– Used for solving compute intensive problems on multiple nodes
• Communication:
– Certain computational problems when decomposed on multiple nodes/machines
requires data exchange across nodes/processors
– MPI is a parallel programming model that provides communication primitives for
parallel programs
– RDMA enables directly access of remote node’s memory without CPU involvement
• Improve communication latency
• Provided by Interconnects such as InfiniBand, ROCE, Slinghshot

4
4
Bigger Challenge: Blood Flow in
Human Vascular Network
• Cardiovascular disease accounts for
about 50% of deaths in western world;
• Formation of arterial disease strongly
correlated to blood flow patterns;
Computational challenges:
Enormous problem size
In one minute, the heart pumps
the entire blood supply of 5 quarts
through 60,000 miles of vessels,
that is a quarter of the distance
between the moon and the earth
Blood flow involves multiple scales
Courtesy: G. Em Karniadakis & L. Grinberg

5
Bigger Challenge: Earthquake and Flu/COVID Pandemic Simulation
Earthquake simulation
Surface velocity 75 sec after
earthquake
Flu pandemic simulation
300 million people tracked
Density of infected population,
45 days after breakout
Courtesy: G. Em Karniadakis & L. Grinberg

6
Big Velocity – How Much Data Is Generated Every Minute on the Internet?
The global Internet population grew 10% (in July
2021) from Jan 2021 and now represents
5.17 Billion People.
Courtesy: https://www.domo.com/blog/data-never-sleeps-9/

7
AI, Machine Learning and Deep Learning?
Courtesy: https://hackernoon.com/difference-between-artificial-intelligence-machine-learning-
and-deep-learning-1pcv3zeg, https://blog.dataiku.com/ai-vs.-machine-learning-vs.-deep-learning,
https://en.wikipedia.org/wiki/Machine_learning
• Machine Learning (ML)
– “the study of computer algorithms to improve
automatically through experience and use of data”
• Deep Learning (DL) – a subset of ML
– Uses Deep Neural Networks (DNNs)
– Perhaps, the most revolutionary subset!
• Based on learning data representation
• DNN Examples: Convolutional Neural Networks,
Recurrent Neural Networks, Hybrid Networks
• Data Scientist or Developer Perspective for using
DNNs
1. Identify DL as solution to a problem
2. Determine Data Set
3. Select Deep Learning Algorithm to Use
4. Use a large data set to train an algorithm

8
Credit Card Fraud Detection using Machine Learning
Courtesy: https://spd.group/machine-learning/fraud-detection-with-machine-learning
https://www.sas.com/en_us/insights/articles/risk-fraud/fraud-detection-machine-learning.html
… almost $112 million due to credit card fraud in 2019.

9
The Impact of Deep Learning on Application Areas
Courtesy: https://github.com/alexjc/neural-doodle
Courtesy: https://arxiv.org/pdf/1808.02334.pdf
Courtesy: https://research.googleblog.com/2015/07/how-google-translate-squeezes-deep.html
Courtesy: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8065136

10
Self Driving Cars
Courtesy: http://www.teslarati.com/teslas-full-self-driving-capability-arrive-3-months-definitely-6-months-says-musk/

11
• Applications
– Prostate Cancer Detection
– Metastasis Detection in Breast Cancer
– Genetic Mutation Prediction
– Tumor Detection for Molecular Analysis
AI-Driven Digital Pathology
Courtesy: https://www.frontiersin.org/articles/10.3389/fmed.2019.00185/full

12
Artificial Intelligence Use Cases and Growth Trends
Courtesy: https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/

13
High-End Computing (HEC): PetaFlop to ExaFlop
100 PetaFlops in
2017
1.1 ExaFlops
(HPL) and 6.88
ExaFlops (HPL-
AI) in 2022
(Frontier at
ORNL with
8.73M cores)
442 Peta
Flops in
2020
(Fugaku in
Japan with
7.63M cores

14
Trends for Commodity Computing Clusters in the Top 500
List (http://www.top500.org)
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
Percentage
of
Clusters
Number
of
Clusters
Timeline
Percentage of Clusters
Number of Clusters
98.4%

15
Drivers of Modern HPC Cluster Architectures
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand, RoCE, Slingshot)
• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs)
• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.
Accelerators
high compute density, high
performance/watt
>9.7 TFlop DP on a chip
High Performance Interconnects –
InfiniBand, Slingshot
<1usec latency, 200-400Gbps Bandwidth>
Multi-/Many-core
Processors
SSD, NVMe-SSD, NVRAM
Frontier Summit Lumi
Fugaku

16
Deep/
Machine
Learning
(TensorFlow,
PyTorch, cuML,
etc.)
Big Data (Hadoop,
Spark), Data
Science
(Dask)
HPC
(MPI, PGAS, etc.)
Increasing Usage of HPC, Deep/Machine Learning, and Data Science
Convergence of HPC,
Deep/Machine Learning,
and Data Science!
Increasing Need to Run these
applications on the Cloud!!
Can MPI-driven Converged Middleware be designed and used for all three domains?

17
Designing Communication Libraries for Multi-Petaflop and
Exaflop Systems: Challenges
Programming Models
MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Hadoop, Spark (RDD, DAG), TensorFlow, PyTorch, etc.
Application Kernels/Applications (HPC, DL, Data Science)
Networking Technologies
(InfiniBand, Ethernet,
RoCE, Omni-Path, and Slingshot)
Multi-/Many-core
Architectures
Accelerators
(GPU and FPGA)
Middleware
Co-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Resilience
Communication Library or Runtime for Programming Models
Point-to-point
Communication
Collective
Communication
Energy-
Awareness
Synchronization
and Locks
I/O and
File Systems
Fault
Tolerance

18
• MVAPICH Project
– MPI Library with CUDA-Awareness (GPU)
– Accelerating applications with DPU
• HiDL Project
– High-Performance Deep Learning
– High-Performance Machine Learning
• HiBD Project
– Accelerating Big Data and Data Science Applications
• Conclusions
Presentation Overview

19
Deep/
Machine
Learning
(TensorFlow,
PyTorch, cuML,
etc.)
Big Data
(Hadoop, Spark),
Data Science
(Dask)
HPC
(MPI, PGAS, etc.)
Converged Software Stacks for HPC, Deep/Machine Learning, and Data
Science

20
Overview of the MVAPICH2 Project
• High Performance open-source MPI Library
• Support for multiple interconnects
– InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), AWS
EFA, Rockport Networks, and Slingshot
• Support for multiple platforms
– x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD)
• Started in 2001, first open-source version demonstrated at SC ‘02
• Supports the latest MPI-3.1 standard
• http://mvapich.cse.ohio-state.edu
• Additional optimized versions for different systems/environments:
– MVAPICH2-X (Advanced MPI + PGAS), since 2011
– MVAPICH2-GDR with support for NVIDIA (since 2014) and AMD (since 2020) GPUs
– MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014
– MVAPICH2-Virt with virtualization support, since 2015
– MVAPICH2-EA with support for Energy-Awareness, since 2015
– MVAPICH2-Azure for Azure HPC IB instances, since 2019
– MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019
• Tools:
– OSU MPI Micro-Benchmarks (OMB), since 2003
– OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015
• Used by more than 3,200 organizations in 89 countries
• More than 1.56 Million downloads from the OSU site
directly
• Empowering many TOP500 clusters (Nov ‘21 ranking)
– 4th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China
– 13th, 448, 448 cores (Frontera) at TACC
– 26th, 288,288 cores (Lassen) at LLNL
– 38th, 570,020 cores (Nurion) in South Korea and many others
• Available with software stacks of many vendors and
Linux Distros (RedHat, SuSE, OpenHPC, and Spack)
• Partner in the 13th ranked TACC Frontera system
• Empowering Top500 systems for more than 16 years

21
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
Sep-04
Mar-05
Sep-05
Mar-06
Sep-06
Mar-07
Sep-07
Mar-08
Sep-08
Mar-09
Sep-09
Mar-10
Sep-10
Mar-11
Sep-11
Mar-12
Sep-12
Mar-13
Sep-13
Mar-14
Sep-14
Mar-15
Sep-15
Mar-16
Sep-16
Mar-17
Sep-17
Mar-18
Sep-18
Mar-19
Sep-19
Mar-20
Sep-20
Mar-21
Number
of
Downloads
Timeline
MV
0.9.4
MV2
0.9.0
MV2
0.9.8
MV2
1.0
MV
1.0
MV2
1.0.3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9
MV2-GDR
2.0b
MV2-MIC
2.0
MV2-GDR
2.3.6
MV2-X
2.3
MV2
Virt
2.2
MV2
2.3.6
OSU
INAM
0.9.6
MV2-Azure
2.3.2
MV2-AWS
2.3
MVAPICH2 Release Timeline and Downloads

22
Architecture of MVAPICH2 Software Family for HPC, DL/ML, and
Data Science
High Performance Parallel Programming Models
Message Passing Interface
(MPI)
PGAS
(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X
(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication Runtime
Diverse APIs and Mechanisms
Point-to-
point
Primitives
Collectives
Algorithms
Energy-
Awareness
Remote
Memory
Access
I/O and
File Systems
Fault
Tolerance
Virtualization
Active
Messages
Job Startup
Introspection
& Analysis
Support for Modern Networking Technology
(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures
(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC SRD UD DC UMR ODP
SR-
IOV
Multi
Rail
Transport Mechanisms
Shared
Memory
CMA IVSHMEM
Modern Features
Optane* NVLink CAPI*
* Upcoming
XPMEM

23
0
2000
4000
6000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Bandwidth
(MB/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
1000
2000
3000
4000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Bandwidth
(MB/s)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
10
20
30 0
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
8K
Latency
(us)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3
MVAPICH2-GDR-2.3
Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPU
Mellanox Connect-X4 EDR HCA
CUDA 9.0
Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR with CUDA-Aware MPI Support
1.85us
11X

24
Enhanced DDT Support: HCA Assisted Inter-Node Scheme (UMR)
• Comparison of UMR based DDT scheme in MVAPICH2-GDR-Next with OpenMPI 4.1.3, MVAPICH2-GDR 2.3.6
• 1 GPU per Node, 2 Node experiment. Speed-up relative to OpenMPI
Platform: ThetaGPU (NVIDIA DGX-A100) (NVIDIA Ampere GPUs connected with NVSwitch), CUDA 11.0
0
5
10
15
20
25
30
(8,8,16,32) (8,16,16,32) (16,16,32,32) (16,32,32,32)
Speedup
Input Parameters
DDTBench-MILC
MVAPICH2-GDR-Next MVAPICH2-GDR OpenMPI
0
10
20
30
40
50
(512,66,66) (1024,66,66) (2048,66,120)
Speedup
Input Parameters
DDTBench-NASMGY
MVAPICH2-GDR-Next MVAPICH2-GDR
OpenMPI
Improved
35%
• Uses nested vector datatype for 4D face exchanges. • 3D face exchanges with vector and nested vector
datatypes
Improved 32%
K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, D. Panda, Network Assisted Non-Contiguous Transfers for GPU-Aware
MPI Libraries, Hot Interconnects 29

25
• Weak-Scaling of HPC application AWP-ODC on Lassen cluster (V100 nodes)
• MPC-OPT achieves up to +18% GPU computing flops, -15% runtime per timestep
• ZFP-OPT achieves up to +35% GPU computing flops, -26% runtime per timestep
“On-the-fly” Compression Support in MVAPICH2-GDR
0
20
40
60
80
100
120
140
160
180
200
8 16 32 64 128 256 512
GPU
Computing
Flops
(TFLOPS)
GPUS
Baseline (No compression)
MPC-OPT
ZFP-OPT (rate:16)
ZFP-OPT (rate:8)
+35%
+18%
0
10
20
30
40
50
60
70
8 16 32 64 128 256 512
Run
Time
Per
Timestep
(ms) GPUS
-15%
-26%
Q. Zhou, C. Chu, N. Senthil Kumar, P. Kousha, M. Ghazimirsaeed, H. Subramoni, and D.K. Panda, Designing High-Performance MPI Libraries with On-the-fly Compression for
Modern GPU Clusters, 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2021. [Best Paper Finalist]

26
MVAPICH Accelerates Parallel 3-D FFT at Oak Ridge
Accelerating the communication cost on parallel 3-D FFTs, Stan Tomov and Alan Ayala, The University of Tennessee, Knoxville
(http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/21/Ayala.pdf)
MVAPICH is around 10-20% faster than SpectrumMPI 10.3
for heFFTe Library
Comparison of achievable bandwidth for two-node exchange via MPI_Send

27
MVAPICH Drives Nuclear Energy Research at Idaho National Lab
(INL)
The MOOSE Multiphysics Computational Framework for Nuclear Power Applications: A Special Issue of Nuclear Technology
(https://www.tandfonline.com/doi/full/10.1080/00295450.2021.1915487)
MVAPICH Integration for PBS Pro, HPC Team, Idaho National Laboratory
(http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/21/inl.pdf)

28
Rapid adoption of MVAPICH2 on INL HPC systems
1
10
100
1000
10000
100000
1000000
mvapich2/2.3.5 mvapich2/2.3.3 openmpi/4.0.2 openmpi/4.0.5 intelmpi
MPI library usage 1 Jan 2021—1 Sep 2021
M. Anderson, Aggressive Asynchronous Communication in the MOOSE framework using MVAPICH2, 10th Annual MVAPICH User
Group Conference (MUG), Aug 2022

29
• Near-Earth asteroids (NEAs) have caused recent
and ancient global catastrophes
– LLNL scientists research ways to prevent NEAs using
methods known as asteroid deflection
– Joint NASA-LLNL research modelled various
asteroid deflection methods (NASA's DART mission)
• MVAPICH2 lived at the core of the (NASA-DART
mission) and enabled scalability
– Underneath large-scale hydrodynamical and
gravitational simulations required to compute the
impact such as Spheral models
MVAPICH2 enabling life-changing NASA's DART mission
•https://twitter.com/NASA/status/1574539270987173903?s=20&t=u_4wI
V9Cui2xyn9QLj286Q
•https://www.cbsnews.com/sanfrancisco/news/i-just-could-not-believe-
it-livermore-team-celebrates-nasas-historic-strike-on-distant-asteroid/
•http://mug.mvapich.cse.ohio-
state.edu/static/media/mug/presentations/18/moody-mug-18.pdf

30
• LLNL’s National Ignition Facility (NIF) conducted the first
controlled fusion experiment in history![1]
• MVAPICH, being the default MPI library on the LLNL
systems, has been enabling the thousands of simulation
jobs that have led to this amazing achievement!
• [1] https://www.llnl.gov/news/national-ignition-facility-achieves-fusion-ignition
MVAPICH2 enabling Nuclear Fusion Research
The target chamber of LLNL’s National Ignition Facility, where 192
laser beams delivered more than 2 million joules of ultraviolet
energy to a tiny fuel pellet to create fusion ignition on Dec. 5, 2022.
The hohlraum that houses the type of cryogenic target used
to achieve ignition on Dec. 5, 2022, at LLNL’s National
Ignition Facility.
To create fusion ignition, the National Ignition Facility’s laser energy is
converted into X-rays inside the hohlraum, which then compress a fuel
capsule until it implodes, creating a high temperature, high pressure plasma.

31
• MVAPICH Project
• HiDL Project
• HiBD Project
• Conclusions

32
• Scale-up: Intra-node Communication
– Many improvements like:
• NVIDIA cuDNN, cuBLAS, NCCL, etc.
• CUDA Co-operative Groups
• Scale-out: Inter-node Communication
– DL Frameworks – most are optimized for
single-node only
– Distributed (Parallel) Training is an emerging
trend
• PyTorch – MPI/NCCL2
• TensorFlow – gRPC-based/MPI/NCCL2
• OSU-Caffe – MPI-based
Scale-up and Scale-out
Scale-up
Performance
Scale-out Performance
cuDNN
gRPC
Hadoop
MPI
MKL-DNN
Desired
NCCL2

33
Deep/
Machine
Learning
(TensorFlow,
PyTorch, cuML,
etc.)
Big Data
(Hadoop, Spark),
Data Science
(Dask)
HPC
(MPI, PGAS, etc.)
Science

34
MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training
MVAPICH2 or MVAPICH2-X
for CPU Training
MVAPICH2-GDR for
GPU Training
Horovod
TensorFlow PyTorch MXNet
ML/DL Applications
MVAPICH2 or MVAPICH2-X
for CPU Training
MVAPICH2-GDR for
GPU Training
Torch.distributed
PyTorch
ML/DL Applications
DeepSpeed
More details available from: http://hidl.cse.ohio-state.edu

35
Distributed TensorFlow on ORNL Summit (1,536 GPUs)
• ResNet-50 Training using
TensorFlow benchmark on
SUMMIT -- 1536 Volta
GPUs!
• 1,281,167 (1.2 mil.) images
• Time/epoch = 3 seconds
• Total Time (90 epochs)
= 3 x 90 = 270 seconds =
4.5 minutes!
0
100
200
300
400
500
1 2 4 6 12 24 48 96 192 384 768 1536
Image
per
second
Thousands
Number of GPUs
MVAPICH2-GDR 2.3.4
MVAPICH2-GDR 2.3.4
Platform: The Summit Supercomputer (#2 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1
*We observed issues for NCCL2 beyond 384 GPUs
MVAPICH2-GDR reaching ~0.42 million
images per second for ImageNet-1k!
ImageNet-1k has 1.2 million images

36
Distributed
Framework
Torch.distributed Horovod DeepSpeed
Images/sec on
256 GPUs
61,794 72,120 74,063 84,659 80,217 88,873
Communication
Backend
NCCL 2.7 MVAPICH2-GDR NCCL 2.7 MVAPICH2-GDR NCCL 2.7 MVAPICH2-GDR
PyTorch at Scale: Training ResNet-50 on 256 V100 GPUs
• Training performance for 256 V100 GPUs on LLNL Lassen
– ~10,000 Images/sec faster than NCCL training!

37
• Pathology whole slide image (WSI)
– Each WSI = 100,000 x 100,000 pixels
– Can not fit in a single GPU memory
– Tiles are extracted to make training possible
• Two main problems with tiles
– Restricted tile size because of GPU memory limitation
– Smaller tiles loose structural information
• Reduced training time significantly
– GEMS-Basic: 7.25 hours (1 node, 4 GPUs)
– GEMS-MAST: 6.28 hours (1 node, 4 GPUs)
– GEMS-MASTER: 4.21 hours (1 node, 4 GPUs)
– GEMS-Hybrid: 0.46 hours (32 nodes, 128 GPUs)
– Overall 15x reduction in training time!!!!
Exploiting Model Parallelism in AI-Driven Digital Pathology
Courtesy: https://blog.kitware.com/digital-slide-
archive-large-image-and-histomicstk-open-source-
informatics-tools-for-management-visualization-and-
analysis-of-digital-histopathology-data/
Scaling ResNet110 v2 on 1024×1024 image tiles
using histopathology data
A. Jain, A. Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, D. K. Panda, R. Machiraju, and A. Parwani,
“GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training”, Supercomputing
(SC ‘20).
0
5
10
15
20
25
4 8 16 32 64 128
Throughput
Speedup
(images
per
sec)
Number of GPUs
1x 1.9x
3.6x
7x
12x
22x

38
• MVAPICH Project
• HiDL Project
• HiBD Project
• Conclusions

39
Deep/
Machine
Learning
(TensorFlow,
PyTorch, cuML,
etc.)
Big Data
(Hadoop, Spark),
Data Science
(Dask)
HPC
(MPI, PGAS, etc.)
Science

40
• Since 2013
• RDMA for Apache Spark
• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache Kafka
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 340 organizations from 36 countries
• More than 44,000 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Also run on Ethernet
Available for x86 and OpenPOWER
Support for Singularity and Docker

41
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA-based design for Spark 1.5.1
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)
RDMA-Spark on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0
50
100
150
200
250
300
350
400
450
Huge BigData Gigantic
Time
(sec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData Gigantic
Time
(sec)
Data Size (GB)
IPoIB
RDMA
43%
37%

42
• The main motivation of this work is to utilize the
communication functionality provided by
MVAPICH2 in the Apache Spark framework
• MPI4Spark relies on Java bindings of the
MVAPICH2 library
• Spark’s default ShuffleManager relies on Netty for
communication:
– Netty is a Java New I/O (NIO) client/server
framework for event-based networking applications
– The key idea is to utilize MPI-based point-to-point
communication inside Netty
MPI4Spark: Using MVAPICH2 to Optimize Apache Spark

43
MPI4Spark: Relative Speedups to Vanilla Spark and RDMA-
Spark on Three HPC Systems
System Name Nodes Used Processor Cores Used Sockets Cores/socket RAM Interconnect
TACC Frontera 34 Xeon Platinum 1792 2 28 192 GB HDR (100G)
RI2 (OSU System) 14 Xeon Broadwell 336 2 14 128 GB EDR (100G)
MRI (OSU System) 12 AMD EPYC 7713 1280 2 64 264 GB 200 Gb/sec (4X HDR)
OHB GroupByTest OHB SortByTest
3.65x
1.88x
3.52x
1.86x

44
Dask Architecture
Distributed
Scheduler Worker Client
Comm Layer
tcp.py ucx.py
Laptops/
Desktops
Dask-MPI Dask-CUDA Dask-Jobqueue
Dask
Dask Bag Dask Array
Dask
DataFrame
Delayed Future
Task Graph
High Performance Computing Hardware
UCX-Py
(Cython wrappers)
TCP UCX
MPI4Dask
mpi4py
MVAPICH2-GDR

45
• MPI4Dask 0.2 was released in Mar ‘21 adding support for high-performance MPI communication to Dask:
– Can be downloaded from: http://hibd.cse.ohio-state.edu
• Features:
– Based on Dask Distributed 2021.01.0
– Compliant with user-level Dask APIs and packages
– Support for MPI-based communication in Dask for cluster of GPUs
– Implements point-to-point communication co-routines
– Efficient chunking mechanism implemented for large messages
– (NEW) Built on top of mpi4py over the MVAPICH2, MVAPICH2-X, and MVAPICH2-GDR libraries
– (NEW) Support for MPI-based communication for CPU-based Dask applications
– Supports starting execution of Dask programs using Dask-MPI
– Tested with
• (NEW) CPU-based Dask applications using numPy and Pandas data frames
• (NEW) GPU-based Dask applications using cuPy and cuDF
• Mellanox InfiniBand adapters (FDR and EDR)
• Various multi-core platforms
• NVIDIA V100 and Quadro RTX 5000 GPUs
MPI4Dask Release

46
Benchmark #1: Sum of cuPy Array and its Transpose (RI2)
0
1
2
3
4
5
6
7
8
9
10
2 3 4 5 6
Total
Execution
Time
(s)
Number of Dask Workers
IPoIB UCX MPI4Dask
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2 3 4 5 6
Communication
Time
(s)
Number of Dask Workers
IPoIB UCX MPI4Dask
3.47x better on average 6.92x better on average
A. Shafi , J. Hashmi , H. Subramoni , and D. K. Panda, Efficient MPI-based
Communication for GPU-Accelerated Dask Applications,
https://arxiv.org/abs/2101.08878
MPI4Dask 0.2 release
(http://hibd.cse.ohio-state.edu)

47
• Solutions to many current and next generation problems are dependent on
the growth of HPC and AI
• Growth and success in AI is very much dependent on HPC
• Presented an overview of the associated opportunities and challenges to
make HPC and AI accessible to all
• Presented a set of solutions to address these challenges
Concluding Remarks

48
Funding Acknowledgments
Funding Support by
Equipment Support by

49
Acknowledgments to all the Heroes (Past/Current Students and Staffs)
Current Students (Graduate)
– N. Alnaasan (Ph.D.)
– Q. Anthony (Ph.D.)
– C.-C. Chun (Ph.D.)
– N. Contini (Ph.D.)
– A. Jain (Ph.D.)
Past Students
– A. Awan (Ph.D.)
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– M. Bayatpour (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– S. Chakraborthy (Ph.D.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– C.-H. Chu (Ph.D.)
– D. Shankar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– N. Sarkauskas (B.S. and M.S)
– N. Senthil Kumar (M.S.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Srivastava (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientists
– K. Hamidouche
– S. Sur
– X. Lu
Past Post-Docs
– D. Banerjee
– X. Besseron
– M. S. Ghazimeersaeed
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– J. Hashmi (Ph.D.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– M. Kedia (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– K. Raj (M.S.)
– R. Rajachandrasekar (Ph.D.)
– K. S. Khorassani (Ph.D.)
– P. Kousha (Ph.D.)
– B. Michalowicz (Ph.D.)
– B. Ramesh (Ph.D.)
– K. K. Suresh (Ph.D.)
– H.-W. Jin
– J. Lin
– M. Luo
Past Senior Research Associate
– J. Hashmi
Past Programmers
– A. Reifsteck
– D. Bureddy
– J. Perkins
– E. Mancini
– K. Manian
– S. Marcarelli
Current Software Engineers
– B. Seeds
– N. Pavuk
– N. Shineman
– M. Lieber
Past Research Specialist
– M. Arnold
– J. Smith
Current Research Scientists
– M. Abduljabbar
– A. Shafi
– A. H. Tu (Ph.D.)
– S. Xu (Ph.D.)
– Q. Zhou (Ph.D.)
– K. Al Attar (M.S.)
– L. Xu (Ph.D.)
– A. Ruhela
– J. Vienne
– H. Wang
Current Students (Undergrads)
– V. Shah
– T. Chen
Current Research Specialist
– R. Motlagh
Current Faculty
– H. Subramoni
– H. Ahn (Ph.D.)
– G. Kuncham (Ph.D.)
– R. Vaidya (Ph.D.)
– J. Yao (Ph.D.)
– M. Han (M.S.)
– A. Guptha (M.S.)

50
• Looking for Bright and Enthusiastic Personnel to join as
– PhD Students
– Post-Doctoral Researchers
– MPI Programmer/Software Engineer
– Spark/Big Data Programmer/Software Engineer
– Deep Learning, Machine Learning, and Cloud Programmer/Software Engineer
• If interested, please send an e-mail to panda@cse.ohio-state.edu
Multiple Positions Available in MVAPICH2, BigData and
DL/ML Projects

51
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
kandadisuresh.1@osu.edu
The High-Performance MPI/PGAS Project
http://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Project
http://hidl.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/
Follow us on
https://twitter.com/mvapich

Designing High performance & Scalable Middleware for HPC

Recommandé

Recommandé

Contenu connexe

Similaire à Designing High performance & Scalable Middleware for HPC

Similaire à Designing High performance & Scalable Middleware for HPC (20)

Plus de Object Automation

Plus de Object Automation (20)

Dernier

Dernier (20)

Designing High performance & Scalable Middleware for HPC