SlideShare une entreprise Scribd logo
1  sur  51
1
Network Based Computing Laboratory Global AI (Dec ‘22)
Designing High-Performance and Scalable Middleware for HPC,
AI and Data Science
Kaushik Kandadi Suresh
The Ohio State University
E-mail: kandadisuresh.1@osu.edu
http://www.cse.ohio-state.edu/~panda
A Talk at Global AI Event (December ‘22)
by
Follow us on
https://twitter.com/mvapich
3
Network Based Computing Laboratory Global AI (Dec ‘22)
Introduction to HPC, MPI, RDMA
• High Performance Computing (HPC):
– utilization of computing power to process data and operations at high speeds
– Used for solving compute intensive problems on multiple nodes
• Communication:
– Certain computational problems when decomposed on multiple nodes/machines
requires data exchange across nodes/processors
– MPI is a parallel programming model that provides communication primitives for
parallel programs
– RDMA enables directly access of remote node’s memory without CPU involvement
• Improve communication latency
• Provided by Interconnects such as InfiniBand, ROCE, Slinghshot
4
Network Based Computing Laboratory Global AI (Dec ‘22)
4
Bigger Challenge: Blood Flow in
Human Vascular Network
• Cardiovascular disease accounts for
about 50% of deaths in western world;
• Formation of arterial disease strongly
correlated to blood flow patterns;
Computational challenges:
Enormous problem size
In one minute, the heart pumps
the entire blood supply of 5 quarts
through 60,000 miles of vessels,
that is a quarter of the distance
between the moon and the earth
Blood flow involves multiple scales
Courtesy: G. Em Karniadakis & L. Grinberg
5
Network Based Computing Laboratory Global AI (Dec ‘22)
Bigger Challenge: Earthquake and Flu/COVID Pandemic Simulation
Earthquake simulation
Surface velocity 75 sec after
earthquake
Flu pandemic simulation
300 million people tracked
Density of infected population,
45 days after breakout
Courtesy: G. Em Karniadakis & L. Grinberg
6
Network Based Computing Laboratory Global AI (Dec ‘22)
Big Velocity – How Much Data Is Generated Every Minute on the Internet?
The global Internet population grew 10% (in July
2021) from Jan 2021 and now represents
5.17 Billion People.
Courtesy: https://www.domo.com/blog/data-never-sleeps-9/
7
Network Based Computing Laboratory Global AI (Dec ‘22)
AI, Machine Learning and Deep Learning?
Courtesy: https://hackernoon.com/difference-between-artificial-intelligence-machine-learning-
and-deep-learning-1pcv3zeg, https://blog.dataiku.com/ai-vs.-machine-learning-vs.-deep-learning,
https://en.wikipedia.org/wiki/Machine_learning
• Machine Learning (ML)
– “the study of computer algorithms to improve
automatically through experience and use of data”
• Deep Learning (DL) – a subset of ML
– Uses Deep Neural Networks (DNNs)
– Perhaps, the most revolutionary subset!
• Based on learning data representation
• DNN Examples: Convolutional Neural Networks,
Recurrent Neural Networks, Hybrid Networks
• Data Scientist or Developer Perspective for using
DNNs
1. Identify DL as solution to a problem
2. Determine Data Set
3. Select Deep Learning Algorithm to Use
4. Use a large data set to train an algorithm
8
Network Based Computing Laboratory Global AI (Dec ‘22)
Credit Card Fraud Detection using Machine Learning
Courtesy: https://spd.group/machine-learning/fraud-detection-with-machine-learning
https://www.sas.com/en_us/insights/articles/risk-fraud/fraud-detection-machine-learning.html
… almost $112 million due to credit card fraud in 2019.
9
Network Based Computing Laboratory Global AI (Dec ‘22)
The Impact of Deep Learning on Application Areas
Courtesy: https://github.com/alexjc/neural-doodle
Courtesy: https://arxiv.org/pdf/1808.02334.pdf
Courtesy: https://research.googleblog.com/2015/07/how-google-translate-squeezes-deep.html
Courtesy: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8065136
10
Network Based Computing Laboratory Global AI (Dec ‘22)
Self Driving Cars
Courtesy: http://www.teslarati.com/teslas-full-self-driving-capability-arrive-3-months-definitely-6-months-says-musk/
11
Network Based Computing Laboratory Global AI (Dec ‘22)
• Applications
– Prostate Cancer Detection
– Metastasis Detection in Breast Cancer
– Genetic Mutation Prediction
– Tumor Detection for Molecular Analysis
AI-Driven Digital Pathology
Courtesy: https://www.frontiersin.org/articles/10.3389/fmed.2019.00185/full
12
Network Based Computing Laboratory Global AI (Dec ‘22)
Artificial Intelligence Use Cases and Growth Trends
Courtesy: https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/
13
Network Based Computing Laboratory Global AI (Dec ‘22)
High-End Computing (HEC): PetaFlop to ExaFlop
100 PetaFlops in
2017
1.1 ExaFlops
(HPL) and 6.88
ExaFlops (HPL-
AI) in 2022
(Frontier at
ORNL with
8.73M cores)
442 Peta
Flops in
2020
(Fugaku in
Japan with
7.63M cores
14
Network Based Computing Laboratory Global AI (Dec ‘22)
Trends for Commodity Computing Clusters in the Top 500
List (http://www.top500.org)
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
Percentage
of
Clusters
Number
of
Clusters
Timeline
Percentage of Clusters
Number of Clusters
98.4%
15
Network Based Computing Laboratory Global AI (Dec ‘22)
Drivers of Modern HPC Cluster Architectures
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand, RoCE, Slingshot)
• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs)
• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.
Accelerators
high compute density, high
performance/watt
>9.7 TFlop DP on a chip
High Performance Interconnects –
InfiniBand, Slingshot
<1usec latency, 200-400Gbps Bandwidth>
Multi-/Many-core
Processors
SSD, NVMe-SSD, NVRAM
Frontier Summit Lumi
Fugaku
16
Network Based Computing Laboratory Global AI (Dec ‘22)
Deep/
Machine
Learning
(TensorFlow,
PyTorch, cuML,
etc.)
Big Data (Hadoop,
Spark), Data
Science
(Dask)
HPC
(MPI, PGAS, etc.)
Increasing Usage of HPC, Deep/Machine Learning, and Data Science
Convergence of HPC,
Deep/Machine Learning,
and Data Science!
Increasing Need to Run these
applications on the Cloud!!
Can MPI-driven Converged Middleware be designed and used for all three domains?
17
Network Based Computing Laboratory Global AI (Dec ‘22)
Designing Communication Libraries for Multi-Petaflop and
Exaflop Systems: Challenges
Programming Models
MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Hadoop, Spark (RDD, DAG), TensorFlow, PyTorch, etc.
Application Kernels/Applications (HPC, DL, Data Science)
Networking Technologies
(InfiniBand, Ethernet,
RoCE, Omni-Path, and Slingshot)
Multi-/Many-core
Architectures
Accelerators
(GPU and FPGA)
Middleware
Co-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Resilience
Communication Library or Runtime for Programming Models
Point-to-point
Communication
Collective
Communication
Energy-
Awareness
Synchronization
and Locks
I/O and
File Systems
Fault
Tolerance
18
Network Based Computing Laboratory Global AI (Dec ‘22)
• MVAPICH Project
– MPI Library with CUDA-Awareness (GPU)
– Accelerating applications with DPU
• HiDL Project
– High-Performance Deep Learning
– High-Performance Machine Learning
• HiBD Project
– Accelerating Big Data and Data Science Applications
• Conclusions
Presentation Overview
19
Network Based Computing Laboratory Global AI (Dec ‘22)
Deep/
Machine
Learning
(TensorFlow,
PyTorch, cuML,
etc.)
Big Data
(Hadoop, Spark),
Data Science
(Dask)
HPC
(MPI, PGAS, etc.)
Converged Software Stacks for HPC, Deep/Machine Learning, and Data
Science
20
Network Based Computing Laboratory Global AI (Dec ‘22)
Overview of the MVAPICH2 Project
• High Performance open-source MPI Library
• Support for multiple interconnects
– InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), AWS
EFA, Rockport Networks, and Slingshot
• Support for multiple platforms
– x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD)
• Started in 2001, first open-source version demonstrated at SC ‘02
• Supports the latest MPI-3.1 standard
• http://mvapich.cse.ohio-state.edu
• Additional optimized versions for different systems/environments:
– MVAPICH2-X (Advanced MPI + PGAS), since 2011
– MVAPICH2-GDR with support for NVIDIA (since 2014) and AMD (since 2020) GPUs
– MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014
– MVAPICH2-Virt with virtualization support, since 2015
– MVAPICH2-EA with support for Energy-Awareness, since 2015
– MVAPICH2-Azure for Azure HPC IB instances, since 2019
– MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019
• Tools:
– OSU MPI Micro-Benchmarks (OMB), since 2003
– OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015
• Used by more than 3,200 organizations in 89 countries
• More than 1.56 Million downloads from the OSU site
directly
• Empowering many TOP500 clusters (Nov ‘21 ranking)
– 4th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China
– 13th, 448, 448 cores (Frontera) at TACC
– 26th, 288,288 cores (Lassen) at LLNL
– 38th, 570,020 cores (Nurion) in South Korea and many others
• Available with software stacks of many vendors and
Linux Distros (RedHat, SuSE, OpenHPC, and Spack)
• Partner in the 13th ranked TACC Frontera system
• Empowering Top500 systems for more than 16 years
21
Network Based Computing Laboratory Global AI (Dec ‘22)
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
Sep-04
Mar-05
Sep-05
Mar-06
Sep-06
Mar-07
Sep-07
Mar-08
Sep-08
Mar-09
Sep-09
Mar-10
Sep-10
Mar-11
Sep-11
Mar-12
Sep-12
Mar-13
Sep-13
Mar-14
Sep-14
Mar-15
Sep-15
Mar-16
Sep-16
Mar-17
Sep-17
Mar-18
Sep-18
Mar-19
Sep-19
Mar-20
Sep-20
Mar-21
Number
of
Downloads
Timeline
MV
0.9.4
MV2
0.9.0
MV2
0.9.8
MV2
1.0
MV
1.0
MV2
1.0.3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9
MV2-GDR
2.0b
MV2-MIC
2.0
MV2-GDR
2.3.6
MV2-X
2.3
MV2
Virt
2.2
MV2
2.3.6
OSU
INAM
0.9.6
MV2-Azure
2.3.2
MV2-AWS
2.3
MVAPICH2 Release Timeline and Downloads
22
Network Based Computing Laboratory Global AI (Dec ‘22)
Architecture of MVAPICH2 Software Family for HPC, DL/ML, and
Data Science
High Performance Parallel Programming Models
Message Passing Interface
(MPI)
PGAS
(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X
(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication Runtime
Diverse APIs and Mechanisms
Point-to-
point
Primitives
Collectives
Algorithms
Energy-
Awareness
Remote
Memory
Access
I/O and
File Systems
Fault
Tolerance
Virtualization
Active
Messages
Job Startup
Introspection
& Analysis
Support for Modern Networking Technology
(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures
(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC SRD UD DC UMR ODP
SR-
IOV
Multi
Rail
Transport Mechanisms
Shared
Memory
CMA IVSHMEM
Modern Features
Optane* NVLink CAPI*
* Upcoming
XPMEM
23
Network Based Computing Laboratory Global AI (Dec ‘22)
0
2000
4000
6000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Bandwidth
(MB/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
1000
2000
3000
4000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Bandwidth
(MB/s)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
10
20
30 0
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
8K
Latency
(us)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3
MVAPICH2-GDR-2.3
Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPU
Mellanox Connect-X4 EDR HCA
CUDA 9.0
Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR with CUDA-Aware MPI Support
1.85us
11X
24
Network Based Computing Laboratory Global AI (Dec ‘22)
Enhanced DDT Support: HCA Assisted Inter-Node Scheme (UMR)
• Comparison of UMR based DDT scheme in MVAPICH2-GDR-Next with OpenMPI 4.1.3, MVAPICH2-GDR 2.3.6
• 1 GPU per Node, 2 Node experiment. Speed-up relative to OpenMPI
Platform: ThetaGPU (NVIDIA DGX-A100) (NVIDIA Ampere GPUs connected with NVSwitch), CUDA 11.0
0
5
10
15
20
25
30
(8,8,16,32) (8,16,16,32) (16,16,32,32) (16,32,32,32)
Speedup
Input Parameters
DDTBench-MILC
MVAPICH2-GDR-Next MVAPICH2-GDR OpenMPI
0
10
20
30
40
50
(512,66,66) (1024,66,66) (2048,66,120)
Speedup
Input Parameters
DDTBench-NASMGY
MVAPICH2-GDR-Next MVAPICH2-GDR
OpenMPI
Improved
35%
• Uses nested vector datatype for 4D face exchanges. • 3D face exchanges with vector and nested vector
datatypes
Improved 32%
K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, D. Panda, Network Assisted Non-Contiguous Transfers for GPU-Aware
MPI Libraries, Hot Interconnects 29
25
Network Based Computing Laboratory Global AI (Dec ‘22)
• Weak-Scaling of HPC application AWP-ODC on Lassen cluster (V100 nodes)
• MPC-OPT achieves up to +18% GPU computing flops, -15% runtime per timestep
• ZFP-OPT achieves up to +35% GPU computing flops, -26% runtime per timestep
“On-the-fly” Compression Support in MVAPICH2-GDR
0
20
40
60
80
100
120
140
160
180
200
8 16 32 64 128 256 512
GPU
Computing
Flops
(TFLOPS)
GPUS
Baseline (No compression)
MPC-OPT
ZFP-OPT (rate:16)
ZFP-OPT (rate:8)
+35%
+18%
0
10
20
30
40
50
60
70
8 16 32 64 128 256 512
Run
Time
Per
Timestep
(ms) GPUS
-15%
-26%
Q. Zhou, C. Chu, N. Senthil Kumar, P. Kousha, M. Ghazimirsaeed, H. Subramoni, and D.K. Panda, Designing High-Performance MPI Libraries with On-the-fly Compression for
Modern GPU Clusters, 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2021. [Best Paper Finalist]
26
Network Based Computing Laboratory Global AI (Dec ‘22)
MVAPICH Accelerates Parallel 3-D FFT at Oak Ridge
Accelerating the communication cost on parallel 3-D FFTs, Stan Tomov and Alan Ayala, The University of Tennessee, Knoxville
(http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/21/Ayala.pdf)
MVAPICH is around 10-20% faster than SpectrumMPI 10.3
for heFFTe Library
Comparison of achievable bandwidth for two-node exchange via MPI_Send
27
Network Based Computing Laboratory Global AI (Dec ‘22)
MVAPICH Drives Nuclear Energy Research at Idaho National Lab
(INL)
The MOOSE Multiphysics Computational Framework for Nuclear Power Applications: A Special Issue of Nuclear Technology
(https://www.tandfonline.com/doi/full/10.1080/00295450.2021.1915487)
MVAPICH Integration for PBS Pro, HPC Team, Idaho National Laboratory
(http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/21/inl.pdf)
28
Network Based Computing Laboratory Global AI (Dec ‘22)
Rapid adoption of MVAPICH2 on INL HPC systems
1
10
100
1000
10000
100000
1000000
mvapich2/2.3.5 mvapich2/2.3.3 openmpi/4.0.2 openmpi/4.0.5 intelmpi
MPI library usage 1 Jan 2021—1 Sep 2021
M. Anderson, Aggressive Asynchronous Communication in the MOOSE framework using MVAPICH2, 10th Annual MVAPICH User
Group Conference (MUG), Aug 2022
29
Network Based Computing Laboratory Global AI (Dec ‘22)
• Near-Earth asteroids (NEAs) have caused recent
and ancient global catastrophes
– LLNL scientists research ways to prevent NEAs using
methods known as asteroid deflection
– Joint NASA-LLNL research modelled various
asteroid deflection methods (NASA's DART mission)
• MVAPICH2 lived at the core of the (NASA-DART
mission) and enabled scalability
– Underneath large-scale hydrodynamical and
gravitational simulations required to compute the
impact such as Spheral models
MVAPICH2 enabling life-changing NASA's DART mission
•https://twitter.com/NASA/status/1574539270987173903?s=20&t=u_4wI
V9Cui2xyn9QLj286Q
•https://www.cbsnews.com/sanfrancisco/news/i-just-could-not-believe-
it-livermore-team-celebrates-nasas-historic-strike-on-distant-asteroid/
•http://mug.mvapich.cse.ohio-
state.edu/static/media/mug/presentations/18/moody-mug-18.pdf
30
Network Based Computing Laboratory Global AI (Dec ‘22)
• LLNL’s National Ignition Facility (NIF) conducted the first
controlled fusion experiment in history![1]
• MVAPICH, being the default MPI library on the LLNL
systems, has been enabling the thousands of simulation
jobs that have led to this amazing achievement!
• [1] https://www.llnl.gov/news/national-ignition-facility-achieves-fusion-ignition
MVAPICH2 enabling Nuclear Fusion Research
The target chamber of LLNL’s National Ignition Facility, where 192
laser beams delivered more than 2 million joules of ultraviolet
energy to a tiny fuel pellet to create fusion ignition on Dec. 5, 2022.
The hohlraum that houses the type of cryogenic target used
to achieve ignition on Dec. 5, 2022, at LLNL’s National
Ignition Facility.
To create fusion ignition, the National Ignition Facility’s laser energy is
converted into X-rays inside the hohlraum, which then compress a fuel
capsule until it implodes, creating a high temperature, high pressure plasma.
31
Network Based Computing Laboratory Global AI (Dec ‘22)
• MVAPICH Project
– MPI Library with CUDA-Awareness (GPU)
– Accelerating applications with DPU
• HiDL Project
– High-Performance Deep Learning
– High-Performance Machine Learning
• HiBD Project
– Accelerating Big Data and Data Science Applications
• Conclusions
Presentation Overview
32
Network Based Computing Laboratory Global AI (Dec ‘22)
• Scale-up: Intra-node Communication
– Many improvements like:
• NVIDIA cuDNN, cuBLAS, NCCL, etc.
• CUDA Co-operative Groups
• Scale-out: Inter-node Communication
– DL Frameworks – most are optimized for
single-node only
– Distributed (Parallel) Training is an emerging
trend
• PyTorch – MPI/NCCL2
• TensorFlow – gRPC-based/MPI/NCCL2
• OSU-Caffe – MPI-based
Scale-up and Scale-out
Scale-up
Performance
Scale-out Performance
cuDNN
gRPC
Hadoop
MPI
MKL-DNN
Desired
NCCL2
33
Network Based Computing Laboratory Global AI (Dec ‘22)
Deep/
Machine
Learning
(TensorFlow,
PyTorch, cuML,
etc.)
Big Data
(Hadoop, Spark),
Data Science
(Dask)
HPC
(MPI, PGAS, etc.)
Converged Software Stacks for HPC, Deep/Machine Learning, and Data
Science
34
Network Based Computing Laboratory Global AI (Dec ‘22)
MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training
MVAPICH2 or MVAPICH2-X
for CPU Training
MVAPICH2-GDR for
GPU Training
Horovod
TensorFlow PyTorch MXNet
ML/DL Applications
MVAPICH2 or MVAPICH2-X
for CPU Training
MVAPICH2-GDR for
GPU Training
Torch.distributed
PyTorch
ML/DL Applications
DeepSpeed
More details available from: http://hidl.cse.ohio-state.edu
35
Network Based Computing Laboratory Global AI (Dec ‘22)
Distributed TensorFlow on ORNL Summit (1,536 GPUs)
• ResNet-50 Training using
TensorFlow benchmark on
SUMMIT -- 1536 Volta
GPUs!
• 1,281,167 (1.2 mil.) images
• Time/epoch = 3 seconds
• Total Time (90 epochs)
= 3 x 90 = 270 seconds =
4.5 minutes!
0
100
200
300
400
500
1 2 4 6 12 24 48 96 192 384 768 1536
Image
per
second
Thousands
Number of GPUs
MVAPICH2-GDR 2.3.4
MVAPICH2-GDR 2.3.4
Platform: The Summit Supercomputer (#2 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1
*We observed issues for NCCL2 beyond 384 GPUs
MVAPICH2-GDR reaching ~0.42 million
images per second for ImageNet-1k!
ImageNet-1k has 1.2 million images
36
Network Based Computing Laboratory Global AI (Dec ‘22)
Distributed
Framework
Torch.distributed Horovod DeepSpeed
Images/sec on
256 GPUs
61,794 72,120 74,063 84,659 80,217 88,873
Communication
Backend
NCCL 2.7 MVAPICH2-GDR NCCL 2.7 MVAPICH2-GDR NCCL 2.7 MVAPICH2-GDR
PyTorch at Scale: Training ResNet-50 on 256 V100 GPUs
• Training performance for 256 V100 GPUs on LLNL Lassen
– ~10,000 Images/sec faster than NCCL training!
37
Network Based Computing Laboratory Global AI (Dec ‘22)
• Pathology whole slide image (WSI)
– Each WSI = 100,000 x 100,000 pixels
– Can not fit in a single GPU memory
– Tiles are extracted to make training possible
• Two main problems with tiles
– Restricted tile size because of GPU memory limitation
– Smaller tiles loose structural information
• Reduced training time significantly
– GEMS-Basic: 7.25 hours (1 node, 4 GPUs)
– GEMS-MAST: 6.28 hours (1 node, 4 GPUs)
– GEMS-MASTER: 4.21 hours (1 node, 4 GPUs)
– GEMS-Hybrid: 0.46 hours (32 nodes, 128 GPUs)
– Overall 15x reduction in training time!!!!
Exploiting Model Parallelism in AI-Driven Digital Pathology
Courtesy: https://blog.kitware.com/digital-slide-
archive-large-image-and-histomicstk-open-source-
informatics-tools-for-management-visualization-and-
analysis-of-digital-histopathology-data/
Scaling ResNet110 v2 on 1024×1024 image tiles
using histopathology data
A. Jain, A. Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, D. K. Panda, R. Machiraju, and A. Parwani,
“GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training”, Supercomputing
(SC ‘20).
0
5
10
15
20
25
4 8 16 32 64 128
Throughput
Speedup
(images
per
sec)
Number of GPUs
1x 1.9x
3.6x
7x
12x
22x
38
Network Based Computing Laboratory Global AI (Dec ‘22)
• MVAPICH Project
– MPI Library with CUDA-Awareness (GPU)
– Accelerating applications with DPU
• HiDL Project
– High-Performance Deep Learning
– High-Performance Machine Learning
• HiBD Project
– Accelerating Big Data and Data Science Applications
• Conclusions
Presentation Overview
39
Network Based Computing Laboratory Global AI (Dec ‘22)
Deep/
Machine
Learning
(TensorFlow,
PyTorch, cuML,
etc.)
Big Data
(Hadoop, Spark),
Data Science
(Dask)
HPC
(MPI, PGAS, etc.)
Converged Software Stacks for HPC, Deep/Machine Learning, and Data
Science
40
Network Based Computing Laboratory Global AI (Dec ‘22)
• Since 2013
• RDMA for Apache Spark
• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache Kafka
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 340 organizations from 36 countries
• More than 44,000 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Also run on Ethernet
Available for x86 and OpenPOWER
Support for Singularity and Docker
41
Network Based Computing Laboratory Global AI (Dec ‘22)
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA-based design for Spark 1.5.1
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)
RDMA-Spark on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0
50
100
150
200
250
300
350
400
450
Huge BigData Gigantic
Time
(sec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData Gigantic
Time
(sec)
Data Size (GB)
IPoIB
RDMA
43%
37%
42
Network Based Computing Laboratory Global AI (Dec ‘22)
• The main motivation of this work is to utilize the
communication functionality provided by
MVAPICH2 in the Apache Spark framework
• MPI4Spark relies on Java bindings of the
MVAPICH2 library
• Spark’s default ShuffleManager relies on Netty for
communication:
– Netty is a Java New I/O (NIO) client/server
framework for event-based networking applications
– The key idea is to utilize MPI-based point-to-point
communication inside Netty
MPI4Spark: Using MVAPICH2 to Optimize Apache Spark
43
Network Based Computing Laboratory Global AI (Dec ‘22)
MPI4Spark: Relative Speedups to Vanilla Spark and RDMA-
Spark on Three HPC Systems
System Name Nodes Used Processor Cores Used Sockets Cores/socket RAM Interconnect
TACC Frontera 34 Xeon Platinum 1792 2 28 192 GB HDR (100G)
RI2 (OSU System) 14 Xeon Broadwell 336 2 14 128 GB EDR (100G)
MRI (OSU System) 12 AMD EPYC 7713 1280 2 64 264 GB 200 Gb/sec (4X HDR)
OHB GroupByTest OHB SortByTest
3.65x
1.88x
3.52x
1.86x
44
Network Based Computing Laboratory Global AI (Dec ‘22)
Dask Architecture
Distributed
Scheduler Worker Client
Comm Layer
tcp.py ucx.py
Laptops/
Desktops
Dask-MPI Dask-CUDA Dask-Jobqueue
Dask
Dask Bag Dask Array
Dask
DataFrame
Delayed Future
Task Graph
High Performance Computing Hardware
UCX-Py
(Cython wrappers)
TCP UCX
MPI4Dask
mpi4py
MVAPICH2-GDR
45
Network Based Computing Laboratory Global AI (Dec ‘22)
• MPI4Dask 0.2 was released in Mar ‘21 adding support for high-performance MPI communication to Dask:
– Can be downloaded from: http://hibd.cse.ohio-state.edu
• Features:
– Based on Dask Distributed 2021.01.0​
– Compliant with user-level Dask APIs and packages​
– Support for MPI-based communication in Dask for cluster of GPUs​
– Implements point-to-point communication co-routines​
– Efficient chunking mechanism implemented for large messages​
– (NEW) Built on top of mpi4py over the MVAPICH2, MVAPICH2-X, and MVAPICH2-GDR libraries​
– (NEW) Support for MPI-based communication for CPU-based Dask applications​
– Supports starting execution of Dask programs using Dask-MPI​
– Tested with​
• (NEW) CPU-based Dask applications using numPy and Pandas data frames
• (NEW) GPU-based Dask applications using cuPy and cuDF​
• Mellanox InfiniBand adapters (FDR and EDR)​
• Various multi-core platforms​
• NVIDIA V100 and Quadro RTX 5000 GPUs​
MPI4Dask Release
46
Network Based Computing Laboratory Global AI (Dec ‘22)
Benchmark #1: Sum of cuPy Array and its Transpose (RI2)
0
1
2
3
4
5
6
7
8
9
10
2 3 4 5 6
Total
Execution
Time
(s)
Number of Dask Workers
IPoIB UCX MPI4Dask
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2 3 4 5 6
Communication
Time
(s)
Number of Dask Workers
IPoIB UCX MPI4Dask
3.47x better on average 6.92x better on average
A. Shafi , J. Hashmi , H. Subramoni , and D. K. Panda, Efficient MPI-based
Communication for GPU-Accelerated Dask Applications,
https://arxiv.org/abs/2101.08878
MPI4Dask 0.2 release
(http://hibd.cse.ohio-state.edu)
47
Network Based Computing Laboratory Global AI (Dec ‘22)
• Solutions to many current and next generation problems are dependent on
the growth of HPC and AI
• Growth and success in AI is very much dependent on HPC
• Presented an overview of the associated opportunities and challenges to
make HPC and AI accessible to all
• Presented a set of solutions to address these challenges
Concluding Remarks
48
Network Based Computing Laboratory Global AI (Dec ‘22)
Funding Acknowledgments
Funding Support by
Equipment Support by
49
Network Based Computing Laboratory Global AI (Dec ‘22)
Acknowledgments to all the Heroes (Past/Current Students and Staffs)
Current Students (Graduate)
– N. Alnaasan (Ph.D.)
– Q. Anthony (Ph.D.)
– C.-C. Chun (Ph.D.)
– N. Contini (Ph.D.)
– A. Jain (Ph.D.)
Past Students
– A. Awan (Ph.D.)
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– M. Bayatpour (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– S. Chakraborthy (Ph.D.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– C.-H. Chu (Ph.D.)
– D. Shankar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– N. Sarkauskas (B.S. and M.S)
– N. Senthil Kumar (M.S.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Srivastava (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientists
– K. Hamidouche
– S. Sur
– X. Lu
Past Post-Docs
– D. Banerjee
– X. Besseron
– M. S. Ghazimeersaeed
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– J. Hashmi (Ph.D.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– M. Kedia (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– K. Raj (M.S.)
– R. Rajachandrasekar (Ph.D.)
– K. S. Khorassani (Ph.D.)
– P. Kousha (Ph.D.)
– B. Michalowicz (Ph.D.)
– B. Ramesh (Ph.D.)
– K. K. Suresh (Ph.D.)
– H.-W. Jin
– J. Lin
– M. Luo
Past Senior Research Associate
– J. Hashmi
Past Programmers
– A. Reifsteck
– D. Bureddy
– J. Perkins
– E. Mancini
– K. Manian
– S. Marcarelli
Current Software Engineers
– B. Seeds
– N. Pavuk
– N. Shineman
– M. Lieber
Past Research Specialist
– M. Arnold
– J. Smith
Current Research Scientists
– M. Abduljabbar
– A. Shafi
– A. H. Tu (Ph.D.)
– S. Xu (Ph.D.)
– Q. Zhou (Ph.D.)
– K. Al Attar (M.S.)
– L. Xu (Ph.D.)
– A. Ruhela
– J. Vienne
– H. Wang
Current Students (Undergrads)
– V. Shah
– T. Chen
Current Research Specialist
– R. Motlagh
Current Faculty
– H. Subramoni
– H. Ahn (Ph.D.)
– G. Kuncham (Ph.D.)
– R. Vaidya (Ph.D.)
– J. Yao (Ph.D.)
– M. Han (M.S.)
– A. Guptha (M.S.)
50
Network Based Computing Laboratory Global AI (Dec ‘22)
• Looking for Bright and Enthusiastic Personnel to join as
– PhD Students
– Post-Doctoral Researchers
– MPI Programmer/Software Engineer
– Spark/Big Data Programmer/Software Engineer
– Deep Learning, Machine Learning, and Cloud Programmer/Software Engineer
• If interested, please send an e-mail to panda@cse.ohio-state.edu
Multiple Positions Available in MVAPICH2, BigData and
DL/ML Projects
51
Network Based Computing Laboratory Global AI (Dec ‘22)
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
kandadisuresh.1@osu.edu
The High-Performance MPI/PGAS Project
http://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Project
http://hidl.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/
Follow us on
https://twitter.com/mvapich

Contenu connexe

Similaire à Designing High performance & Scalable Middleware for HPC

00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdf00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdfaminnezarat
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsGanesan Narayanasamy
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Mark Goldstein
 
Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418inside-BigData.com
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systemsinside-BigData.com
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfHasanAfwaaz1
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit MumbaiAnand Haridass
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
 
Other distributed systems
Other distributed systemsOther distributed systems
Other distributed systemsSri Prasanna
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorialcybercbm
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)Amazon Web Services
 
Ohio LinuxFest: Crash Course in Open Source Cloud Computing
Ohio LinuxFest:  Crash Course in Open Source Cloud ComputingOhio LinuxFest:  Crash Course in Open Source Cloud Computing
Ohio LinuxFest: Crash Course in Open Source Cloud ComputingMark Hinkle
 

Similaire à Designing High performance & Scalable Middleware for HPC (20)

00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdf00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdf
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
grid computing
grid computinggrid computing
grid computing
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
 
Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
 
Presentation-1.ppt
Presentation-1.pptPresentation-1.ppt
Presentation-1.ppt
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
Other distributed systems
Other distributed systemsOther distributed systems
Other distributed systems
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorial
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 
Ohio LinuxFest: Crash Course in Open Source Cloud Computing
Ohio LinuxFest:  Crash Course in Open Source Cloud ComputingOhio LinuxFest:  Crash Course in Open Source Cloud Computing
Ohio LinuxFest: Crash Course in Open Source Cloud Computing
 

Plus de Object Automation

RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION IncRTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION IncObject Automation
 
CHIPS Alliance_Object Automation Inc_workshop
CHIPS Alliance_Object Automation Inc_workshopCHIPS Alliance_Object Automation Inc_workshop
CHIPS Alliance_Object Automation Inc_workshopObject Automation
 
RTL Design Methodologies_Object Automation Inc
RTL Design Methodologies_Object Automation IncRTL Design Methodologies_Object Automation Inc
RTL Design Methodologies_Object Automation IncObject Automation
 
High-Level Synthesis for the Design of AI Chips
High-Level Synthesis for the Design of AI ChipsHigh-Level Synthesis for the Design of AI Chips
High-Level Synthesis for the Design of AI ChipsObject Automation
 
AI-Inspired IOT Chiplets and 3D Heterogeneous Integration
AI-Inspired IOT Chiplets and 3D Heterogeneous IntegrationAI-Inspired IOT Chiplets and 3D Heterogeneous Integration
AI-Inspired IOT Chiplets and 3D Heterogeneous IntegrationObject Automation
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
CDAC presentation as part of Global AI Festival and Future
CDAC presentation as part of Global AI Festival and FutureCDAC presentation as part of Global AI Festival and Future
CDAC presentation as part of Global AI Festival and FutureObject Automation
 
Global AI Festivla and Future one day event
Global AI Festivla and Future one day eventGlobal AI Festivla and Future one day event
Global AI Festivla and Future one day eventObject Automation
 
Generative AI In Logistics_Object Automation
Generative AI In Logistics_Object AutomationGenerative AI In Logistics_Object Automation
Generative AI In Logistics_Object AutomationObject Automation
 
Gen AI_Object Automation_TechnologyWorkshop
Gen AI_Object Automation_TechnologyWorkshopGen AI_Object Automation_TechnologyWorkshop
Gen AI_Object Automation_TechnologyWorkshopObject Automation
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfObject Automation
 
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdfAI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdfObject Automation
 
5G Edge Computing_Object Automation workshop
5G Edge Computing_Object Automation workshop5G Edge Computing_Object Automation workshop
5G Edge Computing_Object Automation workshopObject Automation
 
Course_Object Automation.pdf
Course_Object Automation.pdfCourse_Object Automation.pdf
Course_Object Automation.pdfObject Automation
 

Plus de Object Automation (20)

RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION IncRTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
 
CHIPS Alliance_Object Automation Inc_workshop
CHIPS Alliance_Object Automation Inc_workshopCHIPS Alliance_Object Automation Inc_workshop
CHIPS Alliance_Object Automation Inc_workshop
 
RTL Design Methodologies_Object Automation Inc
RTL Design Methodologies_Object Automation IncRTL Design Methodologies_Object Automation Inc
RTL Design Methodologies_Object Automation Inc
 
High-Level Synthesis for the Design of AI Chips
High-Level Synthesis for the Design of AI ChipsHigh-Level Synthesis for the Design of AI Chips
High-Level Synthesis for the Design of AI Chips
 
AI-Inspired IOT Chiplets and 3D Heterogeneous Integration
AI-Inspired IOT Chiplets and 3D Heterogeneous IntegrationAI-Inspired IOT Chiplets and 3D Heterogeneous Integration
AI-Inspired IOT Chiplets and 3D Heterogeneous Integration
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
CDAC presentation as part of Global AI Festival and Future
CDAC presentation as part of Global AI Festival and FutureCDAC presentation as part of Global AI Festival and Future
CDAC presentation as part of Global AI Festival and Future
 
Global AI Festivla and Future one day event
Global AI Festivla and Future one day eventGlobal AI Festivla and Future one day event
Global AI Festivla and Future one day event
 
Generative AI In Logistics_Object Automation
Generative AI In Logistics_Object AutomationGenerative AI In Logistics_Object Automation
Generative AI In Logistics_Object Automation
 
Gen AI_Object Automation_TechnologyWorkshop
Gen AI_Object Automation_TechnologyWorkshopGen AI_Object Automation_TechnologyWorkshop
Gen AI_Object Automation_TechnologyWorkshop
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdf
 
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdfAI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
AI-INSPIRED IOT CHIPLETS AND 3D HETEROGENEOUS INTEGRATION.pdf
 
5G Edge Computing_Object Automation workshop
5G Edge Computing_Object Automation workshop5G Edge Computing_Object Automation workshop
5G Edge Computing_Object Automation workshop
 
COE AI Lab Universities
COE AI Lab UniversitiesCOE AI Lab Universities
COE AI Lab Universities
 
Bootcamp_AIApps.pdf
Bootcamp_AIApps.pdfBootcamp_AIApps.pdf
Bootcamp_AIApps.pdf
 
Bootcamp_AIApps.pdf
Bootcamp_AIApps.pdfBootcamp_AIApps.pdf
Bootcamp_AIApps.pdf
 
Bootcamp_AIAppsUCSD.pptx
Bootcamp_AIAppsUCSD.pptxBootcamp_AIAppsUCSD.pptx
Bootcamp_AIAppsUCSD.pptx
 
Course_Object Automation.pdf
Course_Object Automation.pdfCourse_Object Automation.pdf
Course_Object Automation.pdf
 
Enterprise AI_New.pdf
Enterprise AI_New.pdfEnterprise AI_New.pdf
Enterprise AI_New.pdf
 
Super AI tools
Super AI toolsSuper AI tools
Super AI tools
 

Dernier

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Dernier (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Designing High performance & Scalable Middleware for HPC

  • 1. 1 Network Based Computing Laboratory Global AI (Dec ‘22)
  • 2. Designing High-Performance and Scalable Middleware for HPC, AI and Data Science Kaushik Kandadi Suresh The Ohio State University E-mail: kandadisuresh.1@osu.edu http://www.cse.ohio-state.edu/~panda A Talk at Global AI Event (December ‘22) by Follow us on https://twitter.com/mvapich
  • 3. 3 Network Based Computing Laboratory Global AI (Dec ‘22) Introduction to HPC, MPI, RDMA • High Performance Computing (HPC): – utilization of computing power to process data and operations at high speeds – Used for solving compute intensive problems on multiple nodes • Communication: – Certain computational problems when decomposed on multiple nodes/machines requires data exchange across nodes/processors – MPI is a parallel programming model that provides communication primitives for parallel programs – RDMA enables directly access of remote node’s memory without CPU involvement • Improve communication latency • Provided by Interconnects such as InfiniBand, ROCE, Slinghshot
  • 4. 4 Network Based Computing Laboratory Global AI (Dec ‘22) 4 Bigger Challenge: Blood Flow in Human Vascular Network • Cardiovascular disease accounts for about 50% of deaths in western world; • Formation of arterial disease strongly correlated to blood flow patterns; Computational challenges: Enormous problem size In one minute, the heart pumps the entire blood supply of 5 quarts through 60,000 miles of vessels, that is a quarter of the distance between the moon and the earth Blood flow involves multiple scales Courtesy: G. Em Karniadakis & L. Grinberg
  • 5. 5 Network Based Computing Laboratory Global AI (Dec ‘22) Bigger Challenge: Earthquake and Flu/COVID Pandemic Simulation Earthquake simulation Surface velocity 75 sec after earthquake Flu pandemic simulation 300 million people tracked Density of infected population, 45 days after breakout Courtesy: G. Em Karniadakis & L. Grinberg
  • 6. 6 Network Based Computing Laboratory Global AI (Dec ‘22) Big Velocity – How Much Data Is Generated Every Minute on the Internet? The global Internet population grew 10% (in July 2021) from Jan 2021 and now represents 5.17 Billion People. Courtesy: https://www.domo.com/blog/data-never-sleeps-9/
  • 7. 7 Network Based Computing Laboratory Global AI (Dec ‘22) AI, Machine Learning and Deep Learning? Courtesy: https://hackernoon.com/difference-between-artificial-intelligence-machine-learning- and-deep-learning-1pcv3zeg, https://blog.dataiku.com/ai-vs.-machine-learning-vs.-deep-learning, https://en.wikipedia.org/wiki/Machine_learning • Machine Learning (ML) – “the study of computer algorithms to improve automatically through experience and use of data” • Deep Learning (DL) – a subset of ML – Uses Deep Neural Networks (DNNs) – Perhaps, the most revolutionary subset! • Based on learning data representation • DNN Examples: Convolutional Neural Networks, Recurrent Neural Networks, Hybrid Networks • Data Scientist or Developer Perspective for using DNNs 1. Identify DL as solution to a problem 2. Determine Data Set 3. Select Deep Learning Algorithm to Use 4. Use a large data set to train an algorithm
  • 8. 8 Network Based Computing Laboratory Global AI (Dec ‘22) Credit Card Fraud Detection using Machine Learning Courtesy: https://spd.group/machine-learning/fraud-detection-with-machine-learning https://www.sas.com/en_us/insights/articles/risk-fraud/fraud-detection-machine-learning.html … almost $112 million due to credit card fraud in 2019.
  • 9. 9 Network Based Computing Laboratory Global AI (Dec ‘22) The Impact of Deep Learning on Application Areas Courtesy: https://github.com/alexjc/neural-doodle Courtesy: https://arxiv.org/pdf/1808.02334.pdf Courtesy: https://research.googleblog.com/2015/07/how-google-translate-squeezes-deep.html Courtesy: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8065136
  • 10. 10 Network Based Computing Laboratory Global AI (Dec ‘22) Self Driving Cars Courtesy: http://www.teslarati.com/teslas-full-self-driving-capability-arrive-3-months-definitely-6-months-says-musk/
  • 11. 11 Network Based Computing Laboratory Global AI (Dec ‘22) • Applications – Prostate Cancer Detection – Metastasis Detection in Breast Cancer – Genetic Mutation Prediction – Tumor Detection for Molecular Analysis AI-Driven Digital Pathology Courtesy: https://www.frontiersin.org/articles/10.3389/fmed.2019.00185/full
  • 12. 12 Network Based Computing Laboratory Global AI (Dec ‘22) Artificial Intelligence Use Cases and Growth Trends Courtesy: https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/
  • 13. 13 Network Based Computing Laboratory Global AI (Dec ‘22) High-End Computing (HEC): PetaFlop to ExaFlop 100 PetaFlops in 2017 1.1 ExaFlops (HPL) and 6.88 ExaFlops (HPL- AI) in 2022 (Frontier at ORNL with 8.73M cores) 442 Peta Flops in 2020 (Fugaku in Japan with 7.63M cores
  • 14. 14 Network Based Computing Laboratory Global AI (Dec ‘22) Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org) 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500 Percentage of Clusters Number of Clusters Timeline Percentage of Clusters Number of Clusters 98.4%
  • 15. 15 Network Based Computing Laboratory Global AI (Dec ‘22) Drivers of Modern HPC Cluster Architectures • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand, RoCE, Slingshot) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. Accelerators high compute density, high performance/watt >9.7 TFlop DP on a chip High Performance Interconnects – InfiniBand, Slingshot <1usec latency, 200-400Gbps Bandwidth> Multi-/Many-core Processors SSD, NVMe-SSD, NVRAM Frontier Summit Lumi Fugaku
  • 16. 16 Network Based Computing Laboratory Global AI (Dec ‘22) Deep/ Machine Learning (TensorFlow, PyTorch, cuML, etc.) Big Data (Hadoop, Spark), Data Science (Dask) HPC (MPI, PGAS, etc.) Increasing Usage of HPC, Deep/Machine Learning, and Data Science Convergence of HPC, Deep/Machine Learning, and Data Science! Increasing Need to Run these applications on the Cloud!! Can MPI-driven Converged Middleware be designed and used for all three domains?
  • 17. 17 Network Based Computing Laboratory Global AI (Dec ‘22) Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Hadoop, Spark (RDD, DAG), TensorFlow, PyTorch, etc. Application Kernels/Applications (HPC, DL, Data Science) Networking Technologies (InfiniBand, Ethernet, RoCE, Omni-Path, and Slingshot) Multi-/Many-core Architectures Accelerators (GPU and FPGA) Middleware Co-Design Opportunities and Challenges across Various Layers Performance Scalability Resilience Communication Library or Runtime for Programming Models Point-to-point Communication Collective Communication Energy- Awareness Synchronization and Locks I/O and File Systems Fault Tolerance
  • 18. 18 Network Based Computing Laboratory Global AI (Dec ‘22) • MVAPICH Project – MPI Library with CUDA-Awareness (GPU) – Accelerating applications with DPU • HiDL Project – High-Performance Deep Learning – High-Performance Machine Learning • HiBD Project – Accelerating Big Data and Data Science Applications • Conclusions Presentation Overview
  • 19. 19 Network Based Computing Laboratory Global AI (Dec ‘22) Deep/ Machine Learning (TensorFlow, PyTorch, cuML, etc.) Big Data (Hadoop, Spark), Data Science (Dask) HPC (MPI, PGAS, etc.) Converged Software Stacks for HPC, Deep/Machine Learning, and Data Science
  • 20. 20 Network Based Computing Laboratory Global AI (Dec ‘22) Overview of the MVAPICH2 Project • High Performance open-source MPI Library • Support for multiple interconnects – InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), AWS EFA, Rockport Networks, and Slingshot • Support for multiple platforms – x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD) • Started in 2001, first open-source version demonstrated at SC ‘02 • Supports the latest MPI-3.1 standard • http://mvapich.cse.ohio-state.edu • Additional optimized versions for different systems/environments: – MVAPICH2-X (Advanced MPI + PGAS), since 2011 – MVAPICH2-GDR with support for NVIDIA (since 2014) and AMD (since 2020) GPUs – MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 – MVAPICH2-Virt with virtualization support, since 2015 – MVAPICH2-EA with support for Energy-Awareness, since 2015 – MVAPICH2-Azure for Azure HPC IB instances, since 2019 – MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019 • Tools: – OSU MPI Micro-Benchmarks (OMB), since 2003 – OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015 • Used by more than 3,200 organizations in 89 countries • More than 1.56 Million downloads from the OSU site directly • Empowering many TOP500 clusters (Nov ‘21 ranking) – 4th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China – 13th, 448, 448 cores (Frontera) at TACC – 26th, 288,288 cores (Lassen) at LLNL – 38th, 570,020 cores (Nurion) in South Korea and many others • Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, OpenHPC, and Spack) • Partner in the 13th ranked TACC Frontera system • Empowering Top500 systems for more than 16 years
  • 21. 21 Network Based Computing Laboratory Global AI (Dec ‘22) 0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 Sep-04 Mar-05 Sep-05 Mar-06 Sep-06 Mar-07 Sep-07 Mar-08 Sep-08 Mar-09 Sep-09 Mar-10 Sep-10 Mar-11 Sep-11 Mar-12 Sep-12 Mar-13 Sep-13 Mar-14 Sep-14 Mar-15 Sep-15 Mar-16 Sep-16 Mar-17 Sep-17 Mar-18 Sep-18 Mar-19 Sep-19 Mar-20 Sep-20 Mar-21 Number of Downloads Timeline MV 0.9.4 MV2 0.9.0 MV2 0.9.8 MV2 1.0 MV 1.0 MV2 1.0.3 MV 1.1 MV2 1.4 MV2 1.5 MV2 1.6 MV2 1.7 MV2 1.8 MV2 1.9 MV2-GDR 2.0b MV2-MIC 2.0 MV2-GDR 2.3.6 MV2-X 2.3 MV2 Virt 2.2 MV2 2.3.6 OSU INAM 0.9.6 MV2-Azure 2.3.2 MV2-AWS 2.3 MVAPICH2 Release Timeline and Downloads
  • 22. 22 Network Based Computing Laboratory Global AI (Dec ‘22) Architecture of MVAPICH2 Software Family for HPC, DL/ML, and Data Science High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- point Primitives Collectives Algorithms Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Job Startup Introspection & Analysis Support for Modern Networking Technology (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter) Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features RC SRD UD DC UMR ODP SR- IOV Multi Rail Transport Mechanisms Shared Memory CMA IVSHMEM Modern Features Optane* NVLink CAPI* * Upcoming XPMEM
  • 23. 23 Network Based Computing Laboratory Global AI (Dec ‘22) 0 2000 4000 6000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth (MB/s) Message Size (Bytes) GPU-GPU Inter-node Bi-Bandwidth MV2-(NO-GDR) MV2-GDR-2.3 0 1000 2000 3000 4000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth (MB/s) Message Size (Bytes) GPU-GPU Inter-node Bandwidth MV2-(NO-GDR) MV2-GDR-2.3 0 10 20 30 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency (us) Message Size (Bytes) GPU-GPU Inter-node Latency MV2-(NO-GDR) MV2-GDR 2.3 MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA 10x 9x Optimized MVAPICH2-GDR with CUDA-Aware MPI Support 1.85us 11X
  • 24. 24 Network Based Computing Laboratory Global AI (Dec ‘22) Enhanced DDT Support: HCA Assisted Inter-Node Scheme (UMR) • Comparison of UMR based DDT scheme in MVAPICH2-GDR-Next with OpenMPI 4.1.3, MVAPICH2-GDR 2.3.6 • 1 GPU per Node, 2 Node experiment. Speed-up relative to OpenMPI Platform: ThetaGPU (NVIDIA DGX-A100) (NVIDIA Ampere GPUs connected with NVSwitch), CUDA 11.0 0 5 10 15 20 25 30 (8,8,16,32) (8,16,16,32) (16,16,32,32) (16,32,32,32) Speedup Input Parameters DDTBench-MILC MVAPICH2-GDR-Next MVAPICH2-GDR OpenMPI 0 10 20 30 40 50 (512,66,66) (1024,66,66) (2048,66,120) Speedup Input Parameters DDTBench-NASMGY MVAPICH2-GDR-Next MVAPICH2-GDR OpenMPI Improved 35% • Uses nested vector datatype for 4D face exchanges. • 3D face exchanges with vector and nested vector datatypes Improved 32% K. Suresh, K. Khorassani, C. Chen, B. Ramesh, M. Abduljabbar, A. Shafi, D. Panda, Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries, Hot Interconnects 29
  • 25. 25 Network Based Computing Laboratory Global AI (Dec ‘22) • Weak-Scaling of HPC application AWP-ODC on Lassen cluster (V100 nodes) • MPC-OPT achieves up to +18% GPU computing flops, -15% runtime per timestep • ZFP-OPT achieves up to +35% GPU computing flops, -26% runtime per timestep “On-the-fly” Compression Support in MVAPICH2-GDR 0 20 40 60 80 100 120 140 160 180 200 8 16 32 64 128 256 512 GPU Computing Flops (TFLOPS) GPUS Baseline (No compression) MPC-OPT ZFP-OPT (rate:16) ZFP-OPT (rate:8) +35% +18% 0 10 20 30 40 50 60 70 8 16 32 64 128 256 512 Run Time Per Timestep (ms) GPUS -15% -26% Q. Zhou, C. Chu, N. Senthil Kumar, P. Kousha, M. Ghazimirsaeed, H. Subramoni, and D.K. Panda, Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters, 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2021. [Best Paper Finalist]
  • 26. 26 Network Based Computing Laboratory Global AI (Dec ‘22) MVAPICH Accelerates Parallel 3-D FFT at Oak Ridge Accelerating the communication cost on parallel 3-D FFTs, Stan Tomov and Alan Ayala, The University of Tennessee, Knoxville (http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/21/Ayala.pdf) MVAPICH is around 10-20% faster than SpectrumMPI 10.3 for heFFTe Library Comparison of achievable bandwidth for two-node exchange via MPI_Send
  • 27. 27 Network Based Computing Laboratory Global AI (Dec ‘22) MVAPICH Drives Nuclear Energy Research at Idaho National Lab (INL) The MOOSE Multiphysics Computational Framework for Nuclear Power Applications: A Special Issue of Nuclear Technology (https://www.tandfonline.com/doi/full/10.1080/00295450.2021.1915487) MVAPICH Integration for PBS Pro, HPC Team, Idaho National Laboratory (http://mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/21/inl.pdf)
  • 28. 28 Network Based Computing Laboratory Global AI (Dec ‘22) Rapid adoption of MVAPICH2 on INL HPC systems 1 10 100 1000 10000 100000 1000000 mvapich2/2.3.5 mvapich2/2.3.3 openmpi/4.0.2 openmpi/4.0.5 intelmpi MPI library usage 1 Jan 2021—1 Sep 2021 M. Anderson, Aggressive Asynchronous Communication in the MOOSE framework using MVAPICH2, 10th Annual MVAPICH User Group Conference (MUG), Aug 2022
  • 29. 29 Network Based Computing Laboratory Global AI (Dec ‘22) • Near-Earth asteroids (NEAs) have caused recent and ancient global catastrophes – LLNL scientists research ways to prevent NEAs using methods known as asteroid deflection – Joint NASA-LLNL research modelled various asteroid deflection methods (NASA's DART mission) • MVAPICH2 lived at the core of the (NASA-DART mission) and enabled scalability – Underneath large-scale hydrodynamical and gravitational simulations required to compute the impact such as Spheral models MVAPICH2 enabling life-changing NASA's DART mission •https://twitter.com/NASA/status/1574539270987173903?s=20&t=u_4wI V9Cui2xyn9QLj286Q •https://www.cbsnews.com/sanfrancisco/news/i-just-could-not-believe- it-livermore-team-celebrates-nasas-historic-strike-on-distant-asteroid/ •http://mug.mvapich.cse.ohio- state.edu/static/media/mug/presentations/18/moody-mug-18.pdf
  • 30. 30 Network Based Computing Laboratory Global AI (Dec ‘22) • LLNL’s National Ignition Facility (NIF) conducted the first controlled fusion experiment in history![1] • MVAPICH, being the default MPI library on the LLNL systems, has been enabling the thousands of simulation jobs that have led to this amazing achievement! • [1] https://www.llnl.gov/news/national-ignition-facility-achieves-fusion-ignition MVAPICH2 enabling Nuclear Fusion Research The target chamber of LLNL’s National Ignition Facility, where 192 laser beams delivered more than 2 million joules of ultraviolet energy to a tiny fuel pellet to create fusion ignition on Dec. 5, 2022. The hohlraum that houses the type of cryogenic target used to achieve ignition on Dec. 5, 2022, at LLNL’s National Ignition Facility. To create fusion ignition, the National Ignition Facility’s laser energy is converted into X-rays inside the hohlraum, which then compress a fuel capsule until it implodes, creating a high temperature, high pressure plasma.
  • 31. 31 Network Based Computing Laboratory Global AI (Dec ‘22) • MVAPICH Project – MPI Library with CUDA-Awareness (GPU) – Accelerating applications with DPU • HiDL Project – High-Performance Deep Learning – High-Performance Machine Learning • HiBD Project – Accelerating Big Data and Data Science Applications • Conclusions Presentation Overview
  • 32. 32 Network Based Computing Laboratory Global AI (Dec ‘22) • Scale-up: Intra-node Communication – Many improvements like: • NVIDIA cuDNN, cuBLAS, NCCL, etc. • CUDA Co-operative Groups • Scale-out: Inter-node Communication – DL Frameworks – most are optimized for single-node only – Distributed (Parallel) Training is an emerging trend • PyTorch – MPI/NCCL2 • TensorFlow – gRPC-based/MPI/NCCL2 • OSU-Caffe – MPI-based Scale-up and Scale-out Scale-up Performance Scale-out Performance cuDNN gRPC Hadoop MPI MKL-DNN Desired NCCL2
  • 33. 33 Network Based Computing Laboratory Global AI (Dec ‘22) Deep/ Machine Learning (TensorFlow, PyTorch, cuML, etc.) Big Data (Hadoop, Spark), Data Science (Dask) HPC (MPI, PGAS, etc.) Converged Software Stacks for HPC, Deep/Machine Learning, and Data Science
  • 34. 34 Network Based Computing Laboratory Global AI (Dec ‘22) MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training MVAPICH2 or MVAPICH2-X for CPU Training MVAPICH2-GDR for GPU Training Horovod TensorFlow PyTorch MXNet ML/DL Applications MVAPICH2 or MVAPICH2-X for CPU Training MVAPICH2-GDR for GPU Training Torch.distributed PyTorch ML/DL Applications DeepSpeed More details available from: http://hidl.cse.ohio-state.edu
  • 35. 35 Network Based Computing Laboratory Global AI (Dec ‘22) Distributed TensorFlow on ORNL Summit (1,536 GPUs) • ResNet-50 Training using TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs! • 1,281,167 (1.2 mil.) images • Time/epoch = 3 seconds • Total Time (90 epochs) = 3 x 90 = 270 seconds = 4.5 minutes! 0 100 200 300 400 500 1 2 4 6 12 24 48 96 192 384 768 1536 Image per second Thousands Number of GPUs MVAPICH2-GDR 2.3.4 MVAPICH2-GDR 2.3.4 Platform: The Summit Supercomputer (#2 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1 *We observed issues for NCCL2 beyond 384 GPUs MVAPICH2-GDR reaching ~0.42 million images per second for ImageNet-1k! ImageNet-1k has 1.2 million images
  • 36. 36 Network Based Computing Laboratory Global AI (Dec ‘22) Distributed Framework Torch.distributed Horovod DeepSpeed Images/sec on 256 GPUs 61,794 72,120 74,063 84,659 80,217 88,873 Communication Backend NCCL 2.7 MVAPICH2-GDR NCCL 2.7 MVAPICH2-GDR NCCL 2.7 MVAPICH2-GDR PyTorch at Scale: Training ResNet-50 on 256 V100 GPUs • Training performance for 256 V100 GPUs on LLNL Lassen – ~10,000 Images/sec faster than NCCL training!
  • 37. 37 Network Based Computing Laboratory Global AI (Dec ‘22) • Pathology whole slide image (WSI) – Each WSI = 100,000 x 100,000 pixels – Can not fit in a single GPU memory – Tiles are extracted to make training possible • Two main problems with tiles – Restricted tile size because of GPU memory limitation – Smaller tiles loose structural information • Reduced training time significantly – GEMS-Basic: 7.25 hours (1 node, 4 GPUs) – GEMS-MAST: 6.28 hours (1 node, 4 GPUs) – GEMS-MASTER: 4.21 hours (1 node, 4 GPUs) – GEMS-Hybrid: 0.46 hours (32 nodes, 128 GPUs) – Overall 15x reduction in training time!!!! Exploiting Model Parallelism in AI-Driven Digital Pathology Courtesy: https://blog.kitware.com/digital-slide- archive-large-image-and-histomicstk-open-source- informatics-tools-for-management-visualization-and- analysis-of-digital-histopathology-data/ Scaling ResNet110 v2 on 1024×1024 image tiles using histopathology data A. Jain, A. Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, D. K. Panda, R. Machiraju, and A. Parwani, “GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training”, Supercomputing (SC ‘20). 0 5 10 15 20 25 4 8 16 32 64 128 Throughput Speedup (images per sec) Number of GPUs 1x 1.9x 3.6x 7x 12x 22x
  • 38. 38 Network Based Computing Laboratory Global AI (Dec ‘22) • MVAPICH Project – MPI Library with CUDA-Awareness (GPU) – Accelerating applications with DPU • HiDL Project – High-Performance Deep Learning – High-Performance Machine Learning • HiBD Project – Accelerating Big Data and Data Science Applications • Conclusions Presentation Overview
  • 39. 39 Network Based Computing Laboratory Global AI (Dec ‘22) Deep/ Machine Learning (TensorFlow, PyTorch, cuML, etc.) Big Data (Hadoop, Spark), Data Science (Dask) HPC (MPI, PGAS, etc.) Converged Software Stacks for HPC, Deep/Machine Learning, and Data Science
  • 40. 40 Network Based Computing Laboratory Global AI (Dec ‘22) • Since 2013 • RDMA for Apache Spark • RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x) • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions • RDMA for Apache Kafka • RDMA for Apache HBase • RDMA for Memcached (RDMA-Memcached) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • OSU HiBD-Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 340 organizations from 36 countries • More than 44,000 downloads from the project site The High-Performance Big Data (HiBD) Project Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Support for Singularity and Docker
  • 41. 41 Network Based Computing Laboratory Global AI (Dec ‘22) • InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) • RDMA-based design for Spark 1.5.1 • RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) – 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) RDMA-Spark on SDSC Comet – HiBench PageRank 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time 0 50 100 150 200 250 300 350 400 450 Huge BigData Gigantic Time (sec) Data Size (GB) IPoIB RDMA 0 100 200 300 400 500 600 700 800 Huge BigData Gigantic Time (sec) Data Size (GB) IPoIB RDMA 43% 37%
  • 42. 42 Network Based Computing Laboratory Global AI (Dec ‘22) • The main motivation of this work is to utilize the communication functionality provided by MVAPICH2 in the Apache Spark framework • MPI4Spark relies on Java bindings of the MVAPICH2 library • Spark’s default ShuffleManager relies on Netty for communication: – Netty is a Java New I/O (NIO) client/server framework for event-based networking applications – The key idea is to utilize MPI-based point-to-point communication inside Netty MPI4Spark: Using MVAPICH2 to Optimize Apache Spark
  • 43. 43 Network Based Computing Laboratory Global AI (Dec ‘22) MPI4Spark: Relative Speedups to Vanilla Spark and RDMA- Spark on Three HPC Systems System Name Nodes Used Processor Cores Used Sockets Cores/socket RAM Interconnect TACC Frontera 34 Xeon Platinum 1792 2 28 192 GB HDR (100G) RI2 (OSU System) 14 Xeon Broadwell 336 2 14 128 GB EDR (100G) MRI (OSU System) 12 AMD EPYC 7713 1280 2 64 264 GB 200 Gb/sec (4X HDR) OHB GroupByTest OHB SortByTest 3.65x 1.88x 3.52x 1.86x
  • 44. 44 Network Based Computing Laboratory Global AI (Dec ‘22) Dask Architecture Distributed Scheduler Worker Client Comm Layer tcp.py ucx.py Laptops/ Desktops Dask-MPI Dask-CUDA Dask-Jobqueue Dask Dask Bag Dask Array Dask DataFrame Delayed Future Task Graph High Performance Computing Hardware UCX-Py (Cython wrappers) TCP UCX MPI4Dask mpi4py MVAPICH2-GDR
  • 45. 45 Network Based Computing Laboratory Global AI (Dec ‘22) • MPI4Dask 0.2 was released in Mar ‘21 adding support for high-performance MPI communication to Dask: – Can be downloaded from: http://hibd.cse.ohio-state.edu • Features: – Based on Dask Distributed 2021.01.0​ – Compliant with user-level Dask APIs and packages​ – Support for MPI-based communication in Dask for cluster of GPUs​ – Implements point-to-point communication co-routines​ – Efficient chunking mechanism implemented for large messages​ – (NEW) Built on top of mpi4py over the MVAPICH2, MVAPICH2-X, and MVAPICH2-GDR libraries​ – (NEW) Support for MPI-based communication for CPU-based Dask applications​ – Supports starting execution of Dask programs using Dask-MPI​ – Tested with​ • (NEW) CPU-based Dask applications using numPy and Pandas data frames • (NEW) GPU-based Dask applications using cuPy and cuDF​ • Mellanox InfiniBand adapters (FDR and EDR)​ • Various multi-core platforms​ • NVIDIA V100 and Quadro RTX 5000 GPUs​ MPI4Dask Release
  • 46. 46 Network Based Computing Laboratory Global AI (Dec ‘22) Benchmark #1: Sum of cuPy Array and its Transpose (RI2) 0 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 Total Execution Time (s) Number of Dask Workers IPoIB UCX MPI4Dask 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2 3 4 5 6 Communication Time (s) Number of Dask Workers IPoIB UCX MPI4Dask 3.47x better on average 6.92x better on average A. Shafi , J. Hashmi , H. Subramoni , and D. K. Panda, Efficient MPI-based Communication for GPU-Accelerated Dask Applications, https://arxiv.org/abs/2101.08878 MPI4Dask 0.2 release (http://hibd.cse.ohio-state.edu)
  • 47. 47 Network Based Computing Laboratory Global AI (Dec ‘22) • Solutions to many current and next generation problems are dependent on the growth of HPC and AI • Growth and success in AI is very much dependent on HPC • Presented an overview of the associated opportunities and challenges to make HPC and AI accessible to all • Presented a set of solutions to address these challenges Concluding Remarks
  • 48. 48 Network Based Computing Laboratory Global AI (Dec ‘22) Funding Acknowledgments Funding Support by Equipment Support by
  • 49. 49 Network Based Computing Laboratory Global AI (Dec ‘22) Acknowledgments to all the Heroes (Past/Current Students and Staffs) Current Students (Graduate) – N. Alnaasan (Ph.D.) – Q. Anthony (Ph.D.) – C.-C. Chun (Ph.D.) – N. Contini (Ph.D.) – A. Jain (Ph.D.) Past Students – A. Awan (Ph.D.) – A. Augustine (M.S.) – P. Balaji (Ph.D.) – M. Bayatpour (Ph.D.) – R. Biswas (M.S.) – S. Bhagvat (M.S.) – A. Bhat (M.S.) – D. Buntinas (Ph.D.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – S. Chakraborthy (Ph.D.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – C.-H. Chu (Ph.D.) – D. Shankar (Ph.D.) – G. Santhanaraman (Ph.D.) – N. Sarkauskas (B.S. and M.S) – N. Senthil Kumar (M.S.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Srivastava (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) – J. Zhang (Ph.D.) Past Research Scientists – K. Hamidouche – S. Sur – X. Lu Past Post-Docs – D. Banerjee – X. Besseron – M. S. Ghazimeersaeed – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – J. Hashmi (Ph.D.) – W. Huang (Ph.D.) – W. Jiang (M.S.) – J. Jose (Ph.D.) – M. Kedia (M.S.) – S. Kini (M.S.) – M. Koop (Ph.D.) – K. Kulkarni (M.S.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – M. Li (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – A. Moody (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – S. Potluri (Ph.D.) – K. Raj (M.S.) – R. Rajachandrasekar (Ph.D.) – K. S. Khorassani (Ph.D.) – P. Kousha (Ph.D.) – B. Michalowicz (Ph.D.) – B. Ramesh (Ph.D.) – K. K. Suresh (Ph.D.) – H.-W. Jin – J. Lin – M. Luo Past Senior Research Associate – J. Hashmi Past Programmers – A. Reifsteck – D. Bureddy – J. Perkins – E. Mancini – K. Manian – S. Marcarelli Current Software Engineers – B. Seeds – N. Pavuk – N. Shineman – M. Lieber Past Research Specialist – M. Arnold – J. Smith Current Research Scientists – M. Abduljabbar – A. Shafi – A. H. Tu (Ph.D.) – S. Xu (Ph.D.) – Q. Zhou (Ph.D.) – K. Al Attar (M.S.) – L. Xu (Ph.D.) – A. Ruhela – J. Vienne – H. Wang Current Students (Undergrads) – V. Shah – T. Chen Current Research Specialist – R. Motlagh Current Faculty – H. Subramoni – H. Ahn (Ph.D.) – G. Kuncham (Ph.D.) – R. Vaidya (Ph.D.) – J. Yao (Ph.D.) – M. Han (M.S.) – A. Guptha (M.S.)
  • 50. 50 Network Based Computing Laboratory Global AI (Dec ‘22) • Looking for Bright and Enthusiastic Personnel to join as – PhD Students – Post-Doctoral Researchers – MPI Programmer/Software Engineer – Spark/Big Data Programmer/Software Engineer – Deep Learning, Machine Learning, and Cloud Programmer/Software Engineer • If interested, please send an e-mail to panda@cse.ohio-state.edu Multiple Positions Available in MVAPICH2, BigData and DL/ML Projects
  • 51. 51 Network Based Computing Laboratory Global AI (Dec ‘22) Thank You! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ kandadisuresh.1@osu.edu The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ Follow us on https://twitter.com/mvapich