SlideShare une entreprise Scribd logo
1  sur  74
Communication Frameworks for HPC and Big Data
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Talk at HPC Advisory Council Spain Conference (2015)
by
High-End Computing (HEC): PetaFlop to ExaFlop
HPCAC Spain Conference (Sept '15) 2
100-200
PFlops in
2016-2018
1 EFlops in
2020-2024?
Trends for Commodity Computing Clusters in the Top 500
List (http://www.top500.org)
3HPCAC Spain Conference (Sept '15)
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
PercentageofClusters
NumberofClusters
Timeline
Percentage of Clusters
Number of Clusters
87%
Drivers of Modern HPC Cluster Architectures
• Multi-core processors are ubiquitous
• InfiniBand very popular in HPC clusters•
• Accelerators/Coprocessors becoming common in high-end systems
• Pushing the envelope for Exascale computing
Accelerators / Coprocessors
high compute density, high performance/watt
>1 TFlop DP on a chip
High Performance Interconnects - InfiniBand
<1usec latency, >100Gbps Bandwidth
Tianhe – 2 Titan Stampede Tianhe – 1A
4
Multi-core Processors
HPCAC Spain Conference (Sept '15)
• 259 IB Clusters (51%) in the June 2015 Top500 list
(http://www.top500.org)
• Installations in the Top 50 (24 systems):
Large-scale InfiniBand Installations
519,640 cores (Stampede) at TACC (8th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (22nd)
185,344 cores (Pleiades) at NASA/Ames (11th) 194,616 cores (Cascade) at PNNL (25th)
72,800 cores Cray CS-Storm in US (13th) 76,032 cores (Makman-2) at Saudi Aramco (28th)
72,800 cores Cray CS-Storm in US (14th) 110,400 cores (Pangea) in France (29th)
265,440 cores SGI ICE at Tulip Trading Australia (15th) 37,120 cores (Lomonosov-2) at Russia/MSU (31st)
124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (16th) 57,600 cores (SwiftLucy) in US (33rd)
72,000 cores (HPC2) in Italy (17th) 50,544 cores (Occigen) at France/GENCI-CINES (36th)
115,668 cores (Thunder) at AFRL/USA (19th) 76,896 cores (Salomon) SGI ICE in Czech Republic (40th)
147,456 cores (SuperMUC) in Germany (20th) 73,584 cores (Spirit) at AFRL/USA (42nd)
86,016 cores (SuperMUC Phase 2) in Germany (21st) and many more!
5HPCAC Spain Conference (Sept '15)
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the
Dominant Programming Model
– Many discussions towards Partitioned Global Address Space
(PGAS)
• UPC, OpenSHMEM, CAF, etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce)
– Spark is emerging for in-memory computing
– Memcached is also used for Web 2.0
6
Two Major Categories of Applications
HPCAC Spain Conference (Sept '15)
Towards Exascale System (Today and Target)
Systems 2015
Tianhe-2
2020-2024 Difference
Today & Exascale
System peak 55 PFlop/s 1 EFlop/s ~20x
Power 18 MW
(3 Gflops/W)
~20 MW
(50 Gflops/W)
O(1)
~15x
System memory 1.4 PB
(1.024PB CPU + 0.384PB CoP)
32 – 64 PB ~50X
Node performance 3.43TF/s
(0.4 CPU + 3 CoP)
1.2 or 15 TF O(1)
Node concurrency 24 core CPU +
171 cores CoP
O(1k) or O(10k) ~5x - ~50x
Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x
System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x
Total concurrency 3.12M
12.48M threads (4 /core)
O(billion)
for latency hiding
~100x
MTTI Few/day Many/day O(?)
Courtesy: Prof. Jack Dongarra
7HPCAC Spain Conference (Sept '15)
HPCAC Spain Conference (Sept '15) 8
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
9
Partitioned Global Address Space (PGAS) Models
HPCAC Spain Conference (Sept '15)
• Key features
- Simple shared memory abstractions
- Light weight one-sided communication
- Easier to express irregular communication
• Different approaches to PGAS
- Languages
• Unified Parallel C (UPC)
• Co-Array Fortran (CAF)
• X10
• Chapel
- Libraries
• OpenSHMEM
• Global Arrays
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in MPI/PGAS based
on communication characteristics
• Benefits:
– Best of Distributed Computing Model
– Best of Shared Memory Computing Model
• Exascale Roadmap*:
– “Hybrid Programming is a practical way to
program exascale systems”
* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011,
International Journal of High Performance Computer Applications, ISSN 1094-3420
Kernel 1
MPI
Kernel 2
MPI
Kernel 3
MPI
Kernel N
MPI
HPC Application
Kernel 2
PGAS
Kernel N
PGAS
HPCAC Spain Conference (Sept '15) 10
Designing Communication Libraries for Multi-
Petaflop and Exaflop Systems: Challenges
Programming Models
MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
Application Kernels/Applications
Networking Technologies
(InfiniBand, 40/100GigE,
Aries, and OmniPath)
Multi/Many-core
Architectures
11
Accelerators
(NVIDIA and MIC)
Middleware
HPCAC Spain Conference (Sept '15)
Co-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Fault-
Resilience
Communication Library or Runtime for Programming Models
Point-to-point
Communication
(two-sided and
one-sided
Collective
Communication
Energy-
Awareness
Synchronization
and Locks
I/O and
File Systems
Fault
Tolerance
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
• Scalable Collective communication
– Offload
– Non-blocking
– Topology-aware
• Balancing intra-node and inter-node communication for next generation
multi-core (128-1024 cores/node)
– Multiple end-points per node
• Support for efficient multi-threading
• Integrated Support for GPGPUs and Accelerators
• Fault-tolerance/resiliency
• QoS support for communication and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC,
MPI + OpenSHMEM, CAF, …)
• Virtualization
• Energy-Awareness
Broad Challenges in Designing Communication Libraries for
(MPI+X) at Exascale
12
HPCAC Spain Conference (Sept '15)
• Extreme Low Memory Footprint
– Memory per core continues to decrease
• D-L-A Framework
– Discover
• Overall network topology (fat-tree, 3D, …)
• Network topology for processes for a given job
• Node architecture
• Health of network and node
– Learn
• Impact on performance and scalability
• Potential for failure
– Adapt
• Internal protocols and algorithms
• Process mapping
• Fault-tolerance solutions
– Low overhead techniques while delivering performance, scalability and fault-
tolerance
Additional Challenges for Designing Exascale Software
Libraries
13
HPCAC Spain Conference (Sept '15)
• High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over
Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Used by more than 2,450 organizations in 76 countries
– More than 286,000 downloads from the OSU site directly
– Empowering many TOP500 clusters (June ‘15 ranking)
• 8th ranked 519,640-core cluster (Stampede) at TACC
• 11th ranked 185,344-core cluster (Pleiades) at NASA
• 22nd ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->
– Stampede at TACC (8th in Jun’15, 519,640 cores, 5.168 Plops)
MVAPICH2 Software
14HPCAC Spain Conference (Sept '15)
HPCAC Spain Conference (Sept '15) 15
MVAPICH2 Software Family
Requirements MVAPICH2 Library to use
MPI with IB, iWARP and RoCE MVAPICH2
Advanced MPI, OSU INAM, PGAS and MPI+PGAS with
IB and RoCE
MVAPICH2-X
MPI with IB & GPU MVAPICH2-GDR
MPI with IB & MIC MVAPICH2-MIC
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimal memory footprint
• Collective communication
– Hardware-multicast-based
– Offload and Non-blocking
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …)
• Virtualization
• Energy-Awareness
• InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by
MVAPICH2 Project for Exascale
16
HPCAC Spain Conference (Sept '15)
HPCAC Spain Conference (Sept '15) 17
Latency & Bandwidth: MPI over IB with MVAPICH2
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
Small Message Latency
Message Size (bytes)
Latency(us)
1.26
1.19
1.05
1.15
TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back
0
2000
4000
6000
8000
10000
12000
14000
Unidirectional Bandwidth
Bandwidth(MBytes/sec)
Message Size (bytes)
12465
3387
6356
11497
0
0.2
0.4
0.6
0.8
1
0 1 2 4 8 16 32 64 128 256 512 1K
Latency(us)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance
(Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
18
Latest MVAPICH2 2.2a
Intel Ivy-bridge
0.18 us
0.45 us
0
2000
4000
6000
8000
10000
12000
14000
16000
Bandwidth(MB/s)
Message Size (Bytes)
Bandwidth (Inter-socket)
inter-Socket-CMA
inter-Socket-Shmem
inter-Socket-LiMIC
0
2000
4000
6000
8000
10000
12000
14000
16000
Bandwidth(MB/s)
Message Size (Bytes)
Bandwidth (Intra-socket)
intra-Socket-CMA
intra-Socket-Shmem
intra-Socket-LiMIC
14,250 MB/s 13,749 MB/s
HPCAC Spain Conference (Sept '15)
HPCAC Spain Conference (Sept '15) 19
Minimizing Memory Footprint further by DC Transport
Node0
P1
P0
Node 1
P3
P2
Node 3
P7
P6
Node2
P5
P4
IB
Network
• Constant connection cost (One QP for any peer)
• Full Feature Set (RDMA, Atomics etc)
• Separate objects for send (DC Initiator) and receive (DC
Target)
– DC Target identified by “DCT Number”
– Messages routed with (DCT Number, LID)
– Requires same “DC Key” to enable communication
• Available with MVAPICH2-X 2.2a
• DCT support available in Mellanox OFED
0
0.5
1
160 320 620
NormalizedExecution
Time
Number of Processes
NAMD - Apoa1: Large data set
RC DC-Pool UD XRC
10
22
47
97
1 1 1
2
10 10 10 10
1 1
3
5
1
10
100
80 160 320 640
ConnectionMemory(KB)
Number of Processes
Memory Footprint for Alltoall
RC DC-Pool
UD XRC
H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected
Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14).
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication
– Hardware-multicast-based
– Offload and Non-blocking
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …)
• Virtualization
• Energy-Awareness
• InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
20
HPCAC Spain Conference (Sept '15)
HPCAC Spain Conference (Sept '15) 21
Hardware Multicast-aware MPI_Bcast on Stampede
0
5
10
15
20
25
30
35
40
2 8 32 128 512
Latency(us)
Message Size (Bytes)
Small Messages (102,400 Cores)
Default
Multicast
ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
0
50
100
150
200
250
300
350
400
450
2K 8K 32K 128K
Latency(us)
Message Size (Bytes)
Large Messages (102,400 Cores)
Default
Multicast
0
5
10
15
20
25
30
Latency(us)
Number of Nodes
16 Byte Message
Default
Multicast
0
50
100
150
200
Latency(us)
Number of Nodes
32 KByte Message
Default
Multicast
0
1
2
3
4
5
512 600 720 800
ApplicationRun-Time
(s)
Data Size
0
5
10
15
64 128 256 512
Run-Time(s)
Number of Processes
PCG-Default Modified-PCG-Offload
Co-Design with MPI-3 Non-Blocking Collectives and Collective
Offload Co-Direct Hardware (Available with MVAPICH2-X 2.2a)
22
Modified P3DFFT with Offload-Alltoall does up to
17% better than default version (128 Processes)
K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking
All-to-All with Collective Offload on InfiniBand Clusters: A Study
with Parallel 3D FFT, ISC 2011
HPCAC Spain Conference (Sept '15)
17%
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70
Normalized
Performance
HPL-Offload HPL-1ring HPL-Host
HPL Problem Size (N) as % of Total Memory
4.5%
Modified HPL with Offload-Bcast does up to 4.5%
better than default version (512 Processes)
Modified Pre-Conjugate Gradient Solver with
Offload-Allreduce does up to 21.8% better than
default version
K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective
Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011
K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective
Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient
Solvers, IPDPS ’12
21.8%
Can Network-Offload based Non-Blocking Neighborhood MPI
Collectives Improve Communication Overheads of Irregular Graph
Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne,
L. Oliker, and D. K. Panda, IWPAPS’ 12
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication
– Hardware-multicast-based
– Offload and Non-blocking
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• MPI-T Interface
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …)
• Virtualization
• Energy-Awareness
• InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
23
HPCAC Spain Conference (Sept '15)
PCIe
GPU
CPU
NIC
Switch
At Sender:
cudaMemcpy(s_hostbuf, s_devbuf, . . .);
MPI_Send(s_hostbuf, size, . . .);
At Receiver:
MPI_Recv(r_hostbuf, size, . . .);
cudaMemcpy(r_devbuf, r_hostbuf, . . .);
• Data movement in applications with standard MPI and CUDA interfaces
High Productivity and Low Performance
24HPCAC Spain Conference (Sept '15)
MPI + CUDA - Naive
PCIe
GPU
CPU
NIC
Switch
At Sender:
for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz,
…);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
<<Similar at receiver>>
• Pipelining at user level with non-blocking MPI and CUDA interfaces
Low Productivity and High Performance
25
HPCAC Spain Conference (Sept '15)
MPI + CUDA - Advanced
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
26
HPCAC Spain Conference (Sept '15)
GPU-Aware MPI Library: MVAPICH2-GPU
CUDA-Aware MPI: MVAPICH2 1.8-2.2 Releases
• Support for MPI communication from NVIDIA GPU device
memory
• High performance RDMA-based inter-node point-to-point
communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point
communication for multi-GPU adapters/node (GPU-GPU,
GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in
intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective
communication from GPU device buffers
27HPCAC Spain Conference (Sept '15)
• OFED with support for GPUDirect RDMA is
developed by NVIDIA and Mellanox
• OSU has a design of MVAPICH2 using GPUDirect
RDMA
– Hybrid design using GPU-Direct RDMA
• GPUDirect RDMA and Host-based pipelining
• Alleviates P2P bandwidth bottlenecks on
SandyBridge and IvyBridge
– Support for communication using multi-rail
– Support for Mellanox Connect-IB and ConnectX
VPI adapters
– Support for RoCE with Mellanox ConnectX VPI
adapters
HPCAC Spain Conference (Sept '15) 28
GPU-Direct RDMA (GDR) with CUDA
IB
Adapter
System
Memory
GPU
Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s
P2P read: < 1.0 GB/s
SNB E5-2670
P2P write: 6.4 GB/s
P2P read: 3.5 GB/s
IVB E5-2680V2
SNB E5-2670 /
IVB E5-2680V2
29
Performance of MVAPICH2-GDR with GPU-Direct-RDMA
HPCAC Spain Conference (Sept '15)
MVAPICH2-GDR-2.1RC2
Intel Ivy Bridge (E5-2680 v2) node - 20 cores
NVIDIA Tesla K40c GPU
Mellanox Connect-IB Dual-FDR HCA
CUDA 7
Mellanox OFED 2.4 with GPU-Direct-RDMA
0
5
10
15
20
25
30
0 2 8 32 128 512 2K
MV2-GDR2.1RC2
MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU Internode Small Message Latency
Message Size (bytes)
Latency(us)
3.5X9.3X
2.15 usec
0
500
1000
1500
2000
2500
3000
3500
1 4 16 64 256 1K 4K
MV2-GDR2.1RC2
MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU Internode MPI Uni-Directional Bandwidth
Message Size (bytes)
Bandwidth
(MB/s)
11x
2X
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 4 16 64 256 1K 4K
MV2-GDR2.1RC2
MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU Internode Bi-directional Bandwidth
Message Size (bytes)
Bi-Bandwidth(MB/s)
11x
2x
HPCAC Spain Conference (Sept '15) 30
Application-Level Evaluation (HOOMD-blue)
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0
MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768
MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1
MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
• MVAPICH2-GDR 2.1RC2 with GDR support can be downloaded from
https://mvapich.cse.ohio-state.edu/download
• System software requirements
• Mellanox OFED 2.1 or later
• NVIDIA Driver 331.20 or later
• NVIDIA CUDA Toolkit 6.0 or later
• Plugin for GPUDirect RDMA
– http://www.mellanox.com/page/products_dyn?product_family=116
– Strongly Recommended : use the new GDRCOPY module from NVIDIA
• https://github.com/NVIDIA/gdrcopy
• Has optimized designs for point-to-point communication using GDR
• Contact MVAPICH help list with any questions related to the package
mvapich-help@cse.ohio-state.edu
Using MVAPICH2-GDR Version
31HPCAC Brazil Conference (Aug ‘15)
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication
– Hardware-multicast-based
– Offload and Non-blocking
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …)
• Virtualization
• Energy-Awareness
• InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
32
HPCAC Spain Conference (Sept '15)
MPI Applications on MIC Clusters
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI
Program
MPI
Program
Offloaded
Computation
MPI
Program
MPI Program
MPI Program
Host-only
Offload
(/reverse Offload)
Symmetric
Coprocessor-only
• Flexibility in launching MPI jobs on clusters with Xeon Phi
33HPCAC Spain Conference (Sept '15)
34
MVAPICH2-MIC 2.0 Design for Clusters with IB and
MIC
• Offload Mode
• Intranode Communication
• Coprocessor-only and Symmetric Mode
• Internode Communication
• Coprocessors-only and Symmetric Mode
• Multi-MIC Node Configurations
• Running on three major systems
• Stampede, Blueridge (Virginia Tech) and Beacon (UTK)
HPCAC Spain Conference (Sept '15)
MIC-Remote-MIC P2P Communication with Proxy-based
Communication
35HPCAC Spain Conference (Sept '15)
Bandwidth
Better
BetterBetter
Latency (Large Messages)
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
Latency(usec)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth(MB/sec)
Message Size (Bytes)
5236
Intra-socket P2P
Inter-socket P2P
0
2000
4000
6000
8000
10000
12000
14000
16000
8K 32K 128K 512K 2M
Latency(usec)
Message Size (Bytes)
Latency (Large Messages)
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth(MB/sec)
Message Size (Bytes)
Better
5594
Bandwidth
36
Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)
HPCAC Spain Conference (Sept '15)
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance
Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014
0
5000
10000
15000
20000
25000
1 2 4 8 16 32 64 128256512 1K
Latency(usecs)
Message Size (Bytes)
32-Node-Allgather (16H + 16 M)
Small Message Latency
MV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Latency(usecs)x1000
Message Size (Bytes)
32-Node-Allgather (8H + 8 M)
Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
200
400
600
800
4K 8K 16K 32K 64K 128K 256K 512K
Latency(usecs)x1000
Message Size (Bytes)
32-Node-Alltoall (8H + 8 M)
Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
10
20
30
40
50
MV2-MIC-Opt MV2-MIC
ExecutionTime(secs)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT Performance
Communication
Computation
76%
58%
55%
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication
– Hardware-multicast-based
– Offload and Non-blocking
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …)
• Virtualization
• Energy-Awareness
• InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
37
HPCAC Spain Conference (Sept '15)
MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS
Applications
HPCAC Spain Conference (Sept '15)
MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS)
Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI CallsUPC Calls
• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with
MVAPICH2-X 1.9 (2012) onwards!
– http://mvapich.cse.ohio-state.edu
• Feature Highlights
– Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM,
MPI(+OpenMP) + UPC
– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard
compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH)
– Scalable Inter-node and intra-node communication – point-to-point and collectives
CAF Calls
38
Application Level Performance with Graph500 and Sort
Graph500 Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
International Supercomputing Conference (ISC’13), June 2013
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l
Conference on Parallel Processing (ICPP '12), September 2012
0
5
10
15
20
25
30
35
4K 8K 16K
Time(s)
No. of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid (MPI+OpenSHMEM)
13X
7.6X
• Performance of Hybrid (MPI+
OpenSHMEM) Graph500 Design
• 8,192 processes
- 2.4X improvement over MPI-CSR
- 7.6X improvement over MPI-Simple
• 16,384 processes
- 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple
J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned
Global Address Space Programming Models (PGAS '13), October 2013.
Sort Execution Time
0
500
1000
1500
2000
2500
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Time(seconds)
Input Data - No. of Processes
MPI Hybrid
51%
• Performance of Hybrid
(MPI+OpenSHMEM) Sort Application
• 4,096 processes, 4 TB Input Size
- MPI – 2408 sec; 0.16 TB/min
- Hybrid – 1172 sec; 0.36 TB/min
- 51% improvement over MPI-design
39HPCAC Spain Conference (Sept '15)
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication
– Hardware-multicast-based
– Offload and Non-blocking
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …)
• Virtualization
• Energy-Awareness
• InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
40
HPCAC Spain Conference (Sept '15)
• Virtualization has many benefits
– Job migration
– Compaction
• Not very popular in HPC due to overhead associated with
Virtualization
• New SR-IOV (Single Root – IO Virtualization) support
available with Mellanox InfiniBand adapters
• Enhanced MVAPICH2 support for SR-IOV
• Publicly available as MVAPICH2-Virt library
Can HPC and Virtualization be Combined?
41
HPCAC Spain Conference (Sept '15)
J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV
based Virtualized InfiniBand Clusters? EuroPar'14
J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled
InfiniBand Clysters, HiPC’14
J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient
Approach to build HPC Clouds, CCGrid’15
0
50
100
150
200
250
300
350
400
450
20,10 20,16 20,20 22,10 22,16 22,20
ExecutionTime(us)
Problem Size (Scale, Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
4%
9%
0
5
10
15
20
25
30
35
FT-64-C EP-64-C LU-64-C BT-64-C
ExecutionTime(s)
NAS Class C
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
1%
9%
• Compared to Native, 1-9% overhead for NAS
• Compared to Native, 4-9% overhead for Graph500
HPCAC Spain Conference (Sept '15) 65
Application-Level Performance (8 VM * 8 Core/VM)
NAS Graph500
HPCAC Spain Conference (Sept '15)
NSF Chameleon Cloud: A Powerful and Flexible
Experimental Instrument
• Large-scale instrument
– Targeting Big Data, Big Compute, Big Instrument research
– ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G
network
• Reconfigurable instrument
– Bare metal reconfiguration, operated as single instrument, graduated approach for
ease-of-use
• Connected instrument
– Workload and Trace Archive
– Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others
– Partnerships with users
• Complementary instrument
– Complementing GENI, Grid’5000, and other testbeds
• Sustainable instrument
– Industry connections http://www.chameleoncloud.org/
43
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication
– Hardware-multicast-based
– Offload and Non-blocking
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …)
• Virtualization
• Energy-Awareness
• InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
44
HPCAC Spain Conference (Sept '15)
• MVAPICH2-EA (Energy-Aware)
• A white-box approach
• New Energy-Efficient communication protocols for pt-pt and
collective operations
• Intelligently apply the appropriate Energy saving techniques
• Application oblivious energy saving
• OEMT
• A library utility to measure energy consumption for MPI applications
• Works with all MPI runtimes
• PRELOAD option for precompiled applications
• Does not require ROOT permission:
• A safe kernel module to read only a subset of MSRs
45
Energy-Aware MVAPICH2 & OSU Energy Management Tool
(OEMT)
HPCAC Spain Conference (Sept '15)
• An energy efficient
runtime that provides
energy savings without
application knowledge
• Uses automatically and
transparently the best
energy lever
• Provides guarantees on
maximum degradation
with 5-41% savings at <=
5% degradation
• Pessimistic MPI applies
energy reduction lever to
each MPI call
46
MV2-EA : Application Oblivious Energy-Aware-MPI (EAM)
A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh , A. Vishnu , K. Hamidouche , N. Tallent ,
D. K. Panda , D. Kerbyson , and A. Hoise - Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist]
HPCAC Spain Conference (Sept '15)
1
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided RMA)
– Extremely minimum memory footprint
• Collective communication
– Hardware-multicast-based
– Offload and Non-blocking
• Integrated Support for GPGPUs
• Integrated Support for Intel MICs
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …)
• Virtualization
• Energy-Awareness
• InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
47
HPCAC Spain Conference (Sept '15)
HPCAC Spain Conference (Sept '15) 48
OSU InfiniBand Network Analysis Monitoring Tool
(INAM) – Network Level View
• Show network topology of large clusters
• Visualize traffic pattern on different links
• Quickly identify congested links/links in error state
• See the history unfold – play back historical state of the network
Full Network (152 nodes) Zoomed-in View of the Network
HPCAC Spain Conference (Sept '15) 49
Upcoming OSU INAM Tool – Job and Node Level Views
Visualizing a Job (5 Nodes) Finding Routes Between Nodes
• Job level view
• Show different network metrics (load, error, etc.) for any live job
• Play back historical data for completed jobs to identify bottlenecks
• Node level view provides details per process or per node
• CPU utilization for each rank/node
• Bytes sent/received for MPI operations (pt-to-pt, collective, RMA)
• Network metrics (e.g. XmitDiscard, RcvError) per rank/node
• Performance and Memory scalability toward 500K-1M cores
– Dynamically Connected Transport (DCT) service with Connect-IB
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)
• Enhanced Optimization for GPU Support and Accelerators
• Taking advantage of advanced features
– User Mode Memory Registration (UMR)
– On-demand Paging
• Enhanced Inter-node and Intra-node communication schemes for
upcoming OmniPath enabled Knights Landing architectures
• Extended RMA support (as in MPI 3.0)
• Extended topology-aware collectives
• Energy-aware point-to-point (one-sided and two-sided) and collectives
• Extended Support for MPI Tools Interface (as in MPI 3.0)
• Extended Checkpoint-Restart and migration support with SCR
MVAPICH2 – Plans for Exascale
50HPCAC Spain Conference (Sept '15)
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the
Dominant Programming Model
– Many discussions towards Partitioned Global Address Space
(PGAS)
• UPC, OpenSHMEM, CAF, etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce)
– Spark is emerging for in-memory computing
– Memcached is also used for Web 2.0
51
Two Major Categories of Applications
HPCAC Spain Conference (Sept '15)
Can High-Performance Interconnects Benefit Big Data
Computing?
• Most of the current Big Data systems use Ethernet
Infrastructure with Sockets
• Concerns for performance and scalability
• Usage of High-Performance Networks is beginning to draw
interest
• What are the challenges?
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs (similar
to the designs adopted for MPI)?
• Can HPC Clusters with High-Performance networks be used
for Big Data applications using Hadoop and Memcached?
HPCAC Spain Conference (Sept '15) 52
Big Data Middleware
(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models
(Sockets)
Applications
Commodity Computing System
Architectures
(Multi- and Many-core
architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point
Communication
QoS
Threaded Models
and Synchronization
Fault-ToleranceI/O and File Systems
Virtualization
Benchmarks
RDMA Protocol
Designing Communication and I/O Libraries for Big
Data Systems: Solved a Few Initial Challenges
HPCAC Spain Conference (Sept '15)
Upper level
Changes?
53
Design Overview of HDFS with RDMA
• Enables high performance RDMA communication, while supporting traditional socket
interface
• JNI Layer bridges Java based HDFS with communication library written in native code
HDFS
Verbs
RDMA Capable Networks
(IB, 10GE/ iWARP, RoCE ..)
Applications
1/10 GigE, IPoIB
Network
Java Socket
Interface
Java Native Interface (JNI)
WriteOthers
OSU Design
• Design Features
– RDMA-based HDFS
write
– RDMA-based HDFS
replication
– Parallel replication
support
– On-demand connection
setup
– InfiniBand/RoCE
support
HPCAC Spain Conference (Sept '15) 54
Communication Times in HDFS
• Cluster with HDD DataNodes
– 30% improvement in communication time over IPoIB (QDR)
– 56% improvement in communication time over 10GigE
• Similar improvements are obtained for SSD DataNodes
Reduced
by 30%
HPCAC Spain Conference (Sept '15)
0
5
10
15
20
25
2GB 4GB 6GB 8GB 10GB
CommunicationTime(s)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
55
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,
High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012
N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in
RDMA-Enhanced HDFS, HPDC '14, June 2014
Triple-H
Heterogeneous Storage
• Design Features
– Three modes
• Default (HHH)
• In-Memory (HHH-M)
• Lustre-Integrated (HHH-L)
– Policies to efficiently utilize the
heterogeneous storage devices
• RAM, SSD, HDD, Lustre
– Eviction/Promotion based on
data usage pattern
– Hybrid Replication
– Lustre-Integrated mode:
• Lustre-based fault-tolerance
Enhanced HDFS with In-memory and Heterogeneous
Storage
Hybrid
Replication
Data Placement Policies
Eviction/
Promotion
RAM Disk SSD HDD
Lustre
N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS
on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015
Applications
HPCAC Spain Conference (Sept '15) 56
• For 160GB TestDFSIO in 32 nodes
– Write Throughput: 7x improvement
over IPoIB (FDR)
– Read Throughput: 2x improvement
over IPoIB (FDR)
Performance Improvement on TACC Stampede (HHH)
• For 120GB RandomWriter in 32
nodes
– 3x improvement over IPoIB (QDR)
0
1000
2000
3000
4000
5000
6000
write read
TotalThroughput(MBps)
IPoIB (FDR) OSU-IB (FDR)
TestDFSIO
0
50
100
150
200
250
300
8:30 16:60 32:120
ExecutionTime(s)
Cluster Size:Data Size
IPoIB (FDR) OSU-IB (FDR)
RandomWriter
HPCAC Spain Conference (Sept '15)
Increased by 7x
Increased by 2x
Reduced by 3x
57
• For 200GB TeraGen on 32 nodes
– Spark-TeraGen: HHH has 2.1x improvement over HDFS-IPoIB (QDR)
– Spark-TeraSort: HHH has 16% improvement over HDFS-IPoIB (QDR)
58
Evaluation with Spark on SDSC Gordon (HHH)
0
50
100
150
200
250
8:50 16:100 32:200
ExecutionTime(s)
Cluster Size : Data Size (GB)
IPoIB (QDR) OSU-IB (QDR)
0
100
200
300
400
500
600
8:50 16:100 32:200
ExecutionTime(s)
Cluster Size : Data Size (GB)
IPoIB (QDR) OSU-IB (QDR)
TeraGen TeraSort
Reduced by 16%
Reduced by 2.1x
HPCAC Spain Conference (Sept '15)
N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-
Memory File Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015
Design Overview of MapReduce with RDMA
MapReduce
Verbs
RDMA Capable Networks
(IB, 10GE/ iWARP, RoCE ..)
OSU Design
Applications
1/10 GigE, IPoIB
Network
Java Socket
Interface
Java Native Interface (JNI)
Job
Tracker
Task
Tracker
Map
Reduce
HPCAC Spain Conference (Sept '15)
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based MapReduce with communication library written in native code
• Design Features
– RDMA-based shuffle
– Prefetching and caching map
output
– Efficient Shuffle Algorithms
– In-memory merge
– On-demand Shuffle
Adjustment
– Advanced overlapping
• map, shuffle, and merge
• shuffle, merge, and reduce
– On-demand connection setup
– InfiniBand/RoCE support
59
• For 240GB Sort in 64 nodes (512
cores)
– 40% improvement over IPoIB (QDR)
with HDD used for HDFS
Performance Evaluation of Sort and TeraSort
HPCAC Spain Conference (Sept '15)
0
200
400
600
800
1000
1200
Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
JobExecutionTime(sec)
IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)
Sort in OSU Cluster
0
100
200
300
400
500
600
700
Data Size: 80 GB Data Size: 160 GBData Size: 320 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
JobExecutionTime(sec)
IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)
TeraSort in TACC Stampede
• For 320GB TeraSort in 64 nodes
(1K cores)
– 38% improvement over IPoIB
(FDR) with HDD used for HDFS
60
Intermediate Data Directory
Design Overview of Shuffle Strategies for MapReduce over
Lustre
HPCAC Spain Conference (Sept '15)
• Design Features
– Two shuffle approaches
• Lustre read based shuffle
• RDMA based shuffle
– Hybrid shuffle algorithm to
take benefit from both shuffle
approaches
– Dynamically adapts to the
better shuffle approach for
each shuffle request based on
profiling values for each Lustre
read operation
– In-memory merge and
overlapping of different phases
are kept similar to RDMA-
enhanced MapReduce design
Map 1 Map 2 Map 3
Lustre
Reduce 1 Reduce 2
Lustre Read / RDMA
In-memory
merge/sort
reduce
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN
MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015.
In-memory
merge/sort
reduce
61
• For 500GB Sort in 64 nodes
– 44% improvement over IPoIB (FDR)
Performance Improvement of MapReduce over Lustre on
TACC-Stampede
HPCAC Spain Conference (Sept '15)
• For 640GB Sort in 128 nodes
– 48% improvement over IPoIB (FDR)
0
200
400
600
800
1000
1200
300 400 500
JobExecutionTime(sec)
Data Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
0
50
100
150
200
250
300
350
400
450
500
20 GB 40 GB 80 GB 160 GB 320 GB 640 GB
Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128
JobExecutionTime(sec)
IPoIB (FDR) OSU-IB (FDR)
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-
based Approach Benefit?, Euro-Par, August 2014.
• Local disk is used as the intermediate data directory
Reduced by 48%Reduced by 44%
62
Design Overview of Spark with RDMA
• Design Features
– RDMA based shuffle
– SEDA-based plugins
– Dynamic connection
management and sharing
– Non-blocking and out-of-
order data transfer
– Off-JVM-heap buffer
management
– InfiniBand/RoCE support
HPCAC Spain Conference (Sept '15)
• Enables high performance RDMA communication, while supporting traditional socket
interface
• JNI Layer bridges Scala based Spark with communication library written in native code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data
Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014
63
Performance Evaluation on TACC Stampede - SortByTest
• Intel SandyBridge + FDR, 16 Worker Nodes, 256 Cores, (256M 256R)
• RDMA-based design for Spark 1.4.0
• RDMA vs. IPoIB with 256 concurrent tasks, single disk per node and
RamDisk. For SortByKey Test:
– Shuffle time reduced by up to 77% over IPoIB (56Gbps)
– Total time reduced by up to 58% over IPoIB (56Gbps)
16 Worker Nodes, SortByTest Shuffle Time 16 Worker Nodes, SortByTest Total Time
HPCAC Spain Conference (Sept '15) 64
0
5
10
15
20
25
30
35
40
16 32 64
Time(sec)
Data Size (GB)
IPoIB RDMA
0
5
10
15
20
25
30
35
40
16 32 64
Time(sec)
Data Size (GB)
IPoIB RDMA 77%
58%
Memcached-RDMA Design
• Server and client perform a negotiation protocol
– Master thread assigns clients to appropriate worker thread
– Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread
• Memcached Server can serve both socket and verbs clients simultaneously
– All other Memcached data structures are shared among RDMA and Sockets worker threads
• Memcached applications need not be modified; uses verbs interface if available
• High performance design of SSD-Assisted Hybrid Memory
Sockets
Client
RDMA
Client
Master
Thread
Sockets
Worker
Thread
Verbs
Worker
Thread
Sockets
Worker
Thread
Verbs
Worker
Thread
Shared
Data
Memory
&
SSD
Slabs
Items
…
1
1
2
2
65HPCAC Spain Conference (Sept '15)
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda,
Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for
InfiniBand Clusters using Hybrid Transport, CCGrid’12
D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads
with Memcached and MySQL, ISPASS’15
• ohb_memlat & ohb_memthr latency & throughput micro-benchmarks
• Memcached-RDMA can
- improve query latency by up to 70% over IPoIB (32Gbps)
- improve throughput by up to 2X over IPoIB (32Gbps)
- No overhead in using hybrid mode with SSD when all data can fit in memory
Performance Benefits on SDSC-Gordon – OHB Latency &
Throughput Micro-Benchmarks
0
1
2
3
4
5
6
7
8
9
10
64 128 256 512 1024
Throughput(milliontrans/sec)
No. of Clients
IPoIB (32Gbps)
RDMA-Mem (32Gbps)
RDMA-Hybrid (32Gbps)
0
100
200
300
400
500
Averagelatency(us)
Message Size (Bytes)
IPoIB (32Gbps)
RDMA-Mem (32Gbps)
RDMA-Hyb (32Gbps) 2X
66
D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, Benchmarking Key-Value Stores on High-Performance
Storage and Interconnects for Web-Scale Workloads, IEEE BigData ‘15.
HPCAC Spain Conference (Sept '15)
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• RDMA for Memcached (RDMA-Memcached)
• OSU HiBD-Benchmarks (OHB)
– HDFS and Memcached Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 125 organizations from 20 countries
• More than 13,000 downloads from the project site
• RDMA for Apache HBase and Spark
The High-Performance Big Data (HiBD) Project
HPCAC Spain Conference (Sept '15) 67
• Upcoming Releases of RDMA-enhanced Packages will support
– Spark
– HBase
• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will
support
– MapReduce
– RPC
• Advanced designs with upper-level changes and optimizations
• E.g. MEM-HDFS
HPCAC Spain Conference (Sept '15)
Future Plans of OSU High Performance Big Data (HiBD)
Project
68
• Exascale systems will be constrained by
– Power
– Memory per core
– Data movement cost
– Faults
• Programming Models and Runtimes for HPC and
BigData need to be designed for
– Scalability
– Performance
– Fault-resilience
– Energy-awareness
– Programmability
– Productivity
• Highlighted some of the issues and challenges
• Need continuous innovation on all these fronts
Looking into the Future ….
69
HPCAC Spain Conference (Sept '15)
HPCAC Spain Conference (Sept '15)
Funding Acknowledgments
Funding Support by
Equipment Support by
70
Personnel Acknowledgments
Current Students
– A. Augustine (M.S.)
– A. Awan (Ph.D.)
– A. Bhat (M.S.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– N. Islam (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– S. Sur
Current Post-Doc
– J. Lin
– D. Shankar
Current Programmer
– J. Perkins
Past Post-Docs
– H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekar (Ph.D.)
– K. Kulkarni (M.S.)
– M. Li (Ph.D.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Senior Research Associates
– K. Hamidouche
– X. Lu
Past Programmers
– D. Bureddy
– H. Subramoni
Current Research Specialist
– M. Arnold
HPCAC Spain Conference (Sept '15) 71
panda@cse.ohio-state.edu
HPCAC Spain Conference (Sept '15)
Thank You!
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/
72
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project
http://mvapich.cse.ohio-state.edu/
73
Call For Participation
International Workshop on Extreme Scale
Programming Models and Middleware
(ESPM2 2015)
In conjunction with Supercomputing Conference
(SC 2015)
Austin, USA, November 15th, 2015
http://web.cse.ohio-state.edu/~hamidouc/ESPM2/
HPCAC Spain Conference (Sept '15)
• Looking for Bright and Enthusiastic Personnel to join as
– Post-Doctoral Researchers
– PhD Students
– Hadoop/Big Data Programmer/Software Engineer
– MPI Programmer/Software Engineer
• If interested, please contact me at this conference and/or send
an e-mail to panda@cse.ohio-state.edu
74
Multiple Positions Available in My Group
HPCAC Spain Conference (Sept '15)

Contenu connexe

Tendances

Modos de Transporte - Vantagens comparativas
Modos de Transporte - Vantagens comparativasModos de Transporte - Vantagens comparativas
Modos de Transporte - Vantagens comparativas
Pedro Damião
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
Cloudera, Inc.
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 

Tendances (20)

Kubernetes Native Infrastructure and CoreOS Operator Framework for 5G Edge Cl...
Kubernetes Native Infrastructure and CoreOS Operator Framework for 5G Edge Cl...Kubernetes Native Infrastructure and CoreOS Operator Framework for 5G Edge Cl...
Kubernetes Native Infrastructure and CoreOS Operator Framework for 5G Edge Cl...
 
02.[참고]오픈소스sw라이선스가이드라인
02.[참고]오픈소스sw라이선스가이드라인02.[참고]오픈소스sw라이선스가이드라인
02.[참고]오픈소스sw라이선스가이드라인
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
DDS vs AMQP
DDS vs AMQPDDS vs AMQP
DDS vs AMQP
 
Hybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and KubernetesHybrid Apache Spark Architecture with YARN and Kubernetes
Hybrid Apache Spark Architecture with YARN and Kubernetes
 
Modos de Transporte - Vantagens comparativas
Modos de Transporte - Vantagens comparativasModos de Transporte - Vantagens comparativas
Modos de Transporte - Vantagens comparativas
 
Reliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at AirbnbReliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at Airbnb
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
JavaCard development Quickstart
JavaCard development QuickstartJavaCard development Quickstart
JavaCard development Quickstart
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Integrating PostgreSql with RabbitMQ
Integrating PostgreSql with RabbitMQIntegrating PostgreSql with RabbitMQ
Integrating PostgreSql with RabbitMQ
 
OpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software StackOpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software Stack
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
A Kafka journey and why migrate to Confluent Cloud?
A Kafka journey and why migrate to Confluent Cloud?A Kafka journey and why migrate to Confluent Cloud?
A Kafka journey and why migrate to Confluent Cloud?
 

En vedette

En vedette (7)

Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systems
 
Strata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureStrata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and Future
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 

Similaire à Communication Frameworks for HPC and Big Data

How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
inside-BigData.com
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
inside-BigData.com
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
BigDataEverywhere
 

Similaire à Communication Frameworks for HPC and Big Data (20)

Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future PlansMVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmap
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
UCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and BeyondUCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and Beyond
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation Computers
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyond
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 

Plus de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
 

Plus de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Communication Frameworks for HPC and Big Data

  • 1. Communication Frameworks for HPC and Big Data Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Talk at HPC Advisory Council Spain Conference (2015) by
  • 2. High-End Computing (HEC): PetaFlop to ExaFlop HPCAC Spain Conference (Sept '15) 2 100-200 PFlops in 2016-2018 1 EFlops in 2020-2024?
  • 3. Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org) 3HPCAC Spain Conference (Sept '15) 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500 PercentageofClusters NumberofClusters Timeline Percentage of Clusters Number of Clusters 87%
  • 4. Drivers of Modern HPC Cluster Architectures • Multi-core processors are ubiquitous • InfiniBand very popular in HPC clusters• • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale computing Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth Tianhe – 2 Titan Stampede Tianhe – 1A 4 Multi-core Processors HPCAC Spain Conference (Sept '15)
  • 5. • 259 IB Clusters (51%) in the June 2015 Top500 list (http://www.top500.org) • Installations in the Top 50 (24 systems): Large-scale InfiniBand Installations 519,640 cores (Stampede) at TACC (8th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (22nd) 185,344 cores (Pleiades) at NASA/Ames (11th) 194,616 cores (Cascade) at PNNL (25th) 72,800 cores Cray CS-Storm in US (13th) 76,032 cores (Makman-2) at Saudi Aramco (28th) 72,800 cores Cray CS-Storm in US (14th) 110,400 cores (Pangea) in France (29th) 265,440 cores SGI ICE at Tulip Trading Australia (15th) 37,120 cores (Lomonosov-2) at Russia/MSU (31st) 124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (16th) 57,600 cores (SwiftLucy) in US (33rd) 72,000 cores (HPC2) in Italy (17th) 50,544 cores (Occigen) at France/GENCI-CINES (36th) 115,668 cores (Thunder) at AFRL/USA (19th) 76,896 cores (Salomon) SGI ICE in Czech Republic (40th) 147,456 cores (SuperMUC) in Germany (20th) 73,584 cores (Spirit) at AFRL/USA (42nd) 86,016 cores (SuperMUC Phase 2) in Germany (21st) and many more! 5HPCAC Spain Conference (Sept '15)
  • 6. • Scientific Computing – Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model – Many discussions towards Partitioned Global Address Space (PGAS) • UPC, OpenSHMEM, CAF, etc. – Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) • Big Data/Enterprise/Commercial Computing – Focuses on large data and data analysis – Hadoop (HDFS, HBase, MapReduce) – Spark is emerging for in-memory computing – Memcached is also used for Web 2.0 6 Two Major Categories of Applications HPCAC Spain Conference (Sept '15)
  • 7. Towards Exascale System (Today and Target) Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20 MW (50 Gflops/W) O(1) ~15x System memory 1.4 PB (1.024PB CPU + 0.384PB CoP) 32 – 64 PB ~50X Node performance 3.43TF/s (0.4 CPU + 3 CoP) 1.2 or 15 TF O(1) Node concurrency 24 core CPU + 171 cores CoP O(1k) or O(10k) ~5x - ~50x Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x Total concurrency 3.12M 12.48M threads (4 /core) O(billion) for latency hiding ~100x MTTI Few/day Many/day O(?) Courtesy: Prof. Jack Dongarra 7HPCAC Spain Conference (Sept '15)
  • 8. HPCAC Spain Conference (Sept '15) 8 Parallel Programming Models Overview P1 P2 P3 Shared Memory P1 P2 P3 Memory Memory Memory P1 P2 P3 Memory Memory Memory Logical shared memory Shared Memory Model SHMEM, DSM Distributed Memory Model MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
  • 9. 9 Partitioned Global Address Space (PGAS) Models HPCAC Spain Conference (Sept '15) • Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication • Different approaches to PGAS - Languages • Unified Parallel C (UPC) • Co-Array Fortran (CAF) • X10 • Chapel - Libraries • OpenSHMEM • Global Arrays
  • 10. Hybrid (MPI+PGAS) Programming • Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics • Benefits: – Best of Distributed Computing Model – Best of Shared Memory Computing Model • Exascale Roadmap*: – “Hybrid Programming is a practical way to program exascale systems” * The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420 Kernel 1 MPI Kernel 2 MPI Kernel 3 MPI Kernel N MPI HPC Application Kernel 2 PGAS Kernel N PGAS HPCAC Spain Conference (Sept '15) 10
  • 11. Designing Communication Libraries for Multi- Petaflop and Exaflop Systems: Challenges Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Application Kernels/Applications Networking Technologies (InfiniBand, 40/100GigE, Aries, and OmniPath) Multi/Many-core Architectures 11 Accelerators (NVIDIA and MIC) Middleware HPCAC Spain Conference (Sept '15) Co-Design Opportunities and Challenges across Various Layers Performance Scalability Fault- Resilience Communication Library or Runtime for Programming Models Point-to-point Communication (two-sided and one-sided Collective Communication Energy- Awareness Synchronization and Locks I/O and File Systems Fault Tolerance
  • 12. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) • Scalable Collective communication – Offload – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, …) • Virtualization • Energy-Awareness Broad Challenges in Designing Communication Libraries for (MPI+X) at Exascale 12 HPCAC Spain Conference (Sept '15)
  • 13. • Extreme Low Memory Footprint – Memory per core continues to decrease • D-L-A Framework – Discover • Overall network topology (fat-tree, 3D, …) • Network topology for processes for a given job • Node architecture • Health of network and node – Learn • Impact on performance and scalability • Potential for failure – Adapt • Internal protocols and algorithms • Process mapping • Fault-tolerance solutions – Low overhead techniques while delivering performance, scalability and fault- tolerance Additional Challenges for Designing Exascale Software Libraries 13 HPCAC Spain Conference (Sept '15)
  • 14. • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,450 organizations in 76 countries – More than 286,000 downloads from the OSU site directly – Empowering many TOP500 clusters (June ‘15 ranking) • 8th ranked 519,640-core cluster (Stampede) at TACC • 11th ranked 185,344-core cluster (Pleiades) at NASA • 22nd ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (8th in Jun’15, 519,640 cores, 5.168 Plops) MVAPICH2 Software 14HPCAC Spain Conference (Sept '15)
  • 15. HPCAC Spain Conference (Sept '15) 15 MVAPICH2 Software Family Requirements MVAPICH2 Library to use MPI with IB, iWARP and RoCE MVAPICH2 Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X MPI with IB & GPU MVAPICH2-GDR MPI with IB & MIC MVAPICH2-MIC HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
  • 16. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimal memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale 16 HPCAC Spain Conference (Sept '15)
  • 17. HPCAC Spain Conference (Sept '15) 17 Latency & Bandwidth: MPI over IB with MVAPICH2 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 Small Message Latency Message Size (bytes) Latency(us) 1.26 1.19 1.05 1.15 TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back 0 2000 4000 6000 8000 10000 12000 14000 Unidirectional Bandwidth Bandwidth(MBytes/sec) Message Size (bytes) 12465 3387 6356 11497
  • 18. 0 0.2 0.4 0.6 0.8 1 0 1 2 4 8 16 32 64 128 256 512 1K Latency(us) Message Size (Bytes) Latency Intra-Socket Inter-Socket MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) 18 Latest MVAPICH2 2.2a Intel Ivy-bridge 0.18 us 0.45 us 0 2000 4000 6000 8000 10000 12000 14000 16000 Bandwidth(MB/s) Message Size (Bytes) Bandwidth (Inter-socket) inter-Socket-CMA inter-Socket-Shmem inter-Socket-LiMIC 0 2000 4000 6000 8000 10000 12000 14000 16000 Bandwidth(MB/s) Message Size (Bytes) Bandwidth (Intra-socket) intra-Socket-CMA intra-Socket-Shmem intra-Socket-LiMIC 14,250 MB/s 13,749 MB/s HPCAC Spain Conference (Sept '15)
  • 19. HPCAC Spain Conference (Sept '15) 19 Minimizing Memory Footprint further by DC Transport Node0 P1 P0 Node 1 P3 P2 Node 3 P7 P6 Node2 P5 P4 IB Network • Constant connection cost (One QP for any peer) • Full Feature Set (RDMA, Atomics etc) • Separate objects for send (DC Initiator) and receive (DC Target) – DC Target identified by “DCT Number” – Messages routed with (DCT Number, LID) – Requires same “DC Key” to enable communication • Available with MVAPICH2-X 2.2a • DCT support available in Mellanox OFED 0 0.5 1 160 320 620 NormalizedExecution Time Number of Processes NAMD - Apoa1: Large data set RC DC-Pool UD XRC 10 22 47 97 1 1 1 2 10 10 10 10 1 1 3 5 1 10 100 80 160 320 640 ConnectionMemory(KB) Number of Processes Memory Footprint for Alltoall RC DC-Pool UD XRC H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14).
  • 20. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 20 HPCAC Spain Conference (Sept '15)
  • 21. HPCAC Spain Conference (Sept '15) 21 Hardware Multicast-aware MPI_Bcast on Stampede 0 5 10 15 20 25 30 35 40 2 8 32 128 512 Latency(us) Message Size (Bytes) Small Messages (102,400 Cores) Default Multicast ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch 0 50 100 150 200 250 300 350 400 450 2K 8K 32K 128K Latency(us) Message Size (Bytes) Large Messages (102,400 Cores) Default Multicast 0 5 10 15 20 25 30 Latency(us) Number of Nodes 16 Byte Message Default Multicast 0 50 100 150 200 Latency(us) Number of Nodes 32 KByte Message Default Multicast
  • 22. 0 1 2 3 4 5 512 600 720 800 ApplicationRun-Time (s) Data Size 0 5 10 15 64 128 256 512 Run-Time(s) Number of Processes PCG-Default Modified-PCG-Offload Co-Design with MPI-3 Non-Blocking Collectives and Collective Offload Co-Direct Hardware (Available with MVAPICH2-X 2.2a) 22 Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes) K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, ISC 2011 HPCAC Spain Conference (Sept '15) 17% 0 0.2 0.4 0.6 0.8 1 1.2 10 20 30 40 50 60 70 Normalized Performance HPL-Offload HPL-1ring HPL-Host HPL Problem Size (N) as % of Total Memory 4.5% Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes) Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011 K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12 21.8% Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12
  • 23. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • MPI-T Interface • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 23 HPCAC Spain Conference (Sept '15)
  • 24. PCIe GPU CPU NIC Switch At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .); At Receiver: MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .); • Data movement in applications with standard MPI and CUDA interfaces High Productivity and Low Performance 24HPCAC Spain Conference (Sept '15) MPI + CUDA - Naive
  • 25. PCIe GPU CPU NIC Switch At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_hostbuf + j * block_sz, blksz . . .); } MPI_Waitall(); <<Similar at receiver>> • Pipelining at user level with non-blocking MPI and CUDA interfaces Low Productivity and High Performance 25 HPCAC Spain Conference (Sept '15) MPI + CUDA - Advanced
  • 26. At Sender: At Receiver: MPI_Recv(r_devbuf, size, …); inside MVAPICH2 • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers High Performance and High Productivity MPI_Send(s_devbuf, size, …); 26 HPCAC Spain Conference (Sept '15) GPU-Aware MPI Library: MVAPICH2-GPU
  • 27. CUDA-Aware MPI: MVAPICH2 1.8-2.2 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers 27HPCAC Spain Conference (Sept '15)
  • 28. • OFED with support for GPUDirect RDMA is developed by NVIDIA and Mellanox • OSU has a design of MVAPICH2 using GPUDirect RDMA – Hybrid design using GPU-Direct RDMA • GPUDirect RDMA and Host-based pipelining • Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge – Support for communication using multi-rail – Support for Mellanox Connect-IB and ConnectX VPI adapters – Support for RoCE with Mellanox ConnectX VPI adapters HPCAC Spain Conference (Sept '15) 28 GPU-Direct RDMA (GDR) with CUDA IB Adapter System Memory GPU Memory GPU CPU Chipset P2P write: 5.2 GB/s P2P read: < 1.0 GB/s SNB E5-2670 P2P write: 6.4 GB/s P2P read: 3.5 GB/s IVB E5-2680V2 SNB E5-2670 / IVB E5-2680V2
  • 29. 29 Performance of MVAPICH2-GDR with GPU-Direct-RDMA HPCAC Spain Conference (Sept '15) MVAPICH2-GDR-2.1RC2 Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA 0 5 10 15 20 25 30 0 2 8 32 128 512 2K MV2-GDR2.1RC2 MV2-GDR2.0b MV2 w/o GDR GPU-GPU Internode Small Message Latency Message Size (bytes) Latency(us) 3.5X9.3X 2.15 usec 0 500 1000 1500 2000 2500 3000 3500 1 4 16 64 256 1K 4K MV2-GDR2.1RC2 MV2-GDR2.0b MV2 w/o GDR GPU-GPU Internode MPI Uni-Directional Bandwidth Message Size (bytes) Bandwidth (MB/s) 11x 2X 0 500 1000 1500 2000 2500 3000 3500 4000 4500 1 4 16 64 256 1K 4K MV2-GDR2.1RC2 MV2-GDR2.0b MV2 w/o GDR GPU-GPU Internode Bi-directional Bandwidth Message Size (bytes) Bi-Bandwidth(MB/s) 11x 2x
  • 30. HPCAC Spain Conference (Sept '15) 30 Application-Level Evaluation (HOOMD-blue) • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
  • 31. • MVAPICH2-GDR 2.1RC2 with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download • System software requirements • Mellanox OFED 2.1 or later • NVIDIA Driver 331.20 or later • NVIDIA CUDA Toolkit 6.0 or later • Plugin for GPUDirect RDMA – http://www.mellanox.com/page/products_dyn?product_family=116 – Strongly Recommended : use the new GDRCOPY module from NVIDIA • https://github.com/NVIDIA/gdrcopy • Has optimized designs for point-to-point communication using GDR • Contact MVAPICH help list with any questions related to the package mvapich-help@cse.ohio-state.edu Using MVAPICH2-GDR Version 31HPCAC Brazil Conference (Aug ‘15)
  • 32. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 32 HPCAC Spain Conference (Sept '15)
  • 33. MPI Applications on MIC Clusters Xeon Xeon Phi Multi-core Centric Many-core Centric MPI Program MPI Program Offloaded Computation MPI Program MPI Program MPI Program Host-only Offload (/reverse Offload) Symmetric Coprocessor-only • Flexibility in launching MPI jobs on clusters with Xeon Phi 33HPCAC Spain Conference (Sept '15)
  • 34. 34 MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC • Offload Mode • Intranode Communication • Coprocessor-only and Symmetric Mode • Internode Communication • Coprocessors-only and Symmetric Mode • Multi-MIC Node Configurations • Running on three major systems • Stampede, Blueridge (Virginia Tech) and Beacon (UTK) HPCAC Spain Conference (Sept '15)
  • 35. MIC-Remote-MIC P2P Communication with Proxy-based Communication 35HPCAC Spain Conference (Sept '15) Bandwidth Better BetterBetter Latency (Large Messages) 0 1000 2000 3000 4000 5000 8K 32K 128K 512K 2M Latency(usec) Message Size (Bytes) 0 2000 4000 6000 1 16 256 4K 64K 1M Bandwidth(MB/sec) Message Size (Bytes) 5236 Intra-socket P2P Inter-socket P2P 0 2000 4000 6000 8000 10000 12000 14000 16000 8K 32K 128K 512K 2M Latency(usec) Message Size (Bytes) Latency (Large Messages) 0 2000 4000 6000 1 16 256 4K 64K 1M Bandwidth(MB/sec) Message Size (Bytes) Better 5594 Bandwidth
  • 36. 36 Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall) HPCAC Spain Conference (Sept '15) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014 0 5000 10000 15000 20000 25000 1 2 4 8 16 32 64 128256512 1K Latency(usecs) Message Size (Bytes) 32-Node-Allgather (16H + 16 M) Small Message Latency MV2-MIC MV2-MIC-Opt 0 500 1000 1500 8K 16K 32K 64K 128K 256K 512K 1M Latency(usecs)x1000 Message Size (Bytes) 32-Node-Allgather (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 0 200 400 600 800 4K 8K 16K 32K 64K 128K 256K 512K Latency(usecs)x1000 Message Size (Bytes) 32-Node-Alltoall (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 0 10 20 30 40 50 MV2-MIC-Opt MV2-MIC ExecutionTime(secs) 32 Nodes (8H + 8M), Size = 2K*2K*1K P3DFFT Performance Communication Computation 76% 58% 55%
  • 37. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 37 HPCAC Spain Conference (Sept '15)
  • 38. MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS Applications HPCAC Spain Conference (Sept '15) MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS) Applications Unified MVAPICH2-X Runtime InfiniBand, RoCE, iWARP OpenSHMEM Calls MPI CallsUPC Calls • Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with MVAPICH2-X 1.9 (2012) onwards! – http://mvapich.cse.ohio-state.edu • Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH) – Scalable Inter-node and intra-node communication – point-to-point and collectives CAF Calls 38
  • 39. Application Level Performance with Graph500 and Sort Graph500 Execution Time J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012 0 5 10 15 20 25 30 35 4K 8K 16K Time(s) No. of Processes MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 13X 7.6X • Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design • 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple • 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013. Sort Execution Time 0 500 1000 1500 2000 2500 3000 500GB-512 1TB-1K 2TB-2K 4TB-4K Time(seconds) Input Data - No. of Processes MPI Hybrid 51% • Performance of Hybrid (MPI+OpenSHMEM) Sort Application • 4,096 processes, 4 TB Input Size - MPI – 2408 sec; 0.16 TB/min - Hybrid – 1172 sec; 0.36 TB/min - 51% improvement over MPI-design 39HPCAC Spain Conference (Sept '15)
  • 40. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 40 HPCAC Spain Conference (Sept '15)
  • 41. • Virtualization has many benefits – Job migration – Compaction • Not very popular in HPC due to overhead associated with Virtualization • New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters • Enhanced MVAPICH2 support for SR-IOV • Publicly available as MVAPICH2-Virt library Can HPC and Virtualization be Combined? 41 HPCAC Spain Conference (Sept '15) J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clysters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15
  • 42. 0 50 100 150 200 250 300 350 400 450 20,10 20,16 20,20 22,10 22,16 22,20 ExecutionTime(us) Problem Size (Scale, Edgefactor) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 4% 9% 0 5 10 15 20 25 30 35 FT-64-C EP-64-C LU-64-C BT-64-C ExecutionTime(s) NAS Class C MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 1% 9% • Compared to Native, 1-9% overhead for NAS • Compared to Native, 4-9% overhead for Graph500 HPCAC Spain Conference (Sept '15) 65 Application-Level Performance (8 VM * 8 Core/VM) NAS Graph500
  • 43. HPCAC Spain Conference (Sept '15) NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument • Large-scale instrument – Targeting Big Data, Big Compute, Big Instrument research – ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network • Reconfigurable instrument – Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use • Connected instrument – Workload and Trace Archive – Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others – Partnerships with users • Complementary instrument – Complementing GENI, Grid’5000, and other testbeds • Sustainable instrument – Industry connections http://www.chameleoncloud.org/ 43
  • 44. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 44 HPCAC Spain Conference (Sept '15)
  • 45. • MVAPICH2-EA (Energy-Aware) • A white-box approach • New Energy-Efficient communication protocols for pt-pt and collective operations • Intelligently apply the appropriate Energy saving techniques • Application oblivious energy saving • OEMT • A library utility to measure energy consumption for MPI applications • Works with all MPI runtimes • PRELOAD option for precompiled applications • Does not require ROOT permission: • A safe kernel module to read only a subset of MSRs 45 Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT) HPCAC Spain Conference (Sept '15)
  • 46. • An energy efficient runtime that provides energy savings without application knowledge • Uses automatically and transparently the best energy lever • Provides guarantees on maximum degradation with 5-41% savings at <= 5% degradation • Pessimistic MPI applies energy reduction lever to each MPI call 46 MV2-EA : Application Oblivious Energy-Aware-MPI (EAM) A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh , A. Vishnu , K. Hamidouche , N. Tallent , D. K. Panda , D. Kerbyson , and A. Hoise - Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist] HPCAC Spain Conference (Sept '15) 1
  • 47. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 47 HPCAC Spain Conference (Sept '15)
  • 48. HPCAC Spain Conference (Sept '15) 48 OSU InfiniBand Network Analysis Monitoring Tool (INAM) – Network Level View • Show network topology of large clusters • Visualize traffic pattern on different links • Quickly identify congested links/links in error state • See the history unfold – play back historical state of the network Full Network (152 nodes) Zoomed-in View of the Network
  • 49. HPCAC Spain Conference (Sept '15) 49 Upcoming OSU INAM Tool – Job and Node Level Views Visualizing a Job (5 Nodes) Finding Routes Between Nodes • Job level view • Show different network metrics (load, error, etc.) for any live job • Play back historical data for completed jobs to identify bottlenecks • Node level view provides details per process or per node • CPU utilization for each rank/node • Bytes sent/received for MPI operations (pt-to-pt, collective, RMA) • Network metrics (e.g. XmitDiscard, RcvError) per rank/node
  • 50. • Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • Enhanced Optimization for GPU Support and Accelerators • Taking advantage of advanced features – User Mode Memory Registration (UMR) – On-demand Paging • Enhanced Inter-node and Intra-node communication schemes for upcoming OmniPath enabled Knights Landing architectures • Extended RMA support (as in MPI 3.0) • Extended topology-aware collectives • Energy-aware point-to-point (one-sided and two-sided) and collectives • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended Checkpoint-Restart and migration support with SCR MVAPICH2 – Plans for Exascale 50HPCAC Spain Conference (Sept '15)
  • 51. • Scientific Computing – Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model – Many discussions towards Partitioned Global Address Space (PGAS) • UPC, OpenSHMEM, CAF, etc. – Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) • Big Data/Enterprise/Commercial Computing – Focuses on large data and data analysis – Hadoop (HDFS, HBase, MapReduce) – Spark is emerging for in-memory computing – Memcached is also used for Web 2.0 51 Two Major Categories of Applications HPCAC Spain Conference (Sept '15)
  • 52. Can High-Performance Interconnects Benefit Big Data Computing? • Most of the current Big Data systems use Ethernet Infrastructure with Sockets • Concerns for performance and scalability • Usage of High-Performance Networks is beginning to draw interest • What are the challenges? • Where do the bottlenecks lie? • Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? • Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop and Memcached? HPCAC Spain Conference (Sept '15) 52
  • 53. Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs) Storage Technologies (HDD and SSD) Programming Models (Sockets) Applications Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Other Protocols? Communication and I/O Library Point-to-Point Communication QoS Threaded Models and Synchronization Fault-ToleranceI/O and File Systems Virtualization Benchmarks RDMA Protocol Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges HPCAC Spain Conference (Sept '15) Upper level Changes? 53
  • 54. Design Overview of HDFS with RDMA • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based HDFS with communication library written in native code HDFS Verbs RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..) Applications 1/10 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI) WriteOthers OSU Design • Design Features – RDMA-based HDFS write – RDMA-based HDFS replication – Parallel replication support – On-demand connection setup – InfiniBand/RoCE support HPCAC Spain Conference (Sept '15) 54
  • 55. Communication Times in HDFS • Cluster with HDD DataNodes – 30% improvement in communication time over IPoIB (QDR) – 56% improvement in communication time over 10GigE • Similar improvements are obtained for SSD DataNodes Reduced by 30% HPCAC Spain Conference (Sept '15) 0 5 10 15 20 25 2GB 4GB 6GB 8GB 10GB CommunicationTime(s) File Size (GB) 10GigE IPoIB (QDR) OSU-IB (QDR) 55 N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012 N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014
  • 56. Triple-H Heterogeneous Storage • Design Features – Three modes • Default (HHH) • In-Memory (HHH-M) • Lustre-Integrated (HHH-L) – Policies to efficiently utilize the heterogeneous storage devices • RAM, SSD, HDD, Lustre – Eviction/Promotion based on data usage pattern – Hybrid Replication – Lustre-Integrated mode: • Lustre-based fault-tolerance Enhanced HDFS with In-memory and Heterogeneous Storage Hybrid Replication Data Placement Policies Eviction/ Promotion RAM Disk SSD HDD Lustre N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015 Applications HPCAC Spain Conference (Sept '15) 56
  • 57. • For 160GB TestDFSIO in 32 nodes – Write Throughput: 7x improvement over IPoIB (FDR) – Read Throughput: 2x improvement over IPoIB (FDR) Performance Improvement on TACC Stampede (HHH) • For 120GB RandomWriter in 32 nodes – 3x improvement over IPoIB (QDR) 0 1000 2000 3000 4000 5000 6000 write read TotalThroughput(MBps) IPoIB (FDR) OSU-IB (FDR) TestDFSIO 0 50 100 150 200 250 300 8:30 16:60 32:120 ExecutionTime(s) Cluster Size:Data Size IPoIB (FDR) OSU-IB (FDR) RandomWriter HPCAC Spain Conference (Sept '15) Increased by 7x Increased by 2x Reduced by 3x 57
  • 58. • For 200GB TeraGen on 32 nodes – Spark-TeraGen: HHH has 2.1x improvement over HDFS-IPoIB (QDR) – Spark-TeraSort: HHH has 16% improvement over HDFS-IPoIB (QDR) 58 Evaluation with Spark on SDSC Gordon (HHH) 0 50 100 150 200 250 8:50 16:100 32:200 ExecutionTime(s) Cluster Size : Data Size (GB) IPoIB (QDR) OSU-IB (QDR) 0 100 200 300 400 500 600 8:50 16:100 32:200 ExecutionTime(s) Cluster Size : Data Size (GB) IPoIB (QDR) OSU-IB (QDR) TeraGen TeraSort Reduced by 16% Reduced by 2.1x HPCAC Spain Conference (Sept '15) N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In- Memory File Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015
  • 59. Design Overview of MapReduce with RDMA MapReduce Verbs RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..) OSU Design Applications 1/10 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI) Job Tracker Task Tracker Map Reduce HPCAC Spain Conference (Sept '15) • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based MapReduce with communication library written in native code • Design Features – RDMA-based shuffle – Prefetching and caching map output – Efficient Shuffle Algorithms – In-memory merge – On-demand Shuffle Adjustment – Advanced overlapping • map, shuffle, and merge • shuffle, merge, and reduce – On-demand connection setup – InfiniBand/RoCE support 59
  • 60. • For 240GB Sort in 64 nodes (512 cores) – 40% improvement over IPoIB (QDR) with HDD used for HDFS Performance Evaluation of Sort and TeraSort HPCAC Spain Conference (Sept '15) 0 200 400 600 800 1000 1200 Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 JobExecutionTime(sec) IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR) Sort in OSU Cluster 0 100 200 300 400 500 600 700 Data Size: 80 GB Data Size: 160 GBData Size: 320 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 JobExecutionTime(sec) IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR) TeraSort in TACC Stampede • For 320GB TeraSort in 64 nodes (1K cores) – 38% improvement over IPoIB (FDR) with HDD used for HDFS 60
  • 61. Intermediate Data Directory Design Overview of Shuffle Strategies for MapReduce over Lustre HPCAC Spain Conference (Sept '15) • Design Features – Two shuffle approaches • Lustre read based shuffle • RDMA based shuffle – Hybrid shuffle algorithm to take benefit from both shuffle approaches – Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation – In-memory merge and overlapping of different phases are kept similar to RDMA- enhanced MapReduce design Map 1 Map 2 Map 3 Lustre Reduce 1 Reduce 2 Lustre Read / RDMA In-memory merge/sort reduce M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015. In-memory merge/sort reduce 61
  • 62. • For 500GB Sort in 64 nodes – 44% improvement over IPoIB (FDR) Performance Improvement of MapReduce over Lustre on TACC-Stampede HPCAC Spain Conference (Sept '15) • For 640GB Sort in 128 nodes – 48% improvement over IPoIB (FDR) 0 200 400 600 800 1000 1200 300 400 500 JobExecutionTime(sec) Data Size (GB) IPoIB (FDR) OSU-IB (FDR) 0 50 100 150 200 250 300 350 400 450 500 20 GB 40 GB 80 GB 160 GB 320 GB 640 GB Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128 JobExecutionTime(sec) IPoIB (FDR) OSU-IB (FDR) M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA- based Approach Benefit?, Euro-Par, August 2014. • Local disk is used as the intermediate data directory Reduced by 48%Reduced by 44% 62
  • 63. Design Overview of Spark with RDMA • Design Features – RDMA based shuffle – SEDA-based plugins – Dynamic connection management and sharing – Non-blocking and out-of- order data transfer – Off-JVM-heap buffer management – InfiniBand/RoCE support HPCAC Spain Conference (Sept '15) • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014 63
  • 64. Performance Evaluation on TACC Stampede - SortByTest • Intel SandyBridge + FDR, 16 Worker Nodes, 256 Cores, (256M 256R) • RDMA-based design for Spark 1.4.0 • RDMA vs. IPoIB with 256 concurrent tasks, single disk per node and RamDisk. For SortByKey Test: – Shuffle time reduced by up to 77% over IPoIB (56Gbps) – Total time reduced by up to 58% over IPoIB (56Gbps) 16 Worker Nodes, SortByTest Shuffle Time 16 Worker Nodes, SortByTest Total Time HPCAC Spain Conference (Sept '15) 64 0 5 10 15 20 25 30 35 40 16 32 64 Time(sec) Data Size (GB) IPoIB RDMA 0 5 10 15 20 25 30 35 40 16 32 64 Time(sec) Data Size (GB) IPoIB RDMA 77% 58%
  • 65. Memcached-RDMA Design • Server and client perform a negotiation protocol – Master thread assigns clients to appropriate worker thread – Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread • Memcached Server can serve both socket and verbs clients simultaneously – All other Memcached data structures are shared among RDMA and Sockets worker threads • Memcached applications need not be modified; uses verbs interface if available • High performance design of SSD-Assisted Hybrid Memory Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory & SSD Slabs Items … 1 1 2 2 65HPCAC Spain Conference (Sept '15) J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11 J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12 D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL, ISPASS’15
  • 66. • ohb_memlat & ohb_memthr latency & throughput micro-benchmarks • Memcached-RDMA can - improve query latency by up to 70% over IPoIB (32Gbps) - improve throughput by up to 2X over IPoIB (32Gbps) - No overhead in using hybrid mode with SSD when all data can fit in memory Performance Benefits on SDSC-Gordon – OHB Latency & Throughput Micro-Benchmarks 0 1 2 3 4 5 6 7 8 9 10 64 128 256 512 1024 Throughput(milliontrans/sec) No. of Clients IPoIB (32Gbps) RDMA-Mem (32Gbps) RDMA-Hybrid (32Gbps) 0 100 200 300 400 500 Averagelatency(us) Message Size (Bytes) IPoIB (32Gbps) RDMA-Mem (32Gbps) RDMA-Hyb (32Gbps) 2X 66 D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, Benchmarking Key-Value Stores on High-Performance Storage and Interconnects for Web-Scale Workloads, IEEE BigData ‘15. HPCAC Spain Conference (Sept '15)
  • 67. • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • RDMA for Memcached (RDMA-Memcached) • OSU HiBD-Benchmarks (OHB) – HDFS and Memcached Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 125 organizations from 20 countries • More than 13,000 downloads from the project site • RDMA for Apache HBase and Spark The High-Performance Big Data (HiBD) Project HPCAC Spain Conference (Sept '15) 67
  • 68. • Upcoming Releases of RDMA-enhanced Packages will support – Spark – HBase • Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support – MapReduce – RPC • Advanced designs with upper-level changes and optimizations • E.g. MEM-HDFS HPCAC Spain Conference (Sept '15) Future Plans of OSU High Performance Big Data (HiBD) Project 68
  • 69. • Exascale systems will be constrained by – Power – Memory per core – Data movement cost – Faults • Programming Models and Runtimes for HPC and BigData need to be designed for – Scalability – Performance – Fault-resilience – Energy-awareness – Programmability – Productivity • Highlighted some of the issues and challenges • Need continuous innovation on all these fronts Looking into the Future …. 69 HPCAC Spain Conference (Sept '15)
  • 70. HPCAC Spain Conference (Sept '15) Funding Acknowledgments Funding Support by Equipment Support by 70
  • 71. Personnel Acknowledgments Current Students – A. Augustine (M.S.) – A. Awan (Ph.D.) – A. Bhat (M.S.) – S. Chakraborthy (Ph.D.) – C.-H. Chu (Ph.D.) – N. Islam (Ph.D.) Past Students – P. Balaji (Ph.D.) – D. Buntinas (Ph.D.) – S. Bhagvat (M.S.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) Past Research Scientist – S. Sur Current Post-Doc – J. Lin – D. Shankar Current Programmer – J. Perkins Past Post-Docs – H. Wang – X. Besseron – H.-W. Jin – M. Luo – W. Huang (Ph.D.) – W. Jiang (M.S.) – J. Jose (Ph.D.) – S. Kini (M.S.) – M. Koop (Ph.D.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – A. Moody (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – S. Potluri (Ph.D.) – R. Rajachandrasekar (Ph.D.) – K. Kulkarni (M.S.) – M. Li (Ph.D.) – M. Rahman (Ph.D.) – D. Shankar (Ph.D.) – A. Venkatesh (Ph.D.) – J. Zhang (Ph.D.) – E. Mancini – S. Marcarelli – J. Vienne Current Senior Research Associates – K. Hamidouche – X. Lu Past Programmers – D. Bureddy – H. Subramoni Current Research Specialist – M. Arnold HPCAC Spain Conference (Sept '15) 71
  • 72. panda@cse.ohio-state.edu HPCAC Spain Conference (Sept '15) Thank You! The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ 72 Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/
  • 73. 73 Call For Participation International Workshop on Extreme Scale Programming Models and Middleware (ESPM2 2015) In conjunction with Supercomputing Conference (SC 2015) Austin, USA, November 15th, 2015 http://web.cse.ohio-state.edu/~hamidouc/ESPM2/ HPCAC Spain Conference (Sept '15)
  • 74. • Looking for Bright and Enthusiastic Personnel to join as – Post-Doctoral Researchers – PhD Students – Hadoop/Big Data Programmer/Software Engineer – MPI Programmer/Software Engineer • If interested, please contact me at this conference and/or send an e-mail to panda@cse.ohio-state.edu 74 Multiple Positions Available in My Group HPCAC Spain Conference (Sept '15)