Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Communication Frameworks for HPC and Big Data

In this deck from the HPC Advisory Council Spain Conference, DK Panda from Ohio State University presents: Communication Frameworks for HPC and Big Data.

Watch the video presentation: http://insidehpc.com/2015/09/video-communication-frameworks-for-hpc-and-big-data/

Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

Learn more: http://www.hpcadvisorycouncil.com/events/2015/spain-workshop/agenda.php

Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

  • Identifiez-vous pour voir les commentaires

Communication Frameworks for HPC and Big Data

  1. 1. Communication Frameworks for HPC and Big Data Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Talk at HPC Advisory Council Spain Conference (2015) by
  2. 2. High-End Computing (HEC): PetaFlop to ExaFlop HPCAC Spain Conference (Sept '15) 2 100-200 PFlops in 2016-2018 1 EFlops in 2020-2024?
  3. 3. Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org) 3HPCAC Spain Conference (Sept '15) 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500 PercentageofClusters NumberofClusters Timeline Percentage of Clusters Number of Clusters 87%
  4. 4. Drivers of Modern HPC Cluster Architectures • Multi-core processors are ubiquitous • InfiniBand very popular in HPC clusters• • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale computing Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth Tianhe – 2 Titan Stampede Tianhe – 1A 4 Multi-core Processors HPCAC Spain Conference (Sept '15)
  5. 5. • 259 IB Clusters (51%) in the June 2015 Top500 list (http://www.top500.org) • Installations in the Top 50 (24 systems): Large-scale InfiniBand Installations 519,640 cores (Stampede) at TACC (8th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (22nd) 185,344 cores (Pleiades) at NASA/Ames (11th) 194,616 cores (Cascade) at PNNL (25th) 72,800 cores Cray CS-Storm in US (13th) 76,032 cores (Makman-2) at Saudi Aramco (28th) 72,800 cores Cray CS-Storm in US (14th) 110,400 cores (Pangea) in France (29th) 265,440 cores SGI ICE at Tulip Trading Australia (15th) 37,120 cores (Lomonosov-2) at Russia/MSU (31st) 124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (16th) 57,600 cores (SwiftLucy) in US (33rd) 72,000 cores (HPC2) in Italy (17th) 50,544 cores (Occigen) at France/GENCI-CINES (36th) 115,668 cores (Thunder) at AFRL/USA (19th) 76,896 cores (Salomon) SGI ICE in Czech Republic (40th) 147,456 cores (SuperMUC) in Germany (20th) 73,584 cores (Spirit) at AFRL/USA (42nd) 86,016 cores (SuperMUC Phase 2) in Germany (21st) and many more! 5HPCAC Spain Conference (Sept '15)
  6. 6. • Scientific Computing – Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model – Many discussions towards Partitioned Global Address Space (PGAS) • UPC, OpenSHMEM, CAF, etc. – Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) • Big Data/Enterprise/Commercial Computing – Focuses on large data and data analysis – Hadoop (HDFS, HBase, MapReduce) – Spark is emerging for in-memory computing – Memcached is also used for Web 2.0 6 Two Major Categories of Applications HPCAC Spain Conference (Sept '15)
  7. 7. Towards Exascale System (Today and Target) Systems 2015 Tianhe-2 2020-2024 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20 MW (50 Gflops/W) O(1) ~15x System memory 1.4 PB (1.024PB CPU + 0.384PB CoP) 32 – 64 PB ~50X Node performance 3.43TF/s (0.4 CPU + 3 CoP) 1.2 or 15 TF O(1) Node concurrency 24 core CPU + 171 cores CoP O(1k) or O(10k) ~5x - ~50x Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x Total concurrency 3.12M 12.48M threads (4 /core) O(billion) for latency hiding ~100x MTTI Few/day Many/day O(?) Courtesy: Prof. Jack Dongarra 7HPCAC Spain Conference (Sept '15)
  8. 8. HPCAC Spain Conference (Sept '15) 8 Parallel Programming Models Overview P1 P2 P3 Shared Memory P1 P2 P3 Memory Memory Memory P1 P2 P3 Memory Memory Memory Logical shared memory Shared Memory Model SHMEM, DSM Distributed Memory Model MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
  9. 9. 9 Partitioned Global Address Space (PGAS) Models HPCAC Spain Conference (Sept '15) • Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication • Different approaches to PGAS - Languages • Unified Parallel C (UPC) • Co-Array Fortran (CAF) • X10 • Chapel - Libraries • OpenSHMEM • Global Arrays
  10. 10. Hybrid (MPI+PGAS) Programming • Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics • Benefits: – Best of Distributed Computing Model – Best of Shared Memory Computing Model • Exascale Roadmap*: – “Hybrid Programming is a practical way to program exascale systems” * The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420 Kernel 1 MPI Kernel 2 MPI Kernel 3 MPI Kernel N MPI HPC Application Kernel 2 PGAS Kernel N PGAS HPCAC Spain Conference (Sept '15) 10
  11. 11. Designing Communication Libraries for Multi- Petaflop and Exaflop Systems: Challenges Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Application Kernels/Applications Networking Technologies (InfiniBand, 40/100GigE, Aries, and OmniPath) Multi/Many-core Architectures 11 Accelerators (NVIDIA and MIC) Middleware HPCAC Spain Conference (Sept '15) Co-Design Opportunities and Challenges across Various Layers Performance Scalability Fault- Resilience Communication Library or Runtime for Programming Models Point-to-point Communication (two-sided and one-sided Collective Communication Energy- Awareness Synchronization and Locks I/O and File Systems Fault Tolerance
  12. 12. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) • Scalable Collective communication – Offload – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, …) • Virtualization • Energy-Awareness Broad Challenges in Designing Communication Libraries for (MPI+X) at Exascale 12 HPCAC Spain Conference (Sept '15)
  13. 13. • Extreme Low Memory Footprint – Memory per core continues to decrease • D-L-A Framework – Discover • Overall network topology (fat-tree, 3D, …) • Network topology for processes for a given job • Node architecture • Health of network and node – Learn • Impact on performance and scalability • Potential for failure – Adapt • Internal protocols and algorithms • Process mapping • Fault-tolerance solutions – Low overhead techniques while delivering performance, scalability and fault- tolerance Additional Challenges for Designing Exascale Software Libraries 13 HPCAC Spain Conference (Sept '15)
  14. 14. • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,450 organizations in 76 countries – More than 286,000 downloads from the OSU site directly – Empowering many TOP500 clusters (June ‘15 ranking) • 8th ranked 519,640-core cluster (Stampede) at TACC • 11th ranked 185,344-core cluster (Pleiades) at NASA • 22nd ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (8th in Jun’15, 519,640 cores, 5.168 Plops) MVAPICH2 Software 14HPCAC Spain Conference (Sept '15)
  15. 15. HPCAC Spain Conference (Sept '15) 15 MVAPICH2 Software Family Requirements MVAPICH2 Library to use MPI with IB, iWARP and RoCE MVAPICH2 Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X MPI with IB & GPU MVAPICH2-GDR MPI with IB & MIC MVAPICH2-MIC HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
  16. 16. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimal memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2 Project for Exascale 16 HPCAC Spain Conference (Sept '15)
  17. 17. HPCAC Spain Conference (Sept '15) 17 Latency & Bandwidth: MPI over IB with MVAPICH2 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 Small Message Latency Message Size (bytes) Latency(us) 1.26 1.19 1.05 1.15 TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back 0 2000 4000 6000 8000 10000 12000 14000 Unidirectional Bandwidth Bandwidth(MBytes/sec) Message Size (bytes) 12465 3387 6356 11497
  18. 18. 0 0.2 0.4 0.6 0.8 1 0 1 2 4 8 16 32 64 128 256 512 1K Latency(us) Message Size (Bytes) Latency Intra-Socket Inter-Socket MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) 18 Latest MVAPICH2 2.2a Intel Ivy-bridge 0.18 us 0.45 us 0 2000 4000 6000 8000 10000 12000 14000 16000 Bandwidth(MB/s) Message Size (Bytes) Bandwidth (Inter-socket) inter-Socket-CMA inter-Socket-Shmem inter-Socket-LiMIC 0 2000 4000 6000 8000 10000 12000 14000 16000 Bandwidth(MB/s) Message Size (Bytes) Bandwidth (Intra-socket) intra-Socket-CMA intra-Socket-Shmem intra-Socket-LiMIC 14,250 MB/s 13,749 MB/s HPCAC Spain Conference (Sept '15)
  19. 19. HPCAC Spain Conference (Sept '15) 19 Minimizing Memory Footprint further by DC Transport Node0 P1 P0 Node 1 P3 P2 Node 3 P7 P6 Node2 P5 P4 IB Network • Constant connection cost (One QP for any peer) • Full Feature Set (RDMA, Atomics etc) • Separate objects for send (DC Initiator) and receive (DC Target) – DC Target identified by “DCT Number” – Messages routed with (DCT Number, LID) – Requires same “DC Key” to enable communication • Available with MVAPICH2-X 2.2a • DCT support available in Mellanox OFED 0 0.5 1 160 320 620 NormalizedExecution Time Number of Processes NAMD - Apoa1: Large data set RC DC-Pool UD XRC 10 22 47 97 1 1 1 2 10 10 10 10 1 1 3 5 1 10 100 80 160 320 640 ConnectionMemory(KB) Number of Processes Memory Footprint for Alltoall RC DC-Pool UD XRC H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14).
  20. 20. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 20 HPCAC Spain Conference (Sept '15)
  21. 21. HPCAC Spain Conference (Sept '15) 21 Hardware Multicast-aware MPI_Bcast on Stampede 0 5 10 15 20 25 30 35 40 2 8 32 128 512 Latency(us) Message Size (Bytes) Small Messages (102,400 Cores) Default Multicast ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch 0 50 100 150 200 250 300 350 400 450 2K 8K 32K 128K Latency(us) Message Size (Bytes) Large Messages (102,400 Cores) Default Multicast 0 5 10 15 20 25 30 Latency(us) Number of Nodes 16 Byte Message Default Multicast 0 50 100 150 200 Latency(us) Number of Nodes 32 KByte Message Default Multicast
  22. 22. 0 1 2 3 4 5 512 600 720 800 ApplicationRun-Time (s) Data Size 0 5 10 15 64 128 256 512 Run-Time(s) Number of Processes PCG-Default Modified-PCG-Offload Co-Design with MPI-3 Non-Blocking Collectives and Collective Offload Co-Direct Hardware (Available with MVAPICH2-X 2.2a) 22 Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes) K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, ISC 2011 HPCAC Spain Conference (Sept '15) 17% 0 0.2 0.4 0.6 0.8 1 1.2 10 20 30 40 50 60 70 Normalized Performance HPL-Offload HPL-1ring HPL-Host HPL Problem Size (N) as % of Total Memory 4.5% Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes) Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011 K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12 21.8% Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12
  23. 23. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • MPI-T Interface • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 23 HPCAC Spain Conference (Sept '15)
  24. 24. PCIe GPU CPU NIC Switch At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .); At Receiver: MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .); • Data movement in applications with standard MPI and CUDA interfaces High Productivity and Low Performance 24HPCAC Spain Conference (Sept '15) MPI + CUDA - Naive
  25. 25. PCIe GPU CPU NIC Switch At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_hostbuf + j * block_sz, blksz . . .); } MPI_Waitall(); <<Similar at receiver>> • Pipelining at user level with non-blocking MPI and CUDA interfaces Low Productivity and High Performance 25 HPCAC Spain Conference (Sept '15) MPI + CUDA - Advanced
  26. 26. At Sender: At Receiver: MPI_Recv(r_devbuf, size, …); inside MVAPICH2 • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers High Performance and High Productivity MPI_Send(s_devbuf, size, …); 26 HPCAC Spain Conference (Sept '15) GPU-Aware MPI Library: MVAPICH2-GPU
  27. 27. CUDA-Aware MPI: MVAPICH2 1.8-2.2 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers 27HPCAC Spain Conference (Sept '15)
  28. 28. • OFED with support for GPUDirect RDMA is developed by NVIDIA and Mellanox • OSU has a design of MVAPICH2 using GPUDirect RDMA – Hybrid design using GPU-Direct RDMA • GPUDirect RDMA and Host-based pipelining • Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge – Support for communication using multi-rail – Support for Mellanox Connect-IB and ConnectX VPI adapters – Support for RoCE with Mellanox ConnectX VPI adapters HPCAC Spain Conference (Sept '15) 28 GPU-Direct RDMA (GDR) with CUDA IB Adapter System Memory GPU Memory GPU CPU Chipset P2P write: 5.2 GB/s P2P read: < 1.0 GB/s SNB E5-2670 P2P write: 6.4 GB/s P2P read: 3.5 GB/s IVB E5-2680V2 SNB E5-2670 / IVB E5-2680V2
  29. 29. 29 Performance of MVAPICH2-GDR with GPU-Direct-RDMA HPCAC Spain Conference (Sept '15) MVAPICH2-GDR-2.1RC2 Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA 0 5 10 15 20 25 30 0 2 8 32 128 512 2K MV2-GDR2.1RC2 MV2-GDR2.0b MV2 w/o GDR GPU-GPU Internode Small Message Latency Message Size (bytes) Latency(us) 3.5X9.3X 2.15 usec 0 500 1000 1500 2000 2500 3000 3500 1 4 16 64 256 1K 4K MV2-GDR2.1RC2 MV2-GDR2.0b MV2 w/o GDR GPU-GPU Internode MPI Uni-Directional Bandwidth Message Size (bytes) Bandwidth (MB/s) 11x 2X 0 500 1000 1500 2000 2500 3000 3500 4000 4500 1 4 16 64 256 1K 4K MV2-GDR2.1RC2 MV2-GDR2.0b MV2 w/o GDR GPU-GPU Internode Bi-directional Bandwidth Message Size (bytes) Bi-Bandwidth(MB/s) 11x 2x
  30. 30. HPCAC Spain Conference (Sept '15) 30 Application-Level Evaluation (HOOMD-blue) • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
  31. 31. • MVAPICH2-GDR 2.1RC2 with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download • System software requirements • Mellanox OFED 2.1 or later • NVIDIA Driver 331.20 or later • NVIDIA CUDA Toolkit 6.0 or later • Plugin for GPUDirect RDMA – http://www.mellanox.com/page/products_dyn?product_family=116 – Strongly Recommended : use the new GDRCOPY module from NVIDIA • https://github.com/NVIDIA/gdrcopy • Has optimized designs for point-to-point communication using GDR • Contact MVAPICH help list with any questions related to the package mvapich-help@cse.ohio-state.edu Using MVAPICH2-GDR Version 31HPCAC Brazil Conference (Aug ‘15)
  32. 32. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 32 HPCAC Spain Conference (Sept '15)
  33. 33. MPI Applications on MIC Clusters Xeon Xeon Phi Multi-core Centric Many-core Centric MPI Program MPI Program Offloaded Computation MPI Program MPI Program MPI Program Host-only Offload (/reverse Offload) Symmetric Coprocessor-only • Flexibility in launching MPI jobs on clusters with Xeon Phi 33HPCAC Spain Conference (Sept '15)
  34. 34. 34 MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC • Offload Mode • Intranode Communication • Coprocessor-only and Symmetric Mode • Internode Communication • Coprocessors-only and Symmetric Mode • Multi-MIC Node Configurations • Running on three major systems • Stampede, Blueridge (Virginia Tech) and Beacon (UTK) HPCAC Spain Conference (Sept '15)
  35. 35. MIC-Remote-MIC P2P Communication with Proxy-based Communication 35HPCAC Spain Conference (Sept '15) Bandwidth Better BetterBetter Latency (Large Messages) 0 1000 2000 3000 4000 5000 8K 32K 128K 512K 2M Latency(usec) Message Size (Bytes) 0 2000 4000 6000 1 16 256 4K 64K 1M Bandwidth(MB/sec) Message Size (Bytes) 5236 Intra-socket P2P Inter-socket P2P 0 2000 4000 6000 8000 10000 12000 14000 16000 8K 32K 128K 512K 2M Latency(usec) Message Size (Bytes) Latency (Large Messages) 0 2000 4000 6000 1 16 256 4K 64K 1M Bandwidth(MB/sec) Message Size (Bytes) Better 5594 Bandwidth
  36. 36. 36 Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall) HPCAC Spain Conference (Sept '15) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014 0 5000 10000 15000 20000 25000 1 2 4 8 16 32 64 128256512 1K Latency(usecs) Message Size (Bytes) 32-Node-Allgather (16H + 16 M) Small Message Latency MV2-MIC MV2-MIC-Opt 0 500 1000 1500 8K 16K 32K 64K 128K 256K 512K 1M Latency(usecs)x1000 Message Size (Bytes) 32-Node-Allgather (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 0 200 400 600 800 4K 8K 16K 32K 64K 128K 256K 512K Latency(usecs)x1000 Message Size (Bytes) 32-Node-Alltoall (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 0 10 20 30 40 50 MV2-MIC-Opt MV2-MIC ExecutionTime(secs) 32 Nodes (8H + 8M), Size = 2K*2K*1K P3DFFT Performance Communication Computation 76% 58% 55%
  37. 37. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 37 HPCAC Spain Conference (Sept '15)
  38. 38. MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS Applications HPCAC Spain Conference (Sept '15) MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS) Applications Unified MVAPICH2-X Runtime InfiniBand, RoCE, iWARP OpenSHMEM Calls MPI CallsUPC Calls • Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with MVAPICH2-X 1.9 (2012) onwards! – http://mvapich.cse.ohio-state.edu • Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH) – Scalable Inter-node and intra-node communication – point-to-point and collectives CAF Calls 38
  39. 39. Application Level Performance with Graph500 and Sort Graph500 Execution Time J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012 0 5 10 15 20 25 30 35 4K 8K 16K Time(s) No. of Processes MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 13X 7.6X • Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design • 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple • 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013. Sort Execution Time 0 500 1000 1500 2000 2500 3000 500GB-512 1TB-1K 2TB-2K 4TB-4K Time(seconds) Input Data - No. of Processes MPI Hybrid 51% • Performance of Hybrid (MPI+OpenSHMEM) Sort Application • 4,096 processes, 4 TB Input Size - MPI – 2408 sec; 0.16 TB/min - Hybrid – 1172 sec; 0.36 TB/min - 51% improvement over MPI-design 39HPCAC Spain Conference (Sept '15)
  40. 40. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 40 HPCAC Spain Conference (Sept '15)
  41. 41. • Virtualization has many benefits – Job migration – Compaction • Not very popular in HPC due to overhead associated with Virtualization • New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters • Enhanced MVAPICH2 support for SR-IOV • Publicly available as MVAPICH2-Virt library Can HPC and Virtualization be Combined? 41 HPCAC Spain Conference (Sept '15) J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clysters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15
  42. 42. 0 50 100 150 200 250 300 350 400 450 20,10 20,16 20,20 22,10 22,16 22,20 ExecutionTime(us) Problem Size (Scale, Edgefactor) MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 4% 9% 0 5 10 15 20 25 30 35 FT-64-C EP-64-C LU-64-C BT-64-C ExecutionTime(s) NAS Class C MV2-SR-IOV-Def MV2-SR-IOV-Opt MV2-Native 1% 9% • Compared to Native, 1-9% overhead for NAS • Compared to Native, 4-9% overhead for Graph500 HPCAC Spain Conference (Sept '15) 65 Application-Level Performance (8 VM * 8 Core/VM) NAS Graph500
  43. 43. HPCAC Spain Conference (Sept '15) NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument • Large-scale instrument – Targeting Big Data, Big Compute, Big Instrument research – ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network • Reconfigurable instrument – Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use • Connected instrument – Workload and Trace Archive – Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others – Partnerships with users • Complementary instrument – Complementing GENI, Grid’5000, and other testbeds • Sustainable instrument – Industry connections http://www.chameleoncloud.org/ 43
  44. 44. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 44 HPCAC Spain Conference (Sept '15)
  45. 45. • MVAPICH2-EA (Energy-Aware) • A white-box approach • New Energy-Efficient communication protocols for pt-pt and collective operations • Intelligently apply the appropriate Energy saving techniques • Application oblivious energy saving • OEMT • A library utility to measure energy consumption for MPI applications • Works with all MPI runtimes • PRELOAD option for precompiled applications • Does not require ROOT permission: • A safe kernel module to read only a subset of MSRs 45 Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT) HPCAC Spain Conference (Sept '15)
  46. 46. • An energy efficient runtime that provides energy savings without application knowledge • Uses automatically and transparently the best energy lever • Provides guarantees on maximum degradation with 5-41% savings at <= 5% degradation • Pessimistic MPI applies energy reduction lever to each MPI call 46 MV2-EA : Application Oblivious Energy-Aware-MPI (EAM) A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh , A. Vishnu , K. Hamidouche , N. Tallent , D. K. Panda , D. Kerbyson , and A. Hoise - Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist] HPCAC Spain Conference (Sept '15) 1
  47. 47. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided RMA) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking • Integrated Support for GPGPUs • Integrated Support for Intel MICs • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Virtualization • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM) Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 47 HPCAC Spain Conference (Sept '15)
  48. 48. HPCAC Spain Conference (Sept '15) 48 OSU InfiniBand Network Analysis Monitoring Tool (INAM) – Network Level View • Show network topology of large clusters • Visualize traffic pattern on different links • Quickly identify congested links/links in error state • See the history unfold – play back historical state of the network Full Network (152 nodes) Zoomed-in View of the Network
  49. 49. HPCAC Spain Conference (Sept '15) 49 Upcoming OSU INAM Tool – Job and Node Level Views Visualizing a Job (5 Nodes) Finding Routes Between Nodes • Job level view • Show different network metrics (load, error, etc.) for any live job • Play back historical data for completed jobs to identify bottlenecks • Node level view provides details per process or per node • CPU utilization for each rank/node • Bytes sent/received for MPI operations (pt-to-pt, collective, RMA) • Network metrics (e.g. XmitDiscard, RcvError) per rank/node
  50. 50. • Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • Enhanced Optimization for GPU Support and Accelerators • Taking advantage of advanced features – User Mode Memory Registration (UMR) – On-demand Paging • Enhanced Inter-node and Intra-node communication schemes for upcoming OmniPath enabled Knights Landing architectures • Extended RMA support (as in MPI 3.0) • Extended topology-aware collectives • Energy-aware point-to-point (one-sided and two-sided) and collectives • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended Checkpoint-Restart and migration support with SCR MVAPICH2 – Plans for Exascale 50HPCAC Spain Conference (Sept '15)
  51. 51. • Scientific Computing – Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model – Many discussions towards Partitioned Global Address Space (PGAS) • UPC, OpenSHMEM, CAF, etc. – Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) • Big Data/Enterprise/Commercial Computing – Focuses on large data and data analysis – Hadoop (HDFS, HBase, MapReduce) – Spark is emerging for in-memory computing – Memcached is also used for Web 2.0 51 Two Major Categories of Applications HPCAC Spain Conference (Sept '15)
  52. 52. Can High-Performance Interconnects Benefit Big Data Computing? • Most of the current Big Data systems use Ethernet Infrastructure with Sockets • Concerns for performance and scalability • Usage of High-Performance Networks is beginning to draw interest • What are the challenges? • Where do the bottlenecks lie? • Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? • Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop and Memcached? HPCAC Spain Conference (Sept '15) 52
  53. 53. Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs) Storage Technologies (HDD and SSD) Programming Models (Sockets) Applications Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Other Protocols? Communication and I/O Library Point-to-Point Communication QoS Threaded Models and Synchronization Fault-ToleranceI/O and File Systems Virtualization Benchmarks RDMA Protocol Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges HPCAC Spain Conference (Sept '15) Upper level Changes? 53
  54. 54. Design Overview of HDFS with RDMA • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based HDFS with communication library written in native code HDFS Verbs RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..) Applications 1/10 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI) WriteOthers OSU Design • Design Features – RDMA-based HDFS write – RDMA-based HDFS replication – Parallel replication support – On-demand connection setup – InfiniBand/RoCE support HPCAC Spain Conference (Sept '15) 54
  55. 55. Communication Times in HDFS • Cluster with HDD DataNodes – 30% improvement in communication time over IPoIB (QDR) – 56% improvement in communication time over 10GigE • Similar improvements are obtained for SSD DataNodes Reduced by 30% HPCAC Spain Conference (Sept '15) 0 5 10 15 20 25 2GB 4GB 6GB 8GB 10GB CommunicationTime(s) File Size (GB) 10GigE IPoIB (QDR) OSU-IB (QDR) 55 N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012 N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014
  56. 56. Triple-H Heterogeneous Storage • Design Features – Three modes • Default (HHH) • In-Memory (HHH-M) • Lustre-Integrated (HHH-L) – Policies to efficiently utilize the heterogeneous storage devices • RAM, SSD, HDD, Lustre – Eviction/Promotion based on data usage pattern – Hybrid Replication – Lustre-Integrated mode: • Lustre-based fault-tolerance Enhanced HDFS with In-memory and Heterogeneous Storage Hybrid Replication Data Placement Policies Eviction/ Promotion RAM Disk SSD HDD Lustre N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015 Applications HPCAC Spain Conference (Sept '15) 56
  57. 57. • For 160GB TestDFSIO in 32 nodes – Write Throughput: 7x improvement over IPoIB (FDR) – Read Throughput: 2x improvement over IPoIB (FDR) Performance Improvement on TACC Stampede (HHH) • For 120GB RandomWriter in 32 nodes – 3x improvement over IPoIB (QDR) 0 1000 2000 3000 4000 5000 6000 write read TotalThroughput(MBps) IPoIB (FDR) OSU-IB (FDR) TestDFSIO 0 50 100 150 200 250 300 8:30 16:60 32:120 ExecutionTime(s) Cluster Size:Data Size IPoIB (FDR) OSU-IB (FDR) RandomWriter HPCAC Spain Conference (Sept '15) Increased by 7x Increased by 2x Reduced by 3x 57
  58. 58. • For 200GB TeraGen on 32 nodes – Spark-TeraGen: HHH has 2.1x improvement over HDFS-IPoIB (QDR) – Spark-TeraSort: HHH has 16% improvement over HDFS-IPoIB (QDR) 58 Evaluation with Spark on SDSC Gordon (HHH) 0 50 100 150 200 250 8:50 16:100 32:200 ExecutionTime(s) Cluster Size : Data Size (GB) IPoIB (QDR) OSU-IB (QDR) 0 100 200 300 400 500 600 8:50 16:100 32:200 ExecutionTime(s) Cluster Size : Data Size (GB) IPoIB (QDR) OSU-IB (QDR) TeraGen TeraSort Reduced by 16% Reduced by 2.1x HPCAC Spain Conference (Sept '15) N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In- Memory File Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015
  59. 59. Design Overview of MapReduce with RDMA MapReduce Verbs RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..) OSU Design Applications 1/10 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI) Job Tracker Task Tracker Map Reduce HPCAC Spain Conference (Sept '15) • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based MapReduce with communication library written in native code • Design Features – RDMA-based shuffle – Prefetching and caching map output – Efficient Shuffle Algorithms – In-memory merge – On-demand Shuffle Adjustment – Advanced overlapping • map, shuffle, and merge • shuffle, merge, and reduce – On-demand connection setup – InfiniBand/RoCE support 59
  60. 60. • For 240GB Sort in 64 nodes (512 cores) – 40% improvement over IPoIB (QDR) with HDD used for HDFS Performance Evaluation of Sort and TeraSort HPCAC Spain Conference (Sept '15) 0 200 400 600 800 1000 1200 Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 JobExecutionTime(sec) IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR) Sort in OSU Cluster 0 100 200 300 400 500 600 700 Data Size: 80 GB Data Size: 160 GBData Size: 320 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 JobExecutionTime(sec) IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR) TeraSort in TACC Stampede • For 320GB TeraSort in 64 nodes (1K cores) – 38% improvement over IPoIB (FDR) with HDD used for HDFS 60
  61. 61. Intermediate Data Directory Design Overview of Shuffle Strategies for MapReduce over Lustre HPCAC Spain Conference (Sept '15) • Design Features – Two shuffle approaches • Lustre read based shuffle • RDMA based shuffle – Hybrid shuffle algorithm to take benefit from both shuffle approaches – Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation – In-memory merge and overlapping of different phases are kept similar to RDMA- enhanced MapReduce design Map 1 Map 2 Map 3 Lustre Reduce 1 Reduce 2 Lustre Read / RDMA In-memory merge/sort reduce M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015. In-memory merge/sort reduce 61
  62. 62. • For 500GB Sort in 64 nodes – 44% improvement over IPoIB (FDR) Performance Improvement of MapReduce over Lustre on TACC-Stampede HPCAC Spain Conference (Sept '15) • For 640GB Sort in 128 nodes – 48% improvement over IPoIB (FDR) 0 200 400 600 800 1000 1200 300 400 500 JobExecutionTime(sec) Data Size (GB) IPoIB (FDR) OSU-IB (FDR) 0 50 100 150 200 250 300 350 400 450 500 20 GB 40 GB 80 GB 160 GB 320 GB 640 GB Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128 JobExecutionTime(sec) IPoIB (FDR) OSU-IB (FDR) M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA- based Approach Benefit?, Euro-Par, August 2014. • Local disk is used as the intermediate data directory Reduced by 48%Reduced by 44% 62
  63. 63. Design Overview of Spark with RDMA • Design Features – RDMA based shuffle – SEDA-based plugins – Dynamic connection management and sharing – Non-blocking and out-of- order data transfer – Off-JVM-heap buffer management – InfiniBand/RoCE support HPCAC Spain Conference (Sept '15) • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014 63
  64. 64. Performance Evaluation on TACC Stampede - SortByTest • Intel SandyBridge + FDR, 16 Worker Nodes, 256 Cores, (256M 256R) • RDMA-based design for Spark 1.4.0 • RDMA vs. IPoIB with 256 concurrent tasks, single disk per node and RamDisk. For SortByKey Test: – Shuffle time reduced by up to 77% over IPoIB (56Gbps) – Total time reduced by up to 58% over IPoIB (56Gbps) 16 Worker Nodes, SortByTest Shuffle Time 16 Worker Nodes, SortByTest Total Time HPCAC Spain Conference (Sept '15) 64 0 5 10 15 20 25 30 35 40 16 32 64 Time(sec) Data Size (GB) IPoIB RDMA 0 5 10 15 20 25 30 35 40 16 32 64 Time(sec) Data Size (GB) IPoIB RDMA 77% 58%
  65. 65. Memcached-RDMA Design • Server and client perform a negotiation protocol – Master thread assigns clients to appropriate worker thread – Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread • Memcached Server can serve both socket and verbs clients simultaneously – All other Memcached data structures are shared among RDMA and Sockets worker threads • Memcached applications need not be modified; uses verbs interface if available • High performance design of SSD-Assisted Hybrid Memory Sockets Client RDMA Client Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory & SSD Slabs Items … 1 1 2 2 65HPCAC Spain Conference (Sept '15) J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11 J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12 D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL, ISPASS’15
  66. 66. • ohb_memlat & ohb_memthr latency & throughput micro-benchmarks • Memcached-RDMA can - improve query latency by up to 70% over IPoIB (32Gbps) - improve throughput by up to 2X over IPoIB (32Gbps) - No overhead in using hybrid mode with SSD when all data can fit in memory Performance Benefits on SDSC-Gordon – OHB Latency & Throughput Micro-Benchmarks 0 1 2 3 4 5 6 7 8 9 10 64 128 256 512 1024 Throughput(milliontrans/sec) No. of Clients IPoIB (32Gbps) RDMA-Mem (32Gbps) RDMA-Hybrid (32Gbps) 0 100 200 300 400 500 Averagelatency(us) Message Size (Bytes) IPoIB (32Gbps) RDMA-Mem (32Gbps) RDMA-Hyb (32Gbps) 2X 66 D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, Benchmarking Key-Value Stores on High-Performance Storage and Interconnects for Web-Scale Workloads, IEEE BigData ‘15. HPCAC Spain Conference (Sept '15)
  67. 67. • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • RDMA for Memcached (RDMA-Memcached) • OSU HiBD-Benchmarks (OHB) – HDFS and Memcached Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 125 organizations from 20 countries • More than 13,000 downloads from the project site • RDMA for Apache HBase and Spark The High-Performance Big Data (HiBD) Project HPCAC Spain Conference (Sept '15) 67
  68. 68. • Upcoming Releases of RDMA-enhanced Packages will support – Spark – HBase • Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support – MapReduce – RPC • Advanced designs with upper-level changes and optimizations • E.g. MEM-HDFS HPCAC Spain Conference (Sept '15) Future Plans of OSU High Performance Big Data (HiBD) Project 68
  69. 69. • Exascale systems will be constrained by – Power – Memory per core – Data movement cost – Faults • Programming Models and Runtimes for HPC and BigData need to be designed for – Scalability – Performance – Fault-resilience – Energy-awareness – Programmability – Productivity • Highlighted some of the issues and challenges • Need continuous innovation on all these fronts Looking into the Future …. 69 HPCAC Spain Conference (Sept '15)
  70. 70. HPCAC Spain Conference (Sept '15) Funding Acknowledgments Funding Support by Equipment Support by 70
  71. 71. Personnel Acknowledgments Current Students – A. Augustine (M.S.) – A. Awan (Ph.D.) – A. Bhat (M.S.) – S. Chakraborthy (Ph.D.) – C.-H. Chu (Ph.D.) – N. Islam (Ph.D.) Past Students – P. Balaji (Ph.D.) – D. Buntinas (Ph.D.) – S. Bhagvat (M.S.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) Past Research Scientist – S. Sur Current Post-Doc – J. Lin – D. Shankar Current Programmer – J. Perkins Past Post-Docs – H. Wang – X. Besseron – H.-W. Jin – M. Luo – W. Huang (Ph.D.) – W. Jiang (M.S.) – J. Jose (Ph.D.) – S. Kini (M.S.) – M. Koop (Ph.D.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – A. Moody (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – S. Potluri (Ph.D.) – R. Rajachandrasekar (Ph.D.) – K. Kulkarni (M.S.) – M. Li (Ph.D.) – M. Rahman (Ph.D.) – D. Shankar (Ph.D.) – A. Venkatesh (Ph.D.) – J. Zhang (Ph.D.) – E. Mancini – S. Marcarelli – J. Vienne Current Senior Research Associates – K. Hamidouche – X. Lu Past Programmers – D. Bureddy – H. Subramoni Current Research Specialist – M. Arnold HPCAC Spain Conference (Sept '15) 71
  72. 72. panda@cse.ohio-state.edu HPCAC Spain Conference (Sept '15) Thank You! The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/ 72 Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/
  73. 73. 73 Call For Participation International Workshop on Extreme Scale Programming Models and Middleware (ESPM2 2015) In conjunction with Supercomputing Conference (SC 2015) Austin, USA, November 15th, 2015 http://web.cse.ohio-state.edu/~hamidouc/ESPM2/ HPCAC Spain Conference (Sept '15)
  74. 74. • Looking for Bright and Enthusiastic Personnel to join as – Post-Doctoral Researchers – PhD Students – Hadoop/Big Data Programmer/Software Engineer – MPI Programmer/Software Engineer • If interested, please contact me at this conference and/or send an e-mail to panda@cse.ohio-state.edu 74 Multiple Positions Available in My Group HPCAC Spain Conference (Sept '15)

×