Contenu connexe
Similaire à Mellanox hpc update @ hpcday 2012 kiev (20)
Plus de Volodymyr Saviak (15)
Mellanox hpc update @ hpcday 2012 kiev
- 1. New Advancements in HPC Interconnect
Technology
October 2012, HPC@mellanox.com
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 1
- 2. Leading Server and Storage Interconnect Provider
High-Performance
Web 2.0 Cloud Database Storage
Computing
Comprehensive End-to-End 10/40/56Gb/s Ethernet and 56Gb/s InfiniBand Portfolio
ICs Adapter Cards Switches/Gateways Software Cables
Scalability, Reliability, Power, Performance
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 2
- 3. Mellanox in the TOP500
InfiniBand has become the de-factor interconnect solution for High Performance
Computing
• Most used interconnect on the TOP500 list – 210 systems
• FDR InfiniBand connects the fastest InfiniBand system on the TOP500 list – 2.9 Petaflops,
91% efficiency (LRZ)
InfiniBand connects more of the world’s sustained Petaflop systems than any other
interconnect (40% - 8 systems out of 20)
Most used interconnect solution in the TOP100, TOP200, TOP300, TOP400, TOP500
• Connects 47% (47 systems) of the TOP100 while Ethernet only 2% (2 system)
• Connects 55% (110 systems) of the TOP200 while Ethernet only 10% (20 systems)
• Connects 52.7% (158 systems) of the TOP300 while Ethernet only 22.7% (68 systems)
• Connects 46.3% (185 systems) of the TOP400 while Ethernet only 33.8% (135 systems)
FDR InfiniBand Based Systems increased 10X Since Nov’11
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 3
- 4. Mellanox InfiniBand Paves the Road to Exascale
PetaFlop
Mellanox Connected
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 4
- 6. InfiniBand Roadmap
2011
2008 56Gb/s
2005 40Gb/s
2002
20Gb/s
10Gb/s
3.0
Highest Performance, Reliability, Scalability, Efficiency
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 6
- 7. FDR InfiniBand New Features and Capabilities
Performance / Scalability Reliability / Efficiency
• >12GB/s bandwidth, <0.7usec latency • Link bit encoding – 64/66
• PCI Express 3.0 • Forward Error Correction
• InfiniBand Routing and IB-Ethernet Bridging • Lower power consumption
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 7
- 8. Virtual Protocol Interconnect (VPI) Technology
VPI Adapter VPI Switch
Unified Fabric Manager
Switch OS Layer
Applications
Managemen
Networking Storage Clustering
t
64 ports 10GbE
Acceleration Engines 36 ports 40GbE
48 10GbE + 12 40GbE
36 ports IB up to 56Gb/s
8 VPI subnets
Ethernet: 10/40 Gb/s
3.0 InfiniBand:10/20/40/
56 Gb/s
LOM Adapter Card Mezzanine Card
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 8
- 9. FDR InfiniBand PCIe 3.0 vs QDR InfiniBand PCIe 2.0
Double the Bandwidth, Half the Latency
120% Higher Application ROI
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 9
- 11. FDR InfiniBand Meets the Needs of Changing Storage World
SSDs, the storage hierarchy, In-Memory Computing…..
Remote I/O access needs to be equal to local I/O access SMB IO Micro
Client Benchmark
IB FDR
SMB
Server
IB FDR
Fusion
Fusion
Fusion
FusionIO
IO
IO
IO
Native Throughput Performance over InfiniBand FDR
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 11
- 12. FDR/QDR InfiniBand Comparisons – Linpack Efficiency
• Derived from 6/12 TOP500 List
• Highest &Lowest Outlier Removed from each group
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 12
- 13. Powered by Mellanox FDR InfiniBand
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 13
- 14. Connect-IB
The Foundation for Exascale Computing
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 14
- 15. Roadmap of Interconnect Innovations
InfiniHost InfiniHost III ConnectX (1,2,3)
World’s first World’s first PCIe World’s first
InfiniBand HCA InfiniBand HCA Virtual Protocol
Interconnect (VPI)
Connect-IB
Adapter
The Exascale
10Gb/s InfiniBand 20Gb/s InfiniBand 40/56Gb/s InfiniBand Foundation
PCI-X host interface PCIe 1.0 PCIe 2.0, 3.0 x8
1 million msg/sec 2 million msg/sec 33 million msg/sec
June
2012
2002 2005 2008-11
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 15
- 16. Connect-IB Performance Highlights
▪ World’s first 100Gb/s interconnect adapter
• PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s
▪ Highest InfiniBand message rate: 130 million messages per second
• 4X higher than other InfiniBand solutions
▪ <0.7 micro-second application latency
▪ Supports GPUDirect RDMA for direct GPU-to-GPU communication
▪ Unmatchable Storage Performance
• 8,000,000 IOPs (1QP), 18,500,000 IOPs (32 QPs)
▪ Enhanced congestion-control mechanism
▪ Supports Scalable HPC with MPI, SHMEM and PGAS offloads
Enter the World of Boundless Performance
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 16
- 18. Mellanox ScalableHPC Accelerate Parallel Applications
MXM FCA
- Reliable Messaging Optimized for Mellanox HCA - Topology Aware Collective Optimization
- Hybrid Transport Mechanism - Hardware Multicast
- Efficient Memory Registration - Separate Virtual Fabric for Collectives
- Receive Side Tag Matching - CoreDIrect Hardware Offload
InfiniBand Verbs API
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 18
- 19. Mellanox MXM – HPCC Random Ring Latency
© 2012 MELLANOX TECHNOLOGIES 19
- 21. Mellanox MPI Optimizations – Highest Scalability at LLNL
Mellanox MPI optimization enable linear strong scaling for LLNL application
World Leading Performance and Scalability
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 21
- 23. Collective Operation Challenges at Large Scale
Collective algorithms are not topology aware
and can be inefficient
Congestion due to many-to-many
communications
Ideal Actual
Slow nodes and OS jitter affect
scalability and increase variability
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 23
- 24. Mellanox Collectives Acceleration Components
CORE-Direct
• Adapter-based hardware offloading for collectives operations
• Includes floating-point capability on the adapter for data reductions
• CORE-Direct API is exposed through the Mellanox drivers
FCA
• FCA is a software plug-in package that integrates into available MPIs
• Provides scalable topology aware collective operations
• Utilizes powerful InfiniBand multicast and QOS capabilities
• Integrates CORE-Direct collective hardware offloads
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 24
- 26. Fabric Collective Accelerations Provide Linear Scalability
Barrier Collective Reduce Collective
100.0 3000
80.0 2500
Latency (us)
2000
Latency (us)
60.0
1500
40.0
1000
20.0 500
0.0 0
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
processes (PPN=8)
Processes (PPN=8)
Without FCA With FCA Without FCA With FCA
8-Byte Broadcast
10000
Bandwidth (KB*processes)
8000
6000
4000
2000
0
0 500 1000 1500 2000 2500
Processes (PPN=8)
Without FCA With FCA
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 26
- 27. Application Example: CPMD and Amber
CPMD and Amber are leading molecular dynamics applications
Result: FCA accelerates CPMD by nearly 35% and Amber by 33%
• At 16 nodes, 192 cores
• Performance benefit increases with cluster size – higher benefit expected at larger scale
*Acknowledgment: HPC Advisory Council for providing the performance results Higher is better
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 27
- 29. GPU Networking Communication (pre-GPUDirect)
In GPU-to-GPU communications, GPU applications uses “pinned” buffers
• A section in the host memory dedicated for the GPU
• Allows optimizations such as write-combining and overlapping GPU computation and data transfer for best performance
InfiniBand verbs API uses “pinned” buffers for efficient communication
• Zero-copy data transfers, Kernel bypass
CPU is involved in the GPU data path
• Memory copies between the different “pinned buffers”
• Slows down the GPU communications and creates communication bottleneck
Receive Transmit
System
1 2
2 1
System
CPU CPU Memory
Memory
GPU Chip Chip GPU
set set
Mellanox Mellanox
InfiniBand InfiniBand
GPU GPU
Memory Memory
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 29
- 30. NVIDIA GPUDirect 1.0
GPUDirect 1.0 joint development between Mellanox and NVIDIA
• Eliminates system memory copies for GPU communications across network using InfiniBand Verbs API
• Reduces latency by 30% for GPUs communication
GPU Direct 1.0 availability announced in May 2010
• Available in CUDA-4.0 and higher
Receive Transmit
System System
1
1
Memory CPU CPU Memory
Chip Chip
GPU set set
GPU
Mellanox Mellanox
InfiniBand InfiniBand
GPU GPU
Memory Memory
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 30
- 31. GPUDirect 1.0 – Application Performance
LAMPS
• 3 nodes, 10% gain
3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node
Amber – Cellulose
• 8 nodes, 32% gain
Amber – FactorX
• 8 nodes, 27% gain
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 31
- 32. GPUDirect RDMA Peer-to-Peer
GPU Direct RDMA (previously known as GPU Direct 3.0)
Enables peer to peer communication directly between HCA and GPU
Dramatically reduces overall latency for GPU to GPU communications
System GDDR5 GDDR5 System
Memory Memory Memory Memory
CPU GPU GPU CPU
PCI Express 3.0 PCI Express 3.0
GPU
Mellanox Mellanox
HCA HCA
Mellanox VPI
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 32
- 33. Optimizing GPU and Accelerator Communications
NVIDIA GPUs
• Mellanox were original partners in Co-Development of GPUDirect 1.0
• Recently announced support of GPUDirect RDMA Peer-to-Peer GPU-to-HCA data path
AMD GPUs
• Sharing of System Memory: AMD DirectGMA Pinned supported today
• AMD DirectGMA P2P: Peer-to-Peer GPU-to-HCA data path under development
Intel MIC
• MIC software development system enables the MIC to communicate directly over the
InfiniBand verbs API to Mellanox devices
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 33
- 34. GPU as a Service
GPUs in every server GPUs as a Service
CPU
CPU
GPU CPU VGPU GPU
GPU
GPU
CPU GPU GPU
GPU
CPU GPU
GPU
GPU VGPU GPU
GPU
CPU GPU
GPU
GPU CPU CPU
GPU VGPU
GPUs as a network-resident service
• Little to no overhead when using FDR InfiniBand
Virtualize and decouple GPU services from CPU
services
• A new paradigm in cluster flexibility
• Lower cost, lower power and ease of use with shared GPU
resources
• Remove difficult physical requirements of the GPU for
standard compute servers
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 34
- 36. Hadoop™ Map Reduce Framework for Unstructured Data
Map Reduce Programing Model
Raw Date 1 1
Sort
1 1
Map 1 1
Reduce 3
1
1 1
Sort
1 1 3
Map 1
1 Reduce 2
1
1
1
1 1
Sort
1 1
Map 1 1 Reduce 4
1 1
Map
• The map function is used on a set of input values and calculates a set of key/value pairs
Reduce
• The reduce function aggregates the key/value pairs data into a scalar
• The reduce function receives all the data for an individual "key" from all the mappers
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 36
- 37. Mellanox Unstructured Data Accelerator (UDA)
Plug-in architecture for Hadoop
• Hadoop applications are unmodified
• Plug-in to Apache Hadoop
• Enabled via xml configuration
Accelerates Map Reduce operations
• Data communication over RDMA
- Using RDMA for In-Memory processing
- Kernel bypass for minimizing CPU overhead, faster execution
• Enables to start the Reduce operation in parallel to the Shuffle operation
- Reduce disk IO operation
• Supports InfiniBand and Ethernet
Enables Competitive Advantage with Big Data Analytics Acceleration
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 37
- 38. UDA Performance Benefit for Map Reduce
Terasort Benchmark*
16GB data per node
1000
900
800
700
Execution Time (sec)
Lower is better
600
45% 1 GE
500
10 GE
400 UDA 10GE
300
200 ~2X Acceleration
100
0
8 Nodes 10 Nodes 12 Nodes
*TeraSort is a popular benchmark used to measure the performance of Hadoop cluster
Fastest Job Completion!
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 38
- 39. Accelerating Big Data Analytics – EMC/GreenPlum
EMC 1000-Node Analytic Platform
Accelerates Industry's Hadoop Development
24 PetaByte of physical storage
• Half of every written word since inception of mankind
Mellanox InfiniBand FDR Solutions
Hadoop
2X Faster Hadoop Job Run-Time
Acceleration
High Throughput, Low Latency, RDMA Critical for ROI
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 39
- 40. Thank You
HPC@mellanox.com
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 40