Interconnect your future

Rich Graham
February 2016, HPCAC Stanford Conference
Interconnect Your Future

© 2015 Mellanox Technologies 2
The Ever Growing Demand for Higher Performance
2000 202020102005
“Roadrunner”
1st
2015
Terascale Petascale Exascale
Single-Core to Many-CoreSMP to Clusters
Performance Development
Co-Design
HW SW
APP
Hardware
Software
Application
The Interconnect is the Enabling Technology

Co-Design Architecture to Enable Exascale Performance
CPU-Centric Co-Design
Limited to Main CPU Usage
Results in Performance Limitation
Creating Synergies
Enables Higher Performance and Scale
Software
Software
In-CPU
Computing
In-Network
Computing
In-Storage
Computing

The Intelligence is Moving to the Interconnect
CPU
Interconnect
Past Future

Breaking the Application Latency Wall
§ Today: Network device latencies are on the order of 100 nanoseconds
§ Challenge: Enabling the next order of magnitude improvement in application performance
§ Solution: Creating synergies between software and hardware – intelligent interconnect
Intelligent Interconnect Paves the Road to Exascale Performance
10 years ago
~10
microsecond
~100
microsecond
NetworkCommunication
Framework
Today
~10
microsecond
Communication
Framework
~0.1
microsecond
Network
~1
microsecond
Communication
Framework
Future
~0.05
microsecond
Co-Design
Network

Co-Design: Offloaded Technologies Target Application Characteristics
Programmability
RDMA GPUDirect Virtualization
Backward and Future Compatibility
Direct Communication
Applications (Innovations, Scalability, Performance)
Software-Defined
Network (SDN)
Co-Design Requires Intelligent Interconnect
Offloaded Technologies: Intelligent Interconnect

The Road to Exascale – Co-Design System Architecture
Co-Design
Co-Design
Co-Design
Co-Design
CPU GPU
HCA
Switch
FPGA
In-CPU
Computing
In-GPU
Computing
In-FPGA
Computing
In-Network
Computing
In-Network
Computing

Introducing Switch-IB 2 World’s First Smart Switch

Introducing Switch-IB 2 World’s First Smart Switch
§ The world fastest switch with <90 nanosecond latency
§ 36-ports, 100Gb/s per port, 7.2Tb/s throughput, 7.02 Billion messages/sec
§ Adaptive Routing, Congestion control, support for multiple topologies
World’s First Smart Switch
Build for Scalable Compute and Storage Infrastructures
10X Higher Performance with The New Switch SHArP Technology

SHArP (Scalable Hierarchical Aggregation Protocol) Technology
Delivering 10X Performance Improvement
for MPI and SHMEM/PAGS Applications
Switch-IB 2 Enables the Switch Network to
Operate as a Co-Processor
SHArP Enables Switch-IB 2 to Manage and
Execute MPI Operations in the Network

Scalable Hierarchical Aggregation Protocol
§ Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases
•  In-network Tree based aggregation mechanism
•  Large number of groups
•  Multiple simultaneous outstanding operations
Accelerating HPC applications
§ Scalable High Performance Collective Offload
•  Barrier, Reduce, All-Reduce, Broadcast
•  Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND
•  Integer and Floating-Point, 32 / 64 bit
§ Significantly reduce MPI collective runtime
§ Increase CPU availability and efficiency
§ Enable communication and computation overlap
Accelerating MapReduce Applications
§ Prevent the Incast Traffic Pattern

SHArP Performance Advantage – MiniFE Details
§  MiniFE is a Finite Element mini-application
•  Implements kernels that represent implicit finite-element applications
10X to 25X Performance Improvement
AllRedcue MPI Collective
Number
of Nodes
CPU-Based
Latency (usec)
SHArP
Latency (usec)
Ratio
32 41.7 4.24 9.9
64 49.08 4.63 10.6
128 57.67 4.76 12.1
256 67.76 4.87 13.9
512 79.62 5.09 15.6
1024 93.55 5.58 16.8
2048 109.92 5.63 19.5
4096 129.16 5.73 22.5
8192 151.76 5.94 25.5

SHArP Performance– First Results (Partial Implementation)
3.5X Performance Improvement on 64 Nodes

The Intelligence is Moving to the Interconnect
Communication Frameworks (MPI, SHMEM/PGAS)
The Only Approach to Deliver 10X Performance Improvements
Applications Transport
RDMA
SR-IOV
Collectives
Peer-Direct
GPUDirect
More…
MPI / SHMEM Offloads
Q1’16
Q3’16

Introducing ConnectX-4 Lx Programmable Adapter
Scalable, Efficient, High-Performance and Flexible Solution
Security
Cloud/Virtualization
Storage
High Performance Computing
Precision Time Synchronization
Networking + FPGA
Mellanox Acceleration Engines
and FGPA Programmability
On One Adapter

InfiniBand Router – In Progress
§ Isolation between InfiniBand subnets
§ Simple connectivity between different topologies
•  Enable sharing a common storage network by multiple disconnected subnets
§ Support 2^128 nodes (unlimited system size)
SB7780

§ Router implements GID to LID mapping
§ SM allocates Alias GID to HCA
§ Address resolution
•  IP based applications
-  Name to IP (standard), IP to GID using new API
•  Pure IB applications
-  Upon LID assignment change, GID DNS is updated
InfiniBand Router Details
IB subnet
IB subnetIB subnet
GID DNS
RMA 1
RPA
RPA RPA
RTM
HCA
GID DNA
Agent
SM
SRPM
SRTM
HCA
GID DNA
Agent
SM
SRPM
SRTM
HCA
GID DNA
Agent
SM
SRPM
SRTM
RTM: Routing Table Manager
SRTM: Subnet Routing Table Manager
RPA: Router Port Agent
SRPM: Subnet Router Port Manager
GID DNS: IP to GID resolution

Multi-Host Socket Direct – Low Latency Socket Communication
§ Each CPU with direct network access
§  QPI avoidance for I/O – improve performance
§  Enables GPU / peer direct on both sockets
§ Solution is transparent to software
CPU CPUCPU CPU
QPI
Multi-Host Socket Direct Performance
50% Lower CPU Utilization
20% lower Latency
Multi Host Evaluation Kit
Lower Application Latency, Free-up CPU

Switch LatencyMessage Rate
Mellanox InfiniBand Leadership Over Future Competition
20%
Lower
44%
Higher
Power Consumption
Per Switch Port
Scalability
CPU efficiency
25%
Lower
2X
Higher
100
Gb/s
Link Speed
200
Gb/s
Link Speed
2014
Gain Competitive Advantage Today
Protect Your Future
2017
Smart Network For Smart Systems
RDMA, Acceleration Engines, Programmability
Higher Performance
Unlimited Scalability
Higher Resiliency
Proven!

Technology Roadmap – One-Generation Lead over the Competition
2000 202020102005
20G 40G 56G 100G
“Roadrunner”
Mellanox Connected
1st3rd
TOP500 2003
Virginia Tech (Apple)
2015
200G
Terascale Petascale Exascale
Mellanox 400G

Interconnect your future

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Interconnect your future

Similaire à Interconnect your future (20)

Plus de inside-BigData.com

Plus de inside-BigData.com (20)

Dernier

Dernier (20)

Interconnect your future