Voltaire - Reducing the Runtime of Collective Communications

Reducing the Runtime of Collective
Communications
ISC’10 Birds of a Feather Session

June 3, 2010
© 2010 Voltaire Inc.

Agenda

► Scalability Challenges for Group Communication

► Voltaire Fabric Collective Accelerator™ (FCA™)

• Yaron Haviv, CTO, Voltaire

► Customer Experience:

University of Braunschweig

• Josef Schüle

© 2010 Voltaire Inc. Confidential - Internal 2

About Voltaire (NASDAQ: VOLT)

► Leading provider of scale-out data center fabrics
• Used by more than 30% of Fortune100 companies
• Hundreds of installations of over 1000 servers

► Addressing the challenges of HPC, virtualized data centers
and clouds
► More than half of TOP500 InfiniBand sites
► InfiniBand and 10GbE scale-out fabrics

End-to-End Scale-out Fabric Product Line


MPI Collectives

► Collective Operations = Group Communication (All to All, One to
All, All to One)
► Synchronous by nature = consume many “Wait” cycles on large
clusters Collective Operations % of MPI Job Runtime
100

► Popular examples: 90

• Reduce 80

70
• Allreduce
Percentage
60
• Barrier 50

• Bcast 40

30
• Gather
20
• Allgather 10

0
ANSYS SAGE CPMD LSTC LS- CD-Adapco Dacapo
FLUENT DYNA STAR-CD

Your cluster might be spending half its time on idle collective cycles

Collective Example - Allreduce

► Allreduce – The Concept
• Perform specific operation on all arguments, and distribute result to all
processes. Example with SUM operation:

30
15
8 30
7 30
15
6 30
9

► Allreduce on a 4-node cluster

144144 144144
144 2 52 6
1
20 5 1 2 5 6
144144 144144 20 2 52 6
1 5
144144 144144 1 2 5 6
144144 144144
144144 144144
3 4 7 8 3 4 7 8
144144 144144 3 4 7 8
144144 144144 3 4 7 8
144144 144144


Now try running it on a Petascale machine…

Dozens of core
switches (3 hops)

Hundreds of edge
switches (1 hop)

1 2 5 6 1 2 5 6
Tens of thousands 1 2 5 6
3 4 7 8 3 4 7 8 of cores 3 4 7 8

Single Operation > 3000usec – Not Scalable

The Challenge:
Collective Operations Scalability

► Grouping algorithms are unaware of the topology
and inefficient

► Network congestion due to “All-to-All”
communication

► Slow nodes & OS involvement impair scalability
and predictability Expected Actual

► The more powerful servers get (GPUs, more
cores), the poorer collectives scale in the fabric

The Voltaire InfiniBand Fabric:
Equipped for the Challenge

Grid Director Unified Fabric
Switches: Manager (UFM):
Fabric Topology Aware
Processing + + Orchestrator
Power

+ +

………. ……….

Fabric computing in use to address the collective challenge

Introducing:
Voltaire Fabric Collective Accelerator

Grid Director
Grid Director FCA Manager: Unified Fabric
Switches: Manager (UFM):
Topology-based collective tree
Switches:
Fabric Topology Aware
Separate Virtual network
Collective
Processing + + for result distribution
IB multicast Orchestrator
operations
Power Integration with job schedulers
offloaded to
switch CPUs

+ FCA Agent: +
Inter-core processing
localized & optimized
………. ……….

Breakthrough performance with no additional hardware

Efficient Collectives with FCA

4. 2nd tier offload 5. Result distribution
1. Pre-config
(result at root) (single message)

648 11664 648

36 648 36 36 648 36
3. 1st tier
offload
11664 11664
11664 11664 11664 11664
11664 11664 11664 11664
11664 11664
1 2 5 6 1 2 5 6 1 2 5 6
36 8
311664 711664
4
11664 11664
36 11664
11664 411664 8
3
11664 7
36
116644 116648
311664 711664
2. Inter-core 6. Allreduce on 100K
processing cores in 25 usec


UFM Integrated With Job Schedulers

Matching Jobs Automatically
Job Submitted in Scheduler Created in UFM

• QoS
• Routing
• Placement
• Collectives

Application Level Monitoring Fabric-wide Policy Pushed to Match
& Optimization Measurements Application Requirements

FCA Benefits:
Slashing Job Runtime

► Slashing Runtime IMB Allreduce 2048 Cores
Open MPI:
4000
>3000usec
3500

3000

2500

usec
2000

1500

1000

500 FCA: <30usec
0

► Eliminating Runtime Variation
• OS jitter – eliminated in switches
• Traffic congestion – significantly lower number of messages
• Cross-application interference – collectives offloaded on a private virtual network

Server-based
Collectives
FCA-based
Collectives

© 2010 Voltaire Inc. Confidential - Internal Completion Time Distribution 12

FCA Benefits:
Unprecedented Scalability on HPC Clusters
10000

ompi-Allreduce-bynode
1000

ompi-Barrier-bynode

100

> 180X FCA-Allreduce > 50%
10

FCA-Barrier

1
0 200 400 600 800 1000 1200

► Extreme performance ► As process count increases
improvement on raw
• % of time spent in MPI
collectives
increases
► Scale according to number
• % of time spent in collectives
of switch hops, not number
increases
of nodes – O(log18)

Enabling capability computing on HPC clusters

Additional Benefits

► Simple, fully integrated
• No changes to application required

► Tolerance to higher oversubscription (blocking) ratio
• Same performance at lower cost

► Enables use of non-blocking collectives
• Part of future MPI implementations

• FCA guarantees no computation power penalty

► Reduce fabric congestion
• Avoid interference to other jobs


About University of Braunschweig

► General Overview
• Founded in 1745
• 120 institutes with ca. 2900 employees
• Ca. 13000 students
► Main Fields of Research
• Mobility and transport (road, rail, air and space)
• Biological and biotechnological research
• Digital television


System Configuration

Newest installation:
► Nodes type: NEC HPC 1812Rb-2
• CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard
► System Configuration: 186 nodes
• 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking)
► OS: CentOS 5.4
► Open MPI: 1.4.1
4 x QDR 4 x QDR
► FCA:1.0_RC3 rev 2760
► UFM: 2.3 RC7
► Switch: 3.0.629
24 x DDR 24 x DDR


FCA Performance:
A Real Cluster Example with 2048 Ranks

Collective latency (usec)

10000
4000
Microsecond
ompi-Allreduce

1000
ompi-Barrier
Latency (us)

180x
Faster FCA-Allreduce

100
FCA-Barrier

10
0 500 1000 1500 2000 2500
Number of ranks (16 ranks per node)


Real Application Results

► OpenFoam
• Open source CFD solver produced by a commercial company, OpenCFD
• Used by many leading automotive companies

Open Foam CFD Aerodynamic Benchmark (64 cores)

5000
4500

4000

41 ette
b
3500

% r
3000
Seconds

Open MPI 1.4.1
2500
Open MPI 1.4.1 + FCA
2000

1500
1000

500
0
1

► Expected benefits for several other applications
• e.g. DLPOLY (molecular dynamics)

Voltaire Fabric Collective Accelerator
Summary

► Fully Integrated Fabric computing offload
• Combination of SW & HW in a single solution
• Offloading blocking computational tasks
• Algorithms leveraging the topology for computation (trees)

► Extreme MPI performance & scalability
• Capability computing on commodity clusters
• Two orders of magnitude, hundred-times faster collective runtime
• Scale by number of hops, not number of nodes
• Variation eliminated - Consistent results

► Transparent to the application
• Plug & play - No need for code changes

Accelerate your fabric!

Q&A


Voltaire - Reducing the Runtime of Collective Communications

Recommandé

Recommandé

Contenu connexe

Similaire à Voltaire - Reducing the Runtime of Collective Communications

Similaire à Voltaire - Reducing the Runtime of Collective Communications (20)

Dernier

Dernier (20)

Voltaire - Reducing the Runtime of Collective Communications