Contenu connexe
Similaire à Voltaire - Reducing the Runtime of Collective Communications (20)
Voltaire - Reducing the Runtime of Collective Communications
- 1. Reducing the Runtime of Collective
Communications
ISC’10 Birds of a Feather Session
June 3, 2010
© 2010 Voltaire Inc.
- 2. Agenda
► Scalability Challenges for Group Communication
► Voltaire Fabric Collective Accelerator™ (FCA™)
• Yaron Haviv, CTO, Voltaire
► Customer Experience:
University of Braunschweig
• Josef Schüle
© 2010 Voltaire Inc. Confidential - Internal 2
- 3. About Voltaire (NASDAQ: VOLT)
► Leading provider of scale-out data center fabrics
• Used by more than 30% of Fortune100 companies
• Hundreds of installations of over 1000 servers
► Addressing the challenges of HPC, virtualized data centers
and clouds
► More than half of TOP500 InfiniBand sites
► InfiniBand and 10GbE scale-out fabrics
End-to-End Scale-out Fabric Product Line
© 2010 Voltaire Inc. Confidential - Internal 3
- 4. MPI Collectives
► Collective Operations = Group Communication (All to All, One to
All, All to One)
► Synchronous by nature = consume many “Wait” cycles on large
clusters Collective Operations % of MPI Job Runtime
100
► Popular examples: 90
• Reduce 80
70
• Allreduce
Percentage
60
• Barrier 50
• Bcast 40
30
• Gather
20
• Allgather 10
0
ANSYS SAGE CPMD LSTC LS- CD-Adapco Dacapo
FLUENT DYNA STAR-CD
Your cluster might be spending half its time on idle collective cycles
© 2010 Voltaire Inc. Confidential - Internal 4
- 5. Collective Example - Allreduce
► Allreduce – The Concept
• Perform specific operation on all arguments, and distribute result to all
processes. Example with SUM operation:
30
15
8 30
7 30
15
6 30
9
► Allreduce on a 4-node cluster
144144 144144
144 2 52 6
1
20 5 1 2 5 6
144144 144144 20 2 52 6
1 5
144144 144144 1 2 5 6
144144 144144
144144 144144
3 4 7 8 3 4 7 8
144144 144144 3 4 7 8
144144 144144 3 4 7 8
144144 144144
© 2010 Voltaire Inc. Confidential - Internal 5
- 6. Now try running it on a Petascale machine…
Dozens of core
switches (3 hops)
Hundreds of edge
switches (1 hop)
1 2 5 6 1 2 5 6
Tens of thousands 1 2 5 6
3 4 7 8 3 4 7 8 of cores 3 4 7 8
Single Operation > 3000usec – Not Scalable
© 2010 Voltaire Inc. Confidential - Internal 6
- 7. The Challenge:
Collective Operations Scalability
► Grouping algorithms are unaware of the topology
and inefficient
► Network congestion due to “All-to-All”
communication
► Slow nodes & OS involvement impair scalability
and predictability Expected Actual
► The more powerful servers get (GPUs, more
cores), the poorer collectives scale in the fabric
© 2010 Voltaire Inc. Confidential - Internal 7
- 8. The Voltaire InfiniBand Fabric:
Equipped for the Challenge
Grid Director Unified Fabric
Switches: Manager (UFM):
Fabric Topology Aware
Processing + + Orchestrator
Power
+ +
………. ……….
Fabric computing in use to address the collective challenge
© 2010 Voltaire Inc. Confidential - Internal 8
- 9. Introducing:
Voltaire Fabric Collective Accelerator
Grid Director
Grid Director FCA Manager: Unified Fabric
Switches: Manager (UFM):
Topology-based collective tree
Switches:
Fabric Topology Aware
Separate Virtual network
Collective
Processing + + for result distribution
IB multicast Orchestrator
operations
Power Integration with job schedulers
offloaded to
switch CPUs
+ FCA Agent: +
Inter-core processing
localized & optimized
………. ……….
Breakthrough performance with no additional hardware
© 2010 Voltaire Inc. Confidential - Internal 9
- 10. Efficient Collectives with FCA
4. 2nd tier offload 5. Result distribution
1. Pre-config
(result at root) (single message)
648 11664 648
36 648 36 36 648 36
3. 1st tier
offload
11664 11664
11664 11664 11664 11664
11664 11664 11664 11664
11664 11664
1 2 5 6 1 2 5 6 1 2 5 6
36 8
311664 711664
4
11664 11664
36 11664
11664 411664 8
3
11664 7
36
116644 116648
311664 711664
2. Inter-core 6. Allreduce on 100K
processing cores in 25 usec
© 2010 Voltaire Inc. Confidential - Internal 10
- 11. UFM Integrated With Job Schedulers
Matching Jobs Automatically
Job Submitted in Scheduler Created in UFM
• QoS
• Routing
• Placement
• Collectives
Application Level Monitoring Fabric-wide Policy Pushed to Match
& Optimization Measurements Application Requirements
© 2010 Voltaire Inc. Confidential - Internal 11
- 12. FCA Benefits:
Slashing Job Runtime
► Slashing Runtime IMB Allreduce 2048 Cores
Open MPI:
4000
>3000usec
3500
3000
2500
usec
2000
1500
1000
500 FCA: <30usec
0
► Eliminating Runtime Variation
• OS jitter – eliminated in switches
• Traffic congestion – significantly lower number of messages
• Cross-application interference – collectives offloaded on a private virtual network
Server-based
Collectives
FCA-based
Collectives
© 2010 Voltaire Inc. Confidential - Internal Completion Time Distribution 12
- 13. FCA Benefits:
Unprecedented Scalability on HPC Clusters
10000
ompi-Allreduce-bynode
1000
ompi-Barrier-bynode
100
> 180X FCA-Allreduce > 50%
10
FCA-Barrier
1
0 200 400 600 800 1000 1200
► Extreme performance ► As process count increases
improvement on raw
• % of time spent in MPI
collectives
increases
► Scale according to number
• % of time spent in collectives
of switch hops, not number
increases
of nodes – O(log18)
Enabling capability computing on HPC clusters
© 2010 Voltaire Inc. Confidential - Internal 13
- 14. Additional Benefits
► Simple, fully integrated
• No changes to application required
► Tolerance to higher oversubscription (blocking) ratio
• Same performance at lower cost
► Enables use of non-blocking collectives
• Part of future MPI implementations
• FCA guarantees no computation power penalty
► Reduce fabric congestion
• Avoid interference to other jobs
© 2010 Voltaire Inc. Confidential - Internal 14
- 16. About University of Braunschweig
► General Overview
• Founded in 1745
• 120 institutes with ca. 2900 employees
• Ca. 13000 students
► Main Fields of Research
• Mobility and transport (road, rail, air and space)
• Biological and biotechnological research
• Digital television
© 2010 Voltaire Inc. Confidential - Internal 16
- 17. System Configuration
Newest installation:
► Nodes type: NEC HPC 1812Rb-2
• CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard
► System Configuration: 186 nodes
• 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking)
► OS: CentOS 5.4
► Open MPI: 1.4.1
4 x QDR 4 x QDR
► FCA:1.0_RC3 rev 2760
► UFM: 2.3 RC7
► Switch: 3.0.629
24 x DDR 24 x DDR
© 2010 Voltaire Inc. Confidential - Internal 17
- 18. FCA Performance:
A Real Cluster Example with 2048 Ranks
Collective latency (usec)
10000
4000
Microsecond
ompi-Allreduce
1000
ompi-Barrier
Latency (us)
180x
Faster FCA-Allreduce
100
FCA-Barrier
10
0 500 1000 1500 2000 2500
Number of ranks (16 ranks per node)
© 2010 Voltaire Inc. Confidential - Internal 18
- 19. Real Application Results
► OpenFoam
• Open source CFD solver produced by a commercial company, OpenCFD
• Used by many leading automotive companies
Open Foam CFD Aerodynamic Benchmark (64 cores)
5000
4500
4000
41 ette
b
3500
% r
3000
Seconds
Open MPI 1.4.1
2500
Open MPI 1.4.1 + FCA
2000
1500
1000
500
0
1
► Expected benefits for several other applications
• e.g. DLPOLY (molecular dynamics)
© 2010 Voltaire Inc. Confidential - Internal 19
- 20. Voltaire Fabric Collective Accelerator
Summary
► Fully Integrated Fabric computing offload
• Combination of SW & HW in a single solution
• Offloading blocking computational tasks
• Algorithms leveraging the topology for computation (trees)
► Extreme MPI performance & scalability
• Capability computing on commodity clusters
• Two orders of magnitude, hundred-times faster collective runtime
• Scale by number of hops, not number of nodes
• Variation eliminated - Consistent results
► Transparent to the application
• Plug & play - No need for code changes
Accelerate your fabric!
© 2010 Voltaire Inc. Confidential - Internal 20