SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Reducing the Runtime of Collective
                                        Communications
                                  ISC’10 Birds of a Feather Session

                                                          June 3, 2010
© 2010 Voltaire Inc.
Agenda


    ►      Scalability Challenges for Group Communication

    ►      Voltaire Fabric Collective Accelerator™ (FCA™)

             • Yaron Haviv, CTO, Voltaire


    ►      Customer Experience:

           University of Braunschweig

             • Josef Schüle



© 2010 Voltaire Inc.                        Confidential - Internal   2
About Voltaire (NASDAQ: VOLT)

    ►      Leading provider of scale-out data center fabrics
             • Used by more than 30% of Fortune100 companies
             • Hundreds of installations of over 1000 servers

    ►      Addressing the challenges of HPC, virtualized data centers
           and clouds
    ►      More than half of TOP500 InfiniBand sites
    ►      InfiniBand and 10GbE scale-out fabrics

        End-to-End Scale-out Fabric Product Line




© 2010 Voltaire Inc.                         Confidential - Internal    3
MPI Collectives

    ►      Collective Operations = Group Communication (All to All, One to
           All, All to One)
    ►      Synchronous by nature = consume many “Wait” cycles on large
           clusters                     Collective Operations % of MPI Job Runtime
                                                      100

    ►      Popular examples:                          90

             • Reduce                                 80

                                                      70
             • Allreduce
                                         Percentage
                                                      60
             • Barrier                                50

             • Bcast                                  40

                                                      30
             • Gather
                                                      20
             • Allgather                              10

                                                        0
                                                               ANSYS            SAGE   CPMD   LSTC LS- CD-Adapco   Dacapo
                                                               FLUENT                          DYNA     STAR-CD



                Your cluster might be spending half its time on idle collective cycles
© 2010 Voltaire Inc.                                        Confidential - Internal                                         4
Collective Example - Allreduce

    ►      Allreduce – The Concept
             • Perform specific operation on all arguments, and distribute result to all
               processes. Example with SUM operation:


                           30
                           15
                           8               30
                                           7                             30
                                                                         15
                                                                         6          30
                                                                                    9

    ►      Allreduce on a 4-node cluster




              144144 144144
              144 2 52 6
               1
               20     5             1 2    5 6
                                   144144 144144                     20 2 52 6
                                                                     1      5
                                                                    144144 144144         1 2    5 6
                                                                                         144144 144144
              144144 144144
               3 4    7 8           3 4    7 8
                                   144144 144144                     3 4    7 8
                                                                    144144 144144         3 4    7 8
                                                                                         144144 144144

© 2010 Voltaire Inc.                           Confidential - Internal                                   5
Now try running it on a Petascale machine…



                                      Dozens of core
                                     switches (3 hops)




                                       Hundreds of edge
                                       switches (1 hop)



   1 2      5 6        1 2   5 6
                                        Tens of thousands                  1 2   5 6
   3 4      7 8        3 4   7 8             of cores                      3 4   7 8




                              Single Operation > 3000usec – Not Scalable
© 2010 Voltaire Inc.                    Confidential - Internal                        6
The Challenge:
   Collective Operations Scalability

    ►      Grouping algorithms are unaware of the topology
           and inefficient


    ►      Network congestion due to “All-to-All”
           communication


    ►      Slow nodes & OS involvement impair scalability
           and predictability                         Expected       Actual




    ►      The more powerful servers get (GPUs, more
           cores), the poorer collectives scale in the fabric
© 2010 Voltaire Inc.                       Confidential - Internal            7
The Voltaire InfiniBand Fabric:
   Equipped for the Challenge

   Grid Director                                                          Unified Fabric
   Switches:                                                              Manager (UFM):
   Fabric                                                                 Topology Aware
   Processing                           +                             +   Orchestrator
   Power




                              +                                                    +

                       ……….                                               ……….



                   Fabric computing in use to address the collective challenge
© 2010 Voltaire Inc.                        Confidential - Internal                        8
Introducing:
   Voltaire Fabric Collective Accelerator

 Grid Director
 Grid Director                                                          FCA Manager: Unified Fabric
 Switches:                                                                                   Manager (UFM):
                                                                         Topology-based collective tree
 Switches:
 Fabric                                                                                      Topology Aware
                                                                         Separate Virtual network
   Collective
  Processing                             +                                          + for result distribution
                                                                         IB multicast        Orchestrator
 operations
  Power                                                                  Integration with job schedulers
 offloaded to
 switch CPUs




                               +       FCA Agent:                                                      +
                                          Inter-core processing
                                          localized & optimized
                       ……….                                                                 ……….



                       Breakthrough performance with no additional hardware
© 2010 Voltaire Inc.                          Confidential - Internal                                           9
Efficient Collectives with FCA

                                              4. 2nd tier offload                        5. Result distribution
  1. Pre-config
                                               (result at root)                            (single message)

                             648        11664       648




                36     648   36                                                            36    648    36
                                                   3. 1st tier
                                                    offload
11664 11664
   11664 11664               11664 11664
                                11664 11664                                                             11664 11664
                                                                                                           11664 11664
  1 2 5 6                       1 2 5 6                                                                   1 2 5 6
     36 8
  311664 711664
     4
11664 11664
                                   36 11664
                             11664 411664 8
                                3
                                11664 7
                                                                                                             36
                                                                                                        116644 116648
                                                                                                          311664 711664
                                                 2. Inter-core                   6. Allreduce on 100K
                                                 processing                         cores in 25 usec


© 2010 Voltaire Inc.                                   Confidential - Internal                                      10
UFM Integrated With Job Schedulers

                                                     Matching Jobs Automatically
               Job Submitted in Scheduler                   Created in UFM




                                                                                   • QoS
                                                                                   • Routing
                                                                                   • Placement
                                                                                   • Collectives



           Application Level Monitoring        Fabric-wide Policy Pushed to Match
           & Optimization Measurements              Application Requirements
© 2010 Voltaire Inc.                        Confidential - Internal                                11
FCA Benefits:
   Slashing Job Runtime

    ►      Slashing Runtime                                                   IMB Allreduce 2048 Cores
                                                    Open MPI:
                                                        4000
                                                    >3000usec
                                                        3500

                                                                 3000

                                                                 2500




                                                          usec
                                                                 2000

                                                                 1500

                                                                 1000

                                                                  500                                    FCA: <30usec
                                                                    0




    ►      Eliminating Runtime Variation
             • OS jitter – eliminated in switches
             • Traffic congestion – significantly lower number of messages
             • Cross-application interference – collectives offloaded on a private virtual network

                                                                                                          Server-based
                                                                                                           Collectives
                                                      FCA-based
                                                      Collectives




© 2010 Voltaire Inc.                                Confidential - Internal      Completion Time Distribution            12
FCA Benefits:
   Unprecedented Scalability on HPC Clusters
10000



                                                  ompi-Allreduce-bynode
1000


                                                  ompi-Barrier-bynode

 100

                                   > 180X         FCA-Allreduce                                                > 50%
   10

                                                  FCA-Barrier


    1
        0     200      400   600   800   1000   1200



        ►   Extreme performance                                                                     ►   As process count increases
            improvement on raw
                                                                                                        • % of time spent in MPI
            collectives
                                                                                                          increases
        ►   Scale according to number
                                                                                                        • % of time spent in collectives
            of switch hops, not number
                                                                                                          increases
            of nodes – O(log18)


                             Enabling capability computing on HPC clusters
© 2010 Voltaire Inc.                                                      Confidential - Internal                                          13
Additional Benefits


    ►      Simple, fully integrated
             • No changes to application required

    ►      Tolerance to higher oversubscription (blocking) ratio
             • Same performance at lower cost

    ►      Enables use of non-blocking collectives
             • Part of future MPI implementations

             • FCA guarantees no computation power penalty

    ►      Reduce fabric congestion
             • Avoid interference to other jobs


© 2010 Voltaire Inc.                          Confidential - Internal   14
Customer Experience
                       University of Braunschweig


                                          June 3, 2010
© 2010 Voltaire Inc.
About University of Braunschweig

    ►      General Overview
             • Founded in 1745
             • 120 institutes with ca. 2900 employees
             • Ca. 13000 students
    ►      Main Fields of Research
             • Mobility and transport (road, rail, air and space)
             • Biological and biotechnological research
             • Digital television




© 2010 Voltaire Inc.                           Confidential - Internal   16
System Configuration

    Newest installation:
    ►      Nodes type: NEC HPC 1812Rb-2
               •       CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard
    ►      System Configuration: 186 nodes
               •       24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking)
    ►      OS: CentOS 5.4
    ►      Open MPI: 1.4.1
                                                      4 x QDR                                     4 x QDR
    ►      FCA:1.0_RC3 rev 2760
    ►      UFM: 2.3 RC7
    ►      Switch: 3.0.629
                                                                24 x DDR                               24 x DDR




© 2010 Voltaire Inc.                                     Confidential - Internal                                  17
FCA Performance:
   A Real Cluster Example with 2048 Ranks

                                            Collective latency (usec)

                       10000
                                                    4000
                                                Microsecond
                                                                                            ompi-Allreduce

                        1000
                                                                                            ompi-Barrier
        Latency (us)




                                                                             180x
                                                                            Faster          FCA-Allreduce

                        100
                                                                                            FCA-Barrier




                          10
                               0   500          1000                     1500        2000    2500
                                         Number of ranks (16 ranks per node)



© 2010 Voltaire Inc.                                   Confidential - Internal                               18
Real Application Results

    ►      OpenFoam
             • Open source CFD solver produced by a commercial company, OpenCFD
             • Used by many leading automotive companies

                                          Open Foam CFD Aerodynamic Benchmark (64 cores)

                                   5000
                                   4500

                                   4000




                                                                 41 ette
                                                                  b
                                   3500




                                                                   % r
                                   3000
                         Seconds




                                                                                    Open MPI 1.4.1
                                   2500
                                                                                    Open MPI 1.4.1 + FCA
                                   2000

                                   1500
                                   1000

                                   500
                                     0
                                                           1


    ►      Expected benefits for several other applications
             • e.g. DLPOLY (molecular dynamics)
© 2010 Voltaire Inc.                                      Confidential - Internal                          19
Voltaire Fabric Collective Accelerator
   Summary

    ► Fully            Integrated Fabric computing offload
             • Combination of SW & HW in a single solution
             • Offloading blocking computational tasks
             • Algorithms leveraging the topology for computation (trees)

    ► Extreme             MPI performance & scalability
             • Capability computing on commodity clusters
             • Two orders of magnitude, hundred-times faster collective runtime
             • Scale by number of hops, not number of nodes
             • Variation eliminated - Consistent results

    ► Transparent             to the application
             • Plug & play - No need for code changes


                                Accelerate your fabric!
© 2010 Voltaire Inc.                          Confidential - Internal             20
Q&A




© 2010 Voltaire Inc.   Confidential - Internal   21

Contenu connexe

Similaire à Voltaire - Reducing the Runtime of Collective Communications

New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Voltaire ufm en_nov10
Voltaire ufm en_nov10Voltaire ufm en_nov10
Voltaire ufm en_nov10sciecomp
 
Keysight Mini-ICT - Testing Days México
Keysight Mini-ICT - Testing Days MéxicoKeysight Mini-ICT - Testing Days México
Keysight Mini-ICT - Testing Days MéxicoInterlatin
 
2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdf2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdfPhmNgcTr3
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...Maho Nakata
 
Etalis rule ml_2011_itterative
Etalis rule ml_2011_itterativeEtalis rule ml_2011_itterative
Etalis rule ml_2011_itterativeDarko Anicic
 
Voltaire - Achieving Peak Performance with Advanced Fabric Management
Voltaire - Achieving Peak Performance with Advanced Fabric ManagementVoltaire - Achieving Peak Performance with Advanced Fabric Management
Voltaire - Achieving Peak Performance with Advanced Fabric ManagementVoltaire
 
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENT
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENTVDSL Vectoring TEST PT TELKOM ALCATEL LUCENT
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENTWahyu Nasution
 
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCE
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCERE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCE
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCEiQHub
 
Grid technology for next gen media processing
Grid technology for next gen media processingGrid technology for next gen media processing
Grid technology for next gen media processingvrt-medialab
 
Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013EDB
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...The Linux Foundation
 
021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemckeJay Kruemcke
 
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...PROIDEA
 
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2Ira Feldman
 
Understanding Hardware Transactional Memory
Understanding Hardware Transactional MemoryUnderstanding Hardware Transactional Memory
Understanding Hardware Transactional MemoryC4Media
 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsContinuent
 

Similaire à Voltaire - Reducing the Runtime of Collective Communications (20)

New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Voltaire ufm en_nov10
Voltaire ufm en_nov10Voltaire ufm en_nov10
Voltaire ufm en_nov10
 
Keysight Mini-ICT - Testing Days México
Keysight Mini-ICT - Testing Days MéxicoKeysight Mini-ICT - Testing Days México
Keysight Mini-ICT - Testing Days México
 
Neutron CI Run on Docker
Neutron CI Run on DockerNeutron CI Run on Docker
Neutron CI Run on Docker
 
2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdf2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdf
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...
 
Etalis rule ml_2011_itterative
Etalis rule ml_2011_itterativeEtalis rule ml_2011_itterative
Etalis rule ml_2011_itterative
 
Voltaire - Achieving Peak Performance with Advanced Fabric Management
Voltaire - Achieving Peak Performance with Advanced Fabric ManagementVoltaire - Achieving Peak Performance with Advanced Fabric Management
Voltaire - Achieving Peak Performance with Advanced Fabric Management
 
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENT
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENTVDSL Vectoring TEST PT TELKOM ALCATEL LUCENT
VDSL Vectoring TEST PT TELKOM ALCATEL LUCENT
 
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCE
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCERE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCE
RE-FRAC OF SHALE WELLS USING ARTIFICIAL INTELLIGENCE
 
Grid technology for next gen media processing
Grid technology for next gen media processingGrid technology for next gen media processing
Grid technology for next gen media processing
 
Scalding on tez (final)
Scalding on tez (final)Scalding on tez (final)
Scalding on tez (final)
 
Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
 
021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke
 
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...
PLNOG 5: Piotr Szołkowski - Data Center i nie tylko...
 
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2
IEEE SWTW 2012 Road to 450 mm Semiconductor Wafers - Ira Feldman li2
 
Understanding Hardware Transactional Memory
Understanding Hardware Transactional MemoryUnderstanding Hardware Transactional Memory
Understanding Hardware Transactional Memory
 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten Replicators
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Voltaire - Reducing the Runtime of Collective Communications

  • 1. Reducing the Runtime of Collective Communications ISC’10 Birds of a Feather Session June 3, 2010 © 2010 Voltaire Inc.
  • 2. Agenda ► Scalability Challenges for Group Communication ► Voltaire Fabric Collective Accelerator™ (FCA™) • Yaron Haviv, CTO, Voltaire ► Customer Experience: University of Braunschweig • Josef Schüle © 2010 Voltaire Inc. Confidential - Internal 2
  • 3. About Voltaire (NASDAQ: VOLT) ► Leading provider of scale-out data center fabrics • Used by more than 30% of Fortune100 companies • Hundreds of installations of over 1000 servers ► Addressing the challenges of HPC, virtualized data centers and clouds ► More than half of TOP500 InfiniBand sites ► InfiniBand and 10GbE scale-out fabrics End-to-End Scale-out Fabric Product Line © 2010 Voltaire Inc. Confidential - Internal 3
  • 4. MPI Collectives ► Collective Operations = Group Communication (All to All, One to All, All to One) ► Synchronous by nature = consume many “Wait” cycles on large clusters Collective Operations % of MPI Job Runtime 100 ► Popular examples: 90 • Reduce 80 70 • Allreduce Percentage 60 • Barrier 50 • Bcast 40 30 • Gather 20 • Allgather 10 0 ANSYS SAGE CPMD LSTC LS- CD-Adapco Dacapo FLUENT DYNA STAR-CD Your cluster might be spending half its time on idle collective cycles © 2010 Voltaire Inc. Confidential - Internal 4
  • 5. Collective Example - Allreduce ► Allreduce – The Concept • Perform specific operation on all arguments, and distribute result to all processes. Example with SUM operation: 30 15 8 30 7 30 15 6 30 9 ► Allreduce on a 4-node cluster 144144 144144 144 2 52 6 1 20 5 1 2 5 6 144144 144144 20 2 52 6 1 5 144144 144144 1 2 5 6 144144 144144 144144 144144 3 4 7 8 3 4 7 8 144144 144144 3 4 7 8 144144 144144 3 4 7 8 144144 144144 © 2010 Voltaire Inc. Confidential - Internal 5
  • 6. Now try running it on a Petascale machine… Dozens of core switches (3 hops) Hundreds of edge switches (1 hop) 1 2 5 6 1 2 5 6 Tens of thousands 1 2 5 6 3 4 7 8 3 4 7 8 of cores 3 4 7 8 Single Operation > 3000usec – Not Scalable © 2010 Voltaire Inc. Confidential - Internal 6
  • 7. The Challenge: Collective Operations Scalability ► Grouping algorithms are unaware of the topology and inefficient ► Network congestion due to “All-to-All” communication ► Slow nodes & OS involvement impair scalability and predictability Expected Actual ► The more powerful servers get (GPUs, more cores), the poorer collectives scale in the fabric © 2010 Voltaire Inc. Confidential - Internal 7
  • 8. The Voltaire InfiniBand Fabric: Equipped for the Challenge Grid Director Unified Fabric Switches: Manager (UFM): Fabric Topology Aware Processing + + Orchestrator Power + + ………. ………. Fabric computing in use to address the collective challenge © 2010 Voltaire Inc. Confidential - Internal 8
  • 9. Introducing: Voltaire Fabric Collective Accelerator Grid Director Grid Director FCA Manager: Unified Fabric Switches: Manager (UFM): Topology-based collective tree Switches: Fabric Topology Aware Separate Virtual network Collective Processing + + for result distribution IB multicast Orchestrator operations Power Integration with job schedulers offloaded to switch CPUs + FCA Agent: + Inter-core processing localized & optimized ………. ………. Breakthrough performance with no additional hardware © 2010 Voltaire Inc. Confidential - Internal 9
  • 10. Efficient Collectives with FCA 4. 2nd tier offload 5. Result distribution 1. Pre-config (result at root) (single message) 648 11664 648 36 648 36 36 648 36 3. 1st tier offload 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 1 2 5 6 1 2 5 6 1 2 5 6 36 8 311664 711664 4 11664 11664 36 11664 11664 411664 8 3 11664 7 36 116644 116648 311664 711664 2. Inter-core 6. Allreduce on 100K processing cores in 25 usec © 2010 Voltaire Inc. Confidential - Internal 10
  • 11. UFM Integrated With Job Schedulers Matching Jobs Automatically Job Submitted in Scheduler Created in UFM • QoS • Routing • Placement • Collectives Application Level Monitoring Fabric-wide Policy Pushed to Match & Optimization Measurements Application Requirements © 2010 Voltaire Inc. Confidential - Internal 11
  • 12. FCA Benefits: Slashing Job Runtime ► Slashing Runtime IMB Allreduce 2048 Cores Open MPI: 4000 >3000usec 3500 3000 2500 usec 2000 1500 1000 500 FCA: <30usec 0 ► Eliminating Runtime Variation • OS jitter – eliminated in switches • Traffic congestion – significantly lower number of messages • Cross-application interference – collectives offloaded on a private virtual network Server-based Collectives FCA-based Collectives © 2010 Voltaire Inc. Confidential - Internal Completion Time Distribution 12
  • 13. FCA Benefits: Unprecedented Scalability on HPC Clusters 10000 ompi-Allreduce-bynode 1000 ompi-Barrier-bynode 100 > 180X FCA-Allreduce > 50% 10 FCA-Barrier 1 0 200 400 600 800 1000 1200 ► Extreme performance ► As process count increases improvement on raw • % of time spent in MPI collectives increases ► Scale according to number • % of time spent in collectives of switch hops, not number increases of nodes – O(log18) Enabling capability computing on HPC clusters © 2010 Voltaire Inc. Confidential - Internal 13
  • 14. Additional Benefits ► Simple, fully integrated • No changes to application required ► Tolerance to higher oversubscription (blocking) ratio • Same performance at lower cost ► Enables use of non-blocking collectives • Part of future MPI implementations • FCA guarantees no computation power penalty ► Reduce fabric congestion • Avoid interference to other jobs © 2010 Voltaire Inc. Confidential - Internal 14
  • 15. Customer Experience University of Braunschweig June 3, 2010 © 2010 Voltaire Inc.
  • 16. About University of Braunschweig ► General Overview • Founded in 1745 • 120 institutes with ca. 2900 employees • Ca. 13000 students ► Main Fields of Research • Mobility and transport (road, rail, air and space) • Biological and biotechnological research • Digital television © 2010 Voltaire Inc. Confidential - Internal 16
  • 17. System Configuration Newest installation: ► Nodes type: NEC HPC 1812Rb-2 • CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard ► System Configuration: 186 nodes • 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking) ► OS: CentOS 5.4 ► Open MPI: 1.4.1 4 x QDR 4 x QDR ► FCA:1.0_RC3 rev 2760 ► UFM: 2.3 RC7 ► Switch: 3.0.629 24 x DDR 24 x DDR © 2010 Voltaire Inc. Confidential - Internal 17
  • 18. FCA Performance: A Real Cluster Example with 2048 Ranks Collective latency (usec) 10000 4000 Microsecond ompi-Allreduce 1000 ompi-Barrier Latency (us) 180x Faster FCA-Allreduce 100 FCA-Barrier 10 0 500 1000 1500 2000 2500 Number of ranks (16 ranks per node) © 2010 Voltaire Inc. Confidential - Internal 18
  • 19. Real Application Results ► OpenFoam • Open source CFD solver produced by a commercial company, OpenCFD • Used by many leading automotive companies Open Foam CFD Aerodynamic Benchmark (64 cores) 5000 4500 4000 41 ette b 3500 % r 3000 Seconds Open MPI 1.4.1 2500 Open MPI 1.4.1 + FCA 2000 1500 1000 500 0 1 ► Expected benefits for several other applications • e.g. DLPOLY (molecular dynamics) © 2010 Voltaire Inc. Confidential - Internal 19
  • 20. Voltaire Fabric Collective Accelerator Summary ► Fully Integrated Fabric computing offload • Combination of SW & HW in a single solution • Offloading blocking computational tasks • Algorithms leveraging the topology for computation (trees) ► Extreme MPI performance & scalability • Capability computing on commodity clusters • Two orders of magnitude, hundred-times faster collective runtime • Scale by number of hops, not number of nodes • Variation eliminated - Consistent results ► Transparent to the application • Plug & play - No need for code changes Accelerate your fabric! © 2010 Voltaire Inc. Confidential - Internal 20
  • 21. Q&A © 2010 Voltaire Inc. Confidential - Internal 21