SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
WHITE PaPEr




Understanding Low and Scalable
Message Passing Interface Latency
 Latency Benchmarks for High                                           QLogic InfiniBand Solutions Offer 70%
                                                                         Advantage Over the Competition
   Performance Computing




                                                                        Key Findings
Executive Summary
                                                                        • The QLogic QLE7140 and QLE7280 HCas outperform the
Considerable improvements in InfiniBand® (IB) interconnect
                                                                           Mellanox® ConnectX™ HCa in osu_latency at the 128-byte
technology for High Performance Computing (HPC) applications
                                                                           message size and the 1024-byte message size by as much
have pushed bandwidth to a point where streaming large
                                                                           as 70%.
amounts data off-node is nearly as fast as within a node.
                                                                        • The QLogic QLE7140 and QLE7280 HCas outperform the
However, latencies for small-message transfers have not kept up
                                                                          ConnectX HCa in “scalable latency” by as much as 70% as
with memory subsystems, and are increasingly the bottleneck in
                                                                          the number of MPI processes increase.
high performance clusters.
Different IB solutions provide dramatically varying latencies,
                                                                        Introduction
especially as cluster sizes scale upward. Understanding how
                                                                        Today’s HPC applications are overwhelmingly implemented
latencies will scale as your cluster grows is critical to choosing a
                                                                        using a parallel programming model known as the Message
network that will optimize your time to solution.
                                                                        Passing Interface (MPI). To achieve maximum performance, HPC
The traditional latency benchmarks, which send 0-byte messages
                                                                        applications require a high-performing MPI solution, involving
between two adjacent systems, result in similar latency
                                                                        both a high-performance interconnect and highly tuned MPI
measurements for emerging DDr IB Host Channel adapters
                                                                        libraries. InfiniBand has rapidly become the HPC interconnect
(HCas) from QLogic® and competitors of about 1.4 microseconds
                                                                        of choice on 128 systems in the June 2007 Top 500 list. This
(µs). However, on larger messages, or across more nodes in
                                                                        rapid upswing was due to its high (2GB/s) maximum bandwidth,
a cluster, QLogic shows a 60-70% latency advantage over
                                                                        and its low (~1.4–3 µsec) latency. High bandwidth is important
competitive offerings. These scalable latency measurements
                                                                        because it allows an application to move large amounts of data
indicate why QLogic IB products provide a significant advantage
                                                                        very quickly. Low latency is important because it allows rapid
on real HPC applications.
                                                                        synchronization and exchanges of small amounts of data.
WHITE PaPEr
      Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage



                                                                             bandwidth is at a 1:1 ratio with available bandwidth from a DDr IB
      This white paper compares several benchmark results. For all of
                                                                             connection.
      these results, the test bed consists of eight servers with standard
      “off-the shelf” components, and a QLogic SilverStorm® 9024             In contrast, socket-to-socket MPI latency in either system is 0.40
      24-port DDr IB Switch.                                                 µs, while the fastest inter-node IB MPI latency of is 1.3-3 µs can be
                                                                             achieved. a ratio of 7x to 3x in comparing socket-to-socket and IB!
                                   Servers                                   Thus, small-message latency is one of the areas where there is a
                                                                             significant penalty to go off-node. Though there are some “back-
       2-socket rack-mounted servers
                                                                             to-back” 2-node benchmarks available to help, the latency observed
       2.6 Ghz dual-core, aMD™ Opteron® 2218 processors
                                                                             does not always represent the desired latency required from a high-
       8 GB of DDr2-667 memory
                                                                             performance cluster.
       Tyan ® Thunder n3600r (S2912) motherboards

      The HCas benchmarked were:
                                                                             Different Ways to Measure Latency
      • Mellanox MHGH28-XTC (ConnectX) DDr HCa
                                                                             MPI latency is often measured by one of a number of common
      • QLogic QLE7140 SDr HCa                                               microbenchmarks such as osu_latency, or the ping-pong component
                                                                             of the Intel® MPI Benchmarks (formerly Pallas MPI Benchmarks),
      • QLogic QLE7280 DDr HCa.
                                                                             or the ping-pong latency component of the High Performance
      all benchmarks were run using MVaPICH-0.9.9 as the MPI. For the
                                                                             Computing Challenge (HPCC) suite of benchmarks. all of these
      Mellanox ConnectX HCas MVaPICH was run over the user-space
                                                                             microbenchmarks have the same basic pattern. Each runs a single
      verbs provided by the OFED-1.2.5 release. For the QLE7140 and
                                                                             ping-pong test sending a 0- or 1-byte message between two cores
      QLE7280 MVaPICH was run over the InfiniPath™ 2.2 software stack,
                                                                             on different cluster nodes, reporting the latency as half the time of
      using the QLogic PSM aPI and OFED-1.2 based drivers.
                                                                             one round-trip. Here are some example graphs showing the results
                                                                             of running osu_latency using three different IB HCas.
      Motivation for Studying Latency
      Bandwidths over the network are approaching memory bandwidths
      within a system. running the Bandwidth microbenchmark from Ohio
      State (osu_bw) on a node, using the MVaPICH-0.9.9 implementation
      of MPI, measures large message intra-node (socket-to-socket) MPI
      bandwidth of 2 GB/s with message sizes 512k or smaller. This




HSG-WP07017                                                         SN0032014-00 a                                                             2
WHITE PaPEr
      Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage



      Judging from this test, the QLE7280, QLE7140, and ConnectX                       as demonstrated, the QLE7280 and QLE7140 latencies largely
      HCas are all similar with respect to 0-byte latency. However, as the             remain flat with increasing process count. The ConnectX HCa’s
      message size increases significant differences are observed. For                 latency, however, rises with the increase of processes. at 32-cores,
      example with a 128-byte message size, the QLE7280 has a latency                  the randomring Latency of QLogic QLE7280 DDr HCa is 1.33
      of 1.7 µs, whereas the ConnectX DDr adapter has a latency of 2.7                 µs compared to 2.26 µs for the ConnectX HCa. This amounts to
      µs providing a 60% performance advantage for the QLE7280. With a                 70% better performance for the QLE7280. The trend is for larger
      1024-byte message size, the QLE7280’s latency is 2.80 µs for a 70%               differences at larger core counts. Since low latency is required even
      advantage over ConnectX’s latency of 4.74 µs.                                    at large core counts to scale application performance to the greatest
                                                                                       extent possible, the QLogic HCa’s consistently low latency is referred
      another test that measures latency is the randomring latency                     to as “scalable latency.”
      benchmark which is a part of the High Performance Computing
      Challenge suite of benchmarks (HPCC). The benchmark tests latency
      across a series of randomly assigned rings, averaging across all of
      them.1 The benchmark forces each process to talk to every other
      process in the cluster. This is important because there is a substantial
      difference in scalability with a large number of cores between those
      HCas that seemed so similar when running osu_latency.



      1 The measurement differs from the pingpong case since the messages are
      sent by two processes calling MPI_Sendrecv, rather than one calling MPI_Send
      followed by MPI_recv.




HSG-WP07017                                                                   SN0032014-00 a                                                              3
WHITE PaPEr
      Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage



                                                                                 However, for small messages the latency cost of that initial setup
      Understanding Why Latency Scalability Varies
                                                                                 is large compared to the cost of sending a message. a round-trip
      To understand why latency scalability would differ, it helps to
                                                                                 on the wire can triple the cost of sending a small message, while
      understand, at least at a basic level, how MPI works. The following is
                                                                                 copying a couple of cache lines from a receive buffer to their final
      the basic path of an MPI packet, from a sending application process
                                                                                 location costs you very little. This leads most implementors to use
      to a receiving application process.
                                                                                 a Send/recv based approach. However, in HCas that have tuned for
                                                                                 rDMa to the exclusion of Send/recv, this causes a large slowdown,
      1. Sending process has data for some remote process.
                                                                                 resulting in poor latency. an rDMa write is much faster, but it
      2. Sender places data in a buffer, passes a pointer to the MPI stack,      requires that costly setup. The following describes a mechanism
         along with an indication of who the receiver is and a tag for           used to sidestep this problem.
         identifying the message.
                                                                                 achieving Low Latency with rDMa
      3. ‘context’ or ‘communication id’ identifies the context over which
         the point-to-point communication happens -- only messages
                                                                                 For interconnects that have been optimized for remote Direct Memory
         in the same communicator can be matched (there is no “any”
                                                                                 access (rDMa), it can be desirable to use rDMa not only for large
         communicator).
                                                                                 messages but also for small messages. This is done without incurring
      There are some variations in how this process is implemented, often        the setup latency cost by mimicking a receive mailbox in memory.
      based on the underlying mechanism for data transfer.                       For each MPI process, the MPI library sets up a temporary memory
                                                                                 location for every process in the job. The setup and coordination
      With many interconnects offering high performance rDMa, there is
                                                                                 is done at initialization time, so by the time communication starts
      a push towards utilizing it to improve MPI performance. rDMa is
                                                                                 every MPI process has knowledge of the memory location to write to,
      a one-sided communication model, allowing data to be transferred
                                                                                 and can use rDMa. When receiving, the MPI library in the receiving
      between from one host to another without the involvement of the
                                                                                 process then goes and checks each temporary memory location,
      remote CPU. This has the advantage of reducing CPU utilization, but
                                                                                 and then copies any messages that may have arrived to the correct
      requires the rDMa initiator to know where it is writing to or reading
                                                                                 buffers.
      from. This requires an exchange of information before the data can
      be sent.                                                                   This can work well in small clusters or jobs, such as when running
                                                                                 the common point to point microbenchmark. Each receiving process
      another mechanism that is used is what is known as the Send/recv
                                                                                 has only one memory location to check, and can very quickly find
      model. This is a two-sided communication model where the receiver
                                                                                 and copy any receiving message.
      maintains a single queue where all messages go initially, and then
      the receiver is involved in directing messages from that queue
      to their final destination. This has the advantage of not requiring
      remote knowledge to begin a transfer, as each side only needs to
      know about its own buffers, but at the cost of involving the CPU on
      both sides.

      Most high performance interconnects provide mechanisms for both
      of these models, but make different optimization choices in terms
      of tuning them. almost all implementations use rDMa for large
      messages, where the setup cost to exchange information initially
      is small relative to the cost of involving the CPU in transferring large
      amounts of data.

      Thus, most MPIs implement a ‘rendezvous protocol’ for large
      messages, where the sender sends a ‘request to send’, the receiver
      pins the final location buffer and sends a key, and the sender does an
      rDMa write to the final location. MPIs implemented on OpenFabrics
      verbs do this explicitly, while the PSM layer provided with the QLogic
      QLE7100 and QLE7200 series HCas does it behind the scenes.


HSG-WP07017                                                             SN0032014-00 a                                                            4
WHITE PaPEr
      Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage



      The issue with the approach is that it doesn’t scale. With rDMa,
      each remote process needs its own temporary memory location
      to write to. Thus, as a cluster grows the receiving process has to
      check an additional memory location for every remote process. In
      today’s world of multicore processors and large clusters, the array of
      memory locations rises exponentially.




      The per-local-process memory and host software time requirements          The effect of a connected protocol is to require some amount of
      of this algorithm go up linearly with the number of processors            per-partner state both on host and on chip. When the number of
      in the cluster. This means that in a cluster made up of N nodes           processes scales up, this can lead to strange caching effects as
      with M cores each, per-process memory use and latency grow as             the data is sent/received from the HCa. This can be mitigated to
      O(M * (N-1)), while per-node memory use grows even faster, as             some extent using methods like Shared receive Queues (SrQ) and
      O(M2 * (N-1)).                                                            Scalable rC, but remains a problem for very large clusters using
                                                                                rC-based MPIs.

      a Scalable Solution: Send/recv                                            The QLogic approach with the PSM aPI sidesteps this by using a
                                                                                connectionless protocol and keeping the minimum necessary
      a more scalable solution is to use send/recv. Because the location
                                                                                state to ensure reliability. Investigations at Ohio State showed the
      in memory where messages are placed is determined locally, all
                                                                                advantages of a connectionless protocol at scale when compared to
      messages can go into a single queue with a single place to check,
                                                                                an rC-based protocol, which were limited by the small MTU and lack
      instead of requiring a memory location per remote process. The
                                                                                of reliability in the UD IB protocol.1 In another paper, the investigators
      results are then copied out in the order they arrive to the memory
                                                                                at OSU showed a need for a ‘UD rDMa’ approach in order to achieve
      buffers posted by the application. Thus, the per-local-process
                                                                                full bandwidth. 2
      memory requirements for this approach are constant, and the per-
      node memory requirements increase only with the size of the node.
                                                                                PSM takes account of all of these issues behind the scenes. It
                                                                                allows the MPI implementor access to all of the scalability of a
      Connection State                                                          connectionless protocol, without the need to develop yet another
                                                                                implementation of segmentation and reliability, or running into any
      a final element which is harder to measure, but apparent in very large
                                                                                of the high-end bandwidth performance issues seen with UD.
      clusters, is the advantage of the connectionless protocol. The PSM is
      based on a connectionless, as opposed to a connected protocol (rC)
      that is used for most verbs-based MPIs.
                                                                                1 http://nowlab.cse.ohio-state.edu/publications/conf-papers/2007/koop-ics07.
                                                                                pdf
                                                                                2 http://nowlab.cse.ohio-state.edu/publications/conf-papers/2007/koop-
                                                                                cluster07.pdf


HSG-WP07017                                                            SN0032014-00 a                                                                     5
WHITE PaPEr
      Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage




      Summary and Conclusion
      The white paper, explores latency measurements, and illustrates
      how benchmarks measuring point-to-point latency may not be
      representative of the latencies that applications on large-scale
      clusters require. Explained were some of the underlying architectural
      reasons of varying approaches to low MPI latency and how the
      QLogic QLE7140 and QLE7280 IB HCas efficiently scale to large
      node counts.

      The current trend towards rDMa in high-performance interconnects
      is very useful for those applications with large amounts of data to
      move. as system resources are already constrained, it is vital to
      limit CPU usage in moving large amounts of data through the system.
      However, a large and growing number of applications are more
      latency-bound than bandwidth bound, and for those an approach
      to low latency that scales is necessary. The QLogic QLE7100 and
      QLE7200 series IB HCas provide scalable low latency.




      Disclaimer
      reasonable efforts have been made to ensure the validity and accuracy of these performance tests. QLogic Corporation is not liable for any
      error in this published white paper or the results thereof. Variation in results may be a result of change in configuration or in the environment.
      QLogic specifically disclaims any warranty, expressed or implied, relating to the test results and their accuracy, analysis, completeness or
      quality.




                                                                                                                                                                                                                                              www.qlogic.com
                                                        Corporate Headquarters                 QLogic Corporation               26650 aliso Viejo Parkway                 aliso Viejo, Ca 92656               949.389.6000

                                                        Europe Headquarters QLogic (UK) LTD.                           Surrey Technology Centre                 40 Occam road Guildford                   Surrey GU2 7YG UK                +44 (0)1483 295825

      © 2007 QLogic Corporation. Specifications are subject to change without notice. all rights reserved worldwide. QLogic, the QLogic logo, and SilverStorm are registered trademarks of QLogic Corporation. InfiniBand is a registered trademark of the InfiniBand Trade
      association. aMD and Opteron are trademarks or registered trademarks of advanced Mirco Devices. Tyan is registered trademark of Tyan Computer Corporation. Mellanox and ConnectX are trademarks or registered trademarks of Mellanox Technologies, Inc.. Infini-
      Path is a trademark of Pathscale, Inc.. Intel is registered trademark of Intel Corporation. all other brand and product names are trademarks or registered trademarks of their respective owners. Information supplied by QLogic Corporation is believed to be accurate
      and reliable. QLogic Corporation assumes no responsibility for any errors in this brochure. QLogic Corporation reserves the right, without notice, to make changes in product design or specifications.




HSG-WP07017                                                                                                                SN0032014-00 a                                                                                                                               6

Contenu connexe

Tendances

Scalable Video Coding Guidelines and Performance Evaluations for Adaptive Me...
Scalable Video Coding Guidelines and Performance Evaluations for Adaptive Me...Scalable Video Coding Guidelines and Performance Evaluations for Adaptive Me...
Scalable Video Coding Guidelines and Performance Evaluations for Adaptive Me...mgrafl
 
Scalable Media Delivery Chain with Distributed Adaptation
Scalable Media Delivery Chain with Distributed AdaptationScalable Media Delivery Chain with Distributed Adaptation
Scalable Media Delivery Chain with Distributed Adaptationmgrafl
 
Network Configuration Example: Configuring LDP Over RSVP
Network Configuration Example: Configuring LDP Over RSVPNetwork Configuration Example: Configuring LDP Over RSVP
Network Configuration Example: Configuring LDP Over RSVPJuniper Networks
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Sourour Kanzari
 
Toward an Understanding of the Processing Delay of Peer-to-Peer Relay Nodes
Toward an Understanding of the Processing Delay of Peer-to-Peer Relay NodesToward an Understanding of the Processing Delay of Peer-to-Peer Relay Nodes
Toward an Understanding of the Processing Delay of Peer-to-Peer Relay NodesAcademia Sinica
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicEric Verhulst
 
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 -  Delft, The NetherlandsHPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlandsbalmanme
 
Transcoding of MPEG Compressed Bitstreams: Techniques and ...
Transcoding of MPEG Compressed Bitstreams: Techniques and ...Transcoding of MPEG Compressed Bitstreams: Techniques and ...
Transcoding of MPEG Compressed Bitstreams: Techniques and ...Videoguy
 
Stefano Giordano
Stefano GiordanoStefano Giordano
Stefano GiordanoGoWireless
 
Virtual Network Performance Challenge
Virtual Network Performance ChallengeVirtual Network Performance Challenge
Virtual Network Performance ChallengeStephen Hemminger
 
Cisco Live! Designing Multipoint WAN QoS
Cisco Live! Designing Multipoint WAN QoSCisco Live! Designing Multipoint WAN QoS
Cisco Live! Designing Multipoint WAN QoSEddie Kempe
 
Chap07 sndcp 03t_kh
Chap07 sndcp 03t_khChap07 sndcp 03t_kh
Chap07 sndcp 03t_khFarzad Ramin
 
DIANA: Scenarios for QoS based integration of IP and ATM
DIANA: Scenarios for QoS based integration of IP and ATMDIANA: Scenarios for QoS based integration of IP and ATM
DIANA: Scenarios for QoS based integration of IP and ATMJohn Loughney
 
H2B2VS (HEVC hybrid broadcast broadband video services) – Building innovative...
H2B2VS (HEVC hybrid broadcast broadband video services) – Building innovative...H2B2VS (HEVC hybrid broadcast broadband video services) – Building innovative...
H2B2VS (HEVC hybrid broadcast broadband video services) – Building innovative...Raoul Monnier
 

Tendances (18)

Scalable Video Coding Guidelines and Performance Evaluations for Adaptive Me...
Scalable Video Coding Guidelines and Performance Evaluations for Adaptive Me...Scalable Video Coding Guidelines and Performance Evaluations for Adaptive Me...
Scalable Video Coding Guidelines and Performance Evaluations for Adaptive Me...
 
Scalable Media Delivery Chain with Distributed Adaptation
Scalable Media Delivery Chain with Distributed AdaptationScalable Media Delivery Chain with Distributed Adaptation
Scalable Media Delivery Chain with Distributed Adaptation
 
Network Configuration Example: Configuring LDP Over RSVP
Network Configuration Example: Configuring LDP Over RSVPNetwork Configuration Example: Configuring LDP Over RSVP
Network Configuration Example: Configuring LDP Over RSVP
 
Ijcnc050206
Ijcnc050206Ijcnc050206
Ijcnc050206
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
 
Lecture04 H
Lecture04 HLecture04 H
Lecture04 H
 
Toward an Understanding of the Processing Delay of Peer-to-Peer Relay Nodes
Toward an Understanding of the Processing Delay of Peer-to-Peer Relay NodesToward an Understanding of the Processing Delay of Peer-to-Peer Relay Nodes
Toward an Understanding of the Processing Delay of Peer-to-Peer Relay Nodes
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 -  Delft, The NetherlandsHPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands
 
Transcoding of MPEG Compressed Bitstreams: Techniques and ...
Transcoding of MPEG Compressed Bitstreams: Techniques and ...Transcoding of MPEG Compressed Bitstreams: Techniques and ...
Transcoding of MPEG Compressed Bitstreams: Techniques and ...
 
Stefano Giordano
Stefano GiordanoStefano Giordano
Stefano Giordano
 
Qo s
Qo sQo s
Qo s
 
Virtual net performance
Virtual net performanceVirtual net performance
Virtual net performance
 
Virtual Network Performance Challenge
Virtual Network Performance ChallengeVirtual Network Performance Challenge
Virtual Network Performance Challenge
 
Cisco Live! Designing Multipoint WAN QoS
Cisco Live! Designing Multipoint WAN QoSCisco Live! Designing Multipoint WAN QoS
Cisco Live! Designing Multipoint WAN QoS
 
Chap07 sndcp 03t_kh
Chap07 sndcp 03t_khChap07 sndcp 03t_kh
Chap07 sndcp 03t_kh
 
DIANA: Scenarios for QoS based integration of IP and ATM
DIANA: Scenarios for QoS based integration of IP and ATMDIANA: Scenarios for QoS based integration of IP and ATM
DIANA: Scenarios for QoS based integration of IP and ATM
 
H2B2VS (HEVC hybrid broadcast broadband video services) – Building innovative...
H2B2VS (HEVC hybrid broadcast broadband video services) – Building innovative...H2B2VS (HEVC hybrid broadcast broadband video services) – Building innovative...
H2B2VS (HEVC hybrid broadcast broadband video services) – Building innovative...
 

En vedette

9000 Datasheet
9000 Datasheet9000 Datasheet
9000 Datasheetseiland
 
OnRamp Partner Program Battlecard
OnRamp Partner Program BattlecardOnRamp Partner Program Battlecard
OnRamp Partner Program BattlecardRyan Scott
 
8 Reasons To Choose True Scale
8 Reasons To Choose True Scale8 Reasons To Choose True Scale
8 Reasons To Choose True Scaleseiland
 
Ibm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle CardIbm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle Cardseiland
 
Ibm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle CardIbm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle Cardseiland
 
Ibm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle CardIbm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle Cardseiland
 

En vedette (6)

9000 Datasheet
9000 Datasheet9000 Datasheet
9000 Datasheet
 
OnRamp Partner Program Battlecard
OnRamp Partner Program BattlecardOnRamp Partner Program Battlecard
OnRamp Partner Program Battlecard
 
8 Reasons To Choose True Scale
8 Reasons To Choose True Scale8 Reasons To Choose True Scale
8 Reasons To Choose True Scale
 
Ibm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle CardIbm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle Card
 
Ibm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle CardIbm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle Card
 
Ibm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle CardIbm F Co E Blade Center Battle Card
Ibm F Co E Blade Center Battle Card
 

Similaire à Understanding Low And Scalable Mpi Latency

8 Reasons To Choose True Scale
8 Reasons To Choose True Scale8 Reasons To Choose True Scale
8 Reasons To Choose True Scaleseiland
 
8 Reasons To Choose True Scale
8 Reasons To Choose True Scale8 Reasons To Choose True Scale
8 Reasons To Choose True Scaleseiland
 
HPC HUB - Virtual Supercomputer on Demand
HPC HUB - Virtual Supercomputer on DemandHPC HUB - Virtual Supercomputer on Demand
HPC HUB - Virtual Supercomputer on DemandVilgelm Bitner
 
Co-Design Architecture for Exascale
Co-Design Architecture for ExascaleCo-Design Architecture for Exascale
Co-Design Architecture for Exascaleinside-BigData.com
 
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreAdvanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreinside-BigData.com
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmapinside-BigData.com
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Community
 
Mellanox High Performance Networks for Ceph
Mellanox High Performance Networks for CephMellanox High Performance Networks for Ceph
Mellanox High Performance Networks for CephMellanox Technologies
 
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Community
 
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Community
 
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Community
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Community
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Ceph Community
 
Turbocharge the NFV Data Plane in the SDN Era - a Radisys presentation
Turbocharge the NFV Data Plane in the SDN Era - a Radisys presentationTurbocharge the NFV Data Plane in the SDN Era - a Radisys presentation
Turbocharge the NFV Data Plane in the SDN Era - a Radisys presentationRadisys Corporation
 
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Community
 

Similaire à Understanding Low And Scalable Mpi Latency (20)

8 Reasons To Choose True Scale
8 Reasons To Choose True Scale8 Reasons To Choose True Scale
8 Reasons To Choose True Scale
 
8 Reasons To Choose True Scale
8 Reasons To Choose True Scale8 Reasons To Choose True Scale
8 Reasons To Choose True Scale
 
HPC HUB - Virtual Supercomputer on Demand
HPC HUB - Virtual Supercomputer on DemandHPC HUB - Virtual Supercomputer on Demand
HPC HUB - Virtual Supercomputer on Demand
 
Co-Design Architecture for Exascale
Co-Design Architecture for ExascaleCo-Design Architecture for Exascale
Co-Design Architecture for Exascale
 
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreAdvanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
 
Interconnect your future
Interconnect your futureInterconnect your future
Interconnect your future
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks
 
Mellanox High Performance Networks for Ceph
Mellanox High Performance Networks for CephMellanox High Performance Networks for Ceph
Mellanox High Performance Networks for Ceph
 
Gupta_Keynote_VTDC-3
Gupta_Keynote_VTDC-3Gupta_Keynote_VTDC-3
Gupta_Keynote_VTDC-3
 
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance Networks
 
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
 
Mellanox Approach to NFV & SDN
Mellanox Approach to NFV & SDNMellanox Approach to NFV & SDN
Mellanox Approach to NFV & SDN
 
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
 
Brocade solution brief
Brocade solution briefBrocade solution brief
Brocade solution brief
 
Turbocharge the NFV Data Plane in the SDN Era - a Radisys presentation
Turbocharge the NFV Data Plane in the SDN Era - a Radisys presentationTurbocharge the NFV Data Plane in the SDN Era - a Radisys presentation
Turbocharge the NFV Data Plane in the SDN Era - a Radisys presentation
 
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
 

Understanding Low And Scalable Mpi Latency

  • 1. WHITE PaPEr Understanding Low and Scalable Message Passing Interface Latency Latency Benchmarks for High QLogic InfiniBand Solutions Offer 70% Advantage Over the Competition Performance Computing Key Findings Executive Summary • The QLogic QLE7140 and QLE7280 HCas outperform the Considerable improvements in InfiniBand® (IB) interconnect Mellanox® ConnectX™ HCa in osu_latency at the 128-byte technology for High Performance Computing (HPC) applications message size and the 1024-byte message size by as much have pushed bandwidth to a point where streaming large as 70%. amounts data off-node is nearly as fast as within a node. • The QLogic QLE7140 and QLE7280 HCas outperform the However, latencies for small-message transfers have not kept up ConnectX HCa in “scalable latency” by as much as 70% as with memory subsystems, and are increasingly the bottleneck in the number of MPI processes increase. high performance clusters. Different IB solutions provide dramatically varying latencies, Introduction especially as cluster sizes scale upward. Understanding how Today’s HPC applications are overwhelmingly implemented latencies will scale as your cluster grows is critical to choosing a using a parallel programming model known as the Message network that will optimize your time to solution. Passing Interface (MPI). To achieve maximum performance, HPC The traditional latency benchmarks, which send 0-byte messages applications require a high-performing MPI solution, involving between two adjacent systems, result in similar latency both a high-performance interconnect and highly tuned MPI measurements for emerging DDr IB Host Channel adapters libraries. InfiniBand has rapidly become the HPC interconnect (HCas) from QLogic® and competitors of about 1.4 microseconds of choice on 128 systems in the June 2007 Top 500 list. This (µs). However, on larger messages, or across more nodes in rapid upswing was due to its high (2GB/s) maximum bandwidth, a cluster, QLogic shows a 60-70% latency advantage over and its low (~1.4–3 µsec) latency. High bandwidth is important competitive offerings. These scalable latency measurements because it allows an application to move large amounts of data indicate why QLogic IB products provide a significant advantage very quickly. Low latency is important because it allows rapid on real HPC applications. synchronization and exchanges of small amounts of data.
  • 2. WHITE PaPEr Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage bandwidth is at a 1:1 ratio with available bandwidth from a DDr IB This white paper compares several benchmark results. For all of connection. these results, the test bed consists of eight servers with standard “off-the shelf” components, and a QLogic SilverStorm® 9024 In contrast, socket-to-socket MPI latency in either system is 0.40 24-port DDr IB Switch. µs, while the fastest inter-node IB MPI latency of is 1.3-3 µs can be achieved. a ratio of 7x to 3x in comparing socket-to-socket and IB! Servers Thus, small-message latency is one of the areas where there is a significant penalty to go off-node. Though there are some “back- 2-socket rack-mounted servers to-back” 2-node benchmarks available to help, the latency observed 2.6 Ghz dual-core, aMD™ Opteron® 2218 processors does not always represent the desired latency required from a high- 8 GB of DDr2-667 memory performance cluster. Tyan ® Thunder n3600r (S2912) motherboards The HCas benchmarked were: Different Ways to Measure Latency • Mellanox MHGH28-XTC (ConnectX) DDr HCa MPI latency is often measured by one of a number of common • QLogic QLE7140 SDr HCa microbenchmarks such as osu_latency, or the ping-pong component of the Intel® MPI Benchmarks (formerly Pallas MPI Benchmarks), • QLogic QLE7280 DDr HCa. or the ping-pong latency component of the High Performance all benchmarks were run using MVaPICH-0.9.9 as the MPI. For the Computing Challenge (HPCC) suite of benchmarks. all of these Mellanox ConnectX HCas MVaPICH was run over the user-space microbenchmarks have the same basic pattern. Each runs a single verbs provided by the OFED-1.2.5 release. For the QLE7140 and ping-pong test sending a 0- or 1-byte message between two cores QLE7280 MVaPICH was run over the InfiniPath™ 2.2 software stack, on different cluster nodes, reporting the latency as half the time of using the QLogic PSM aPI and OFED-1.2 based drivers. one round-trip. Here are some example graphs showing the results of running osu_latency using three different IB HCas. Motivation for Studying Latency Bandwidths over the network are approaching memory bandwidths within a system. running the Bandwidth microbenchmark from Ohio State (osu_bw) on a node, using the MVaPICH-0.9.9 implementation of MPI, measures large message intra-node (socket-to-socket) MPI bandwidth of 2 GB/s with message sizes 512k or smaller. This HSG-WP07017 SN0032014-00 a 2
  • 3. WHITE PaPEr Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage Judging from this test, the QLE7280, QLE7140, and ConnectX as demonstrated, the QLE7280 and QLE7140 latencies largely HCas are all similar with respect to 0-byte latency. However, as the remain flat with increasing process count. The ConnectX HCa’s message size increases significant differences are observed. For latency, however, rises with the increase of processes. at 32-cores, example with a 128-byte message size, the QLE7280 has a latency the randomring Latency of QLogic QLE7280 DDr HCa is 1.33 of 1.7 µs, whereas the ConnectX DDr adapter has a latency of 2.7 µs compared to 2.26 µs for the ConnectX HCa. This amounts to µs providing a 60% performance advantage for the QLE7280. With a 70% better performance for the QLE7280. The trend is for larger 1024-byte message size, the QLE7280’s latency is 2.80 µs for a 70% differences at larger core counts. Since low latency is required even advantage over ConnectX’s latency of 4.74 µs. at large core counts to scale application performance to the greatest extent possible, the QLogic HCa’s consistently low latency is referred another test that measures latency is the randomring latency to as “scalable latency.” benchmark which is a part of the High Performance Computing Challenge suite of benchmarks (HPCC). The benchmark tests latency across a series of randomly assigned rings, averaging across all of them.1 The benchmark forces each process to talk to every other process in the cluster. This is important because there is a substantial difference in scalability with a large number of cores between those HCas that seemed so similar when running osu_latency. 1 The measurement differs from the pingpong case since the messages are sent by two processes calling MPI_Sendrecv, rather than one calling MPI_Send followed by MPI_recv. HSG-WP07017 SN0032014-00 a 3
  • 4. WHITE PaPEr Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage However, for small messages the latency cost of that initial setup Understanding Why Latency Scalability Varies is large compared to the cost of sending a message. a round-trip To understand why latency scalability would differ, it helps to on the wire can triple the cost of sending a small message, while understand, at least at a basic level, how MPI works. The following is copying a couple of cache lines from a receive buffer to their final the basic path of an MPI packet, from a sending application process location costs you very little. This leads most implementors to use to a receiving application process. a Send/recv based approach. However, in HCas that have tuned for rDMa to the exclusion of Send/recv, this causes a large slowdown, 1. Sending process has data for some remote process. resulting in poor latency. an rDMa write is much faster, but it 2. Sender places data in a buffer, passes a pointer to the MPI stack, requires that costly setup. The following describes a mechanism along with an indication of who the receiver is and a tag for used to sidestep this problem. identifying the message. achieving Low Latency with rDMa 3. ‘context’ or ‘communication id’ identifies the context over which the point-to-point communication happens -- only messages For interconnects that have been optimized for remote Direct Memory in the same communicator can be matched (there is no “any” access (rDMa), it can be desirable to use rDMa not only for large communicator). messages but also for small messages. This is done without incurring There are some variations in how this process is implemented, often the setup latency cost by mimicking a receive mailbox in memory. based on the underlying mechanism for data transfer. For each MPI process, the MPI library sets up a temporary memory location for every process in the job. The setup and coordination With many interconnects offering high performance rDMa, there is is done at initialization time, so by the time communication starts a push towards utilizing it to improve MPI performance. rDMa is every MPI process has knowledge of the memory location to write to, a one-sided communication model, allowing data to be transferred and can use rDMa. When receiving, the MPI library in the receiving between from one host to another without the involvement of the process then goes and checks each temporary memory location, remote CPU. This has the advantage of reducing CPU utilization, but and then copies any messages that may have arrived to the correct requires the rDMa initiator to know where it is writing to or reading buffers. from. This requires an exchange of information before the data can be sent. This can work well in small clusters or jobs, such as when running the common point to point microbenchmark. Each receiving process another mechanism that is used is what is known as the Send/recv has only one memory location to check, and can very quickly find model. This is a two-sided communication model where the receiver and copy any receiving message. maintains a single queue where all messages go initially, and then the receiver is involved in directing messages from that queue to their final destination. This has the advantage of not requiring remote knowledge to begin a transfer, as each side only needs to know about its own buffers, but at the cost of involving the CPU on both sides. Most high performance interconnects provide mechanisms for both of these models, but make different optimization choices in terms of tuning them. almost all implementations use rDMa for large messages, where the setup cost to exchange information initially is small relative to the cost of involving the CPU in transferring large amounts of data. Thus, most MPIs implement a ‘rendezvous protocol’ for large messages, where the sender sends a ‘request to send’, the receiver pins the final location buffer and sends a key, and the sender does an rDMa write to the final location. MPIs implemented on OpenFabrics verbs do this explicitly, while the PSM layer provided with the QLogic QLE7100 and QLE7200 series HCas does it behind the scenes. HSG-WP07017 SN0032014-00 a 4
  • 5. WHITE PaPEr Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage The issue with the approach is that it doesn’t scale. With rDMa, each remote process needs its own temporary memory location to write to. Thus, as a cluster grows the receiving process has to check an additional memory location for every remote process. In today’s world of multicore processors and large clusters, the array of memory locations rises exponentially. The per-local-process memory and host software time requirements The effect of a connected protocol is to require some amount of of this algorithm go up linearly with the number of processors per-partner state both on host and on chip. When the number of in the cluster. This means that in a cluster made up of N nodes processes scales up, this can lead to strange caching effects as with M cores each, per-process memory use and latency grow as the data is sent/received from the HCa. This can be mitigated to O(M * (N-1)), while per-node memory use grows even faster, as some extent using methods like Shared receive Queues (SrQ) and O(M2 * (N-1)). Scalable rC, but remains a problem for very large clusters using rC-based MPIs. a Scalable Solution: Send/recv The QLogic approach with the PSM aPI sidesteps this by using a connectionless protocol and keeping the minimum necessary a more scalable solution is to use send/recv. Because the location state to ensure reliability. Investigations at Ohio State showed the in memory where messages are placed is determined locally, all advantages of a connectionless protocol at scale when compared to messages can go into a single queue with a single place to check, an rC-based protocol, which were limited by the small MTU and lack instead of requiring a memory location per remote process. The of reliability in the UD IB protocol.1 In another paper, the investigators results are then copied out in the order they arrive to the memory at OSU showed a need for a ‘UD rDMa’ approach in order to achieve buffers posted by the application. Thus, the per-local-process full bandwidth. 2 memory requirements for this approach are constant, and the per- node memory requirements increase only with the size of the node. PSM takes account of all of these issues behind the scenes. It allows the MPI implementor access to all of the scalability of a Connection State connectionless protocol, without the need to develop yet another implementation of segmentation and reliability, or running into any a final element which is harder to measure, but apparent in very large of the high-end bandwidth performance issues seen with UD. clusters, is the advantage of the connectionless protocol. The PSM is based on a connectionless, as opposed to a connected protocol (rC) that is used for most verbs-based MPIs. 1 http://nowlab.cse.ohio-state.edu/publications/conf-papers/2007/koop-ics07. pdf 2 http://nowlab.cse.ohio-state.edu/publications/conf-papers/2007/koop- cluster07.pdf HSG-WP07017 SN0032014-00 a 5
  • 6. WHITE PaPEr Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage Summary and Conclusion The white paper, explores latency measurements, and illustrates how benchmarks measuring point-to-point latency may not be representative of the latencies that applications on large-scale clusters require. Explained were some of the underlying architectural reasons of varying approaches to low MPI latency and how the QLogic QLE7140 and QLE7280 IB HCas efficiently scale to large node counts. The current trend towards rDMa in high-performance interconnects is very useful for those applications with large amounts of data to move. as system resources are already constrained, it is vital to limit CPU usage in moving large amounts of data through the system. However, a large and growing number of applications are more latency-bound than bandwidth bound, and for those an approach to low latency that scales is necessary. The QLogic QLE7100 and QLE7200 series IB HCas provide scalable low latency. Disclaimer reasonable efforts have been made to ensure the validity and accuracy of these performance tests. QLogic Corporation is not liable for any error in this published white paper or the results thereof. Variation in results may be a result of change in configuration or in the environment. QLogic specifically disclaims any warranty, expressed or implied, relating to the test results and their accuracy, analysis, completeness or quality. www.qlogic.com Corporate Headquarters QLogic Corporation 26650 aliso Viejo Parkway aliso Viejo, Ca 92656 949.389.6000 Europe Headquarters QLogic (UK) LTD. Surrey Technology Centre 40 Occam road Guildford Surrey GU2 7YG UK +44 (0)1483 295825 © 2007 QLogic Corporation. Specifications are subject to change without notice. all rights reserved worldwide. QLogic, the QLogic logo, and SilverStorm are registered trademarks of QLogic Corporation. InfiniBand is a registered trademark of the InfiniBand Trade association. aMD and Opteron are trademarks or registered trademarks of advanced Mirco Devices. Tyan is registered trademark of Tyan Computer Corporation. Mellanox and ConnectX are trademarks or registered trademarks of Mellanox Technologies, Inc.. Infini- Path is a trademark of Pathscale, Inc.. Intel is registered trademark of Intel Corporation. all other brand and product names are trademarks or registered trademarks of their respective owners. Information supplied by QLogic Corporation is believed to be accurate and reliable. QLogic Corporation assumes no responsibility for any errors in this brochure. QLogic Corporation reserves the right, without notice, to make changes in product design or specifications. HSG-WP07017 SN0032014-00 a 6