Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Understanding Low And Scalable Mpi Latency
1. WHITE PaPEr
Understanding Low and Scalable
Message Passing Interface Latency
Latency Benchmarks for High QLogic InfiniBand Solutions Offer 70%
Advantage Over the Competition
Performance Computing
Key Findings
Executive Summary
• The QLogic QLE7140 and QLE7280 HCas outperform the
Considerable improvements in InfiniBand® (IB) interconnect
Mellanox® ConnectX™ HCa in osu_latency at the 128-byte
technology for High Performance Computing (HPC) applications
message size and the 1024-byte message size by as much
have pushed bandwidth to a point where streaming large
as 70%.
amounts data off-node is nearly as fast as within a node.
• The QLogic QLE7140 and QLE7280 HCas outperform the
However, latencies for small-message transfers have not kept up
ConnectX HCa in “scalable latency” by as much as 70% as
with memory subsystems, and are increasingly the bottleneck in
the number of MPI processes increase.
high performance clusters.
Different IB solutions provide dramatically varying latencies,
Introduction
especially as cluster sizes scale upward. Understanding how
Today’s HPC applications are overwhelmingly implemented
latencies will scale as your cluster grows is critical to choosing a
using a parallel programming model known as the Message
network that will optimize your time to solution.
Passing Interface (MPI). To achieve maximum performance, HPC
The traditional latency benchmarks, which send 0-byte messages
applications require a high-performing MPI solution, involving
between two adjacent systems, result in similar latency
both a high-performance interconnect and highly tuned MPI
measurements for emerging DDr IB Host Channel adapters
libraries. InfiniBand has rapidly become the HPC interconnect
(HCas) from QLogic® and competitors of about 1.4 microseconds
of choice on 128 systems in the June 2007 Top 500 list. This
(µs). However, on larger messages, or across more nodes in
rapid upswing was due to its high (2GB/s) maximum bandwidth,
a cluster, QLogic shows a 60-70% latency advantage over
and its low (~1.4–3 µsec) latency. High bandwidth is important
competitive offerings. These scalable latency measurements
because it allows an application to move large amounts of data
indicate why QLogic IB products provide a significant advantage
very quickly. Low latency is important because it allows rapid
on real HPC applications.
synchronization and exchanges of small amounts of data.
2. WHITE PaPEr
Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage
bandwidth is at a 1:1 ratio with available bandwidth from a DDr IB
This white paper compares several benchmark results. For all of
connection.
these results, the test bed consists of eight servers with standard
“off-the shelf” components, and a QLogic SilverStorm® 9024 In contrast, socket-to-socket MPI latency in either system is 0.40
24-port DDr IB Switch. µs, while the fastest inter-node IB MPI latency of is 1.3-3 µs can be
achieved. a ratio of 7x to 3x in comparing socket-to-socket and IB!
Servers Thus, small-message latency is one of the areas where there is a
significant penalty to go off-node. Though there are some “back-
2-socket rack-mounted servers
to-back” 2-node benchmarks available to help, the latency observed
2.6 Ghz dual-core, aMD™ Opteron® 2218 processors
does not always represent the desired latency required from a high-
8 GB of DDr2-667 memory
performance cluster.
Tyan ® Thunder n3600r (S2912) motherboards
The HCas benchmarked were:
Different Ways to Measure Latency
• Mellanox MHGH28-XTC (ConnectX) DDr HCa
MPI latency is often measured by one of a number of common
• QLogic QLE7140 SDr HCa microbenchmarks such as osu_latency, or the ping-pong component
of the Intel® MPI Benchmarks (formerly Pallas MPI Benchmarks),
• QLogic QLE7280 DDr HCa.
or the ping-pong latency component of the High Performance
all benchmarks were run using MVaPICH-0.9.9 as the MPI. For the
Computing Challenge (HPCC) suite of benchmarks. all of these
Mellanox ConnectX HCas MVaPICH was run over the user-space
microbenchmarks have the same basic pattern. Each runs a single
verbs provided by the OFED-1.2.5 release. For the QLE7140 and
ping-pong test sending a 0- or 1-byte message between two cores
QLE7280 MVaPICH was run over the InfiniPath™ 2.2 software stack,
on different cluster nodes, reporting the latency as half the time of
using the QLogic PSM aPI and OFED-1.2 based drivers.
one round-trip. Here are some example graphs showing the results
of running osu_latency using three different IB HCas.
Motivation for Studying Latency
Bandwidths over the network are approaching memory bandwidths
within a system. running the Bandwidth microbenchmark from Ohio
State (osu_bw) on a node, using the MVaPICH-0.9.9 implementation
of MPI, measures large message intra-node (socket-to-socket) MPI
bandwidth of 2 GB/s with message sizes 512k or smaller. This
HSG-WP07017 SN0032014-00 a 2
3. WHITE PaPEr
Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage
Judging from this test, the QLE7280, QLE7140, and ConnectX as demonstrated, the QLE7280 and QLE7140 latencies largely
HCas are all similar with respect to 0-byte latency. However, as the remain flat with increasing process count. The ConnectX HCa’s
message size increases significant differences are observed. For latency, however, rises with the increase of processes. at 32-cores,
example with a 128-byte message size, the QLE7280 has a latency the randomring Latency of QLogic QLE7280 DDr HCa is 1.33
of 1.7 µs, whereas the ConnectX DDr adapter has a latency of 2.7 µs compared to 2.26 µs for the ConnectX HCa. This amounts to
µs providing a 60% performance advantage for the QLE7280. With a 70% better performance for the QLE7280. The trend is for larger
1024-byte message size, the QLE7280’s latency is 2.80 µs for a 70% differences at larger core counts. Since low latency is required even
advantage over ConnectX’s latency of 4.74 µs. at large core counts to scale application performance to the greatest
extent possible, the QLogic HCa’s consistently low latency is referred
another test that measures latency is the randomring latency to as “scalable latency.”
benchmark which is a part of the High Performance Computing
Challenge suite of benchmarks (HPCC). The benchmark tests latency
across a series of randomly assigned rings, averaging across all of
them.1 The benchmark forces each process to talk to every other
process in the cluster. This is important because there is a substantial
difference in scalability with a large number of cores between those
HCas that seemed so similar when running osu_latency.
1 The measurement differs from the pingpong case since the messages are
sent by two processes calling MPI_Sendrecv, rather than one calling MPI_Send
followed by MPI_recv.
HSG-WP07017 SN0032014-00 a 3
4. WHITE PaPEr
Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage
However, for small messages the latency cost of that initial setup
Understanding Why Latency Scalability Varies
is large compared to the cost of sending a message. a round-trip
To understand why latency scalability would differ, it helps to
on the wire can triple the cost of sending a small message, while
understand, at least at a basic level, how MPI works. The following is
copying a couple of cache lines from a receive buffer to their final
the basic path of an MPI packet, from a sending application process
location costs you very little. This leads most implementors to use
to a receiving application process.
a Send/recv based approach. However, in HCas that have tuned for
rDMa to the exclusion of Send/recv, this causes a large slowdown,
1. Sending process has data for some remote process.
resulting in poor latency. an rDMa write is much faster, but it
2. Sender places data in a buffer, passes a pointer to the MPI stack, requires that costly setup. The following describes a mechanism
along with an indication of who the receiver is and a tag for used to sidestep this problem.
identifying the message.
achieving Low Latency with rDMa
3. ‘context’ or ‘communication id’ identifies the context over which
the point-to-point communication happens -- only messages
For interconnects that have been optimized for remote Direct Memory
in the same communicator can be matched (there is no “any”
access (rDMa), it can be desirable to use rDMa not only for large
communicator).
messages but also for small messages. This is done without incurring
There are some variations in how this process is implemented, often the setup latency cost by mimicking a receive mailbox in memory.
based on the underlying mechanism for data transfer. For each MPI process, the MPI library sets up a temporary memory
location for every process in the job. The setup and coordination
With many interconnects offering high performance rDMa, there is
is done at initialization time, so by the time communication starts
a push towards utilizing it to improve MPI performance. rDMa is
every MPI process has knowledge of the memory location to write to,
a one-sided communication model, allowing data to be transferred
and can use rDMa. When receiving, the MPI library in the receiving
between from one host to another without the involvement of the
process then goes and checks each temporary memory location,
remote CPU. This has the advantage of reducing CPU utilization, but
and then copies any messages that may have arrived to the correct
requires the rDMa initiator to know where it is writing to or reading
buffers.
from. This requires an exchange of information before the data can
be sent. This can work well in small clusters or jobs, such as when running
the common point to point microbenchmark. Each receiving process
another mechanism that is used is what is known as the Send/recv
has only one memory location to check, and can very quickly find
model. This is a two-sided communication model where the receiver
and copy any receiving message.
maintains a single queue where all messages go initially, and then
the receiver is involved in directing messages from that queue
to their final destination. This has the advantage of not requiring
remote knowledge to begin a transfer, as each side only needs to
know about its own buffers, but at the cost of involving the CPU on
both sides.
Most high performance interconnects provide mechanisms for both
of these models, but make different optimization choices in terms
of tuning them. almost all implementations use rDMa for large
messages, where the setup cost to exchange information initially
is small relative to the cost of involving the CPU in transferring large
amounts of data.
Thus, most MPIs implement a ‘rendezvous protocol’ for large
messages, where the sender sends a ‘request to send’, the receiver
pins the final location buffer and sends a key, and the sender does an
rDMa write to the final location. MPIs implemented on OpenFabrics
verbs do this explicitly, while the PSM layer provided with the QLogic
QLE7100 and QLE7200 series HCas does it behind the scenes.
HSG-WP07017 SN0032014-00 a 4
5. WHITE PaPEr
Latency Benchmarks for High Performance Computing QLogic InfiniBand Solutions Offer 70% Advantage
The issue with the approach is that it doesn’t scale. With rDMa,
each remote process needs its own temporary memory location
to write to. Thus, as a cluster grows the receiving process has to
check an additional memory location for every remote process. In
today’s world of multicore processors and large clusters, the array of
memory locations rises exponentially.
The per-local-process memory and host software time requirements The effect of a connected protocol is to require some amount of
of this algorithm go up linearly with the number of processors per-partner state both on host and on chip. When the number of
in the cluster. This means that in a cluster made up of N nodes processes scales up, this can lead to strange caching effects as
with M cores each, per-process memory use and latency grow as the data is sent/received from the HCa. This can be mitigated to
O(M * (N-1)), while per-node memory use grows even faster, as some extent using methods like Shared receive Queues (SrQ) and
O(M2 * (N-1)). Scalable rC, but remains a problem for very large clusters using
rC-based MPIs.
a Scalable Solution: Send/recv The QLogic approach with the PSM aPI sidesteps this by using a
connectionless protocol and keeping the minimum necessary
a more scalable solution is to use send/recv. Because the location
state to ensure reliability. Investigations at Ohio State showed the
in memory where messages are placed is determined locally, all
advantages of a connectionless protocol at scale when compared to
messages can go into a single queue with a single place to check,
an rC-based protocol, which were limited by the small MTU and lack
instead of requiring a memory location per remote process. The
of reliability in the UD IB protocol.1 In another paper, the investigators
results are then copied out in the order they arrive to the memory
at OSU showed a need for a ‘UD rDMa’ approach in order to achieve
buffers posted by the application. Thus, the per-local-process
full bandwidth. 2
memory requirements for this approach are constant, and the per-
node memory requirements increase only with the size of the node.
PSM takes account of all of these issues behind the scenes. It
allows the MPI implementor access to all of the scalability of a
Connection State connectionless protocol, without the need to develop yet another
implementation of segmentation and reliability, or running into any
a final element which is harder to measure, but apparent in very large
of the high-end bandwidth performance issues seen with UD.
clusters, is the advantage of the connectionless protocol. The PSM is
based on a connectionless, as opposed to a connected protocol (rC)
that is used for most verbs-based MPIs.
1 http://nowlab.cse.ohio-state.edu/publications/conf-papers/2007/koop-ics07.
pdf
2 http://nowlab.cse.ohio-state.edu/publications/conf-papers/2007/koop-
cluster07.pdf
HSG-WP07017 SN0032014-00 a 5