1. D
S
M
P
Overview of the
Distributed Symmetric Multiprocessing
Software Architecture
By Peter Robinson
Technical Marketing Manager
Symmetric Computing
Venture Development Center
University of Massachusetts - Boston
Boston MA 02125
Page 1
3. Introduction
Distributed Symmetric Multiprocessing or DSMP, is a new kernel extension or kernel
enhancement, that extends the capabilities of the legacy Linux operating system, so it can support a
scalable, shared-memory architecture over a 40Gb InfiniBand attached cluster. DSMP is comprised of
two unique software components; the host operating system (OS)
System Call Interface (SCI) which runs on the head-node and a unique lightweight micro-kernel
Process Virtual File OS which runs on all “other” servers (which make-up the cluster).
Management (PM) System (VFS) The host OS consists of a Linux image plus a new DSMP kernel,
Memory Network creating a new durative work as noted in Figure 1. The micro-kernel
Management (MM) Stack is a non-Linux based image that extends the function of the host OS
DSMP Gasket Interface over the entire cluster. These two OS images (host and micro-
kernel), are designed to run on commodity, Symmetric
Device
ARCH Drivers Multiprocessing (SMP) servers based on the AMD64 processor.
(DD) The AMD64 architecture was selected over competing
platforms for a number of reasons, the primary being
Figure 1 – Host DSMP Software architecture
price performance. Back in 2005 when we conceived
DSMP, the AMD Opteron™ Processor was the only x86 solution that supported a high
density, 4P direct connect architecture in a 1U form-factor. As of 4Q09, AMD continues to provide the
best value for 4P 1U servers and they continue to offer the only commercially viable 4P solution on the
market today.
A look at supercomputing today
Supercomputing can be divided into two camps - proprietary shared-memory systems or
commodity message passing Interface (MPI) clusters. Shared memory systems are based on commodity
processors such as the PowerPC or Itanium or the ever-popular x86 and commodity memory (DRAM
SIMMs). At the core of most shared-memory systems is a proprietary fabric. This fabric physically
extends the host processors coherency scheme over multiple nodes, providing low-latency inter-node
communication while maintaining system wide coherency. These ultra expensive, hardened shared-
memory supercomputers are designed to accommodate concurrent, enterprise or transactional processing
applications. These applications; VMware, Oracle, dbase, SAP, etc. can utilize one to 512+ processor-
cores and tera-bytes of shared-memory. Most of these applications are optimized for the host OS and the
micro-architecture of the host processor, but not for the macro architecture of the target system. Shared-
memory systems are also a great deal easer to develop applications for. In fact, rarely is it ever necessary
to modify code-sets or data-sets to run on a shared-memory system, for most SMP software plugs and
plays, which is why the shared-memory supercomputers are in such high demand.
Page 3
4. Conversely, MPI clusters are comprised entirely of commodity servers, connected via Ethernet,
InfiniBand, or similar communication fabrics. However, these commodity networks introduce tremendous
latency compared to proprietary fabrics on OEM shared-memory supercomputers. Additionally, cluster
computing poses challenges for application providers to comply with the strict rules of MPI and to work
within the memory limitations of the SMP nodes which makeup the cluster. Despite the computational
and porting overhead, the cost benefits of commodity based computing solutions make MPI clusters a
staple of University and small-business research labs.
Although MPI is the platform of choice for Universities and Research Labs, data-sets in
bioinformatics, Oil & Gas, atmospheric modeling, etc. are becoming too large for single node Symmetric
Multi-Processing (SMP) systems and are impractical for an MPI clusters, due to the problems that arise
when you decimate data-sets. The alternative is to purchase time on a National Labs shared-memory
supercomputer (such as the ORNL peta-scale Cray XT4/XT5 Jaguar supercomputer). The problem with
the Jaguar supercomputer option is cost, time and overkill. In short, the reliability, availability
serviceability (RAS) of enterprise computing is quite different from what a researcher wants; as an
example researchers and academia:
• Don’t need an hardened enterprise class 9-9s reliable platform;
• Do not run multiple applications concurrently and there is no need for virtualization.
• Applications are single-process, multiple-thread;
• Have an aversion to spending time, dollars and staff-hours needed to apply to access these
National Lab machines;
• Do not want to wait weeks on end in a queue to run their application;
• Are willing to optimized their applications for the target hardware to get the most out of the run;
• Ultimately want unencumbered 24/7 access to an affordable shared-memory machine – just like
their MPI cluster.
Enter Symmetric Computing
The design team of Symmetric Computing came out of the research community. As such, they
were very aware of the problems researcher face today and in the future. This awareness drove the
development of DSMP and the need to base it on commodity hardware. Our intent is nothing short of
have DSMP do for shared-memory supercomputing what the Beowulf project (MPI) did for cluster
computing.
Page 4
5. How DSMP works
As stated in the introduction, DSMP is software that transforms an InfiniBand connected cluster
of homogeneous 1U/4P commodity servers into a shared-memory supercomputer. Although there are two
unique kernels, (host-kernel and a micro-kernel), for this discussion, we will ignore the difference
between them because, from the programmers perspective, there is only one OS image and one kernel.
The DSMP kernel provides seven (7) enhancements that transform a cluster into a distributed symmetric
multiprocessing platform, they are:
1. The shared-memory system;
2. The optimized InfiniBand driver which supports a shared-memory architecture;
3. An application driven, memory page coherency scheme;
4. An enhanced multi-threading service, based on the POSIX thread standard;
5. A distributed MuTeX;
6. A memory based distributed disk-queue and
7. A Distributed disk array.
Treo™ Departmental
The shared-memory system: The center piece of DSMP is its shared- Supercomputer
memory architecture. For our example we will assume a three node 4P
system with 64GB of physical memory per node. The three nodes are
networked via 40Gb InfiniBand and there is no switch. This in fact is our
value Treo™ Departmental Supercomputer product offering, shown here on
the right.
Figure 2 presents a macro view of the DSMP memory architecture. What become quite obvious from
viewing this graphic is the application of two memory segments, i.e., local-memory and global-memory.
64GB 64GB 64GB
G
B
12GB
4
P0 P1 P2 P3
Global
Memory
“0”
16GB 16GB 16GB
Local Local Local
Memory Memory Global
Memory
“0” “3” Memory
“1”
“1”
Global
Memory
“3”
TX TX
Figure 1 - DSMP memory architecture
TX
RX RX RX
SMP 0 SMP 1 SMP n
Page 5
6. Both coexist in the SMP physical memory and are evenly distributed over the four AMD64 processors on
each of the three servers. However, the memory management unit (MMU) on the AMD Opteron™
processor sees only the local memory (as noted in blue). Local memory is statically allocated by the
kernel, for our Treo™ example we will assume 1GB of local memory for every AMD64 core within the
server. Hence, there are 16GB of local-memory per server or 48GB of local-memory allocated from the
192GB of available system wide memory. The remaining 144GB is global-memory, which is
concurrently viewable and accessible by all 48 processor cores within the Treo™ Departmental
Supercomputer.
All memory (local and global) is partitioned into 4,096 byte pages or 64 AMD64 cache-lines.
When there is a cache-line miss from local-memory (a page fault), the kernel identifies a least recently
used (LRU) memory-page and swaps in the missing memory-page from global-memory. That happens,
across the InfiniBand fabric, in just under 5µ-seconds, even faster if the page is on the same physical
node.
The Optimized InfiniBand Drivers: The entire success of DSMP revolved around the existence
of a low latency, commercially available network fabric. It wasn’t that long ago, with the exit of Intel
from InfiniBand, that the industry experts were forecasting its demise. Today InfiniBand is the fabric of
choice for most High Performance Computing (HPC) clusters due to its low latency and high bandwidth.
To squeeze every last nano-second of performance out of the fabric, the designer of DSMP
bypassed the Linux InfiniBand protocol stack and wrote his own low-level driver. In addition, he
developed a set of drivers that leveraged the native RDMA capabilities of the InfiniBand host channel
adapter (HCA). This allowed the HCA to service and move memory-pages requests, without processor
intervention. Hence, RDMA eliminates the overhead for message construction and deconstruction,
reducing system-wide latency.
An application driven, memory page coherency scheme: As stated in the introduction, all
proprietary supercomputers maintain memory-consistency and/or coherency via hardware extension of
the host processor. DSMP took a different approach for maintaining the two separated levels of coherency
within the system. First there is cache-line coherency within the local SMP server. Coherency at this level
is maintained by the MMU and the SMP logic native to the AMD64 processor, i.e., Cache-coherent
HyperTransport™ Technology. However, global memory page coherency and consistency is controlled
by, and maintained by the programmer. This approach may seem counter intuitive at first. However, the
target market-segment for DSMP was technical computing not enterprise and it was assumed that the end
user is familiar with the algorithm and how to optimize it for the target platform (in the same way code
was optimized for a Beowulf cluster). Given the high skill level of the end users with the need to use only
commodity hardware, drove system level code decisions to keep a DSMP cluster both affordable and fast.
To obtain these goals, new and enhanced Linux primitives were developed. Hence, with some simple,
intuitive programming rules, augmented with new primitives; porting an application to a DSMP platform
(while maintaining coherency), is simple and manageable. Those rules are as follows:
Page 6
7. • Be sensitive to the fact the memory-pages are swapped into and out of local memory from global
memory in 4K pages and that it takes 5µ-seconds to complete the swap.
• Be careful not to overlap or allocate multiple data sets within the same memory page. To help
prevent this a new Alloc( ) primitive is provided to assure alignment.
• Because of the way local and global memory are partitioned (within physical memory), care
should be taken to distribute process/threads and associated data evenly over the four processors.
In short, try not to pile-up process/threads on one processor/memory unit, but rather distribute
them evenly over the system. POSIX thread primitives are provided to support the distribution of
these threads.
• If there is a data-set which is “modified-shared” and accessed by multiple process/threads which
are on an adjacent server, then it will be necessary to use a set of new Linux primitives
to maintain coherency i.e., Sync( ), Lock( ) and Release( ).
Multi-Threading: The “gold standard” for parallelizing Linux C/C++ source code is with the
POSIX thread library or Pthreads. POSIX is an acronym for Portable Operating System Interface. The
latest version; POSIX.1 - IEEE Std 1003.1, 2004 Edition, was developed by The Austin Common
Standards Revision Group (CSRG). To ensure that Pthreads would work with DSMP each of the two
dozen or so POSIX routines were either tested to and/or modified for DSMP and the Treo™ platform.
The common method for parallelizing a process is via the Fork( ) primitive. Within DSMP there
is a flag associated with Fork( ). This flag determines if the forked thread is to say local (with the current
process on the primary server), or run on one of the remote servers. This allows the programmer to
specify, how many threads of a given process can be serviced by the head node. Simple analysis will
show just how many thread can run concurrently before performance flattens out due to the memory-wall
effect, or other conditions. Once this value is understood, the remote flag can be used to evenly distribute
threads over all the servers within the DSMP system. By default, each successive instance of Fork( )
caused that thread to be associated with the next server in the DSMP system, in a round-robin fashion.
Hence, a Fork ( ) remote of three threads on Treo™ would place the current process on each of the three
servers with one thread per server. The Kernel manages the consistency of the process to ensure it
executes with the same environment and associated state variables.
Coherency at the memory-page level is the responsibility of the programmer. A lot of this is
common sense; if a memory page is accessed by multiple threads and up-dated (modified – exclusive),
then it will be necessary to hold off pending threads until the current thread has updated the page in
question. To facilitate this, three DSMP Linux primitives are provided. They are Sync( ), Lock( ) and
Release( ).
Page 7
8. • Sync( ): as the name Implies, synchronize one (1) local private memory-page with its
source global-memory page.
• Lock( ): is used to prevent any other process thread from accessing and subsequently
modifying the memory-page. Lock( ) also invalidates all other copies of the locked memory-
page within the system. If a process thread on an adjacent server accesses a locked memory
page, execution is suspended until the page is released.
• Release( ): as the name implies, releases a previously locked memory page.
Lastly, to insure that data structure do not overlap, a new DSMP Alloc( ) primitive is provided to
force alignment for a give data-structure on a 4K boundary. This primitive assures that the end of one
data-structure does not fall inside an adjacent data-structure.
Distributed MuTeX: Wikipedia describes MuTeX or Mutual exclusion as a set of algorithms
which are used in concurrent programming to avoid the simultaneous use of a common resource, such as
a global variable or a critical sections. A distributed MuTeX is nothing more than a DSMP kernel
enhancement which insures that MuTeX functions as expected within the DSMP system. From a
programmers point-of-view, there are no changes or modification to MuTeX – it just works.
Memory based distributed disk-queue: A new DSMP primitive D_file( ) provides a high-
bandwidth/low-latency elastic queue for data which is intended to be written to a low bandwidth
interface, such as a Hard Disk Drive (HDD) or the network. This distributed input/output queue, is a
memory (DRAM) based storage buffer which effectively eliminates bottlenecks which occur when a
multiple threads compete for a low bandwidth device such as a HDD. Once the current process retires, the
contents of the queue are sent to the target I/O device and the queue is released.
A Distributed disk array: A distributed disk array is implemented by the kernel through
enhancements made to the Linux striped volume manager. These enhancements extend the Linux volume
manager over the entire network interface providing to the OS, a single consolidated drive. On Treo™ the
distributed disk array is made up of six (6) 1TB drivers – two per server, for a single 6TB storage device.
DSMP Performance
Performance of a supercomputer is a function of two metrics:
1) Processor performance (computational throughput);
2) Global Memory Read/Write performance - which can be furthered divided down to:
a. Stream performance – continuous R/W memory bandwidth and
b. Random read/write performance (memory R/W latency).
The extraordinary thing about the DSMP™ is the fact that it is based on commodity components.
That’s important, because DSMP performance scales with the performance of the commodity components
from which it is made. As an example, random read/write latency for Treo™, went down 40% with the
availability of 40Gb InfiniBand. Furthermore, this move from 20Gb to a 40Gb fabric caused no
appreciable increase in the cost of a Treo™ system (and no changes to the DSMP software were needed).
Page 8
9. Also, within this same timeframe, AMD64 processor density went from quad-core to six-core, again
without any appreciable increase in the cost of the total system. Therefore, over time the performance gap
between DSMP™ shared-memory supercomputers and proprietary shared-memory systems will close.
Today proprietary shared-memory system providers have intra-node bandwidth numbers in the
order of 2.5GB/sec. and random access times in the order of 1µsec. That’s a difference of ~4:1 in
bandwidth and ~5:1 in R/W latency over DSMP™. At first glance, this much of a disparity might appear
as a disadvantage, but that is not necessarily the case - for three reasons. First: DSMP random R/W
latency is based on the time it takes to move 4,096B vs. 64B or 128B in <1µsec. (for SGI and others);
that’s a 64:1 or 32:1 difference in size of the cache-line or page size. In addition, the processors used in
these proprietary systems might have enhanced floating-point capabilities but they might run slower, in
some case, much slower than a 2.8GHz quad-core AMD Opteron™ Processor. So performance is not tied
entirely to memory latency or processor performance but is a function of many system variables as-well-
as the algorithm and the way the data is structured.
A second and more important reason why the DSMP performance is not a problem is access.
That is, having open and unencumbered 24/7 access to a shared-memory system. As an example, let’s
assume it takes 24 hours to run a job on the ORNL Jaguar supercomputer with a allocation of 48
processors and 150GB of shared-memory. However, it takes months to submit the proposal and gain
approved. Then there’s the additional wait in the queue of around 14 days - to access the system; typical
for this type of engagement. If we assume the DSMP™ shared-memory supercomputer is 1/5 the
performance of the one at Oakridge (due to memory latency, bandwidth and related factors), then it would
take five times longer to get the same results – that’s 120 hours verses, 24. However, when you take into
account the two week queue time, the results are available 10-days sooner. In the same time-frame, you
could have run the job three times over.
The third and final reason is value. Today, an entry level Treo™ departmental supercomputer
costs only $49,950 - configured with 144GB of shared memory, 48 - 2.8Ghz AMD64 processor cores and
6TB of disk storage (University pricing). A comparable shared-memory platform from an OEM would
approach $1,000,000 (not including maintenance and licensing fees), that’s 1/20 of the price at 1/5 the
performance. With the introduction of the Treo™ departmental supercomputer, Universities and
researchers have a new option which is based on the same market forces that drove the emergence of the
MPI cluster i.e., commodity hardware, value and availability. Today, Symmetric Computing is offering
four unique configuration of Treo™ from 48 to 72 - AMD64 cores and 144GB to 336GB of shared-
memory (see table on following page).
Page 9
10. Treo™ Quad-core Six-core 4GB 8GB Total
P/N 2.8GHz 2.6GHz PC5300 PC5300 Shared
DIMMs DIMMs Memory
SCA161604-3 269 Giga-flops - 192 GB - 144GB
SCA241604-3 - 374 Giga-flops 192 GB - 120GB
SCA241608-3 - 374 Giga-flops - 384 GB 312GB
SCA161608-3 269 Giga-flops - - 384 GB 336GB
Looking forward to 1Q10, the Symmetric Computing engineering staff will introduce a 10-node
blade center delivering 1.2 Tera-flops of peak throughput with 640GB or 1.28GB of system memory. In
addition, we are working with our partners to deliver turn-key platform tuned for application specific
missions – such as next generation sequencing, HMMER, BLAST, etc.
Conclusion
Symmetric Computing’s overall goal is to make supercomputing accessible and affordable to a
broad range of end users. We believe that DSMP is to shared-memory computing what Beowulf/MPI was
to distributed-memory computing. We are focused on delivering an affordable, commodity based
technical computing solutions that services an entirely new market with – the Departmental
Supercomputer. Our initial focus is to provide open applications optimized to run under DSMP and on
Treo™, to accelerate scientific developments in Biosciences and Bioinformatics. We continue to expand
our scope of applications and remain committed to delivering Supercomputing to the Masses.
About Symmetric Computing
Symmetric Computing is a Boston based software company with offices at the Venter
Development Center on the campus of the University of Massachusetts – Boston. We design software to
accelerate the use and application of shared-memory computing systems for Bioinformatics, Oil & Gas,
Post Production Editing, Financial analysis and related fields. Symmetric Computing is dedicated to
delivering standards-based, customer-focused technical computing solutions for users, ranging from
Universities to enterprises. For more information, visit www.symmetriccomputing.com.
Page 10